[PDF] Artificial intelligence applied to bailout decisions in financial systemic risk management

Abstract

We describe the bailout of banks by governments as a Markov Decision Process (MDP) where the actions are equity investments. The underlying dynamics is derived from the network of financial institutions linked by mutual exposures, and the negative rewards are associated to the banks' default. Each node represents a bank and is associated to a probability of default per unit time (PD) that depends on its capital and is increased by the default of neighbouring nodes. Governments can control the systemic risk of the network by providing additional capital to the banks, lowering their PD at the expense of an increased exposure in case of their failure. Considering the network of European global systemically important institutions, we find the optimal investment policy that solves the MDP, providing direct indications to governments and regulators on the best way of action to limit the effects of financial crises.

Full PDF

AArtiﬁcial intelligence applied to bailout decisions in ﬁnancial systemic riskmanagement

Daniele Petrone, Neofytos Rodosthenous, and Vito Latora School of Mathematical Sciences, Queen Mary University of London. Mile End Road, London E1 4NS, UK ∗ (Dated: February 4, 2021)We describe the bailout of banks by governments as a Markov Decision Process (MDP) where theactions are equity investments. The underlying dynamics is derived from the network of ﬁnancialinstitutions linked by mutual exposures, and the negative rewards are associated to the banks’default. Each node represents a bank and is associated to a probability of default per unit time(PD) that depends on its capital and is increased by the default of neighbouring nodes. Governmentscan control the systemic risk of the network by providing additional capital to the banks, loweringtheir PD at the expense of an increased exposure in case of their failure. Considering the networkof European global systemically important institutions, we ﬁnd the optimal investment policy thatsolves the MDP, providing direct indications to governments and regulators on the best way ofaction to limit the eﬀects of ﬁnancial crises. PACS numbers: 02.70.-c, 64.60.aq, 05.40.-a, 07.05.Mh, 89.65.Gh

I. INTRODUCTION

In times of crisis, as during the recession of 2008 orthe economic disruption triggered by the COVID-19 pan-demic, the governments face diﬃcult decisions regardingbailing-out strategically important companies. In partic-ular, large banks are critical for the stability of the ﬁnan-cial system and are closely monitored by central banksand governments. As an example, to rescue Royal Bankof Scotland (RBS) in 2008-2009, the UK government be-came the majority shareholder of the bank, purchasingshares for a total 45.5 billion pounds [5]. The governmentachieved its objectives to stabilise the ﬁnancial system,and no depositor in UK banks lost money. However,the cost for taxpayers has been estimated by the Oﬃcefor Budget Responsibility (OBR) to be in the region of27 billion pounds as of March 2018 [2]. The price ofRBS shares plummeted after the purchase and the gov-ernment has since sold part of its investment at a loss.Was the government intervention value for money? TheNational Audit Oﬃce (NAO) is the UK’s public spend-ing watchdog and in December 2009 released the report“Maintaining ﬁnancial stability across the United King-dom’s banking system” [3] where they analysed the gov-ernment support for the banking sector and the conclu-sion was that: “If the support measures had not beenput in place, the scale of the economic and social costs ifone or more major UK banks had collapsed is diﬃcult toenvision. The support provided to the banks was there-fore justiﬁed, but the ﬁnal cost to the taxpayer of thesupport will not be known for a number of years”. NAOdid not produce an estimate of the impact in case of in-action of the government. In this paper, we propose amathematical framework that allows a quantitative com-parison between investment decisions by the government. ∗ Electronic address: [email protected]

Our framework is based on the three following buildingblocks: ( a ) a dynamical network model of the ﬁnancialsystem with a contagion mechanism between ﬁnancial in-stitutions; ( b ) a set of allowed government interventionsto control the network; and ( c ) a quantitative way to as-sess the government actions at each time step. A networkmodel [11][14][19][20] is essential, as the main concern isnot the direct cost of a default but the systemic risk thatit entails. Systemic risk can be deﬁned as the risk thatlarge part of the ﬁnancial system is disrupted and as suchit requires connections between ﬁnancial institutions thatcan transfer the distress along the network [12][13][16].The contagion mechanism that we use is the impact thata bank default has on other banks [6]. The impact can bedue to direct losses in bilateral credit exposures [23][22](for example if they had lent money to the defaultingbank), or indirect losses due to ﬁre selling of assets by thedefaulting bank [15], that would lower the market valueof similar assets in the balance sheet of the other ﬁnancialinstitutions. The impact would lower the capital buﬀer ofthe aﬀected banks, weakening the network and its abilityto withstand future shocks. In particular, the probabilityof default per unit time (PD) of the nodes (banks or ﬁ-nancial institutions) would increase, hence increasing theexpected loss in the network [6]. One main novelty of ourmodel is that we allow for the network to be controlledby a government investment in the capital of the banks.Such an investment would, conversely, decrease the PDof the banks that receive the additional capital, but alsoincrease the loss for the government in case of default. Inour framework, the connection between the change in PDand the variation in the amount of capital is provided bythe Merton model of credit risk [18]. To follow the evo-lution in time of the network, we simulate the defaultprocess given the PD of the nodes and their tendencyof defaulting during the same time step. Finally, we useartiﬁcial intelligence techniques [24][26][28] to assess theoptimality of government decisions (no investment vs dif-ferent amounts of investment), recasting the system as a a r X i v : . [ q -f i n . M F ] F e b Markov Decision Process (MDP) [29] where the actions(controls) are government investments at each time step.The paper is structured as follows. In section II A wedescribe the network of ﬁnancial institutions, its dynam-ics and contagion mechanism, and then in section II Bwe introduce a Markov Decision Process based on thenetwork, in order to model government interventions onbailed-out banks. We continue in section II C with pre-senting our strategy to solve the MDP, by ﬁnding theoptimal government investment decision for each stateof the network and time. Section III contains our re-sults, obtained by applying our model to a homogeneousnetwork organised as a Krackhardt kite graph (see Fig.1) and to the network of the European Global Systemi-cally Important Institutions. We have found that a pre-existing investment in a distressed node makes it conve-nient for the government to intervene again to try to savethe invested capital (creating moral hazard as the nodecould act haphazardly relying on the implicit governmentguarantee). Moreover, by changing the parameter α , thataccounts for the taxpayers’ loss in case a bank defaults,we have observed that there is a ‘critical’ value that sep-arates networks for which the inaction of the governmentis the best option from networks where an investment ofthe government would be the optimal decision as it wouldlower the overall expected loss of the system. Finally, weprovide our conclusions in section IV. II. OUR FRAMEWORKA. Network of ﬁnancial institutions

We consider a network G with a set I = { , ..., N } of nodes representing ﬁnancial institutions. Each node i ∈ I is characterised at time t by a probability of default P D i ( t ) ∈ (0 ,

1] per time interval ∆ t , a total asset W i ( t )and an equity E i ( t ) (such that E i ( t ) ≤ W i ( t )), that is thecapital used by node i as a buﬀer to withstand ﬁnanciallosses . The edges w ij of the network represent the expo-sure of node i to the default of node j for all i (cid:54) = j ∈ I .To take into account government interventions aimed atlimiting the overall losses, we use an adaptation of the‘PD Model’ described in [6] by extending it to allow thepossibility for the nodes (banks) to incur positive shocks,via investments in the nodes, rather than just negativeshocks due to the default of other nodes. The focus hasalso changed from the one in [6], as we are now exclu-sively interested in the losses incurred by the taxpayers,disregarding the losses sustained by private investors. Inthe following, we will measure the time in discrete timesteps that are multiples of ∆ t , i.e. t + 1 is equivalent to t + ∆ t .We deﬁne the total impact I i ( t ) on node i at time t , dueto the default of other nodes j ∈ I \ { i } in the network as I i ( t ) := (cid:88) j ∈I\{ i } w ij ( t ) δ j ( t ) , for all i ∈ I , (1)where δ j ( t ) = 1 if and only if node j defaults at time t and δ j ( t ) = 0 otherwise. The impact I i ( t ) represents aloss for the total asset W i , which in turn decreases alsothe equity E i of node i , hence reducing their value at time t + 1. This can be seen from the accounting equation foreach node i , namely W i ( t ) = E i ( t ) + B i ( t ) , (2)which states that the total asset W i is always equal at alltimes to the equity E i plus the total liability B i . Notethat B i is not aﬀected by the losses as it is comprisedof loans from other banks, deposits, etc., that are due infull unless the bank i defaults. Hence, we have∆ W i ( t ) = ∆ E i ( t ) , (3)where we deﬁne ∆ X i ( t ) := X i ( t + 1) − X i ( t ). We cantherefore write W i ( t + 1) − W i ( t ) = − I i ( t ) + ∆ J i ( t ) , (4) E i ( t + 1) − E i ( t ) = − I i ( t ) + ∆ J i ( t ) , (5)where ∆ J i ( t ) denotes the potential increase in the currentinvestment J i ( t ) of the government in node i at time t .On the other hand, the probability of default P D i ( t ) ofnode i is increased by the impact I i ( t ) at time t , sincepart of the capital buﬀer (equity E i ) is lost. In orderto model the eﬀect of the impact I i ( t ) on P D i ( t ), weuse the Merton model for credit risk [18] to calculate the‘implied probability of default’ P DM as a function of theparameters of each node:

P DM ( W, E, µ, σ ) := 1 − Φ (cid:16)(cid:16) log WW − E + µ − σ (cid:17)(cid:14) σ (cid:17) (6) where the term W − E represents the total liability B of each bank, Φ is the univariate standard Gaussian dis-tribution, µ is the drift and σ is the volatility of thegeometric Brownian motion associated to the total asset W in the Merton model. We then use (6) to obtain P D i ( t ) := max { P DM ( W i ( t ) , E i ( t ) , µ i , σ i ) , P DM floori } (7) where we introduced the ﬁxed number “ P DM floori ” rep-resenting the lower bound of the

P D i that is used toexclude unreasonably low probabilities of default. Forexample, it is a standard assumption for the P D i of abank i to be greater or equal to the probability of de-fault of the country where it is based. In this context,the latter is the probability of a country defaulting on itsdebt.Now, if node i loses an amount of capital I i ( t ) at sometime t greater or equal to its buﬀer E i ( t ), the total asset W i ( t ) becomes less than its liability B i ( t ) and it is conve-nient for the shareholders to exercise their option to de-fault. In practice, when this occurs, we set P D i ( t +1) = 1and node i will default at time t + 1. Moreover, recallthat node i may also default at any time t with probabil-ity P D i ( t ) due to its own individual characteristics givenby (7); see also the default mechanism in (11) below.Now, when node i defaults, we denote by LGD i the“Loss Given Default” of node i , which is a ﬁxed num-ber representing the percentage of the investments J i onnode i by the government, that cannot be recovered af-ter a default. In case of default of node i , we furtherassume that in addition to the aforementioned loss of in-vestments, the taxpayers’ loss L i is also comprised of aﬁxed percentage α i (for convenience) of the total asset W i of the node. That is, the taxpayers’ overall loss L i isgiven by L i := α i W i + J i LGD i . (8)To complete our framework we need to specify theprobability of more than one default happening duringthe same time step, given the P D i of each node i ob-tained by (7). For example, if the nodes were indepen-dent the probability of nodes i and j defaulting at thesame time step, denoted by P D [ ij ] , would be the prod-uct of the individual probabilities P D i and P D j . In thispaper, we allow nodes to depend on each other and use aGaussian latent variable model [17] to calculate the prob-abilities of simultaneous defaults of two or more nodes.To be more precise, the probability of a ﬁnite subset ofnodes { i, j, k, ... } ⊆ I in the network G defaulting at thesame time, is given by the following integral P D [ i,j,k,... ] := (cid:90) D Φ (cid:48) N ( u ; Σ) d u , (9)where Φ (cid:48) N is the standardised multivariate Gaussian den-sity function with zero mean and a symmetric correlationmatrix Σ ∈ [ − , N × N given byΦ (cid:48) N ( u ; Σ) := exp {− u T Σ − u } (cid:112) (2 π ) n | Σ | (10)and | Σ | is the determinant of Σ. We further note thatthe integration domain D in (9) is the Cartesian prod-uct of the intervals [ −∞ , Φ − ( P D i )] for each node i thatbelongs to the set of defaulting nodes, and the intervals[ −∞ , ∞ ] for the remaining nodes, where Φ is the uni-variate standard Gaussian distribution.In the sequel, this model will also be used to sim-ulate the default mechanism. To be more precise, bysampling values x , ..., x N of the random vector X =( X , X , ..., X N ) T with the multivariate Gaussian distri-bution mentioned above, at each time step t , we will as-sume that node i defaults according to the rule: x i < Φ − ( P D i ( t )) ⇐⇒ δ i ( t ) = 1 . (11) B. Formulation of the banks bailout problem as aMarkov Decision Process

We describe the government decisions of bailing outbanks as a Markov Decision Process (MDP) driven by the network framework described above. We assume that thegovernment estimated that the crisis will likely be over attime M, and in any case it will be able to sell the sharesof the rescued banks to the private sector for a price thatis similar to the purchasing price. We deﬁne the 4-tuple(

S, A s , P a , R a ) of the set S of all the states, set A s of allactions available from state s ∈ S , transition probabili-ties P a ( s, s (cid:48) ) = P ( s t +1 = s (cid:48) | s t = s, a t = a ) between states at any time t and state s (cid:48) at time t + 1 having takenaction a ∈ A s at time t , and rewards (negative losses inour model) R a ( s, s (cid:48) ) received after taking action a at anytime t while being at state s and landing in state s (cid:48) attime t + 1, where s, s (cid:48) ∈ S . Furthermore, a constant dis-count factor γ is deﬁned with 0 ≤ γ <

1, so that rewardsobtained sooner are more relevant in the calculation ofthe cumulative reward CR over M steps. The latter istherefore deﬁned by CR := M − (cid:88) t =0 γ t R a t ( s t , s (cid:48) t +1 ) . (12)In the remaining of this section, we expand on the 4-tuple( S, A s , P a , R a ) that deﬁnes our MDP. MDP states.

The states s t ∈ S , at each time t , aredeﬁned by three main pilars: ( a ) all the parameters ofthe network G ( W i ( t ), E i ( t ), P D i ( t ), LGD i , α i , µ i , J i , σ i , w ij , Σ ij , for i, j ∈ { , ..., N } , where w ii = 0), ( b ) anindexed set I def ( t ) ⊆ I containing all defaulted nodesprior to time t and ( c ) the time to maturity M − t . MDP actions.

The MDP actions in ourmodel are injections of capital a t → ∆ J a ( t ) =(∆ J a ( t ) , ∆ J a ( t ) , ..., ∆ J aN ( t )) by the government to thenodes (1 , , ..., N ). These additional resources on onehand, make the nodes more resilient, hence diminishingtheir probability of default via (4)–(7), but on the otherhand they will be at risk in case of default since theyincrease each J i in (8). These actions are the controlvariables of the government when trying to minimise thelosses of the network (i.e. maximise the expected CR in(12), see Section II C for more details).We further assume that these government investments(relative to action a t ) decided at time t are implementedimmediately, so that the probability of default P D i ( t )given by (7), the default mechanism in (11) and subse-quently the impacts I i ( t ), for each node i ∈ I , are im-plemented using the updated (increased) capital E i ( t ) = E i ( t ) + ∆ J ai ( t ), total asset W i ( t ) = W i ( t ) + ∆ J ai ( t ) andgovernment investment J i ( t ) = J i ( t ) + ∆ J ai ( t ). MDP transition probabilities.

Within our framework,a node that has defaulted does not contribute to futurelosses and cannot become active again, i.e. the cardi-nality of the set of defaulted nodes |I def ( t ) | is a non-decreasing function of time t . Hence the transition prob-ability P a ( s, s (cid:48) ) from state s to s (cid:48) will be non-zero onlyfor states s (cid:48) that: ( a ) have the same number or moredefaulted nodes than state s ; ( b ) are “reachable”, in thesense that their P D i ( t + 1), W i ( t + 1) and E i ( t + 1), for i ∈ I \ I def ( t + 1) (the set of remaining active nodesin s (cid:48) ) take values that are coherent with equations (4)–(7) after calculating the impacts I i ( t ) from the nodes i ∈ I def ( t + 1) \ I def ( t ). In order to illustrate the abovewe consider the following example. Example 1

Let us consider a network with three nodes I = { , , } and w ij = 1 ∀ i (cid:54) = j ∈ I , at a time t ,such that node 3 ∈ I def ( t ) has already defaulted, whilethe remaining nodes have W i ( t ) = 100, E i ( t ) = 3 and P D i ( t ) = 0 .

001 for i ∈ I \ I def ( t ) = { , } . In case thegovernment does not intervene, the states s (cid:48) that can bereached are the ones where: ( i ) all the nodes default attime t , i.e. I def ( t + 1) = I ; ( ii ) nodes 1 and 2 are stillactive and W i ( t +1), E i ( t +1) and P D i ( t +1) for i ∈ { , } are the same as for state s ; ( iii ) node 1 defaults at time t while node 2 remains active, i.e. I def ( t + 1) = { , } , W ( t + 1) = 99 and E ( t + 1) = 2 (since the impact I ( t ) = w = 1) and P D ( t + 1) needs to take the valuecalculated via (7) using the W ( t +1) and E ( t +1) inputs;and ( iv ) node 2 defaults at time t but node 1 remainsactive, which is analogous to ( iii ) by swapping indices 1and 2. Now, if the government decides to invest, i.e. a → (∆ J a ( t ) , ∆ J a ( t )) on nodes 1 and 2, respectively, at time t , we need to update the capitals E i ( t ) = E i ( t ) + ∆ J ai ( t )and total assets W i ( t ) = W i ( t ) + ∆ J ai ( t ) for i ∈ { , } according to the government intervention and then usethe updated E i ( t ) , W i ( t ) to perform the same analysisas above to identify the reachable states.For states s (cid:48) t +1 with a non-zero transition probability P a t ( s t , s (cid:48) t +1 ), we can calculate the latter via the Gaussianlatent variable model, thus they will depend exclusivelyon the parameters P D i and Σ ij with i, j ∈ I \I def ( t ). Tobe more precise, we ﬁrst create an intermediate state s byapplying the government investments relative to action a t to state s t ; hence each node i of s will have an increasedcapital E i ( t ) = E i ( t ) + ∆ J ai ( t ), an increased total asset W i ( t ) = W i ( t )+∆ J ai ( t ), an increased government invest-ment J i ( t ) = J i ( t ) + ∆ J ai ( t ) and a probability of defaultgiven by (7) with inputs W i ( t ) and E i ( t ). Using the in-termediate state s with updated E i ( t ) , W i ( t ) , J i ( t ) andupdated P D i ( t ), we calculate the transition probabilityvia (see also (9)) the following integral P a t ( s t , s (cid:48) t +1 ) := (cid:90) D Φ (cid:48)|I\I def ( t ) | ( u ; Σ sub ) d u , (13)where Φ (cid:48) is the density given by (10) with dimensionequal to the cardinality of the set of surviving nodes |I \ I def ( t ) | ≤ N . Moreover, the integration domain D in (13) is the Cartesian product of the intervals[ −∞ , Φ − ( P D i )] for the additional defaulted nodes i ∈I def ( t + 1) \ I def ( t ) and the intervals [Φ − ( P D i ) , ∞ ] forall the remaining active nodes i ∈ I \ I def ( t + 1) at state s (cid:48) t +1 – upon recalling the default mechanism in (11). TheΣ sub is the sub-matrix of the original correlation matrixΣ after removing the rows and the columns correspond-ing to defaulted nodes i ∈ I def ( t ) at state s t . MDP rewards.

In our model the “rewards” take non-positive values, since their overall maximisation has totranslate for our MDP into the minimisation of the over-all taxpayers’ losses L i ( t ) given by (8), for all nodes i ∈ I \ I def ( t ) at each time t . Namely, R a t ( s t , s (cid:48) t +1 ) := − (cid:88) i ∈I\I def ( t ) ( α i W i ( t ) + J i ( t ) LGD i ) δ i ( t ) , (14) where only the nodes defaulting at time t with δ i ( t ) = 1contribute to the sum of losses, hence the “reward” is 0 ifthere are no additional defaults at time t . The expected“reward” depends on the action taken, as it inﬂuencesthe total asset W i and investment J i corresponding tothe intermediate state s described previously, as well asthe P D i , i.e. the probability of having δ i ( t ) = 1. C. Solving the Markov Decision Process

Solving the MDP means to ﬁnd the optimal actionfor each possible state s t . In our context, we expect ourmodel to indicate if the government should intervene andif so, which amount it should invest for a given conﬁgu-ration of the ﬁnancial system network. To ﬁnd a solutionand describe it mathematically, we need to deﬁne a fewconcepts as described below. Optimal policy.

The optimal policy π ∗ ( s t ) → a ∗ t , isa function that returns the optimal action a ∗ ∈ A s foreach state s at time t . The optimal action is the onethat obtains the maximum expected cumulative rewardas deﬁned in (12). Optimal value function.

The optimal value function V ∗ ( s t ) is deﬁned by V ∗ ( s t ) := E π ∗ [ CR | s t ] . (15)This is the expected cumulative reward starting fromstate s t and following the optimal policy π ∗ for any ofthe successive time steps till the end of the episode (re-call that a full episode consists of M time steps). Oneway to obtain this expected cumulative reward is to runthe MDP starting at s t multiple times and average theresults. Given the deﬁnition of π ∗ , V ∗ ( s t ) represents themaximum expected cumulative reward that can be ob-tained starting from s t . Optimal action value function.

The optimal actionvalue function Q ∗ ( s t , a t ) is the expected cumulative re-ward we obtain if we ﬁrst take action a t at state s t andthen follow the optimal policy π ∗ for any of the succes-sive steps from t + 1 until the end of the episode M . Itis deﬁned by Q ∗ ( s t , a t ) := E π ∗ [ CR | s t , a t ] . (16)Similarly to the previous paragraph, this represents themaximum expected cumulative reward that can be ob-tained starting from s t after taking action a t .Notice that, ﬁnding Q ∗ is equivalent to solving theMDP, since the optimal action for each state s t (hencethe optimal policy π ∗ ) can be obtained by a ∗ t = argmax a t Q ∗ ( s t , a t ) . (17) Relationships between Q ∗ and V ∗ . From the deﬁnitionsof V ∗ ( s t ) and Q ∗ ( s t , a t ), it follows that V ∗ ( s t ) = max a t Q ∗ ( s t , a t ) , (18)i.e. the maximum cumulative reward from s t is the onecorresponding to the maximum value of Q ∗ after lookingat all the potential alternative actions a t . Conversely, wecan write Q ∗ ( s t , a t ) in terms of V ∗ ( s t ) as: Q ∗ ( s t , a t ) = (cid:88) s (cid:48) t +1 P a t ( s t , s (cid:48) t +1 )( R a t ( s t , s (cid:48) t +1 ) + γV ∗ ( s (cid:48) t +1 )) . (19) In other words, Q ∗ ( s t , a t ) can be expressed as theimmediate expected reward at time t , given by (cid:80) s (cid:48) t +1 P a t ( s t , s (cid:48) t +1 ) R a t ( s t , s (cid:48) t +1 ), plus the expected cu-mulative reward from time t + 1 onwards, given by γ (cid:80) s (cid:48) t +1 P a t ( s t , s (cid:48) t +1 ) V ∗ ( s (cid:48) t +1 ).Merging together equations (18) and (19) we then ob-tain the Bellman Optimality Equation: V ∗ ( s ) = max a (cid:110) (cid:88) s (cid:48) P a ( s, s (cid:48) )( R a ( s, s (cid:48) ) + γV ∗ ( s (cid:48) )) (cid:111) (20) Our strategy to solve the MDP.

We have a com-plete description of our MDP (in particular, we havethe transition probabilities P a t ( s t , s (cid:48) t +1 ) and the rewards R a t ( s t , s (cid:48) t +1 )), hence, in theory, we could enumerate allthe possible states, use Dynamic Programming and theValue Iteration algorithm [30] to ﬁnd V ∗ and calculate Q ∗ via equation (19), thus solve the MDP. However, thisis not a scalable approach due to the complexity of theMDP states and the very large number of successor states s (cid:48) for all but trivial networks. Instead, we use a FittedValue Iteration algorithm [25] that involves: ( a ) devis-ing a parametric representation V ∗ ( s, β ) for the optimalvalue function V ∗ ( s ) in (20), where β is a placeholder fora set of parameters to ﬁt (see section II C 1 and V B); ( b )using the approximate Bellman Optimality Equation (i.e.substituting V ∗ ( s (cid:48) ) with V ∗ ( s (cid:48) , β ) in (20) to ﬁt β so thateventually V ∗ ( s ) ≈ V ∗ ( s, β fit ) via a learning process (seesection II C 2); and ﬁnally ( c ) calculating Q ∗ ( s, a ) from V ∗ ( s, β fit ) hence solve the MDP.

1. Value function approximation

In order to solve our MDP using the Fitted Value It-eration algorithm, we need a parametric representationof the optimal value function V ∗ ( s t ). In our case, V ∗ ( s t )is minus the minimum expected cumulative losses from s t (i.e. the maximum expected cumulative reward fromstate s t ) incurred between time t to the end of the episodetime M (see also (14)-(15)). The greater the number ofnodes in the ﬁnancial network and the number of residualsteps m := M − t , the greater is the potential for addi-tional losses. It is natural to try to express V ∗ ( s t ) as asum of the loss contributions due to each individual nodeat each of the remaining m time steps. Hence, we intro-duce the matrix Z := ( Z ik ( s t )) with i ∈ I \ I def ( t ) and k ∈ { , ..., m } , where each element Z ik ( s t ) represents theapproximate expected loss due to the default of node i attime t + k −

1, taking into account potential governmentinvestments. Our ﬁnal ansatz for the parametric repre-sentation V ∗ ( s t , β ) of V ∗ ( s t ) is that it is given by a linearcombination of the elements Z ik in which the coeﬃcients β are arranged in a matrix that can change with time,i.e. β ≡ β t := ( β ik ( t )). Namely, V ∗ ( s t , β t ) := − (cid:88) i,k β ik ( t ) Z ik ( s t ) (21)for i ∈ I \ I def ( t ) , k ∈ { , ..., m } . We then let the system learn the parameters β ik ( t ) thatmaximise the expected cumulative reward (minimise thelosses). In the following section, we describe how we ﬁtthe parameters β ik ( t ) to achieve the aforementioned task,while in section V B we detail our choice of Z ( s t ) in termsof the characteristics of the network.

2. Learning process

In order to learn the parameters β ik ( t ) we use the Bell-man Optimality Equation (20) and we deﬁne the “Bell-man value” as its right hand side after substituting V ∗ ( s (cid:48) )with the approximation V ∗ ( s (cid:48) , β ) from the previous sec-tion. Namely, we deﬁne V B ( s t , β t +1 ) (22):= max a t (cid:110) (cid:88) s (cid:48) t +1 P a t ( s t , s (cid:48) t +1 ) (cid:0) R a t ( s t , s (cid:48) t +1 ) + γV ∗ ( s (cid:48) t +1 , β t +1 ) (cid:1)(cid:111) . We can initialise β with β ik ( t ) = 1 for all i, k, t , asa natural starting point due to our initial approxima-tion of expected direct losses Z ik ( s t ) in (21) (see alsoSection V B for more details). We can then compare V ∗ ( s t , β t ) from (21) with V B ( s t , β t +1 ) from (22) at state s t (starting from the initial state s at time 0 and movingforward to time t ), and adjust β so that the two valuescome closer. Afterwards, we move to another state s (cid:48) t +1 and repeat the same procedure until the diﬀerence be-tween V ∗ and V B is “small enough”, within the subsetof the state space S that is reachable from s . Noticehowever, that the above approach does not converge ingeneral, unless we use speciﬁc learning strategies. Theissue is that V B itself depends on β , which is what wewant to ﬁt, potentially triggering a divergent loop. Toresolve this issue, we notice that V ∗ ( s t , β t ) depends on β at time t , while the corresponding V B ( s t , β t +1 ) is afunction of β at time t + 1. Using this fact, if we ﬁt β backwards in time, then V ∗ ( s t , β t ) is compared witha value V B ( s t , β t +1 ) that is ﬁxed (because β t +1 wouldhave been already ﬁtted), thus solving the convergenceproblem.The primary issue that we now need to address is toﬁnd a way to calculate V B from (22), despite the factthat the set of states s (cid:48) that can be reached from state s is huge, even for relatively small networks. We ﬁrst no-tice that (cid:80) s (cid:48) P a ( s, s (cid:48) ) R a ( s, s (cid:48) ) is the ‘one-step’ expectedreward that can be rewritten in terms of the nodes of thenetwork: (cid:88) s (cid:48) P a ( s, s (cid:48) ) R a ( s, s (cid:48) ) = − (cid:88) i ∈I\I def P D ai L ai , (23)with P D ai := P D ( W i + ∆ J ai , E i + ∆ J ai , µ i , σ i ) , (24) L ai := α i ( W i + ∆ J ai ) + ( J i + ∆ J ai ) LGD i . (25)Secondly, the term (cid:80) s (cid:48) P a ( s, s (cid:48) ) V ∗ ( s (cid:48) , β ) can be esti-mated via Monte Carlo simulations, which involve ( a )sampling s (cid:48) using the distribution P a ( s ) deﬁned by theprobability mass function P a ( s, s (cid:48) ) and ( b ) calculating theexpected value E P a ( s ) [ V ∗ ( s (cid:48) , β )] by averaging the values V ∗ ( s (cid:48) , β ). Essentially, (cid:88) s (cid:48) P a ( s, s (cid:48) ) V ∗ ( s (cid:48) , β ) ∼ E P a ( s ) [ V ∗ ( s (cid:48) , β )] . (26)However, it is not feasible to calculate P a ( s, s (cid:48) ) for allthe states s (cid:48) that can be reached from s after taking ac-tion a , due to the huge number of these states s (cid:48) . Onceagain, we use our knowledge of the underlying networkdynamics to describe the right hand side of (26) in termsof nodes defaulting instead of MDP transition probabili-ties. We observe that the transition probability P a ( s, s (cid:48) )was deﬁned through the Gaussian latent variable model(see (13)) and that there is a one-to-one correspondencebetween additional nodes defaulting from state s and thestate s (cid:48) reached given action a . In particular, we denoteby G as the probability distribution of states s (cid:48) which arederived by using our Gaussian latent variable model inorder to ﬁrst simulate which nodes default via the defaultmechanism in (11) and then to obtain the correspondingstate s (cid:48) . Since G as is equivalent to P a ( s ) due to the afore-mentioned one-to-one correspondence, we can thereforerewrite (26) as (cid:88) s (cid:48) P a ( s, s (cid:48) ) V ∗ ( s (cid:48) , β ) ∼ E G as [ V ∗ ( s (cid:48) , β )] . (27)Putting all these together (using essentially (23) and(27)) we can eventually rewrite V B ( s t , β t +1 ) from (22)for all t ∈ [0 , M −

1] in the form of V B ( s t , β t +1 ) = max a t (cid:26) − (cid:88) i ∈I\I def ( t ) P D a t i L a t i (28)+ γ E G atst (cid:2) V ∗ ( s (cid:48) t +1 , β t +1 ) (cid:3)(cid:27) Now, given that our episode ends at time step M , weobserve that V ∗ ( s t ) = 0 for all t ≥ M . Hence, at time M −

1, we have that V B ( s M − , β M ) ≡ V B ( s M − ), sinceit will not depend on β , and we can thus write V B ( s M − ) = max a M − (cid:26) − (cid:88) i ∈I\I def ( M − P D a M − i L a M − i (cid:27) = V ∗ ( s M − ) , (29) where the latter equality follows from (20) and (23). Nowthat we can calculate the exact optimal value function V ∗ for each state at time M −

1, we notice from (28) that V B ( s M − , β M − ) ≡ V B ( s M − ) is also independent of β ,namely V B ( s M − ) = max a M − (cid:26) − (cid:88) i ∈I\I def ( M − P D a M − i L a M − i + γ E G aM − sM − (cid:2) V ∗ ( s (cid:48) M − ) (cid:3)(cid:27) . (30) We then ﬁt β backwards in time for the decreasing se-quence of time steps ( M − , ..., V ∗ ( s t , β t ) with V B ( s t , β t +1 ).Firstly, for time step M −

2, we compare V ∗ ( s M − , β M − )with V B ( s M − ), for all the states in the representativeportfolio, and we ﬁt β M − . Then, for time step M − V B ( s M − , β M − ) = max a M − (cid:26) − (cid:88) i ∈I\I def ( M − P D a M − i L a M − i + γ E G aM − sM − [ V ∗ ( s (cid:48) M − , β M − )] (cid:27) (31) and compare it with V ∗ ( s M − , β M − ), for all the states ofthe representative portfolio, to obtain once again β M − via a ridge regression. We continue the procedure back-ward in time until we successfully obtain β fit , i.e. theﬁtted β t for each time t .

3. Solution of MDP

Finally, we can solve the MDP by combining all theabove results to calculate Q ∗ ( s t , α t ). In particular, bysubstituting V ∗ ( s (cid:48) t +1 ) in (19) with its approximation V ∗ ( s (cid:48) t +1 , β fitt +1 ) obtained in Section II C 2, and applyingthe same analysis performed to obtain (28), we concludethat Q ∗ ( s t , α t ) ≈ (cid:26) − (cid:88) i ∈I\I def ( t ) P D a t i L a t i + γ E G atst (cid:2) V ∗ ( s (cid:48) t +1 , β fitt +1 ) (cid:3)(cid:27) , (32)which provides the solution to the MDP. III. RESULTS

The main result of this paper is the creation of theframework itself. A professional calibration of our modelwould require the eﬀort of a central bank or a govern-ment oﬃce. To show how our model works, we exploretwo instances of our framework: In Section III A we use anetwork with homogeneous nodes organised as the Krack-hardt kite [1] (KK) graph (Fig. 1), while in Section III Bwe use the network of the European Global SystemicallyImportant Institutions (GSIIs) obtained from the datain the European Banking Authority website [31] (EBAnetwork).

A. Krackhardt kite network

FIG. 1: Krackhardt kite (KK) graph. In our example weconsider: set of nodes I = { , ..., } , total asset W i (0) =100, capital E i (0) = 3, µ i = 0, LGD i = 1, for all i ∈ I , P D i (0) = 0 .

01 for i ∈ { , , } and P D i (0) = 0 .

001 for i ∈ I \ { , , } We assume the number of steps to be M = 7, thediscount factor γ = 0 .

98 and for each node i , J i = 0 (un-less otherwise speciﬁed), µ i = 0 (assuming conservativelythat the expected value of the assets’ return is zero), α i = α (the same for each node) and that the govern-ment can invest only in “risky” banks i with “relativelyhigh” P D i (in our examples, “risky” banks will have P D i > . ij = 0 . i (cid:54) = j ∈ I \ I def . The value of σ i , for each node i , is cal-culated at time t = 0 from P D i (0), E i (0) and W i (0), byinverting (6). Finally, we set P DM

F loori = 0 . < node > @ < capital investment asa tenth of a percent of the total asset W > . For exam-ple, 8@05 means an investment of 50 bp W or 0 . W in node 8. An action that considers all the nodes is in-dicated with < node > = 0. Hence 0@15 stands for aninvestment of 1 . W i in each “risky” node i ∈ I \ I def .The common theme is that adding external resourcesmakes the network more resilient but they can be lost ina subsequent default, which creates a trade-oﬀ for the de-cision maker. For relatively low values of α , it is generallynot convenient to invest, while for relatively high valuesof α , the best action is to invest an amount of capitalthat makes the network suﬃciently resilient. In the EBAnetwork (see Section III B), we have shown that there ex-ists a “critical” α c that splits the space of α -values intotwo “regimes” of low/high values, where α c ≈ . α c ≈ . A l pha : − A l pha : − A l pha : − ActionID A c t i on V a l ue ActionID

FIG. 2: The picture is relative to the KK network and showsthe optimal action value at time t = 0 for diﬀerent actionsand values of α . In the legend, 0@0 means no investment,0@05 means investing 0.5 in all the nodes with P D > . P D > .

009 (i.e nodes 4, 8, 10). For small val-ues of alpha, the best action is not to invest (0@0), as alphaincreases, so does the convenience of investing more capital.For alpha = 1 e −

02 the best action is to invest 1.5 in nodes4, 8 and 10 (0@15).

We have chosen this particular network to assess if ouralgorithm can distinguish between central nodes and pe-ripheral ones. All the nodes have W (0) = 100, E (0) = 3, µ = 0, LGD = 1 (we assume, conservatively, thatall the investment would be lost in case of a default),

P D (0) = 0 .

001 for all but nodes 4, 8 and 10 with

P D (0) = 0 .

01. The edges between nodes are orientedand homogeneous, assuming the value w ij = 1 ∀ i (cid:54) = j .We have restricted the potential investment amounts, foreach node, to be: 0, 0 . W , 1% W , 1 . W or 2% W .Furthermore, the government can choose to invest in asingle node or all the nodes for each time step, providedthat the nodes are considered distressed. In our example,a node i is deﬁned as risky or distressed if P D i > . t = 0 in (Fig. 2). For α = 0 . alpha = 0 .

01 the best action is to invest 1.5 in all therisky nodes (0@15). It is interesting to note that action0@2 (i.e. investing the maximum amount, 2, in all therisky nodes) is never the best choice, while 0@05 is al-ways the worst, as it provides too few capital to eachnode to make them resilient. In Fig. 3(a) we can see thatthe optimal action values corresponding to investmentsin diﬀerent nodes tend to converge as the time to the endof the episode decreases because the contagion has lesstime to propagate and the node position becomes lessand less relevant.In Fig. 3(a), we focus our analysis on nodes 4 and 10 for α = 0 . J (0) = 0 . t = 0 (time to end = 7). −0.0125−0.0100−0.0075−0.0050−0.0025 A c t i on V a l ue ActionID a) KK network with no pre−existing investments −0.016−0.012−0.008−0.004

Time to end A c t i on V a l ue ActionID b) KK network with pre−existing investment: 0.5 at node 10

FIG. 3: The pictures are relative to the Krackhardt kite graph(KK) and shows the optimal action value vs time to end ofthe episode, with alpha = 0 . B. European GSII network

We use the data from the European Banking Author-ity website about the Global Systemically Important In-stitutions, relative to the year 2014 (EBA network) [4].The data does not contain the complete bilateral network(as this is considered business sensitive information) butaggregates of credit exposures vs other ﬁnancial insti-tutions. For our analysis, we have used the algorithmdescribed in our previous paper [6] (see also [27]) to re-construct the network.We set

LGD = 0 . σ have been obtained inverting (6) at time zero andassuming it remains constant during the simulation. Wehave restricted the potential investment amounts to be:0, 0 . W , 1% W , 1 . W , 2% W , 2 . W , 3% W . Fur-thermore, if the government decides to invest, it needs toprovide additional capital to all the risky nodes, deﬁnedin our example, as the nodes with P D > . J i (0) = 0 ∀ i . We deﬁne the ‘Conve-nience’ to intervene as: Convenience ( s t ) := max a t (cid:54) = a t { Q ∗ ( s t , a t ) } − Q ∗ ( s t , a t with a t representing the action at time t correspondingto no investments. In Fig. 5(a) we have reported theConvenience vs the time (number of steps) to the end ofthe episode. We have found that the Convenience is pos-itive and almost constant for large values of α ( α = 0 . α = 0 . α ( α = 0 . α = 0 . α = 0 . t = 0 as a function of α in (Fig. 6(a)).For α > α c ≈ . . W in the riskynodes) becomes the best action. It is also interestingto notice that the optimal action becomes “0@10” forhigher values of α . In (Fig. 6b)) we have the optimalaction values at time t = 0 when the capital of the bankshas been halved. We notice that the value α c at which agovernment intervention becomes favourable is lower at α c ≈ . FIG. 4: Maximum spanning tree of the EBA graph. Eachnode represents a ﬁnancial institution (see Table I). The graphas been reconstructed from aggregated data available at theEuropean Banking Authority (EBA) website and it can bediﬀerent from the actual network of bilateral exposures. Thedarker edges identify stronger exposures. The nodes withhigher PD are Monte dei Paschi di Siena (MPS) and BFA. C on v en i en c e Alpha a) EBA network −1000100200 2 4 6

Time to end C on v en i en c e Alpha b) EBA network with half capital E

FIG. 5: (a) The Convenience, expressed in million of EUR inthe charts, and deﬁned as the diﬀerence between the optimalaction value corresponding to the best government interven-tion, and the optimal action value associated to inaction (see(33)), is almost constant vs time to the end of the episode forpositive values, and a decreasing function for negative values.(b) If we stress the EBA network, halving the capital of thenodes, we obtain a chart similar to a) but the minimum valueof alpha for which the Convenience is positive is lower. a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c −1200−900−600−3000 0.0000 0.0025 0.0050 0.0075 0.0100 A c t i on V a l ue ( m . E UR ) ActionID a) EBA network a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c a c −900−600−3000 0.0000 0.0025 0.0050 0.0075 0.0100 Alpha A c t i on V a l ue ( m . E UR ) ActionID b) EBA network with half capital E

FIG. 6: a) The chart is relative to the ‘EBA network’ (Euro-pean Global Systemically Important Institutions). The opti-mal action value at time t = 0 is reported as a function of α fordiﬀerent actions. 0@0 means no investment. As α increases,0@0 becomes less and less convenient and for α = α c = 0 . . W in each of the riskynodes. For even higher values of alpha , the best action be-comes 0@10 (i.e investing 1% W for each of the risky nodes).b) The chart is relative to a distressed version of the ‘EBAnetwork’ where the capital of the banks has been halved. Thevalue of α c at which a government intervention is convenientis lower than in a): α c ≈ . IV. CONCLUDING REMARKS

We have shown how to cast a bank bailout decision bya government into an action in a Markov Decision Process(MDP) where the states of the MDP are deﬁned in termsof the underlying network of ﬁnancial exposures and theMDP dynamics is derived from the network dynamics.In our example, that uses the data relative to the Eu-ropean Global Systemically Important Institutions, wehave found that government interventions do not improvethe expected loss of the ﬁnancial network if the loss forthe taxpayer as a fraction of the bank total assets α sat-isﬁes α < α c ≈ . α c becomes lower asthe distress of the network increases. It is evident fromour analysis that the parameter alpha plays a central rolein systemic risk modelling and even if there are works [9][10] studying the impact for the taxpayers linked to abank default, additional analysis need to be performedfor its reliable estimation. Using a simpliﬁed Krackhardt1kite network, we have found that the government be-comes biased toward investing in a risky node if it hadalready invested in it in the past. The government needsto evaluate carefully a potential investment. The rescuedbank could increase its risky investments knowing that itwould be bailed-out in case it became distressed again,thus leading to moral hazard. V. MATERIAL AND METHODSA. Representative portfolio of MDP states

In section II C 1 we have expressed the approximatedvalue function V ∗ ( s t , β ) as a linear combination of terms Z ik ( s t ) with coeﬃcients β ik ( t ) given by (21).In order to ﬁt these β ik ( t ), we ﬁrst identify a repre-sentative portfolio of MDP states that can be reached,at time t , from the initial state s , and for which we cancalculate the Bellman value V B using (28)–(29). Equat-ing V ∗ ( s t , β t ) with the corresponding V B ( s t , β t +1 ), foreach state s t in the portfolio, we derive a set of linearequations that we use to obtain the coeﬃcients β ik ( t ) viaa ridge regression (with a 5-fold cross-validation). Thestates in the representative portfolio, at time t, are ob-tained from the initial state s , after changing the timeto maturity from M to M − t (i.e. the states are ‘moved’forward in time) and forcing a set U of nodes to default.The representative portfolio contains: ( a ) the state cor-responding to U = ∅ , plus ( b ) all the states correspondingto U = { i } for i ∈ I (i.e. with one additional defaultednode with respect to s ), plus ( c ) a selection of statescorresponding to | U | > exp ( −| U | ) (a greater importanceis given to states with fewer number of additional de-faults as they are more likely to be reached in an actualsimulation). In addition, we obtain elements in the rep-resentative portfolio by performing a government actionon s and then move the corresponding state at time t (i.e. at time to maturity M − t ). The number of statesin the representative portfolio needs to be chosen tak-ing into account the trade-oﬀ between stable results andcomputational resources. B. Value function parametrisation

In this section, we detail our choice for ( Z ik ( s t ))used in our ansatz for the value function approxima-tion V ∗ ( s t , β ) in (21), with node i ∈ I \ I def ( t ) andstep k ∈ { , ..., m = M − t } until the end of theepisode. We introduce the auxiliary matrix Z , with el-ements Z ik ( s t ; a , ..., a k ) representing the approximatedcontribution of the expected direct loss, due to the de-fault of node i , at time t + k −

1, taking into account thegovernment actions a j at time t + j −

1, for all j = 1 , ..., k . That is, we deﬁne Z ik := (cid:40) P D ik L ik , if k = 1; P D ik L ik γ k − (cid:81) k − r =1 (1 − P D ir ) , if k > . Here, the value

P D ik is the modiﬁed probability of de-fault and the value L ik is the modiﬁed loss, associatedto the node i at time t + k −

1, that take into accountthe expected cumulative impact I ik and the potential cu-mulative investment from the government J ik on node i from time t up to time t + k −

1. Note that for k > i can contribute to the expected loss only if ithas not defaulted in the previous time steps (hence thepresence of the survival probabilities (1 − P D ir )). To bemore precise, we ﬁrstly deﬁne P D ik := P D ( W i + J ik − I ik , E i + J ik − I ik , µ i , σ i , P DM floori ) . Then, the cumulative impact I ik depends on the modiﬁedprobability of default of all the nodes j ∈ H := I \ ( I def ( t ) ∪ { i } ) and is deﬁned by I ik :=  k = 1; (cid:80) j ∈ H P D j w ij if k = 2; I i k − + (cid:80) j ∈ H P D j k − w ij (cid:81) k − r =1 (1 − P D jr )if k > . Moreover, the cumulative government investment J ik innode i is a function of the actions ( a , ..., a k ) that thegovernment can take between t and t + k − J ik ( a , ..., a k ) := k (cid:88) r =1 ∆ J a r i ( t + r − L ik := α i ( W i + J ik − I ik ) + ( J i + J ik ) LGD i In light of the above equations, we observe that Z ik = Z ik ( s t ; a , ..., a k ) depend on the actions ( a , ..., a k ) viathe terms J ik ( a , ..., a k ) involved in both P D ik and L ik .We now call a the action corresponding to no addi-tional government investment and deﬁne the total ex-pected direct loss for all i ∈ I \ I def ( t ) and k ∈ { , ..., m } as T L ( s t ; a , a , .., a m ) := (cid:88) i,k Z ik ( s t ; a , a , .., a k ) . Then, the speciﬁc matrix Z = ( Z ik ( s t )) involved in ourvalue function approximation is deﬁned by Z ik ( s t ) := Z ik ( s t ; a , ..., a k ) , where each a j is calculated sequentially for each j ∈ { , ..., m } as follows: a := argmin a T L ( s t ; a , a , a , .., a m ) a := argmin a T L ( s t ; a , a , a , .., a m )... a m := argmin a m T L ( s t ; a , a , ..., a m − , a m ) . [1] Krackhardt, D. Assessing the Political Landscape: Struc-ture, Cognition, and Power in Organizations. Adm. Sci.Q. , 342–369, (1990)[2] Oﬃce for Budget Responsibility. Available athttp://cdn.obr.uk/EFO-MaRch 2018.pdf

Sci Rep , 5561 (2018).[7] Altman, E., Resti, A. & Sironi, A. Default recovery ratesin credit risk modelling: a review of the literature andempirical evidence. Econ. Notes OECDJournal: Financial Market Trends , (2016)[10] Cariboni, J., Fontana, A., Langedijk, S., Maccaferri, S.,Pagano, A., Giudici M. , Rancan, M., Schich, S. Reducingand sharing the burden of bank failures. OECD Journal:Financial Market Trends , (2016)[11] Caccioli, F., Barucca, P. & Kobayashi, T. Network mod-els of ﬁnancial systemic risk: a review. J Comput Soc Sc , 81–114 (2018).[12] Gai, P. & Kapadia, S. Contagion in ﬁnancial networks. Proc. Royal Soc. A

Phys. Rep. , 175–308 (2006).[15] Cont, R. & Wagalath, L. Running for the exit: distressed selling and endogenous correlation in ﬁnancial markets.

Math. Finance , 718–741 (2013).[16] Battiston, S., Puliga, M., Kaushik, R., Tasca, P. &Caldarelli, G. DebtRank: too central to fail? ﬁnancialnetworks, the FED and systemic risk. Sci. Rep. , 541(2012).[17] O’Kane, D. The gaussian latent variable model in Mod-elling Single-name and Multi-name Credit Derivatives

J. Finance , 449–470(1974).[19] Lehar, A. Measuring systemic risk: A risk managementapproach. J. Bank. Finance , 2577–2603 (2005).[20] Furﬁne, C. Interbank exposures: Quantifying the risk ofcontagion. J. Money Credit Bank. , 111–128 (2003).[21] Huang, X., Zhou, H. & Zhu, H. A framework for assessingthe systemic risk of major ﬁnancial institutions. J. Bank.Finance , 2036–2049 (2009).[22] Upper, C. Simulation methods to assess the danger ofcontagion in interbank markets. J. Financial Stab. ,111–125 (2011).[23] Upper, C. & Worms, A. Estimating bilateral exposuresin the german interbank market: Is there a danger ofcontagion? Eur. Econ. Rev. , 827–849 (2004).[24] Sutton, R. S. & Barto, A. G. Reinforcement Learning:An Introduction. (MIT Press, 2018).[25] Gordon, G. & Tom Michael Mitchell. Approximate solu-tions to markov decision processes. (1999).[26] O’Halloran, S. & Nowaczyk, N. An Artiﬁcial IntelligenceApproach to Regulating Systemic Risk. Front. Artif. In-tell. , 1–14 (2019).[27] Anand, K., Craig, B. & Von Peter, G. Filling in theblanks: Network structure and interbank contagion. Quant. Finance , 625–636 (2015).[28] Kou, G., Chao, X., Peng, Y., Alsaadi, F. & Herrera-Viedma, E. Machine Learning Methods for Systemic RiskAnalysis In Financial Sectors. Technol. Econ. Dev. Econ. , 1–27 (2019).[29] Bellman, R. E. A Markovian decision process. J. Math.Mech.6(5)