[PDF] Causal Discovery for Causal Bandits utilizing Separating Sets

Abstract

The Causal Bandit is a variant of the classic Bandit problem where an agent must identify the best action in a sequential decision-making process, where the reward distribution of the actions displays a non-trivial dependence structure that is governed by a causal model. All methods proposed thus far in the literature rely on exact prior knowledge of the causal model to obtain improved estimators for the reward. We formulate a new causal bandit algorithm that is the first to no longer rely on explicit prior causal knowledge and instead uses the output of causal discovery algorithms. This algorithm relies on a new estimator based on separating sets, a causal structure already known in causal discovery literature. We show that given a separating set, this estimator is unbiased, and has lower variance compared to the sample mean. We derive a concentration bound and construct a UCB-type algorithm based on this bound, as well as a Thompson sampling variant. We compare our algorithms with traditional bandit algorithms on simulation data. On these problems, our algorithms show a significant boost in performance.

Full PDF

CCausal Discovery for Causal Bandits utilizingSeparating Sets

Arnoud A.W.M. de Kroon

Korteweg-de Vries Institute for MathematicsUniversity of AmsterdamAmsterdam, The Netherlands

Danielle Belgrave

Microsoft ResearchCambridge, United Kingdom

Joris M. Mooij ∗ Korteweg-de Vries Institute for MathematicsUniversity of AmsterdamAmsterdam, The Netherlands

Abstract

The Causal Bandit is a variant of the classic Bandit problem where an agentmust identify the best action in a sequential decision-making process, where thereward distribution of the actions displays a non-trivial dependence structure thatis governed by a causal model. All methods proposed thus far in the literaturerely on exact prior knowledge of the causal model to obtain improved estimatorsfor the reward. We formulate a new causal bandit algorithm that is the ﬁrst tono longer rely on explicit prior causal knowledge and instead uses the output ofcausal discovery algorithms. This algorithm relies on a new estimator based onseparating sets, a causal structure already known in causal discovery literature. Weshow that given a separating set, this estimator is unbiased, and has lower variancecompared to the sample mean. We derive a concentration bound and construct aUCB-type algorithm based on this bound, as well as a Thompson sampling variant.We compare our algorithms with traditional bandit algorithms on simulation data.On these problems, our algorithms show a signiﬁcant boost in performance.

In recent years, there have been several works on the Causal Bandit problem (Lattimore et al., 2016;Sen et al., 2017; Yabe et al., 2018; Lee and Bareinboim, 2018). This is a variant of the classicalmulti-armed bandits problem, where an underlying structural causal model (Pearl, 2009) is assumedbetween observed variables.In the bandit problem, we iteratively choose an arm from a set of arms to play, after which we observea reward variable conditional on the chosen arm. In classical bandits, the rewards for the arms areassumed to be independent. If we assume the rewards are generated by a causal model, the rewardsare no longer independent. We can use this additional structure to improve our performance.Consider the following example. We play a video game with two buttons A and B. The game isplayed in rounds, and in each round we have to choose which combination of the buttons we push.Then, the game program generates a randomly chosen cute animal S which appears on the screen,for example a giraffe or a zebra, conditional on the buttons pressed. Afterwards, a random cutenessscore Y is generated by the program. The distribution of this score is conditional on the animal ∗ JMM was supported by the European Research Council (ERC) under the European Union’s Horizon 2020research and innovation programme (grant agreement 639466).Preprint. Under review. a r X i v : . [ c s . A I] S e p hat appeared on the screen. Our goal is then to ﬁnd out which combinations of buttons to press tomaximize the total cuteness score achieved over the course of this game. The generative process ofthis game can be represented by a causal graph, which is depicted in Figure 1.This is an example of a causal bandit, where we choose actions and obtain rewards, but in additionto a reward variable, we also observe additional variables after choosing our action. All observedvariables are generated through a causal mechanism. If the probability distributions for each buttoncombination of the animals that appear on the screen do not overlap, then the reward signals of thebuttons pressed are independent of each other and this game can be considered a classical banditproblem. However, if the distributions overlap, we can share information of the reward signal betweenthese button combinations to better estimate the expected reward value of each combination.As a concrete example, consider starting the game with no prior knowledge on the conditionaldistributions. We press button A once, observe a giraffe, and gain cuteness points, and then pressbutton B , again observe a giraffe, but now gain cuteness points. A traditional bandit algorithmwould estimate the expected value of button A with the sample mean, which is . However, it isintuitively obvious that it is a better strategy to separately model the distribution of the animal thatappears on the screen and the expected reward given the animal, thus obtaining an estimated value ofbutton A of . . We will refer to this sharing of data between actions as ‘information leakage’.Recent approaches to this problem have shown greatly improved regret bounds compared to naïveapproaches that treat it like a classical bandit problem by leveraging information leakage (Lattimoreet al., 2016; Sen et al., 2017; Yabe et al., 2018). However, they all rely on perfect prior knowledgeof the causal structure. In this work, we formulate a causal bandits algorithm which drops theassumption of prior causal knowledge.In the example, the screen S plays a core role. Once we know how the buttons inﬂuence what appearson the screen, we no longer need to know what buttons were pressed to estimate the expected reward:the screen separates the action from the reward. This corresponds with the separating set conceptknown from the causality literature (Spirtes et al., 2000; Magliacane et al., 2018; Rojas-Carulla et al.,2018), which (assuming faithfulness) is deﬁned as a set S that renders a target variable Y independentof a context variable I when conditioned upon: I ⊥⊥ Y | S , where the context variable encodes whichinterventions are performed. We formulate a Causal Bandit algorithm based on separating sets, wherewe separately model how actions (i.e. interventions) inﬂuence the separating set S and the expectedreward given S . This will turn out to yield an unbiased estimator, with improved variance comparedto a naïve sample mean estimator, on the condition that S is a correct separating set.Formulating the algorithm in terms of separating sets allows us to combine it with any causaldiscovery algorithm that can estimate separating sets from data, and thereby drop the assumptionof prior causal knowledge. We formulate a concentration bound for our estimator, and construct anUpper Conﬁdence Bound algorithm based on this bound. We then show greatly improved cumulativeregret performance compared to classical bandit algorithms in simulation studies. ABI A I B S Y

Figure 1: The causal graph for our example game. I A and I B are intervention variables encodinginterventions on buttons A and B , the screen content is encoded by S , and Y is the reward (cutenessscore). { S } is a separating set for Y and { I A , I B } , since { I A , I B } ⊥ G Y | S . In this section we introduce the required preliminaries regarding causality and causal bandits.2 .1 Causal modeling and graph deﬁnitions

We will very brieﬂy introduce the elements of the theory of graphical causal modeling that are usedin this work. An in-depth introduction can for example be found in Pearl (2009).We will denote tuples of variables with a bold capital letter, e.g. X = ( X i ) ni =1 , and will use lowercase letter x for a value assigned to X . The domain of X is denoted by D ( X ) . We assume thatwe observe variables generated through an acyclic Structural Causal Model M = (cid:104) V , E , F , P [ E ] (cid:105) ,with a tuple of endogenous variables V and a tuple of independent exogenous variables E withprobability distribution P [ E ] . The values of V are deﬁned by the tuple of functions F , where for each V i ∈ V there is a f V i ∈ F such that V i = f V i ( pa ( V i ) , E i ) . Here pa ( V i ) ⊆ V \ { V i } are the directcauses (“parents”) of V i and E i ⊂ E is a subset of the exogenous variables. We explicitly allow forconfounders (since the E i can overlap), but exclude cycles, though it would be straightforward toinclude them (see e.g. Mooij et al., 2016). For simplicity, we assume all variables henceforth to bediscrete.Each SCM has an associated graph G = (cid:104) V , E(cid:105) , which is acyclic if and only if the SCM is acyclic,where V is a set of nodes corresponding to the endogenous variables and E is a set of edges. If V i directly inﬂuences V j according to f V j , then there is a directed edge V i → V j ∈ E . There is abidirected edge V i ↔ V j ∈ E if they share independent noise variables, i.e., if E i ∩ E j (cid:54) = ∅ . Weadopt the default family relationships: pa, ch, an, and de for parents, children and ancestors anddescendants respectively, where for an and de we include the variable itself.We may now reason about performing interventions on the variables V i . In the SCM causal modelingframework, interventions are deﬁned by altering the functional dependencies of the SCM. For example,we may force the value of a variable to a speciﬁc value ξ . This is called a perfect intervention , andthe joint probability is then notated as P [ V | do ( V i = ξ )] . One may also deﬁne other types ofinterventions, for example soft interventions which alter the functional dependency f V i but may keepa functional relationship instead of just setting the variable to a value.Here we make use of context variables (Mooij et al., 2016) to model interventions. We introduce I to be the set of context variables. We will consider graphs G = ( V ∪ I , E ) with additional vertices I corresponding to a different interventions. If I i ∈ I encodes an intervention on nodes T i ⊆ V ,we set I i to ∅ if we do not perform this intervention, and to a different value ξ for each possibleversion of intervention I i (for example to different perfect intervention values in the domain of T i ).Furthermore, we add an edge I i → V i to E for each V i ∈ T i . For example, we can model a perfectintervention do ( V i = ζ ) by intervention variable I i if we modify f V i to: f ∗ V i = (cid:26) ζ if I i = ζf V i ( pa ( V i ) , E i ) if I i = ∅ Then, if we perform some combination of interventions, this corresponds to choosing a vector ofvalues ζ , of the same size as the number of intervention variables, and where some values may be ∅ , resulting in P [ V | do ( I = ζ )] . Note that with this formalism, P [ V | do ( I = ζ )] = P [ V | I = ζ ] ,because the intervention variables are exogenous.We deﬁne a path between nodes V and V n as a tuple (cid:104) V , e , V , e , . . . , e n , V n (cid:105) , with V i ∈ V , e i ∈ E , where each node occurs at most once and e i is an edge with endpoints V i − and V i . V k iscalled a collider on a path if there is a subpath (cid:104) V k − , e k , V k , e k +1 , V k +1 (cid:105) where the edges e k and e k +1 meet head to head on node V k . Otherwise this node is called a non-collider . The endpoints arealso refered to as non-colliders.Using the deﬁnition of paths and colliders, one deﬁnes d-separation: Deﬁnition 1. ( d -separation) We say a path (cid:104) V , e , . . . , e n , V n (cid:105) in graph G = ( V , E ) is blocked by C ⊆ V if:(i): Its ﬁrst or last node is in C , or(ii): It contains a collider on a node not in an ( C ) , or(iii): It contains a non-collider in C If for sets A , B ⊆ V all paths from nodes in A to nodes in B are blocked by C ⊆ V , we say that A is d-separated from B by C , and write A ⊥ G B | C . M with graph G . Let P M be the probability distribution induced by thismodel. Then the Directed Global Markov Property holds for subsets A , B , C ⊆ V : A ⊥ G B | C = ⇒ A ⊥⊥ P M B | C . These conditional independencies are the core information provided by causal reasoning that weexploit in this work. While our algorithm itself does not explicitly assume the converse (called faithfulness ), this is assumed by many causal discovery algorithms thus we henceforth assumefaithfulness as well.

The multi-armed bandit problem is one of the classic problems studied in sequential decision makingliterature (Lai and Robbins, 1985). In this setting, an agent decides on which arm to pull and receivesa reward corresponding to that arm. Classically, the rewards of the arms are considered independentwhich gives rise to strategies like ε -greedy, UCB (Auer et al., 2002; Cappé et al., 2013) and ThompsonSampling (Thompson, 1933).Lattimore, Lattimore, and Reid (2016) introduced the Causal Bandit problem as follows. Consideran agent in a sequential decision making process consisting of T trials. In each trial, the agentchooses an assignment of values ζ to intervention variables I (also referred to as choosing an arm).It then observes variables from P [ V | I = ζ ] , according to an SCM M = (cid:104) V , E , F , P [ E ] (cid:105) withcorresponding graph G = ( V ∪ I , E ) . One of the endogenous variables Y ∈ V is the target variable.Thus, when choosing an arm for trial N + 1 , the agent has observed data D N = { ( ζ n , v n ) } Nn =1 ,which are pairs of intervention node values ζ and realizations of V . In this paper, we assume allvariables to be discrete and for Y to be binary. Let Y n denote the target variable observed in trial n .The goal is then to minimize the cumulative regret R = (cid:80) Tn =1 [ Y n − max ζ E [ Y | I = ζ ]] .As a convenience, we will introduce notation to count the number of samples in our data for whicha certain predicate p holds. Let N D N ( p ) = |{ ( ζ n , v n ) ∈ D N | ( ζ n , v n ) (cid:15) p }| . For example, N D N ( Y = 1 , I = ζ ) is the number of samples in dataset D N for which we performed intervention ζ and observed the value for reward variable Y . Two types of algorithms have been proposed to solve this problem, those relying on informationleakage and those that prune the action space based on the structure of the causal graph. The initialpaper by Lattimore et al. (2016) was able to give improved bounds for simple regret for the causalbandit problem compared to traditional methods which assume independent arms. This was done byutilizing information leakage : the reward obtained under one intervention may provide informationabout other interventions. The authors construct an importance sampling estimator based on thisprinciple that assumes full prior knowledge of the probability distribution of all variables besides thetarget variable. Using this, the authors derive an improved simple regret bound. Sen et al. (2017)focused on applying more advanced techniques from the Bandit literature. For example, they analyzegap dependent bounds and apply dynamic clipping, where they divide the T trials into phases andapply a different clipping constant for each phase. These advances lead to sometimes exponentiallybetter regret than the algorithm by Lattimore et al. (2016).Yabe et al. (2018) extend Lattimore et al.’s work in a different direction. They consider only binaryvariables and perfect interventions on subsets of nodes. They use the full knowledge of the graph toestimate the probabilities p ( V | pa ( V ) , I = ζ ) for each node V ∈ V . Interestingly, they only requireprior knowledge of the graph and estimate all required probability distributions from data acquiredfrom the actual bandit.More recently, Lee and Bareinboim (2018) introduced a new method for the causal multi-armedbandit problem. They consider perfect interventions on subsets of nodes of the causal graph. Becausethey only consider perfect interventions, it is sometimes impossible for some interventions to performbetter than other interventions more downstream, and thus they may be pruned.One thing that all existing approaches have in common is that they assume the causal relationships tobe known beforehand, an assumption that is often not met in practice.4 Separating sets lead to improved estimators

In this section, we generalize the intuition we had about the example game in the introduction toan estimator based on a separating set with favorable properties compared to direct sample meanestimation. We then derive a concentration bound for this estimator.

The core strategy we have seen in Causal Bandits in previous work is to exploit very speciﬁcknowledge about the causal structure in order to construct estimators that share information betweenarms. In order to make Causal Bandits suitable for causal discovery, we introduce a novel informationsharing estimator that relies on less speciﬁc knowledge about the causal graph, exploiting informationleakage to share data between interventions.Recall our initial example. The core realization we made is that we may separately estimate therelationship between the combination of buttons pressed and the screen and between the screen andthe score. More generally, we say that a set of variables S is a separating set for intervention variables I and target variable Y if I ⊥ G Y | S . By the Markov property and faithfulness, this is equivalent tothe conditional independence I ⊥⊥ P M Y | S .If S is a separating set, we have for all possible interventions do ( I = ζ ) the following identity by thelaw of total expectation, where the second equality uses the independence: E [ Y | I = ζ ] = E [ E [ Y | S , I = ζ ] | I = ζ ]= E [ E [ Y | S ] | I = ζ ] . We introduce separate estimators ˆ µ ( s ) for E [ Y | S = s ] and ˆ p ( s | ζ ) for P [ S = s | I = ζ ] . Inspired bythe above identity, we then propose the following information sharing estimator for E [ Y | I = ζ ] : ˆ µ IS ( ζ |D N ; S ) := (cid:88) s ∈ D ( S ) ˆ µ ( s |D N )ˆ p ( s | ζ , D N ) . (1)Since S is discrete and Y is binary, the sample percentages and mean are the obvious candidates toestimate these quantities. Thus we deﬁne: ˆ p ( s | ζ , D N ) := N D N ( S = s , I = ζ ) N D N ( I = ζ ) , (2) ˆ µ ( s |D N ) := N D N ( Y = 1 , S = s ) N D N ( S = s ) . (3)To further understand the proposed estimator, we estimate its bias and variance. In the appendix weshow that the following theorem holds: Theorem 3.1.

If we calculate ˆ µ IS ( ζ |D N ; S ) from dataset D N as deﬁned above, I ⊥⊥ P M Y | S andthere is at least one sample from each possible intervention, then the information sharing estimator(1) is unbiased and there exists a constant α ∗ ∈ [0 , such that its variance conditional on thenumber of samples from each intervention is given by: V [ˆ µ IS ( ζ |D N ; S )] = 1 N D N ( I = ζ ) (cid:18) V s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ] (cid:3) (4) + (1 − α ∗ ( ζ , D N )) E s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ](1 − E [ Y | S = s ]) (cid:3)(cid:19) . Proof.

See appendix.It is easy to see that if α ∗ ( ζ , D N ) = 0 (which for example happens if no data with I (cid:54) = ζ is available)then V [ˆ µ IS ( ζ |D N ; S )] = E [ Y | I = ζ ](1 − E [ Y | I = ζ ]) N D N ( I = ζ ) , which is the variance of the naïve sample meancalculated only from data where I = ζ . Thus the information sharing estimator always performsat least as well as the sample mean. We can therefore see the variance of the information sharing Here, we abuse notation by omitting explit conditioning on {N D N ( I = ζ ) } . I = ζ , and the second term which can also be reducedby data where I (cid:54) = ζ . Indeed, the ﬁrst term equals the variance of our estimator if ˆ µ ( s |D N ) would beperfect estimates for the expectations E [ Y | S = s ] .The second term can be reduced by adding data where I (cid:54) = ζ , depending on the overlap in distributionson S between the interventions. In the appendix we provide lower bounds on α ∗ ( ζ , D N ) underdifferent assumptions to better understand when this estimator behaves well, e.g. we show that ifwe condition on that N D N ( S = s ) ≥ c · N D N ( S = s , I = ζ ) for all s and some positive c , then α ∗ ( ζ , D N ) ≥ c c . The true mean µ ( ζ ) = E [ Y | I = ζ ] is a function of parameters µ ( s ) = E [ Y | S = s ] and p ( s | ζ ) = P [ S = s | I = ζ ] through the relation µ ( ζ ) = (cid:80) s ∈ D ( S ) p ( s | ζ ) µ ( s ) . We derive a concentration boundby constraining p ( s | ζ ) and µ ( s ) individually with high probability using Hoeffdings bound and thebound on multinomial variables from Weissman et al. (2003). We may then use a union bound onthese individual events to obtain a simultaneous multidimensional region Θ of high probability for allparameters. We can then solve the maximization problem: P  µ ( ζ ) ≤ max ( µ ∗ ( s ) ,p ∗ ( s | ζ )) ∈ Θ (cid:88) s ∈ D ( S ) p ∗ ( s | ζ ) µ ∗ ( s )  ≤ P [( µ ( s ) , p ( s | ζ )) ∈ Θ] to obtain a concentration bound. For δ ≥ , let us deﬁne ucb (ˆ µ ( s |D N )) = ˆ µ ( s |D N ) + (cid:112) log(2 | D ( S ) | /δ ) / (2 N D N ( S = s )) . Moreover, let ∆ ˆ p ( ζ ) = (cid:112) | S | log(4 /δ ) / (2 N ( I = ζ )) . Thenthe following theorem holds: Theorem 3.2.

See appendix.

With a concentration bound in hand, we may now deﬁne our Separating Set Causal Bandit UCBalgorithm. Note that while the bound of equation (21) is often tighter than the standard bound used inUCB for Bernoulli variables, this is not always the case. Therefore, our algorithm will choose thetightest bound available to it, and will fall back to just the sample mean ˆ µ SM ( ζ |D N ) if the bound ofequation (21) is not tighter than the standard UCB bound. To reduce computational cost, we run thecausal discovery algorithm once every time the number of iterations has increased by . Eachiteration, for each possible intervention, we calculated the normal UCB and for each separating setwe calculate the UCB based on equation ( ). We then choose as index the UCB which is closest toits corresponding estimate (i.e. it has the smallest width). We then pick the action with the highestindex. The full description of the algorithm is in the appendix. Since our UCB algorithm has thesame conﬁdence level as used in normal UCB and the bound is at least as tight, we can show thefollowing theorem: Theorem 4.1.

If we run the separating set causal bandit UCB algorithm as deﬁned in the appendixon dataset D N and if I ⊥⊥ G Y | S , it has the same cumulative regret upper-bound as normal UCB.Proof. See appendix.In practice, as we will see in section 5, this algorithm may perform much better than normal UCB.We also test a Thompson sampling variant of the Separating Set Causal Bandit algorithm. Here,6nstead of using an index based on an upper conﬁdence bound, given a separating set, we modelthe parameters P [ S = s | I = ζ ] using a Dirichlet prior and the parameters P [ Y = 1 | S = s ] using aBeta prior. We can then apply Thompson sampling, by sampling the parameters from their posteriordistributions, and calculating the resulting expected value. Given a sample ( ζ , s , Y ) , we can updateeach of the posteriors separately and naturally. A full speciﬁcation of this variant is in the appendix.We may combine our novel algorithms with any causal discovery algorithm which outputs separatingsets. The methods we used in our experiments are described in the following subsections. Since we have full interventional data, we can directly test for all sets S whether they have theseparating set property, i.e., whether I ⊥⊥ P M Y | S . Our baseline causal discovery method is then todirectly test for separating sets from data in this way. We here make use of the G -test for conditionalindependence of discrete variables (Neapolitan, 2004) with p -value threshold α = 2 . / √ N . A state-of-the-art causal discovery algorithm for small numbers of variables is ASD-JCI123kt (Mooijet al., 2016). It is a particular implementation of the Joint Causal Inference framework (Mooij et al.,2016), which pools data over contexts. This allows it to simultaneously handle data from differentsources, e.g. different interventional distributions. ASD-JCI123kt is a hybrid causal discoveryalgorithm that scores how well each hypothetical causal graph matches the (strengths of the) observeddependences in the pooled data, giving more weight to stronger dependences. As an independencetest, we again make use of the G -test for conditional independence of discrete variables with p -value threshold α = 2 . / √ N . Contrary to the direct testing baseline, ASD-JCI123kt combines allconditional independence test results in order to score the underlying causal graph(s). Since thealgorithm makes use of an Answer Set Program (ASP) building on work by Hyttinen et al. (2014), itis straightforward to query the ASP optimizer for separating sets (e.g., how much evidence is therethat variable V i is independent of V j given S ), by applying the feature scoring approach proposed byMagliacane et al. (2016). We accept a set S as a valid separating set if for all I ∈ I , the conﬁdencescore for the independence I ⊥⊥ P M Y | S output by ASD-JCI123kt is positive. We now proceed to simulate several Causal Bandit problems. For each experiment, initially allalgorithms uniformly pick arms to ensure that all statistical tests are well behaved. Both causaldiscovery algorithms have hyperparameters in the form of a test threshold α , and ASD-JCI123ktfurthermore has a score threshold parameter t . We only test one set of these parameters, where weset α = 2 . / √ N (which is somewhat reasonable from experience) and t = 0 , due to the high costof hyperparameter tuning in this setting. We leave hyperparameter tuning as a further optimizationchallenge for the future. We compare to UCB and Thompson Sampling baselines, as well as versionsof our algorithm with knowledge of a separating set, namely the parents of the target variable. First, we simulate data inspired by our running example game. We generate data as follows. Weconsider two buttons A and B , with corresponding intervention nodes I A and I B . If I A = ∅ or I B = ∅ , we let our younger brother decide whether to press the corresponding button, which hedoes independently with probability. If we set I A to we intervene such that button A is notpressed (i.e., do( A = 0) ), if we set I A to we press the corresponding button (i.e., do( A = 1) ).Similarly for I B . Thus there are = 9 possible actions in this bandit. The screen is a binary variable,generated according to P [ S = 1 | A = a, B = b ] = a + b . Finally, we generate Y according to P [ Y = 1 | S = s ] = s .Furthermore, we generated all acyclic causal graphs G = ( V , E ) over binary variables with noconfounders and compare the cumulative regret, with a similar sampling strategy. We allow perfectinterventions on all subsets of variables excluding the target variable, and thus there are = 27 possible actions. We only generate graphs where Y has at least parent (otherwise the regret is7a) (b)Figure 2: (a): Simulation results on the game example causal bandit over 150 runs. Shaded areas areestimated standard errors. (b): Sensitivity and false positive rate for our causal discovery methods.(a) (b)Figure 3: (a): Simulation results on all DAGs of nodes. We generated a parameter sample for eachof the graphs. Shaded areas are estimated standard for errors. (b) Sensitivity and false positiverate for our causal discovery methods.always ). Permutations of the variables excluding Y are disregarded. Full simulation details areprovided in the appendix, including a ﬁnal simulation study on larger graphs. Results for the experiments are shown in ﬁgure (2) and (4). As can be seen, traditional UCB isoutclassed by all our information sharing (IS) based algorithms that rely on causal discovery. Inregimes where the causal discovery methods perform well, Thompson sampling (TS) is also clearlybeaten by our IS TS variants and beaten sometimes by our IS UCB variants. The video game exampleis a structure which seems particularly easy to identify, and therefore the performance of all our ISalgorithms is superior on this problem after our causal discovery methods converge.Unfortunately, in the experiment with all DAGs of nodes, the sensitivity of ASD-JCI123kt convergespoorly, likely due to suboptimal hyperparameter settings. Compared to traditional UCB, even withsomewhat unreliable causal knowledge our methods show increased performance. However, TS seemsto converge quickly for our parameterization strategy after which the mistakes by ASD-JCI123kt arecomparatively too costly. Direct testing does converge and therefore our methods using direct testingperform very well, with the TS variant almost immediately converging. We see that surprisingly, evensomewhat unreliable causal knowledge may lead to great performance gains, and that this is clearlya very fruitful direction to pursue. However, when the causal discovery methods do not convergeproperly our estimates are not unbiased and thus in that case there are no convergence guarantees. We have shown that exploiting separating sets in causal bandit problems may yield signiﬁcantlyimproved performance compared to traditional UCB and Thompson Sampling. We proved that givencorrect separating sets, our algorithm has the same regret bound as UCB. In case the causal graph8and hence, the correct separating sets) is not known, we employed causal discovery algorithms toestimate separating sets from the data in an online fashion. In our simulation experiments, we foundthat when the causal discovery methods perform reasonably well, our algorithms that rely on themclearly outperform their baseline bandit counterparts.Our estimator and algorithm may be applied whenever we know of a separating set. This furthermoremakes it applicable with pre-existing knowledge less speciﬁc than a full causal graph. Furthermore,there is potential in extending this work to contextual bandits and more general reinforcement learningif we formulate an equivalent deﬁnition of separating set in these settings. Our approach also turnsthe Causal Bandit into an interesting task in which to utilize and compare different causal discoveryalgorithms.

References

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed banditproblem.

Machine learning , 47(2-3):235–256, 2002.Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al.Kullback–Leibler upper conﬁdence bounds for optimal sequential allocation.

The Annals ofStatistics , 41(3):1516–1541, 2013.Antti Hyttinen, Frederick Eberhardt, and Matti Järvisalo. Constraint-based causal discovery: Conﬂictresolution with answer set programming. In

UAI , pages 340–349, 2014.Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules.

Advances inapplied mathematics , 6(1):4–22, 1985.Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interventions viacausal inference. In

Advances in Neural Information Processing Systems , pages 1181–1189, 2016.Tor Lattimore and Csaba Szepesvári.

Bandit Algorithms . Cambridge University Press, 2020.Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene? In S. Bengio,H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,

Advances inNeural Information Processing Systems 31 , pages 2573–2583. Curran Associates, Inc., 2018.Sara Magliacane, Tom Claassen, and Joris M. Mooij. Ancestral causal inference. In

Proceedings ofAdvances in Neural Information Processing Systems (NIPS 2016) , pages 4466–4474, Barcelona,Spain, 2016.Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris MMooij. Domain adaptation by using causal inference to predict invariant conditional distributions.In

Advances in Neural Information Processing Systems , pages 10846–10856, 2018.Joris M Mooij, Sara Magliacane, and Tom Claassen. Joint causal inference from multiple contexts. arXiv preprint arXiv:1611.10351 , 2016.R.E. Neapolitan.

Learning Bayesian Networks . Pearson, 2004.Judea Pearl.

Causality . Cambridge university press, 2009.Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models forcausal transfer learning.

Journal of Machine Learning Research , 19(36):1–34, 2018.Rajat Sen, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Identifying bestinterventions through online importance sampling. arXiv preprint arXiv:1701.02789 , 2017.Peter Spirtes, Clark Glymour, and Richard Scheines.

Causation, Prediction, and Search . MIT press,2nd edition, 2000.William R Thompson. On the likelihood that one unknown probability exceeds another in view ofthe evidence of two samples.

Biometrika , 25(3/4):285–294, 1933.Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger.Inequalities for the l1 deviation of the empirical distribution.

Hewlett-Packard Labs, Tech. Rep ,2003.Akihiro Yabe, Daisuke Hatano, Hanna Sumita, Shinji Ito, Naonori Kakimura, Takuro Fukunaga,and Ken-Ichi Kawarabayashi. Causal bandits with propagating inference. arXiv preprintarXiv:1806.02252 , 2018. 9 ppendixA Proof of theorems

In this appendix section, we set out to prove the theorems stated in the main paper. Here it isconvenient to introduce vectorized notation for the relevant quantities. Let us consider our estimatorgiven a particular separating set S with domain D ( S ) . We deﬁne the following vectors indexed by D ( S ) , such that the value at index s ∈ S is given by: (cid:0) ˆ p S ( ζ |D N ) (cid:1) s = ˆ p ( s | ζ , D N ) , (6) ( p S ( ζ )) s = P [ S = s | I = ζ ] , (7) (cid:0) ˆ µ S ( D N ) (cid:1) s = ˆ µ ( s | D N ) , (8) ( µ S ) s = E [ Y = 1 | S = s ] , (9) (cid:0) N S , D N ( p ) (cid:1) s = N D N ( S = s ∧ p ) . (10)With this in hand, we can write the deﬁnition of our information sharing estimator (1) as an innerproduct: ˆ µ IS ( ζ |D N ; S ) = ˆ p (cid:124) S ( ζ |D N ) ˆ µ S ( D N ) . (11) A.1 Proof of Theorem 3.1

We set out to prove the theorem:

Theorem 3.1.

If we calculate ˆ µ IS ( ζ |D N ; S ) from dataset D N as deﬁned in (1), I ⊥⊥ P M Y | S andthere is at least one sample from each possible intervention, then the information sharing estimator(1) is unbiased and there exists a constant α ∗ ∈ [0 , such that its variance conditional on thenumber of samples from each intervention is given by: V [ˆ µ IS ( ζ |D N ; S )] = 1 N D N ( I = ζ ) (cid:18) V s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ] (cid:3) (12) + (1 − α ∗ ( ζ , D N )) E s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ](1 − E [ Y | S = s ]) (cid:3)(cid:19) . We consider the information sharing estimator (11) calculated from data generated under the randomprocess of the interaction of the policy of a learner with a bandit environment, which we’ll denote P ν ,where we assume we have at least one sample from each possible intervention ζ ∈ D ( I ) . We ﬁrstshow that the vectors in (11) are uncorrelated, which has as immediate corollary that the informationsharing estimator is unbiased. This follows from the law of total expectation: E P ν (cid:2) ˆ p (cid:124) S ( ζ |D N ) ˆ µ S ( D N ) (cid:3) = E P ν (cid:20) E P ν (cid:2) ˆ p (cid:124) S ( ζ |D N ) ˆ µ S ( D N ) | N S , D N ( I = ζ ) , N S , D N ( I (cid:54) = ζ ) (cid:3)(cid:21) = E P ν (cid:20) ˆ p (cid:124) S ( ζ |D N ) E P ν (cid:2) ˆ µ S ( D N ) | N S , D N ( I = ζ ) , N S , D N ( I (cid:54) = ζ ) (cid:3)(cid:21) = E P ν (cid:2) ˆ p (cid:124) S ( ζ |D N ) (cid:3) µ S = p (cid:124) S ( ζ |D N ) µ S = E [ Y | I = ζ ] where in the second line we use that ˆ p (cid:124) S ( ζ |D N ) is deterministic conditional on N S , D N ( I = ζ ) andthus it factors out of the inner expectation. On the third line, we use that conditionally on the counts N D N ( S = s ) (which is a deterministic function of the vectors we condition on), ˆ µ ( s | D N ) is just themean of N D N ( S = s ) Bernoulli variables and thus unbiased, and thus the inner expectation evaluatesto the vector of true means µ S ( D N ) and factors out. The exact same conditioning argument using thelaw of total expectation can be used to show that E P ν (cid:2) ˆ µ S ( D N ) (cid:3) = µ S , from which it follows thatthe expectation factors and thus the vectors are uncorrelated. Finally in the fourth line, the elementsof ˆ p (cid:124) S ( ζ |D N ) can be seen as the mean of at least one Bernoulli variable (by assumption) and thus areunbiased, from which the unbiasedness of the information sharing estimator follows.10e analyze the variance using a similar strategy, where we add conditioning through the law of totalvariance. Let us consider conditioning on the number of samples for each intervention, i.e. we addconditioning on the set {N ( I = ζ ) } ζ ∈ D ( I ) : V P ν [ˆ µ IS ( ζ |D N ; S )] = E P ν (cid:2) V P ν (cid:2) ˆ µ IS ( ζ |D N ; S ) | {N ( I = ζ ) } ζ ∈ D ( I ) (cid:3)(cid:3) + V P ν (cid:2) E P ν (cid:2) ˆ µ IS ( ζ |D N ; S ) | {N ( I = ζ ) } ζ ∈ D ( I ) (cid:3)(cid:3) = E P ν (cid:2) V P ν (cid:2) ˆ µ IS ( ζ |D N ; S ) | {N ( I = ζ ) } ζ ∈ D ( I ) (cid:3)(cid:3) where on the ﬁrst line, the second term is since the estimator is unbiased if we have at least onesample for each possible intervention, which holds by assumption. Thus it sufﬁces to analyze theestimator conditioned on {N ( I = ζ ) } ζ ∈ D ( I ) , and then analyze the expectation of the resultingexpression w.r.t. P ν and the random variables {N ( I = ζ ) } ζ ∈ D ( I ) for a given policy and bandit. Weomit explicit conditioning on {N ( I = ζ ) } ζ ∈ D ( I ) to reduce clutter, and analyze the inner variance: V P ν (cid:2) ˆ µ IS ( ζ |D N ; S ) | {N ( I = ζ ) } ζ ∈ D ( I ) (cid:3) = V I (cid:2) ˆ µ IS ( ζ |D N ; S ) (cid:3) . Again, we use the law of total variance, adding the same conditioning we did to show unbiasedness: V I (cid:2) ˆ µ IS ( ζ |D N ; S ) (cid:3) = E I (cid:2) V I (cid:2) ˆ p (cid:124) S ( ζ |D N ) ˆ µ S ( D N ) | N S , D N ( I = ζ ) , N S , D N ( I (cid:54) = ζ ) (cid:3)(cid:3) + V I (cid:2) E I (cid:2) ˆ p (cid:124) S ( ζ |D N ) ˆ µ S ( D N ) | N S , D N ( I = ζ ) , N S , D N ( I (cid:54) = ζ ) (cid:3)(cid:3) . Now note, in both terms, ˆ p (cid:124) S ( ζ |D N ) is non-random. In the second term this vector factors out becauseof linearity of expectation. In the case of the ﬁrst term, the individual elements of ˆ µ S are uncorrelatedwith each-other since they are calculated from disjoint sets of data, thus this vector factors out of thevariance element wise squared. This yields: V I (cid:2) ˆ µ IS ( ζ |D N ; S ) (cid:3) = E I (cid:104)(cid:0) ˆ p (cid:124) S ( ζ |D N ) (cid:1) V I (cid:2) ˆ µ S ( D N ) | N S , D N ( I = ζ ) , N S , D N ( I (cid:54) = ζ ) (cid:3)(cid:105) (13) + V I (cid:2) ˆ p (cid:124) S ( ζ |D N ) E I (cid:2) ˆ µ S ( D N ) | N S , D N ( I = ζ ) , N S , D N ( I (cid:54) = ζ ) (cid:3)(cid:3) where the square of the vector in the ﬁrst term is elementwise, and the variance of a vectorin the ﬁrst term is just the vector of diagonal elements of the covariance matrix, i.e. thereare no covariance terms. In the second term, we may now again use that the inner expecta-tion is unbiased following the same argument as before. For the ﬁrst term, the variance vector V I (cid:2) ˆ µ S ( D N ) | N S , D N ( I = ζ ) , N S , D N ( I (cid:54) = ζ ) (cid:3) = µ S ⊗ (1 − µ S ) (cid:11) N S , D N ( (cid:62) ) is also well deﬁnedas the variance of a sample mean of a set of Bernoulli random variables, where ⊗ is elementwiseproduct, (cid:11) is elementwise division and N S , D N ( (cid:62) ) = N S , D N ( I = ζ ) + N S , D N ( I (cid:54) = ζ ) . Substitutingthis into (13) yields: V I (cid:2) ˆ µ IS ( ζ |D N ; S ) (cid:3) = E I (cid:104)(cid:0) ˆ pS ( ζ |D N ) (cid:1) (cid:11) N S , D N ( (cid:62) ) (cid:105) (cid:124) µ S ⊗ ( − µ S ) (14) + V I (cid:2) ˆ p (cid:124) S ( ζ |D N ) µ S (cid:3) Interestingly, the second term corresponds to our information leakage estimator if we were givenperfect oracle estimates µ S . Since the advantage gained by the information sharing estimator isthrough better estimation of µ S , this term can be seen as a base error that cannot be reduced throughinformation leakage.We evaluate the variance of the second term. Let s , . . . , s N ( I = ζ ) be the one-hot vector encodedvalues of S observed in the subset of our data where I = ζ . These are i.i.d. categorical variables, andsince ˆ p (cid:124) S ( ζ |D N ) = N D N ( I = ζ ) (cid:80) N D N ( I = ζ ) k =1 s k , the second term becomes by independence: V I (cid:2) ˆ p (cid:124) S ( ζ |D N ) µ S (cid:3) = V I  N D N ( I = ζ ) N D N ( I = ζ ) (cid:88) k =1 s (cid:124) k µ S  = 1 N D N ( I = ζ ) V I [ s (cid:124) µ S ]= 1 N D N ( I = ζ ) V s ∼ P [ S = s | I = ζ ] [ E [ Y | S = s ]] . (15)11et us now turn our attention to the ﬁrst term of equation (14). This is an inner product betweenvectors, where the left factor is an expectation of a vector. Let us consider an element of thisexpectation vector at index s ∈ D ( S ) : (cid:16) E I (cid:104)(cid:0) ˆ p (cid:124) S ( ζ |D N ) (cid:1) (cid:11) N S , D N ( (cid:62) ) (cid:105)(cid:17) s = E I (cid:20) ˆ p ( s | ζ , D N ) N D N ( S = s ) (cid:21) = E I (cid:20) ˆ p ( s | ζ , D N ) N D N ( I = ζ ) N ( S = s , I = ζ ) N D N ( S = s ) (cid:21) = E I (cid:20) ˆ p ( s | ζ , D N ) N D N ( I = ζ ) (cid:18) − N D N ( S = s , I (cid:54) = ζ ) N D N ( S = s ) (cid:19)(cid:21) = 1 N D N ( I = ζ ) E I (cid:2) ˆ p ( s | ζ , D N )(1 − α ( s , ζ , D N )) (cid:3) (16)where: α ( s , ζ , D N ) := N D N ( S = s , I (cid:54) = ζ ) N D N ( S = s ) . (17)Note that α ( s , ζ , D N ) equals if N D N ( S = s , I (cid:54) = ζ ) = 0 (i.e. there is no additional data to usewhere I (cid:54) = ζ for information sharing for this value of s ), and goes if N D N ( S = s , I (cid:54) = ζ ) goesto ∞ and we keep N D N ( S = s , I = ζ ) ﬁxed, since in the denominator N D N ( S = s ) = N D N ( S = s , I = ζ ) + N D N ( S = s , I (cid:54) = ζ ) .Substituting (15) and (16) into (14), the variance of the information sharing estimator then becomes V I (cid:2) ˆ µ IS ( ζ |D N ; S ) (cid:3) = 1 N D N ( I = ζ ) (cid:18) E I (cid:2) ˆ p S ( ζ , D N ) ⊗ ( − α ( ζ , D N ))) (cid:3) (cid:124) µ S ⊗ ( − µ S )+ V s ∼ P [ S = s | I = ζ ] [ E [ Y | S = s ]] (cid:19) (18)where we deﬁne α ( ζ , D N ) as the vectorized version of α ( s , ζ , D N ) indexed by s ∈ D ( S ) such that (cid:0) α ( ζ , D N ) (cid:1) s = α ( s , ζ , D N ) . Let us now consider the term which has α ( ζ , D N )) as a factor whenwe expand the parenthesis inside the expectation of the ﬁrst term. From its deﬁnition, we see that α ( ζ , D N ) is elementwise upper bounded by (at an inﬁnite of samples where I (cid:54) = ζ and a ﬁnitenumber of samples I = ζ ) for all values of s ), and elementwise lower bounded by if we have nosamples where I = ζ . Therefore, since all values are positive, if we deﬁne: α ∗ ( ζ , D N ) = E I [ˆ p S ( ζ , D N ) ⊗ α ( ζ , D N ))] (cid:124) µ S ⊗ ( − µ S ) E I [ˆ p S ( ζ , D N ) ⊗ ] (cid:124) µ S ⊗ ( − µ S ) (19)then α ∗ ( ζ , D N ) is upper bounded by since from its deﬁnition we see that α ( ζ , D N )) is elementwiseupper bounded by in which case the numerator and denominator are equal. Furthermore, α ∗ ( ζ , D N ) is lower bounded by since all values are nonnegative. Then α ∗ ( ζ , D N ) ∈ [0 , and substitution of α ∗ ( ζ , D N ) into (18) yields: V I (cid:2) ˆ µ IS ( ζ |D N ; S ) (cid:3) = 1 N D N ( I = ζ ) (cid:18) (1 − α ∗ ( ζ , D N )) E I [ˆ p S ( ζ , D N ) ⊗ ] (cid:124) µ S ⊗ ( − µ S )+ V s ∼ P [ S = s | I = ζ ] [ E [ Y | S = s ]] (cid:19) = 1 N D N ( I = ζ ) (cid:18) (1 − α ∗ ( ζ , D N )) p (cid:124) S ( ζ ) µ S ⊗ ( − µ S ) + V s ∼ P [ S = s | I = ζ ] [ E [ Y | S = s ]] (cid:19) = 1 N D N ( I = ζ ) (cid:18) V s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ] (cid:3) + (1 − α ∗ ( ζ , D N )) E s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ](1 − E [ Y | S = s ]) (cid:3)(cid:19) . (20)which is what was to be shown. The value of α ∗ ( ζ , D N ) is a complicated inner product dependingon the model parameters, and is a measure of the expected relative sizes of N D N ( S = s | I = ζ ) and N D N ( S = s | I (cid:54) = ζ ) for the values of s where P [ S = s | I = ζ ] is large.12t is easy to see that α ∗ ( ζ , D N ) ≥ min s α ( s , ζ ) , since then α ( ζ , D N )) ≥ min s α ( s , ζ ) elementwiseand we may then factor α ∗ ( ζ , D N ) out of the expectation in the numerator of (19) after which thefraction cancels. An interesting case is if we condition on knowing {N D N ( S = s , I = ζ ) } s ∈ D ( S ) .Let us deﬁne c to be the largest real number such that for all s ∈ D ( S ) , it holds that N D N ( S = s , I (cid:54) = ζ ) ≥ c N D N ( S = s , I = ζ ) . From its deﬁnition (17), we see that then α ( s , ζ , D N ) ≥ cc +1 , and thus α ∗ ( ζ , D N ) ≥ cc +1 .The relative sizes of V s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ] (cid:3) and E s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ](1 − E [ Y | S = s ]) (cid:3) signify how well additional data from N ( I (cid:54) = ζ ) helps in estimating E [ Y | I = ζ ] . Interestingly, notalways it is even beneﬁcial to share data through information leakage. Speciﬁcally, if E [ Y | S = s ](1 − E [ Y | S = s ]) = 0 for all s in the support of P [ S = s | I = ζ ] , then there is no error dueto misestimation of E [ Y | S = s ] (since they are then deterministic thus if we have just samplethis is enough) and all error of the information sharing estimator stems from misestimation of P [ S = s | I = ζ ] . In this case, no amount of additional data from I (cid:54) = ζ may help. In the best casehowever, the term V s ∼ P [ S = s | I = ζ ] (cid:2) E [ Y | S = s ] (cid:3) may be in which case all error is reducible throughinformation leakage. This happens for example if P [ S = s | I = ζ ] is deterministic, in which casethere is no error due to misestimation of these probabilities. A.2 Proof of Theorem 3.2

We now set out to show the following theorem:

Theorem 3.2.

If we calculate ˆ µ IS ( ζ |D N ; S ) as in (1) from dataset D N and I ⊥⊥ G Y | S , then: P (cid:20) E [ Y | I = ζ ] > (cid:88) s ∈ D ( S ) ˆ p ( s | ζ , D N ) ucb (ˆ µ ( s |D N )) (21) + ∆ ˆ p ( ζ ) (cid:18) max s ∈ D ( S ) ucb (ˆ µ ( s |D N )) − min s (cid:48) ∈ D ( S ) ucb (ˆ µ ( s (cid:48) |D N )) (cid:19)(cid:21) < δ. Given that S is a separating set, the true mean µ ( ζ ) = E [ Y | I = ζ ] is a function of the true parameters p S ( ζ ) and µ S : µ ( ζ ) = p S ( ζ ) (cid:124) µ S . We will construct an upper bound for µ ( ζ ) by constraining the parameters to some set with highprobability, i.e. with high probability ( p S ( ζ ) , µ S ) ∈ Θ = Θ p × Θ µ . Then: P (cid:34) µ ( ζ ) ≤ sup ( p ∗ S , µ ∗ S ) ∈ Θ ( p ∗ S ) (cid:124) µ ∗ S (cid:35) ≥ P [( p S ( ζ ) , µ S ) ∈ Θ] . (22)We construct Θ by using existing concentration bounds for individual elements of µ ( ζ ) ’s decomposi-tion. Let us ﬁrst consider µ S . We construct an estimator ˆ µ S ( D N ) for these parameters, where eachelement of this vector is a sample mean of N D N ( S = s ) of Bernoulli variables. Thus, we can use thestandard Chernoff-Hoeffding bound for each individual index s ∈ D ( S ) of the vector: P (cid:34) ( µ S ) s ≥ (cid:0) ˆ µ S ( D N ) (cid:1) s + (cid:115) log(1 /δ µ ( ζ ) )2 N D N ( S = s ) (cid:35) ≤ δ µ ( ζ ) . Then, we can take the union bound of this event over the indices, to obtain a vectorized complementaryversion: P (cid:34) µ S < ˆ µ S ( D N ) + (cid:114)

12 log(1 /δ µ ( ζ ) ) (cid:11) N S , D N ( (cid:62) ) (cid:35) ≤ − | D ( S ) | δ µ ( ζ ) , where the square root and the inequality are elementwise, i.e. a < b implies that forall s it holds that a s < b s . We will refer to the complement of the event inside prob-ability as B µ . We now turn to bounding p S ( ζ ) . This is a multinomial variable, i.e. ˆ p S ( ζ |D N ) ∼ N D N ( I = ζ ) M ultinomial ( N D N ( I = ζ ) , p S ( ζ )) . Then by Weissman, Ordentlich,Seroussi, Verdu, and Weinberger (2003): P (cid:34) (cid:107) ˆ p S ( ζ |D N ) − p S ( ζ ) (cid:107) ≥ (cid:115) | D ( S ) | log(2 /δ p S ) N D N ( I = ζ ) (cid:35) ≤ δ p S .

13e will refer to this event as B p . Then P [ B µ ∪ B p ] ≤ | D ( S ) | δ µ ( ζ ) + δ p S := δ . It is an interestingoptimization problem to choose the values of δ µ ( ζ ) and δ p S in order to minimize the width of theresulting conﬁdence interval. However, out of convenience we will just pick these such that theconﬁdence is ‘evenly spread out’, i.e. we set | D ( S ) | δ µ ( ζ ) = δ p S and set all δ µ ( ζ ) equal to eachother.So then, if we deﬁne regions corresponding to the complement of B µ and B p : Θ p = (cid:40) p s ( ζ ) | (cid:107) ˆ p S ( ζ |D N ) − p S ( ζ ) (cid:107) < (cid:115) | D ( S ) | log(4 /δ ) N D N ( I = ζ ) (cid:41) , Θ µ = (cid:40) µ S | µ S < ˆ µ S ( D N ) + (cid:115)

12 log (cid:18) | D ( S ) | δ (cid:19) (cid:11) N S , D N ( (cid:62) ) (cid:41) . indeed by union bound: P [( p S ( ζ ) , µ S ) ∈ Θ] ≥ − δ. (23)It then remains to maximize p S ( ζ ) (cid:124) µ S over Θ = Θ p × Θ µ . Let us deﬁne ucb ( ˆ µ S ( D N )) =ˆ µ S ( D N ) + (cid:114) log (cid:16) | D ( S ) | δ (cid:17) (cid:11) N S , D N ( (cid:62) ) and ∆ ˆ p ( ζ ) = (cid:113) | D ( S ) | log(4 /δ )2 N D N ( I = ζ ) . We will prove thefollowing lemma: Lemma A.1.

The optimization problem: sup ( p ∗ S , µ ∗ S ) ∈ Θ ( p ∗ S ) (cid:124) µ ∗ S , under constraints (where inequalities are element-wise): p ∗ S ≥ , (cid:107) p ∗ S (cid:107) = 1 , is upper bounded by: ˆ p (cid:124) S ( ζ |D N ) ucb ( ˆ µ S ( D N )) + ∆ ˆ p ( ζ ) (cid:18) max s ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s − min s (cid:48) ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s (cid:48) (cid:19) Proof.

First note that since all elements of p ∗ S ( ζ ) are nonnegative, and we have element-wise upperbounds for µ ∗ S , to maximize w.r.t. µ ∗ S we can always just pick the maximum possible value for eachelement of µ ∗ S in Θ µ , which are given by ucb ( ˆ µ S ( D N )) . Thus our maximization problem reducesto: sup ( p ∗ S , µ ∗ S ) ∈ Θ ( p ∗ S ) (cid:124) µ ∗ S = sup p ∗ S ∈ Θ p ( p ∗ S ) (cid:124) ucb ( ˆ µ S ( D N )) (24)Let us deﬁne ∆ p S = p ∗ S − ˆ p (cid:124) S ( ζ |D N ) . Let ∆ + be the positive elements of ∆ p S and ∆ − be theabsolute value of the negative elements of ∆ p S . Then p ∗ S = ˆ p S ( ζ |D N ) + ∆ + − ∆ − . Substitutingthis into (24) yields: sup p ∗ S ∈ Θ p ( p ∗ S ) (cid:124) ucb ( ˆ µ S ( D N )) = sup p ∗ S ∈ Θ p (cid:0) ˆ p (cid:124) S ( ζ |D N ) + ∆ (cid:124) + − ∆ (cid:124) − (cid:1) ucb ( ˆ µ S ( D N )) (25)Now, since (cid:107) p ∗ S (cid:107) = 1 and (cid:107) ˆp S ( ζ ) (cid:107) = 1 and all values are positive, it follows that (cid:107) ∆ + (cid:107) = (cid:107) ∆ − (cid:107) and (cid:107) p ∗ S − ˆp S ( ζ ) (cid:107) = (cid:107) ∆ + (cid:107) + (cid:107) ∆ − (cid:107) . Looking at the region we aremaximizing over, we see that lV ert p ∗ S − ˆp S ( ζ ) (cid:107) ≤ (cid:113) | D ( S ) | log(4 /δ ) N D N ( I = ζ ) , and thus (cid:107) ∆ + (cid:107) = (cid:107) ∆ − (cid:107) ≤ (cid:113) | D ( S ) | log(4 /δ )2 N D N ( I = ζ ) = ∆ ˆ p ( ζ ) . Furthermore, it is trivial to check that for strictly posi-tive values ∆ (cid:124) + ucb ( ˆ µ S ( D N )) ≤ (cid:107) ∆ + (cid:107) max s ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s and ∆ (cid:124) − ucb ( ˆ µ S ( D N )) ≥(cid:107) ∆ − (cid:107) min s (cid:48) ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s (cid:48) . Combining these facts with (25) yields: sup p ∗ S ( ζ ) ∈ Θ p p ∗ S ( ζ ) (cid:124) ucb ( ˆ µ S ( D N )) ≤ ˆ p (cid:124) S ( ζ |D N ) ucb ( ˆ µ S ( D N ))+ ∆ ˆ p ( ζ ) (cid:18) max s ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s − min s (cid:48) ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s (cid:48) (cid:19) (26)which ﬁnishes the proof. 14e may now easily combine (22), (23) and (26) to obtain a vectorized version of the theoremstatement: P (cid:20) µ ( ζ ) ≤ ˆ p (cid:124) S ( ζ |D N ) ucb ( ˆ µ S ( D N )) + ∆ ˆ p ( ζ ) (cid:18) max s ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s − min s (cid:48) ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s (cid:48) (cid:19)(cid:21) > − δ (27)which ﬁnishes the proof of the theorem. We may follow the exact same argument solving aminimization problem for (22) and taking the reverse Chernoff-Hoeffding bound by deﬁning lcb ( ˆ µ S ( D N )) = ˆ µ S ( D N ) − (cid:114) log (cid:16) | D ( S ) | δ (cid:17) (cid:11) N S , D N ( (cid:62) ) , to obtain a symmetric lower bound: P (cid:20) µ ( ζ ) ≥ ˆ p (cid:124) S ( ζ |D N ) lcb ( ˆ µ S ( D N )) − ∆ ˆ p ( ζ ) (cid:18) max s ∈ D ( S ) (cid:0) lcb ( ˆ µ S ( D N )) (cid:1) s − min s (cid:48) ∈ D ( S ) (cid:0) lcb ( ˆ µ S ( D N )) (cid:1) s (cid:48) (cid:19)(cid:21) > − δ (28) A.3 Proof of theorem 4.1

We set out to prove the theorem:

Theorem 4.1.

If we run the separating set causal bandit UCB algorithm as deﬁned in appendix B.1on a discrete causal bandit with binary rewards, and if I ⊥⊥ G Y | S , it has the same cumulative regretupper-bound as normal UCB on that bandit. To see why this is true, consider the proof of the regret bound for UCB in chapter 8.1 of Lattimoreand Szepesvári (2020). To ease notation, this proof is for -subgaussian variables, while our problemconcerns Bernoulli variables, which are / -subgaussian. This causes an extra factor of / inside thesquare of our bounds, but this is an uninteresting technicality. Following that proof and its notation,let µ i be the expected reward of action i , µ ∗ = max i µ i = µ i be the optimal action indexed by for convenience. Then ∆ i = µ ∗ − µ i is the expected regret of action i . We consider the regretdecomposition at timestep n : R n = (cid:88) i :∆ i > ∆ i E [ T i ( n )] where T i ( n ) is the number of times we have chosen action i at timestep n . Let ˆ µ i ( t ) be the estimatedreward of action i (corresponding to some intervention ζ ) by the Information Sharing UCB algorithmat timestep t , and ucb i ( t ) be the calculated additive upper bound bonus (i.e. the best _ width calculatedin the Information Sharing UCB algorithm). We can then upper-bound T i ( n ) as follows: T i ( n ) = n (cid:88) t =1 { A t = i } ≤ n (cid:88) t =1 { ˆ µ ( t −

1) + ucb ( t − ≤ µ − ε } + n (cid:88) t =1 { ˆ µ i ( t −

1) + ucb i ( t − > µ − ε and A t = i } , (29)where A t is the chosen action, and we have a term corresponding to the number of times the index ofthe optimal arm is less than µ − ε and the second term which corresponds to the number of timesthat A t = i and its index is larger than µ − ε . Let us start with analyzing the expectation of the ﬁrstterm of (29): E (cid:34) n (cid:88) t =1 { ˆ µ ( t −

1) + ucb ( t ) ≤ µ − ε } (cid:35) (30)Let us assume we picked the information sharing estimator for ˆ µ at timestep t − . Now, fromconstruction of our algorithm we know that ucb ( t ) ≤ (cid:113) log( f ( t ))2 T ( t − where f ( t ) = 1 + t log ( t ) , wherefor the information sharing estimator, ucb ( t − is given through the deﬁnition of idx ( ζ , D N ; S ) in(43) (after some simpliﬁcation): ucb i ( t −

1) = ˆ p (cid:124) ( ζ i , D N ) (cid:115)

12 log (cid:18) | D ( S ) | δ (cid:19) (cid:11) N S , D N ( (cid:62) ) (31) + (cid:115) | D ( S ) | log(4 /δ )2 N D N ( I = ζ i ) (cid:16) max s (cid:0) ˆ µ S ( D N ) (cid:1) s − min s (cid:48) (cid:0) ˆ µ S ( D N ) (cid:1) s (cid:48) (cid:17) , (32)15here ζ i is the intervention corresponding to action A i , and S is the separating set chosen by thealgorithm at timestep t − for that intervention. Let us now introduce some further simplifyingnotation. Let: K i ( t −

1) = ˆ p (cid:124) ( ζ i , D N ) (cid:114) (cid:11) N S , D N ( (cid:62) ) ,L i ( t −

1) = (cid:115) | D ( S ) | N D N ( I = ζ i ) (cid:16) max s (cid:0) ˆ µ S ( D N ) (cid:1) s − min s (cid:48) (cid:0) ˆ µ S ( D N ) (cid:1) s (cid:48) (cid:17) Then: ucb i ( t −

1) = K i ( t − (cid:112) log(1 /δ ) + log(2 | D ( S ) | ) + L i ( t − (cid:112) log(1 /δ ) + log(4) (33)Now, since we picked this upper bound over the sample mean bound, by construction of the algorithmit must hold that at the chosen conﬁdence at parameter at timestep t given by δ ( t ) = 1 /f ( t ) , it holdsthat: ucb i ( t −

1) = K i ( t − (cid:112) log( f ( t )) + log(2 | D ( S ) | ) + L i ( t − (cid:112) log( f ( t )) + log(4) ≤ (cid:115) log( f ( t ))2 T i ( t − (34)Which implies that K i ( t −

1) + L i ( t − ≤ (cid:113) T i ( t − . Now consider taking the derivative of ucb i ( t − with regard to log(1 /δ ) : ∂∂ (log(1 /δ )) ucb i ( t −

1) = K i ( t − (cid:112) log(1 /δ ) + log(2 | D ( S ) | ) + L i ( t − (cid:112) log(1 /δ ) + log(4) (35) ≤ K i ( t −

1) + L i ( t − (cid:112) log(1 /δ ) (36) ≤ (cid:112) T i ( t −

1) log(1 /δ ) = ∂∂ (log(1 /δ )) (cid:115) log(1 /δ )2 T i ( t − (37)This shows that the information sharing bound grows more slowly as log(1 /δ ) grows than thetraditional UCB bound for / -subgaussian variables. Let us now consider ucb i ( t ) as a function of δ and calculate δ (cid:48) ε such that: P [ˆ µ i ( t −

1) + ucb i ( t, / ( f ( t )) + ε ≤ µ ] = P [ˆ µ i ( t −

1) + ucb i ( t, δ (cid:48) ε )] ≤ δ (cid:48) ε , (38)i.e. we solve δ (cid:48) ε such that K i ( t − (cid:112) log(1 /δ (cid:48) ε ) + log(2 | D ( S ) | ) + L i ( t − (cid:112) log(1 /δ (cid:48) ε ) + log(4) = ucb i ( t, / ( f ( t )) + ε . We may then compare this to analyzing the same event as a / -subgaussianvariable as in the book, i.e. in that case we solve δ ∗ ε such that: (cid:115) log(1 /δ ∗ ε )2 T i ( t −

1) = (cid:115) log( f ( t ))2 T i ( t −

1) + ε Then, since we have shown that the information sharing bound grows more slowly than the / -subgaussian bound as δ decreases, and since clearly δ (cid:48) ε ≥ / ( f ( t )) and δ ∗ ε ≥ / ( f ( t )) , it must holdthat δ (cid:48) ε ≥ δ ∗ ε . Thus when we analyze the event: E (cid:34) n (cid:88) t =1 (cid:40) ˆ µ ( t −

1) + (cid:115) log( f ( t ))2 T ( t − ≤ µ − ε (cid:41)(cid:35) , (39)where we consider ˆ µ ( t − a / -subgaussian variable, the resulting upper bound must also be anupper bound for the event ˆ µ i ( t −

1) + ucb i ( t, / ( f ( t )) + ε analyzed with our concentration bound.Substituting this into (39) yields: E (cid:34) n (cid:88) t =1 { ˆ µ ( t −

1) + ucb ( t ) ≤ µ − ε } (cid:35) ≤ E (cid:34) n (cid:88) t =1 (cid:40) ˆ µ ( t −

1) + (cid:115) log( f ( t ))2 T ( t − ≤ µ − ε (cid:41)(cid:35) (40)16here again very importantly, on the right side we treat ˆ µ ( t − as a / -subgaussian variable.This analysis was conditional on ˆ µ ( t − being an information sharing estimator, however, (40) istrivially true if the algorithm reverted to the sample mean. From here, we may continue the analysisof ﬁrst term exactly as in the book as the expression is exactly the same modulo the 1/4 factor insidethe square due to the fact that a Bernoulli variable is 1/4-subgaussian, while the book focusses on1-subgaussian variables to simplify notation. For the analysis of the second term, the followinginequality trivially holds: E (cid:34) n (cid:88) t =1 { ˆ µ i ( t −

1) + ucb i ( t ) ≥ µ − ε and A t = i } (cid:35) (41) ≤ E (cid:34) n (cid:88) t =1 (cid:40) ˆ µ i ( t −

1) + (cid:115) log( f ( t ))2 T i ( t − ≥ µ − ε and A t = i (cid:41)(cid:35) , (42)where ˆ µ i ( t − may be result of the algorithm picking an information sharing estimator or thealgorithm reverting to the sample mean. Consider the information sharing estimator. Since itsvariance is upper bounded by that of the sample mean, and it is bounded between [0 , , by Hoeffdingslemma it is itself a / (4 √ n ) -subgaussian variable, which implies that corollary . from the bookalso holds if ˆ µ i ( t − is an information sharing estimator. Thus, we may just continue the analysisin the book regardless of the estimator chosen. This concludes the proof of the theorem. B Algorithm speciﬁcation

In this section of the appendix, we specify our information sharing Causal Bandit Separating Setalgorithms, speciﬁcally a UCB and a Thompson sample variant. For both algorithms, we will use theupper bound (27). To ease notation, let us deﬁne: idx ( ζ , D N ; S ) = ˆ p (cid:124) S ( ζ |D N ) ucb ( ˆ µ S ( D N )) + ∆ ˆ p ( ζ ) (cid:18) max s ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s − min s (cid:48) ∈ D ( S ) (cid:0) ucb ( ˆ µ S ( D N )) (cid:1) s (cid:48) (cid:19) . (43)One important detail when calculating this quantity, is the effect of perfect interventions withknown targets. Speciﬁcally, consider intervention ζ corresponding to perfect interventions on nodes O ⊆ V \ { Y } . If S ∩ O = O (cid:48) is not the empty set, then under intervention ζ , the values for O (cid:48) areﬁxed. This limits the possible support for P [ S = s | I = ζ ] , and in the calculation of (43) we maythen limit ourselves to values of s that are possible given the known targets of ζ instead of the fulldomain D ( S ) , effectively reducing the dimension the problem of estimation of P [ S = s | I = ζ ] by | O (cid:48) | . This prior knowledge is used when calculating (43) in our experiments and when we constructour Thompson Sampling index.Both of our algorithms take as input a causal discovery algorithm disc _ sep _ set , that given a dataset D N , the set of possible interventions D ( I )) and the target variable Y attempts to infer the sets S suchthat Y ⊥ G I | S and returns those inferred separating sets. B.1 Information Sharing UCB

We ﬁrst deﬁne our UCB variant using the information sharing estimator. The full details are inAlgorithm . First, on line 5, the algorithm retrieves all separating set. Then, for each possibleintervention, on line 6 it initializes the best found width so far to the width of the standard naïveUCB bound for Bernoulli variables, and sets the best found index to that of the standard naïve UCBalgorithm. For that intervention it then checks each separating set if the bound (27) is tighter, afterwhich on line 17 we store as index for that intervention the index corresponding to the tightest bound.As intervention for that round we then pick the intervention with the highest index. B.2 Information Sharing TS

Next we deﬁne the Thompson Sampling variant of our algorithm. Given a true separating set, itis very natural to construct a Thompson Sampling estimator as follows. We impose a Dirichletprior on p s ( ζ ) with parameter α = 1 · and for each element of µ S we impose a Beta prior with17 lgorithm 1 Information Sharing UCB Input:

Data: D N = { ( ζ n , v n ) } Nn =1 , set of possible interventions D ( I ) , target variable Y Separating set algorithm: disc _ sep _ set Output:

Next action to pick at iteration N + 1 δ = N log ( N ) Initialize array index [ ζ ] for ζ ∈ D ( I ) S _ set ← disc _ sep _ set ( D N , Y, D ( I )) for all ζ ∈ D ( I ) do best _ width = (cid:113) log(1 /δ )2 N D N ( I = ζ ) best _ index = ˆ µ SM ( ζ |D N ) + best _ width for all S ∈ S _ set ( ζ ) do new _ index = idx ( ζ , D N ; S ) new _ width = new _ index − ˆ µ IS ( ζ |D N ; S ) if new _ width < best _ width then best _ index = new _ index best _ width = new _ width end if end for index [ ζ ] = best _ index end for return arg max ζ index [ ζ ] parameters α = β = 1 , where the parameters are chosen to maximize entropy. We may then calculatethe posteriors of these distributions given our data, which are simple and closed form, and thensample from the parameters from them. Given this parameter sample, we may then calculate thecorresponding mean and use that as an index.Unfortunately, our chosen causal discovery methods are inherently frequentist, and thus we did notimplement and end-to-end Bayesian approach. Instead, we rely on our Information Sharing UCBalgorithm to select a separating set, after which we may construct our Thompson Sampling indexassuming that the separating set is correct. If no separating set is found, we revert to a traditionaldirect Thompson sampling index for that intervention. The full details are given in Algorithm 2,where from line 1-16, we run a variant of Algorithm 1 that just selects for each intervention theseparating set that Algorithm 1 would have picked to construct its index. Then, from line 17-20, weconstruct a Thompson Sampling index for that intervention if a separating set is found. C Experiments

In this section of the appendix, we specify the details of the experimental design of the experimentswith all DAGs of nodes. Furthermore, we detail a ﬁnal experiment on larger graphs of nodes. C.1 All DAGs of 4 nodes experiment

When we generate all DAGs of nodes in the manner described in section 5.1, we end up with 234DAGs. For each generated graph G = ( V , E ) , we model each node as a binary variable, and add anintervention node I V for each variable V ∈ V \ { Y } . Each intervention node I V models a perfectintervention on V , where we can either set I V = ∅ which corresponds to not intervening on V , wecan set it to which corresponds to intervening on V such that V = 1 , and we can set it to whichcorresponds to intervening on V such that V = 0 . For each variable V , we generate a random binarytarget vector t V of size | pa ( V ) | with uniform distribution. Let match ( t V , pa ( V )) be a function thatcounts the number of parents of V that matches the target vector. We then sample V according to: P [ V = 1 | pa ( V )] = 1 + match ( t V , pa ( V ))2 + | pa ( V ) | , (44)if we do not intervene on that variable. That is, the probability depends on the numerator whichcounts the number of parents of V that match the target vector.18 lgorithm 2 Information Sharing TS Input:

Data: D N = { ( ζ n , v n ) } Nn =1 , set of possible interventions D ( I ) , target variable Y Separating set algorithm: disc _ sep _ set Output:

Sample ˜ p ∼ Dirichlet ( N S , D N ( I = ζ ) + 0 . · ) Sample ˜ µ ∼ ( Beta ( N D N ( Y = 1 , S = s ) + 1 , N D N ( Y = 0 , S = s ) + 1)) s ∈ D ( S ) index [ ζ ] = ˜ p (cid:124) ˜ µ else traditional _ T S _ index = Beta ( N D N ( Y = 1 , I = ζ ) + 1 , N D N ( Y = 0 , I = ζ ) + 1)) end if end for return arg max ζ index [ ζ ] C.2 Experiment with DAGs of 6 nodes