aa r X i v : . [ ec on . E M ] N ov Adaptive Combinatorial Allocation
Maximilian Kasy ∗ Alexander Teytelboym † November 5, 2020
Abstract
We consider settings where an allocation has to be chosen repeatedly, returns areunknown but can be learned, and decisions are subject to constraints. Our model coverstwo-sided and one-sided matching, even with complex constraints. We propose an approachbased on Thompson sampling. Our main result is a prior-independent finite-sample boundon the expected regret for this algorithm. Although the number of allocations growsexponentially in the number of participants, the bound does not depend on this number.We illustrate the performance of our algorithm using data on refugee resettlement in theUnited States.
Adaptive experimentation uses information obtained in the course of an experiment in orderto optimize the treatment assignment for later study participants. For example, if job seekersarrive at a job center over time, a policymaker can use the outcomes of earlier job seekers in orderto improve the assignment of labor market interventions for later participants (Caria et al.,2020). Using this information, adaptive experimentation can for instance be used to maximizethe welfare of study participants, or to inform subsequent policy choices (Kasy and Sautmann,2020).In many policy settings, however, policymakers do not simply choose between a few in-terventions. Instead, they need to select an entire allocation of resources among participants.These resources are typically scarce, and feasible allocations can be described by combinatorialconstraints. For example, if the policymaker wants to allocate students to classrooms whenclassroom composition affects student outcomes (Graham et al., 2010), she must ensure thatall students are assigned to classrooms, that the capacity of classrooms is not exceeded, andthat the allocation respect the demographic composition of students in the population. If thepolicymaker wants to allocate children to foster families when families impact the outcomes ofthe children, she needs to ensure that siblings are placed together and that foster homes areclose to schools and family homes (MacDonald, 2019; Robinson-Cort´es, 2019). If the policy-maker wants to allocate tenants to social housing, she needs to ensure that housing matches the ∗ Department of Economics, University of Oxford, [email protected] . † Department of Economics, University of Oxford, [email protected] . We thank Daniel Privitera and Manos Perdikakis for excellent research assistance. options (e.g., matches) but is constrained to selecting only allocations (combinationsof options; e.g., matchings) that satisfy the resource constraints (e.g., a one-to-one matching).Participants arrive in batches every period. The decision-maker selects an allocation and ob-serves the outcome of each selected option. The outcome of each option results in a reward .The decision-maker’s objective is to maximize the expected cumulative rewards from all theoptions she picked over time; equivalently, the decision-maker aims to minimize expected re-gret , i.e., the expected difference between the optimal combination of options in each period.This type of setting has been called a combinatorial semi-bandit setting with linear rewards inthe literature (Audibert et al., 2011). “Combinatorial” because the decision-maker can choosecombinations of options; “semi-bandit” because the decision-maker can observe the outcomesof every match, not just of the entire matching; and “linear rewards” because the objectivefunction is the sum of the rewards of all matches made.Our main theoretical result is a bound on the worst-case regret obtained by using Thomp-son sampling in our setting. Thompson sampling is a classic heuristic for standard banditproblems; it requires that each action is picked with probability equal to the posterior proba-bility that this action is optimal. Our theoretical result is appealing for three reasons. First,the worst-case expected regret does not depend on the batch size even though the numberof possible actions (i.e., allocation) grows exponentially in the batch size. Second, our boundholds in finite-samples and does not rely on asymptotic approximations. Third, our bound isprior-independent allowing for arbitrary statistical dependence across match outcomes; this isparticularly relevant for matching contexts. We apply our approach to the problem of matching resettled refugees to local communities inthe United States (Bansak et al., 2018; Trapp et al., 2020). Our data cover the placement of all This contrasts with existing bounds which require prior independence across options, as discussed below. any bandit algorithm, which was derivedby (Lai and Robbins, 1985). Wang and Chen (2018) provide a distribution-dependent regretbound for the Thompson algorithm in the combinatorial semi-bandit setting; our result isdistribution-free. Adaptive experimentation using the Thompson algorithm has been pro-posed for applications such as drug trials (Berry, 2006), recommender systems (Kawale et al.,2015), and customer acquistion (Schwartz et al., 2017). More recently, adaptive experimen-tation has been deployed in field experiments in development contexts (Kasy and Sautmann,2020; Caria et al., 2020).The rest of the paper is organized as follows. Section 2 describes our combinatorial semi-bandit setting and the Thompson heuristic; Section 2.1 then discusses several examples coveredby this general framework. Section 3 gives our main theoretical result and the intuition forits proof. Section 4 covers several considerations for implementation in practice, including thechoice of model and prior, methods for sampling from the posterior, and statistical inference.Section 5 discusses calibrated simulations based on the motivating applications. Appendix Aprovides a brief review of information theory, which is need for the proof of our main result.All proofs can be found in Appendix B.
We denote all random variables with capital letters (e.g., A ) and the realizations of randomvariables with lower-case letters (e.g., a ). Feasible options and actions
The decision-maker has access to options (e.g., matches orprojects) j ∈ , . . . , J , but only has sufficient resources to select M ≤ J of these. We denoteby A ⊆ { a ∈ { , } J : k a k = M } a collection of feasible combinations of options. A is astrict subset if the decision-maker faces additional allocation constraints. The decision-maker’s action a ∈ A is a feasible combination of options.3 iming, potential outcomes, and observability The program takes places in a finitenumber of periods t = 1 , . . . , T . In each period, there is a vector Y t ∈ [0 , J of potentialoutcomes, where Y jt is the potential outcome for option j in period t . The vectors Y t are i.i.d.across periods. We denote the average potential outcome (or average structural function) foroption j by Θ j , that is, Θ j = E [ Y jt | Θ]. The decision-maker holds a prior belief over the vectorΘ ∈ [0 , J , where we allow for arbitrary dependence of this prior across the options j .In each period, the decision-maker chooses an action A t ∈ { , } J . If the decision-makerchooses action a , they observe the outcomes of the chosen options j (the options j for which a j = 1), i.e., the vector Y t ( a ) = ( a j · Y jt : j = 1 , . . . , J ) . (1)We assume “stable unit treatment values” (i.e., no spillovers or interference) across options j , in the sense that Y jt does not depend on the chosen action a j ′ t for any j ′ . Note that thisassumption is consistent with settings where Y jt is itself the equilibrium outcome of interactionsacross multiple individuals comprising an option j , as is the case for example in the applicationsto peer effects or matching discussed below.Given our assumption about observability, the information available at the beginning ofperiod t is given by F t = { ( A t ′ , Y t ′ ( A t ′ )) : 1 ≤ t ′ < t } . (2)Throughout this paper, the subscript t on E t indicates that the expectation is evaluated underthe posterior distribution P t ( · ) = P ( · | F t ), where F t is the information available at thebeginning of period t . The decision-maker can choose their action A t at the beginning of eachperiod t based on the information F t , as well as possibly based on a randomization devicethat is statistically independent across periods and independent of the sequence of potentialoutcomes ( Y t ) Tt =1 . Objective and policy
If the decision-maker chooses action a in period t , they receive a reward which is equal to h a, Y t i . Therefore, upon taking action a the decision-maker’s expectedreward given Θ, which is the same across periods t , equals R ( a ) = E t [ h a, Y t i| Θ] = h a, Θ i . (3)The decision-maker would like to maximize cumulative expected rewards, E " T X t =1 R ( A t ) . (4)The expectation in this expression is taken over the randomness in the choice of actions A t , thesampling distribution of potential outcomes Y t , and over the prior distribution of Θ. Denote by A ∗ a feasible action that maximizes the expected reward conditional on Θ (but not conditional4n the vector Y t ), that is, A ∗ ∈ argmax a ∈A R ( a ) = argmax a ∈A h a, Θ i . (5)The decision-maker’s objective is equivalent to minimizing expected regret at T E " T X t =1 ( R ( A ∗ ) − R ( A t )) . (6)Solving this dynamic stochastic combinatorial optimization problem is computationallyquite costly. Rather than solving it, we propose that the decision-maker adopt the follow-ing heuristic policy. In each period the decision-maker should take an action a from the feasibleset A according to the posterior probability that a is optimal, that is, for each a ∈ A , P t ( A t = a ) = P t ( A ∗ t = a ) . (7)This assumption implies in particular that E t [ A t ] = E t [ A ∗ ] . This heuristic approach is knownas Thompson sampling, and was originally introduced by Thompson (1933) for treatment as-signment in adaptive experiments.
In the following we discuss several examples that are covered by our general framework, and thusin particular by the regret bound provided in Theorem 1 below. These examples correspond topractically relevant policy problems. They also illustrate how various combinatorial allocationproblems that have been studied in the literature fit into our framework, such as assignment topeers, one-to-one matching, many-to-one matching, knapsack problems, etc.For each of these examples, several options might correspond to same underlying parameter,so that Θ j = Θ j ′ with prior probability 1, for some j, j ′ . In the case of one-to-one matching,for instance, each matched pair corresponds to one option, but Θ j is the same for all matchedpairs j with the same observed covariates on both sides of the match. Example . American refugee resettlement agen-cies need to make weekly decisions about the allocation of arriving refugee families to localcommunities. An action a is a matching of refugee families to local communities. The numberof options J is the number of distinct matches between different family-locality pairs, and thebatch size M is equal to the number of refugee families arriving in a given week. We willconsider this example in greater detail in Section 5 below. Example . Foster families are typically able to host several fosterchildren at the same time (MacDonald, 2019; Robinson-Cort´es, 2019). An action a is a many-to-one matching between families and children. The feasible actions a require that no familyreceives more children than it can host, that all siblings are matched to the same foster family,5nd that children are hosted near their school and activities. The parameters Θ j are againperfectly dependent across options j that are observationally identical, i.e., across matches ofchildren and families with the same observed covariates. Example . Suppose that a policymaker would liketo choose the gender composition of classrooms in order to maximize student performance(Graham et al., 2010). Assume students are of two types, boys and girls. Classrooms havea fixed number of students. An action a allocates (i.e., groups) the students into classrooms.Classroom identity does not matter, but the identity of peers does matter, for student outcomes.The number of options J is equal to the number of classroom-sized subsets of the set of allstudents. The batch size M is equal to the number of classrooms. If students are observationallyindistinguishable from each other, except for gender, then the prior exhibits perfect dependenceacross classrooms with the same number of girls and boys. We next state our main theoretical result. This result provides tight worst-case guarantees forthe expected regret of Thompson sampling in our setup.
Theorem 1.
Under the assumptions of Section 2, E " T X t =1 ( R ( A ∗ ) − R ( A t )) ≤ r JT M · (cid:2) log (cid:0) JM (cid:1) + 1 (cid:3) . Discussion of Theorem 1
Several features of the regret bound in Theorem 1 are worthemphasizing. First, this bound is a finite sample bound. There is no large sample limit andno remainder term. Second, this bound does not depend on the prior distribution for Θ inany way. Furthermore, it allows for prior distributions with arbitrary statistical dependenceacross the components of Θ, as required by our motivating examples. Third, this bound impliesthat Thompson sampling in our setting achieves the efficient rate of convergence for regret: Asshown by Audibert et al. (2014), the minimax regret in our setting grows at a rate of √ JT M ,up to logarithmic terms.Theorem 1 bounds the worst case expected regret across all possible priors, summed acrossunits. To get the worst case expected regret per unit, divide this expression by
T M , whichyields the bound q J · (cid:2) log (cid:0) JM (cid:1) + 1 (cid:3) (cid:14) (2 T M ). This bound goes to 0 at a rate of 1 over thesquare root of the sample size, that is, at a rate of 1 / √ T M . The theorem furthermore showsthat this worst case expected regret grows, as a function of the number of possible options J ,like √ J (neglecting the logarithmic term). Remarkably, worst case regret does not grow in thebatch size M . This is despite the fact that the setup of Section 2 allows for action sets of size (cid:0) JM (cid:1) . For comparison, application of the worst case regret bound for Thompson sampling inbandits with dependent arms provided by Proposition 3 in Russo and Van Roy (2016) yields a6uch larger bound which grows in proportion to q(cid:0) JM (cid:1) log (cid:0) JM (cid:1) . Instead, the regret bound inTheorem 1 grows like that for a simple multiarmed bandit with J arms. Intuition for the proof of Theorem 1
The proof of Theorem 1 is provided in AppendixB. This proof builds on several definitions and standard results from information theory whichare reviewed in Appendix A. Here we just sketch some of the key steps in our proof.First, we use Pinsker’s inequality in order to relate expected regret to the information aboutthe optimal action A ∗ provided by observations, where information is measured by the KL-distance of posteriors and priors. Pinsker’s inequality implies, for Bernoulli random variables B and B ′ , that ( E [ B ] − E [ B ′ ]) ≤ D KL ( B, B ′ ) . Lemma 1 applies Pinsker’s inequality to termsshowing up in the definition of expected regret which are of the form E t [Θ j | A ∗ j = 1] − E t [Θ j ].This use of Pinsker’s inequality is at the core of the proofs in Russo and Van Roy (2016).Second, following some of the ideas introduced in Bubeck and Sellke (2020), Lemma 2 relatesthe KL-distance to the entropy of the events A ∗ j = 1. The combination of these two Lemmasallows to bound the expected regret for option j in terms of the entropy reduction for theposterior of A ∗ j .Third and lastly, Lemma 3 shows that the total reduction of entropy across the options j ,and across the time periods t , can be no more than the sum of the prior entropy for each of theevents A ∗ j = 1, which is bounded by M · (cid:2) log (cid:0) JM (cid:1) + 1 (cid:3) . The proof of Theorem 1 then combinesthese three Lemmas. Relationship to the literature
Our proof builds on the information theoretic approachpioneered by Russo and Van Roy (2016) (in particular Lemma 1 and 2 as well as Proposition6 therein) and some variations of this approach proposed by Bubeck and Sellke (2020) (inparticular Lemma 13). Despite the close relationship to their arguments, Theorem 1 differsfrom the bounds provided in these papers as follows. The closest result in Russo and Van Roy(2016) is their Proposition 6. Their result, however, requires statistical independence of theprior and posterior distribution for the components of Θ at all times t . By contrast, Theorem1 above allows for arbitrary dependence. This is especially relevant for the matching setting,where independence in the prior distribution would be quite hard to justify.The closest result in Bubeck and Sellke (2020) is their Theorem 21. Their result is asymp-totic, rather than providing an exact finite sample bound. The main interest of Bubeck and Sellke(2020) is an asymptotic refinement of regret bounds that scales in the best achievable regret,allowing for the latter to converge to 0; this is something which our result does not aim to do.7 Implementation of Thompson sampling for matchingproblems
In order to achieve good performance in practice, our proposed procedure relies on specifyingan appropriate model for the data generating process, and an appropriate prior distribution forthe underlying parameters. We generally advocate for the use of default priors that are diffuseand symmetric across types, while incorporating reasonable assumptions about the dependencystructure between different options j .Table 1 proposes some variants of models and priors for matching settings, covering ourleading motivating examples, including those used in our empirical applications. For each ofthese variants, we assume that the options j consist of two-sided matches between types u j andtypes v j . For each possible match (option), the potential outcomes Y jt are drawn from somedistribution with mean Θ j . We need to specify this distribution of Y jt , as well as a joint priordistribution of the parameters Θ j across j .Each of these models assumes that the option-effect Θ j is determined by the sum of type-effects Γ uu j and Γ vv j , plus an interaction effect Γ uvu j ,v j . For continuous outcomes, we assumethat Θ j is directly given by this sum. For binary or discrete outcomes, we assume that Θ j isgiven by the logit link function applied to this sum.For the model for outcomes with discrete bounded support, the distribution of Y jt is gov-erned by the mean parameter Θ j as well as a dispersion parameter m . The latter is necessaryto allow for larger dispersions relative to a more restrictive Binomial model, which might putexcessive weight on the information content of single observations. In order to implement Thompson sampling, we need to sample from the posterior for Θ. Thisposterior is also relevant for statistical inference on parameter values. Such inference is oftena secondary goal, in additional to the primary goal of maximizing participant outcomes. Suchinference might be Bayesian, using the same posterior distributions that go into the assignmentalgorithm. Alternatively, such inference might be based on permutation tests, as describedbelow.
Markov Chain Monte Carlo and Bayesian inference
For hierarchical priors, such asthose discussed in Section 4.1, posterior distributions are not available in closed form, in general.We can, however, sample from the posterior for Θ using Markov Chain Monte Carlo (MCMC)methods. Such MCMC methods only require us to specify the posterior up to a multiplicativeconstant (typically, up to the denominator of the posterior density, which is given by themarginal density of the observed data). MCMC methods are based on constructing a Markov8able 1: Models and priors for matching
Continuous outcomes Y jt ∼ N (Θ j , σ )Θ j = Γ uu j + Γ vv j + Γ uvu j ,v j Γ uu j ∼ N (0 , τ u ) , Γ vv j ∼ N (0 , τ v ) , Γ uvu j ,v j ∼ N ( µ, τ uv ) , Binary outcomes Y jt ∼ Bernoulli(Θ j )Θ j = 11 + exp (cid:0) − (cid:0) Γ uu j + Γ vv j + Γ uvu j ,v j (cid:1)(cid:1) Γ uu j ∼ N (0 , τ u ) , Γ vv j ∼ N (0 , τ v ) , Γ uvu j ,v j ∼ N ( µ, τ uv ) , Discrete outcomes with bounded support { , . . . , ¯ y } Y jt ∼ Beta-Binomial( α j , β j , ¯ y ) A j = m · Θ j , B j = m · (1 − Θ j )Θ j = 11 + exp (cid:0) − (cid:0) Γ uu j + Γ vv j + Γ uvu j ,v j (cid:1)(cid:1) Γ uu j ∼ N (0 , τ u ) , Γ vv j ∼ N (0 , τ v ) , Γ uvu j ,v j ∼ N ( µ, τ uv ) , Notes:
For each of these cases we assume that the components of Γ u , Γ v , Γ uv are mutually independent giventhe hyper-parameters. The hyper-parameters are given by σ , τ u , τ v , τ uv and µ for continuous outcomes,by τ u , τ v , τ uv and µ for binary outcomes, and by τ u , τ v , τ uv and µ for discrete outcomes with boundedsupport. We propose to use some diffuse prior for these hyper-parameters. t be a draw from the posterior given F t generated by MCMC, after a sufficiently longwarm-up period. Choose A t = argmax a ∈A h a, ˆΘ t i . (8)Then A t follows the distribution required for Thompson sampling, that is, it satisfies Equa-tion (7).In order to form 1 − α credible sets for the parameters Θ j given the history F t , sample alarge number of draws ˆΘ t from the posterior, and form a credible interval based on the α/ − α/ j,t across these draws. Randomization inference
An alternative to Bayesian inference is randomization (permu-tation) inference. In the context of treatment effect estimation, randomization inference canbe used to test the null hypothesis that treatment does not affect any outcome, so that forinstance Y i = Y i for all units i and treatment value 0 , Y t ( a ) for any counterfactual action a ∈ A under the null hypothesis H , given knowledge of Y t ( A t ) for the realized action A t .In many cases of interest, there might be more than one plausible way to specify such a nullhypothesis and the corresponding counterfactual outcome vectors.To illustrate, consider the case of one-to-one matching (of refugees to local communities,say), where each option j corresponds to a match of a refugee family to a local community.We could formalize the null hypothesis that “the matching does not matter” in two differentways. We could consider the hypothesis that refugee outcomes are the same, no matter whichcommunity they are allocated to. Or we could consider the hypothesis that outcomes in acommunity are the same, no matter which refugees are allocated to be there.Given some specification of counterfactual outcomes, we can sample counterfactual histories˜ F t by re-running the Thompson sampling algorithm iteratively. In each period s , draw ˜Θ t ′ and the corresponding ˜ A t ′ from the posterior given ˜ F t ′ . Impute a counter-factual outcomevector Y t ′ ( ˜ A t ′ ), based on the null hypothesis to be tested. Update the history F t ′ by adding˜ A t ′ , Y t ′ ( ˜ A t ′ ), and iterate for the next period. Once t ′ = t , calculate a realization of the test-statistic as a function of ˜ F t . Repeat this process to generate a sampling distribution of thetest-statistic, and corresponding critical values and p-values for testing the null hypothesisunder consideration. 10 Applications
The United States has historically been the world’s largest destination of resettled refugees, with78,340 admitted in 2016. There is a lot of evidence that the initial match between refugees andlocal communities dramatically affects the socioeconomic outcomes of refugees (Bansak et al.,2018; Trapp et al., 2020). However, local community capacities are tightly regulated by the USgovernment. As a result, one US resettlement agency (HIAS) optimizes the placement of theresettled refugees using its recommendation system called Annie MOORE. However, Annie’sestimates of refugee employment are static and come from a LASSO regression run annually(Trapp et al., 2020).Here, we draw on the data used by Annie MOORE in order to run calibrated simulationsfor our proposed procedure subject to realistic constraints, hoping to inform actual refugeeplacement by Annie MOORE in the future.
Data
Our data covers all refugees resettled by HIAS between October 2005 and September2020. There were a total of 37,149 refugees that constitute 15,523 cases (i.e., families). Wefocus on the employment of primary applicants in each family. For each primary applicant inthe family, we observe three binary variables: whether the applicant is of prime working age(25-54), their gender, and whether they are English-speaking. We also observe the family size,and the affiliate where the family was resettled. We furthermore observe whether the primaryapplicant had any US ties. Applicants with US ties (e.g., US resident friends or family) areautomatically resettled to the community where their US ties reside. Applicants without US tiescan be resettled to any of the communities where HIAS operates. Finally, we observe whetheror not the primary applicant was employed within 90 days of arrival. This is a key metric usedby the State Department to assess the performance of American resettlement agencies.There are 57 affiliates in our data. We drop any affiliate with fewer than 150 resettledcases over the whole period under consideration, leaving us with 17 affiliates. All affiliates areanonymized. Based on the available observables, we classify refugees into 8 “types” u , whiletreating each of the 17 affiliates as a separate “type” v . This means that there are 8 ·
17 = 136parameters (probabilities of finding employment) that we might wish to learn.As noted above, the affiliates have a limited capacity in hosting refugees, where capacityis the total number of resettled people (primary applicants and their family members). Thecapacities of affiliates can sometimes change throughout the course of the year. For our sim-ulations, we conservatively set the available annual capacities to be 110% of the total numberof refugees without US ties actually resettled to each affiliate in a given year. The monthly In their analysis, Trapp et al. (2020) also pool some affiliates because of small numbers of observations. Resettlement agencies are allowed to exceed their official capacity by 10% without further approval; how-ever, in practice, agencies occasionally have to seek approval for capacity extensions and reductions because ofstochastic refugee flows. Dynamic capacity management is beyond the scope of this paper.
Simulation design
Our simulated matching process works as follows. For each month t inthe available data, we consider all the refugees who were resettled by HIAS in this month. Wematch the refugees with US ties to their actual affiliates. For all the refugees without US ties,we match them to affiliates using the Thompson algorithm discussed above. This matching hasto satisfy the capacity constraints of affiliates described above, and the varying sizes of refugeefamilies. Solving for the optimal matching for a given draw ˆΘ from the posterior defines a so-called multiple knapsack problem. Any leftover capacity and any unmatched refugees withoutUS ties are carried over to the next month.We then impute counterfactual outcomes for the refugees allocated using the Thompsonalgorithm. based on calibrated parameter values Θ . These calibrated parameter values areset to the posterior mean for the Bayesian hierarchical model for binary outcomes described inSection 4.1. Results
WORK IN PROGRESS. 12 eferences
Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed banditproblem. In
Conference on learning theory , pages 39–1.Audibert, J.-Y., Bubeck, S., and Lugosi, G. (2011). Minimax policies for combinatorial pre-diction games. In
Proceedings of the 24th Annual Conference on Learning Theory , pages107–132.Audibert, J.-Y., Bubeck, S., and Lugosi, G. (2014). Regret in online combinatorial optimization.
Mathematics of Operations Research , 39(1):31–45.Bansak, K., Ferwerda, J., Hainmueller, J., Dillon, A., Hangartner, D., Lawrence, D., and Wein-stein, J. (2018). Improving refugee integration through data-driven algorithmic assignment.
Science , 359(6373):325–329.Berry, D. A. (2006). Bayesian clinical trials.
Nature Reviews Drug Discovery , 5(1):27–36.Bubeck, S. and Sellke, M. (2020). First-order bayesian regret analysis of thompson sampling.In
Algorithmic Learning Theory , pages 196–233.Caria, S., Gordon, G., Kasy, M., Osman, S., Quinn, S., and Teytelboym, A. (2020). Adaptivetreatment assignment in experiments for policy choice. Mimeo.Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker,M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language.
Journal of statistical software , 76(1).Delacr´etaz, D., Kominers, S. D., Teytelboym, A., et al. (2019). Matching mechanisms forrefugee resettlement. Technical report.Gittins, J. C. (1979). Bandit processes and dynamic allocation indices.
Journal of the RoyalStatistical Society: Series B (Methodological) , 41(2):148–164.Graham, B. S., Imbens, G. W., and Ridder, G. (2010). Measuring the effects of segregationin the presence of social spillovers: a nonparametric approach. Technical report, NationalBureau of Economic Research.Kasy, M. and Sautmann, A. (2020). Adaptive treatment assignment in experiments for policychoice.
Econometrica , forthcoming.Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptoticallyoptimal finite-time analysis. In
International conference on algorithmic learning theory , pages199–213. Springer. 13awale, J., Bui, H. H., Kveton, B., Tran-Thanh, L., and Chawla, S. (2015). Efficient thompsonsampling for online matrix-factorization recommendation. In
Advances in neural informationprocessing systems , pages 1297–1305.Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules.
Advancesin applied mathematics , 6(1):4–22.MacDonald, D. E. (2019). Foster care: A dynamic matching approach. Technical report,Mimeo.MacKay, D. J. (2003).
Information theory, inference and learning algorithms . Cambridgeuniversity press.Robinson-Cort´es, A. (2019). Who gets placed where and why? an empirical framework forfoster care placement.Russo, D. and Van Roy, B. (2016). An information-theoretic analysis of thompson sampling.
The Journal of Machine Learning Research , 17(1):2442–2471.Schwartz, E. M., Bradlow, E. T., and Fader, P. S. (2017). Customer acquisition via displayadvertising using multi-armed bandit experiments.
Marketing Science , 36(4):500–522.Thakral, N. (2016). The public-housing allocation problem: Theory and evidence from pitts-burgh. Technical report, Harvard University.Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another inview of the evidence of two samples.
Biometrika , 25(3/4):285–294.Trapp, A. C., Teytelboym, A., Martinello, A., Andersson, T., Ahani, N., et al. (2020). Place-ment optimization in refugee resettlement.
Operations Research .van Dijk, W. (2019). The socio-economic consequences of housing assistance. Technical report,University of Chicago Kenneth C. Griffin Department of Economics.Waldinger, D. (2018). Targeting in-kind transfers through market design: A revealed preferenceanalysis of public housing allocation. Technical report, New York University.Wang, S. and Chen, W. (2018). Thompson sampling for combinatorial semi-bandits. arXivpreprint arXiv:1803.04623 . 14
A brief review of information theory
In this section we review some basic definitions and facts about entropy, mutual information,and KL-divergence. For further background, see MacKay (2003) (in particular chapter 8), aswell as Section 3 in Russo and Van Roy (2016). For our purposes, it is enough to restrictattention to the Bernoulli case, so that we can introduce the following definitions in elementaryform. Let A be a Bernoulli random variable with expectation p , and let A ′ be a Bernoullirandom variable with expectation q . We overload notation by allowing the arguments A and p to be used interchangeably. • Entropy : H ( A ) = H ( p ) = − [ p log( p ) + (1 − p ) log(1 − p )] . (A1) • KL divergence : D KL ( A, A ′ ) = D KL ( p, q ) = p log (cid:16) pq (cid:17) + (1 − p ) log (cid:16) − p − q (cid:17) . (A2) • Pinsker’s inequality :( E [ A ] − E [ A ′ ]) = | p − q | ≤ D KL ( p, q ) = D KL ( A, A ′ ) . (A3) • Mutual information as expected divergence of the posterior:For any random variable or vector F , let p ( f ) = E [ A | F = f ]. Then I ( A ; F ) = E [ D KL ( p ( F ) , p )] . (A4) • Conditional entropy : H ( A | F ) = E [ H ( p ( F ))] . (A5) • Entropy reduction form of mutual information: I ( A ; F ) = H ( A ) − H ( A | F ) . (A6) • Data processing inequality : For any transformation g ( F ) of a random variable orvector F , I ( A ; g ( F )) ≤ I ( A ; F ) . (A7) • Chain rule of mutual information: I ( A ; ( F, G )) = I ( A ; F ) + I ( A ; G | F ) . (A8)15 Proofs
For ease of reference, we begin by restating our notation and assumptions. Y t , Θ , A t ∈ R J Outcome, parameter, and action vectors A t ∈ A ⊆ { a ∈ { , } J : k a k = M } Feasible allocations and batch size Y jt ∈ [0 ,
1] Bounded outcomesΘ = E t [ Y t | Θ] Parameters are expectation of outcomes¯Θ t = E t [Θ] = E t [ Y t ] Prior expectation of the parameter (at t ) R ( a ) = E t [ h a, Y t i| Θ] = h a, Θ i Linear (combinatorial) expected rewards Y t ( a ) = ( a j · Y jt : j = 1 , . . . , J ) Observable outcomes (semi bandit) A ∗ ∈ argmax a ∈A R ( a ) = argmax a ∈A h a, Θ i Optimal action¯Θ ∗ jt = E t [Θ j | A ∗ j = 1] = E t [ Y jt | A ∗ j = 1] Conditional expectation of parameters p t = E t [ A ∗ ] Expected optimal actionFor Thompson sampling we have that A t has the same distribution as A ∗ , and therefore E t [ A t ] = E t [ A ∗ ] = p t . We next prove three preliminary Lemmas, before combining them in the proof of Theorem1 itself.
Lemma 1 (Bounding regret by the component-wise information) .E t [ R ( A ∗ ) − R ( A t )] ≤ vuut J · J X j =1 p jt · D KL (cid:0) ¯Θ ∗ jt , ¯Θ jt (cid:1) . Proof of Lemma 1:E t [ R ( A ∗ ) − R ( A t )] = E t [ h A ∗ − A t , Θ i ] (B1)= h p t , ¯Θ ∗ i − h p t , ¯Θ t i (B2) ≤ vuut J · J X j =1 p jt · (cid:0) ¯Θ ∗ jt − ¯Θ jt (cid:1) (B3) ≤ vuut J · J X j =1 p jt · D KL (cid:0) ¯Θ ∗ jt , ¯Θ jt (cid:1) (B4)These steps hold for the following reasons.(B1) By definition of R . 16B2) By splitting the inner product, and using (i) iterated expectations, conditioning on A ∗ j = 1for each component j in turn, and (ii) independence of A t and Θ t and the definition ofThompson sampling.(B3) By Cauchy Schwarz (for the inner product with a J -vector of 1s).(B4) By Pinsker’s inequality, applied to Bernoulli random variables with expectation ¯Θ ∗ jt , ¯Θ jt . (cid:3) Lemma 2 (Divergence and component-wise information gain) . p jt · D KL (cid:0) ¯Θ ∗ jt , ¯Θ jt (cid:1) ≤ I t ( A ∗ j ; Y t ( A t ) , A t ) . Proof of Lemma 2:
For the purpose of this proof, construct a Bernoulli random variable e Y jt with expectation Y jt ,independently of everything else. Note that E t [ e Y jt ] = ¯Θ jt . D KL (cid:0) ¯Θ ∗ jt , ¯Θ jt (cid:1) can be interpreted asthe KL-divergence between the distribution of e Y jt conditional on A ∗ j = 1 and the (unconditional)distribution of e Y jt . Taking the expectation over A ∗ j of the KL-divergence yields the mutualinformation between A ∗ j and e Y jt , I t ( A ∗ j ; e Y jt ): I t ( A ∗ j ; e Y jt ) = p jt · D KL (cid:0) E t [Θ jt | A ∗ j = 1] , ¯Θ jt (cid:1) +(1 − p jt ) · D KL (cid:0) E t [Θ jt | A ∗ j = 0] , ¯Θ jt (cid:1) , (B5)and thus p jt · D KL (cid:0) ¯Θ ∗ jt ; ¯Θ jt (cid:1) ≤ p jt · I t ( A ∗ j ; e Y jt ) (B6) ≤ p jt · I t ( A ∗ j ; Y jt ) (B7)= I t ( A ∗ j ; A jt · Y jt , A jt ) (B8) ≤ I t ( A ∗ j ; Y t ( A t ) , A t ) . (B9)These steps hold for the following reasons.(B6) Because the second term in Equation (B5) is non-negative.(B7) By the data-processing inequality, applied to the mapping from Y jt to e Y jt .(B8) By the law of iterated expectations, applied to I t ( A ∗ j ; A jt · Y jt , A jt ), averaging over thedistribution of A jt (under Thompson sampling).(B9) By the data processing inequality, again.17 Lemma 3 (Bounding the sum of component-wise information) . T X t =1 J X j =1 I t ( A ∗ j ; Y t ( A t ) , A t ) ≤ M · (cid:2) log (cid:0) JM (cid:1) + 1 (cid:3) Proof of Lemma 3: T X t =1 J X j =1 I t ( A ∗ j ; Y t ( A t ) , A t ) = J X j =1 I ( A ∗ j ; ( Y t ( A t ) , A t : t = 1 , . . . , T )) (B10) ≤ J X j =1 H ( A ∗ j ) (B11)= − J X j =1 [ p j, log( p j, ) + (1 − p j, ) log(1 − p j, )] (B12) ≤ J · (cid:16) MJ log (cid:0) JM (cid:1) + (cid:0) J − MJ (cid:1) log (cid:16) JJ − M (cid:17)(cid:17) (B13) ≤ M · (cid:2) log (cid:0) JM (cid:1) + 1 (cid:3) (B14)These steps hold for the following reasons.(B10) The chain rule of mutual information.(B11) The entropy reduction form of mutual information and the non-negativity of (conditional)entropy.(B12) The definition of entropy for A ∗ j .(B13) Jensen’s inequality.(B14) The inequality log(1 + x ) ≤ x for x = m/ ( d − m ). (cid:3) roof of Theorem 1:E " T X t =1 ( R ( A ∗ ) − R ( A t )) = E " T X t =1 E t [ R ( A ∗ ) − R ( A t )] (B15) ≤ E T X t =1 vuut J J X j =1 I t ( A ∗ j ; Y t ( A t ) , A t ) (B16) ≤ vuuut JT E T X t =1 J X j =1 I t ( A ∗ j ; Y t ( A t ) , A t ) (B17) ≤ r JT M · (cid:2) log (cid:0) JM (cid:1) + 1 (cid:3) . (B18)These steps hold for the following reasons.(B15) The law of iterated expectations.(B16) Lemma 1.(B17) Cauchy-Schwarz for the inner product with a T -vector of 1s.(B18) Lemma 3. (cid:3)(cid:3)