[PDF] A Multi-Arm Bandit Approach To Subset Selection Under Constraints

Abstract

We explore the class of problems where a central planner needs to select a subset of agents, each with different quality and cost. The planner wants to maximize its utility while ensuring that the average quality of the selected agents is above a certain threshold. When the agents' quality is known, we formulate our problem as an integer linear program (ILP) and propose a deterministic algorithm, namely \dpss\ that provides an exact solution to our ILP. We then consider the setting when the qualities of the agents are unknown. We model this as a Multi-Arm Bandit (MAB) problem and propose \newalgo\ to learn the qualities over multiple rounds. We show that after a certain number of rounds, \tau, \newalgo\ outputs a subset of agents that satisfy the average quality constraint with a high probability. Next, we provide bounds on \tau and prove that after \tau rounds, the algorithm incurs a regret of O(\ln T), where T is the total number of rounds. We further illustrate the efficacy of \newalgo\ through simulations. To overcome the computational limitations of \dpss, we propose a polynomial-time greedy algorithm, namely \greedy, that provides an approximate solution to our ILP. We also compare the performance of \dpss\ and \greedy\ through experiments.

Full PDF

AA Multi-Arm Bandit Approach To Subset Selection Under Constraints

Ayush Deva

IIIT Hyderabad, [email protected]

Kumar Abhishek

IIIT Hyderabad, [email protected]

Sujit Gujar

IIIT Hyderabad, [email protected]

Abstract

We explore the class of problems where a central plan-ner needs to select a subset of agents, each with diﬀerentquality and cost. The planner wants to maximize its util-ity while ensuring that the average quality of the selectedagents is above a certain threshold. When the agents’quality is known, we formulate our problem as an integerlinear program (ILP) and propose a deterministic algo-rithm, namely DPSS that provides an exact solution toour ILP.We then consider the setting when the qualities of theagents are unknown. We model this as a Multi-Arm Ban-dit (MAB) problem and propose DPSS-UCB to learn thequalities over multiple rounds. We show that after a cer-tain number of rounds, τ , DPSS-UCB outputs a subsetof agents that satisfy the average quality constraint witha high probability. Next, we provide bounds on τ andprove that after τ rounds, the algorithm incurs a regretof O (ln T ), where T is the total number of rounds. Wefurther illustrate the eﬃcacy of DPSS-UCB through sim-ulations.To overcome the computational limitations of DPSS,we propose a polynomial-time greedy algorithm, namelyGSS, that provides an approximate solution to our ILP.We also compare the performance of DPSS and GSSthrough experiments. Almost all countries have cooperative societies that caterto developing sectors such as agriculture and handicrafts.We observed that some cooperatives, especially those thatare consumer-oriented, such as Coop (Switzerland) or ar-tisan cooperatives who operate their stores, lack a well-deﬁned system to procure products from its many mem-bers (manufacturers, artisans, or farmers). Since the pro-duction is highly decentralized and usually not standard-ized, each producer has a diﬀerent quality and cost of pro-duce depending on various factors such as workmanshipand the scale at which it operates. The central planner(say, the cooperative manager) has to carefully trade-oﬀbetween each producer’s qualities and cost to decide thequantity to procure from each producer so that it is mostbeneﬁcial for the society as a whole.This problem is not limited to cooperatives, but it isalso faced in other familiar marketplaces. E-commerceplatforms, like Amazon and Alibaba, have several sellersregistered on their platform. For each product, the plat-form needs to select a subset of sellers to display on its page while ensuring that it avoids low-quality sellers anddoes not display only the searched product’s high-costvariants. Similarly, a supermarket chain may need to de-cide the number of apples to procure from the regionalapple farmers, each with a diﬀerent quality of produce,to maximize proﬁts while ensuring that the quality stan-dards are met.We formulate this as a subset selection problem wherea central planner needs to select a subset of these sell-ers/producers, whom we refer to as agents. In this paper,we associate each agent with its quality and cost of pro-duction. The agent’s quality refers to the average qualityof the units produced by it; however, the quality of an in-dividual unit of its product could be stochastic, especiallyin artistic and farm products. Thus, it becomes diﬃcultto design an algorithm that guarantees constraint satis-faction on the realized qualities of the individual unitsprocured. Towards this, we show that we achieve proba-bly approximately correct (PAC) results by satisfying ourconstraint on the expected average quality of the unitsprocured. Every unit procured from these agents gener-ates revenue that is a function of its quality. The planneraims to maximize its utility (i.e., revenue - cost) while en-suring that the procured units’ average quality is above acertain threshold to guarantee customer satisfaction andretention [1, 29]. When the agents’ quality is known, wemodel our problem as an Integer Linear Program (ILP)and propose a novel algorithm, DPSS that provides anexact solution to our ILP.Often, the quality of the agents is unknown to the plan-ner beforehand. An E-commerce platform may not knowits sellers’ quality at the time of registration, and an ar-tisan’s quality of work may be hard to estimate until itsproducts are procured and sold in the market. Thus, theplanner needs to carefully learn the qualities by procur-ing units from the agents across multiple rounds whileminimizing its utility loss. Towards this, we model oursetting as a Multi-Arm Bandit (MAB) problem, whereeach agent represents an independent arm with an un-known parameter (here, quality). To model our subsetselection problem, we consider the variant of the classicalMAB setting where we may select more than one agentin a single round. This setting is popularly referred toas a Combinatorial MAB (CMAB) problem [12, 17, 24].In studying CMAB, we consider the semi-bandit feedbackmodel where the algorithm observes the quality realiza-tions corresponding to each of the selected arms and theoverall utility for selecting the subset of arms. The prob-lem becomes more interesting when we also need to ensureour quality constraint in a CMAB problem. We position1 a r X i v : . [ c s . L G ] F e b ur work with respect to the existing literature in Section2. Typically, in a CMAB problem, the planner’s goal is tominimize the expected regret , i.e., the diﬀerence betweenthe expected cumulative utility of the best oﬄine algo-rithm with known distributions of an agent’s quality andthe expected cumulative reward of the algorithm. How-ever, the traditional deﬁnition of regret is not suitablein our setting as an optimal subset of agents (in termsof utility) may violate the quality constraint. Thus, wemodify the regret deﬁnition to make it compatible withour setting. We propose a novel, UCB-inspired algorithm,DPSS-UCB, that addresses the subset selection problemwhen the agents’ quality is unknown. We show that aftera certain threshold number of rounds, τ , the algorithmsatisﬁes the quality constraint with a high probability forevery subsequent round, and under the revised regret def-inition, it incurs a regret of O (ln T ), where T is the totalnumber of rounds.To address the computational challenges of DPSSwhich has a time complexity of O (2 n ), we propose agreedy-based algorithm, GSS that runs in polynomialtime O ( n ln n ), where n is the number of agents. We showthat while the approximation ratio of the utility achievedby GSS to that of DPSS can be arbitrarily small in theworst case, it achieves almost the same utility as DPSSin practice, which makes GSS a practical alternative toDPSS especially when n is large.In summary, our contributions are: • We propose a framework, SS-UCB, to model subsetselection problem under constraints when the prop-erties (here, qualities) of the agents are unknown tothe central planner. In our setting, both the objectivefunction and the constraint depends on the unknownparameter. • We ﬁrst formulate our problem as an ILP assumingthe agents’ quality to be known and propose a novel,deterministic algorithm, namely DPSS(Algorithm 1)to solve the ILP. • Using DPSS, we design DPSS-UCB which addressesthe setting where the agents’ quality is unknown. Weprove that after a certain number of rounds, τ = O (ln T ), DPSS-UCB satisﬁes quality constraint withhigh probability. We also prove that it achieves aregret of O (ln T ) (Theorem 2). • To address the computational limitation of DPSS,we propose an alternative greedy approach, GSS andGSS-UCB, that solves the known and the unknownsettings, respectively. We show that while the greedyapproach may not be optimal, it performs well inpractice with a huge computational gain that allowsour framework to scale to settings with a large num-ber of agents.The remaining of the paper is organized as follows: InSection 2, we discuss the related works. In Section 3,we deﬁne our model and solve for the setting when thequality of the agents are known. In Section 4, we addressthe problem when the quality of the agents is unknown. In Section 5, we propose a greedy approach to our problem.In Section 6, we discuss our simulation-based analysis andconclude the paper in Section 7.

Subset selection is a well-studied class of problems thatﬁnds its applications in many ﬁelds, for example, in re-tail, vehicle routing, and network theory. Usually, theseproblems are modeled as knapsack problems where a cen-tral planner needs to select a subset of agents that maxi-mizes its utility under budgetary constraints [33]. Thereare several variations to the knapsack, such as robustness[26], dynamic knapsacks [25], and knapsack with multipleconstraints [27] studied in the literature. In this paper,we consider a variant where the constraint is not addi-tive, i.e., adding another agent to a subset doesn’t alwaysincrease the average quality.When online learning is involved, the stochastic multi-armed bandit (MAB) problem captures the explorationvs. exploitation trade-oﬀ eﬀectively [19, 23, 21, 22, 28,32, 31, 6]. The classical MAB problem involves learningthe optimal agent from a set of agents with a ﬁxed butunknown reward distribution [7, 28, 3, 30]. Combinato-rial MAB (CMAB) [10, 16, 8, 13, 9] is an extension tothe classical MAB problem where multiple agents can beselected in any round. In [10, 17, 11], the authors haveconsidered a CMAB setting where they assume the avail-ability of a feasible set of subsets to select from. Thekey diﬀerence with our setting is that our constraint it-self depends on the unknown parameter (quality) thatwe are learning through MAB. Thus, the feasible subsetsthat satisfy the constraint need to be learned, unlike theprevious works. [10, 17, 11] also assumes the availabil-ity of an oracle that outputs an optimal subset given theestimates of the parameter as input, whereas we designsuch an oracle for our problem. Bandits with Knapsacks(BwK) is another interesting extension that introducesconstraints in the standard bandit setting [4, 5, 2, 22] andﬁnds its applications in dynamic pricing, crowdsourcing,etc. (see [4, 2]). Typically, in BwK, the objective is tolearn the optimal agent(s) under a budgetary constraint(e.g., a limited number of selections) that depends solelyon the agents’ cost. However, we consider a setting wherethe selected subset needs to satisfy a quality constraintthat depends on the learned quantities.The closest work to ours is [22] where the authorspresent an assured accuracy bandit (AAB) frameworkwhere the objective is to minimize cost while ensuringa target accuracy level in each round. While they do con-sider a constraint setting similar to ours, the objectivefunction in [22] depends only on the agents’ cost and noton the qualities of the agents that are unknown. Hence,it makes our setting diﬀerent and more generalizable withrespect to both AAB and CMAB as in our setting, boththe constraint and the utility function depend on the un-known parameter.2

Subset Selection With KnownQualities of Agents

Here we assume that the agents’ quality is known andconsider the problem where a central planner C needs toprocure multiple units of a particular product from a ﬁxedset of agents. Each agent is associated with the qualityand cost of production. C ’s objective is to procure theunits from the agents such that the average quality of allthe units procured meets a certain threshold. We assumethat there is no upper limit to the number of units it canprocure as long as the quality threshold is met.In Section 3.1, we deﬁne the notations required to de-scribe our model, formulate it as an integer linear pro-gram (ILP) in Section 3.3, and propose a solution to it inSection 3.4.

1. There is a ﬁxed set of agents N = { , , . . . , n } avail-able for selection for procurement by planner C .2. Agent i has a cost of production, c i , and capacity, k i (maximum number of units it can produce).3. The quality of the j th unit of produce by agent i is denoted by Q ij , which we model as a Bernoullirandom variable.4. For any agent i , the probability that Q ij is 1 is de-ﬁned by q i , i.e., E [ Q ij ] = q i for any unit j procuredfrom agent i . q i is also referred to as the quality ofthe agent in the rest of the paper.5. The utility for C to procure a single unit of producefrom agent i is denoted by r i , which is equal to itsexpected revenue minus the cost of production, i.e., r i = Rq i − c i , where R is the proportionality constant.6. The quantity of products procured by C from the i th agent is given by x i .7. The average quality of products procured by C istherefore equal to (cid:80) i ∈ N (cid:80) xij =1 Q ij (cid:80) i ∈ N x i .8. We deﬁne q av = (cid:80) i ∈ N x i q i (cid:80) i ∈ N x i , which is the expectedaverage quality of the units procured by C .9. C needs to ensure that the average quality of all theunits procured is above a certain threshold, α ∈ [0 , C is given by, z = (cid:80) i ∈ N x i r i .Usually, an individual unit’s quality Q ij may not bequantiﬁable and can only be characterized by observingwhether it was sold. Hence, we model it as a Bernoullirandom variable. We assume expected revenue to be proportional to the quality ofthe product. It is a reasonable assumption as if q i is the probabilityof the product being sold and R is the price of the product, itsexpected revenue would be Rq i In our setting, average quality (Section 3.1, point 7) isdependent on Q ij , which is stochastic in nature. Insuch a stochastic framework, it is more natural to workwith expected terms than on a sequence of realized val-ues. Towards this, we show that by ensuring our qualityconstraint on expected average quality, q av , instead, wecan still achieve approximate constraint satisfaction witha high probability. Formally, we present the followinglemma, Lemma 1.

The probability that average quality is lessthan α − (cid:15) given that q av ≥ α , can be bounded as follows: P (cid:32) (cid:80) i ∈ N (cid:80) x i j =1 Q ij (cid:80) i ∈ N x i < α − (cid:15) | q av ≥ α (cid:33) ≤ exp ( − (cid:15) m )) , where m = (cid:80) i ∈ N x i , and (cid:15) is a constant. Proof.

Let, V = (cid:80) i ∈ N (cid:80) xij =1 Q ij (cid:80) i ∈ N x i E [ V ] = (cid:80) i ∈ N (cid:80) x i j =1 E [ Q ij ] (cid:80) i ∈ N x i = (cid:80) i ∈ N q i x i (cid:80) i ∈ N x i = q av Therefore, P (cid:0) V < α − (cid:15) | E [ V ] ≥ α (cid:1) ≤ P (cid:0) V < E [ V ] − (cid:15) )= P (cid:0) V − E [ V ] < − (cid:15) ) ≤ exp ( − (cid:15) m )The last line follows from the Hoeﬀding’s inequality [20]. (cid:4) From the above lemma, we show that by ensuring q av ≥ α , we can achieve probably approximate correct (PAC)results on our constraint. Hence, for the rest of the paper,we work with q av ≥ α as our quality constraint (QC). When the qualities of the agents are known, the plan-ner’s subset selection problem can be formulated as anILP where it needs to decide on the number of units, x i ,to procure from each agent i so as to maximize its util-ity (objective function) while ensuring the quality andcapacity constraints. The optimization problem can bedescribed as follows: max x i (cid:88) i ∈ N ( Rq i − c i ) x i s.t. q av = (cid:80) i ∈ N q i x i (cid:80) i ∈ N x i q av ≥ α ≤ x i ≤ k i ∀ i ∈ Nx i ∈ Z ∀ i ∈ N (1) In order to solve the ILP, we propose a dynamic pro-gramming based algorithm, called DPSS. For ease of ex-position, we consider k i = 1, i.e., each agent has a unit3apacity of production. This is a reasonable assumptionthat doesn’t change our algorithm’s results, since, for anagent with k i >

1, we can consider each unit as a separateagent, and the proofs and discussion henceforth follows.Formally, the algorithm proceeds as follows:1. Divide the agents into one of the four categories:(a) S : Agents with q i ≥ α and r i ≥ S : Agents with q i < α and r i ≥ S : Agents with q i ≥ α and r i < S : Agents with q i < α and r i <

02. Let x = { x i } i ∈ N be the selection vector, where x i =1 if the i th agent is selected and 0 otherwise.3. Since an agent in S has a positive utility and abovethreshold quality, x i = 1, ∀ i ∈ S . Let d = (cid:80) i ∈ S ( q i − α ) be the excess quality accumulated.4. Similarly, all units in S have a negative utility andbelow threshold quality. Hence, x i = 0, ∀ i ∈ S .5. Let G be the set of the remaining agents (in S and S ). For each agent i ∈ G , we deﬁne d i = q i − α . Thus, we need to select the agents i ∈ G thatmaximizes the utility, such that (cid:80) i ∈ G x i d i ≤ d .6. For agents in G, select according to the DP functiondeﬁned in Algorithm 1 (Lines [8-16]). Here, d te de-notes the access quality accumulated before choosingthe next agent and x t e refers to the selections madeso far in the DP formulation Algorithm 1

DPSS Inputs: N , α , R , costs c = { c i } i ∈ N , qualities q = { q i } i ∈ N Output:

Quantities procured x = ( x , . . . , x n ) Initialization: ∀ i ∈ N , r i = Rq i − c i , z = 0 Segregate S , S , S , S as described in Section 4.1 ∀ i ∈ S , x i = 1; z = z + r i ; d = (cid:80) i ∈ S ( q i − α ) ∀ i ∈ S , x i = 0 G = S ∪ S ; ∀ i ∈ G, d i = q i − α function dp ( i, d te , x te , x (cid:63) , z te , z (cid:63) ) if i == | G | and d te < then return x (cid:63) , z (cid:63) if i == | G | and d te ≥ then if z te > z (cid:63) then z (cid:63) = z te ; x (cid:63) = x te return x (cid:63) , z (cid:63) x (cid:63) , z (cid:63) = DP ( i + 1 , d te , [ x te , , x (cid:63) , z te , z (cid:63) ) x (cid:63) , z (cid:63) = DP ( i + 1 , d te + d i , [ x te , , x (cid:63) , z te + r i , z (cid:63) ) return x (cid:63) , z (cid:63) x G , z G = DP(0,d,[ ],[ ],0,0) ∀ i ∈ G, x i = x Gi return x In the previous section, we assumed that the qualities ofthe agents, q i , are known to C . We now consider a setting when q i are unknown beforehand and can only be learnedby selecting the agents. We model it as a CMAB problemwith semi-bandit feedback and QC. We introduce the additional notations to model our prob-lem. Similar to our previous setting, we assume that weare given a ﬁxed set of agents, N , each with its ownaverage quality of produce, q i and cost of produce, c i .Additionally, our algorithm proceeds in discrete rounds t = 1 , . . . , T . For a round t : • Let x t ∈ { , } n be the selection vector at round t ,where x ti = 1 if the agent i is selected in round t and x ti = 0 if not. • The algorithm selects a subset of agents, S t ⊆ N ,referred to as a super-arm henceforth, where S t = { i ∈ N | x ti = 1 } . Let s t be cardinality of selectedsuper-arm, i.e., s t = | S t | . • Let w ti denote the number of rounds an agent i hasbeen selected until round t , i.e., w ti = (cid:80) y ≤ t x yi . • For each agent i ∈ S t , the planner, C , observes itsrealized quality X ji , where j = w ti and E[ X ji ] = q i .For an agent i / ∈ S t , we do not observe its realizedquality (semi-bandit setting). • The empirical mean estimate of q i at round t , isdenoted by ˆ q ti = w ti (cid:80) w ti j =1 X ji . The upper conﬁ-dence bound (UCB) estimate is denoted by (ˆ q ti ) + =ˆ q ti + (cid:113) t w ti . • Utility to C at round t is given by: r q ( S t ) = (cid:80) i ∈ S t Rq i − c i , where q = { q , q , . . . , q n } is the qual-ity vector. • The expected average quality of selected super-armat round t is given by: q tav = s t (cid:80) i ∈ S t q i .Following from Lemma 1, we continue to work with ex-pected average quality instead of realized average quality. In this section, we propose an abstract framework, SS-UCB, for subset selection problem with quality constraint.SS-UCB assumes that there exist an oﬄine subset selec-tion algorithm, SSA, (e.g., DPSS), which takes a vectorof qualities, q (cid:48) , and costs, c (cid:48) , along with the target qual-ity threshold, α (cid:48) , and proportionality constant, R , as aninput and returns a super-arm which satisﬁes the qualityconstraint (QC) with respect to q (cid:48) and α (cid:48) .SS-UCB runs in two phases: (i) Exploration: where allthe agents are explored for certain threshold number ofrounds, τ ; (ii) Explore-exploit: We invoke SSA (line 10,Algorithm 2) with { (ˆ q ti ) + } i ∈ N , { c i } i ∈ N , α + (cid:15) and R asthe input parameters and select accordingly. We invokeSSA with a slightly higher target threshold, α + (cid:15) , sothat our algorithm is more conservative while selecting the4uper-arm in order to ensure QC with a high probability(discussed in Section 4.3). As we shall see in Section 4.3,the higher the value of (cid:15) , the sooner the SSA satisﬁesQC with a high probability but it comes with the cost ofloss in utility. Thus, the value of (cid:15) must be appropriatelyselected based on the planner’s preferences.We refer to the algorithm as DPSS-UCB when we useDPSS (Algorithm 1) as SSA in the SS-UCB framework.We show that DPSS-UCB outputs the super-arm thatsatisﬁes the QC with high probability (w.h.p) after a cer-tain threshold number of rounds, τ , and incurs a regretof O (ln T ). Algorithm 2

SS-UCB Inputs: N , α , (cid:15) , R, costs c = { c i } i ∈ N For each agent i , maintain: w ti , q ti , (ˆ q ti ) + τ ← T (cid:15) ; t = 0 while t ≤ τ ( Explore Phase ) do Play a super-arm S t = N Observe qualities X ji , ∀ i ∈ S t and update w ti , ˆ q ti t ← t + 1 while t ≤ T ( Explore-Exploit Phase ) do For each agent i , set (ˆ q ti ) + = ˆ q ti + (cid:113) t w ti S t = SSA ( { (ˆ q ti ) + } i ∈ N , c, α + (cid:15) ,R) Observe qualities X ji , ∀ i ∈ S t and update w ti , ˆ q ti t ← t + 1 We provide Probably Approximate Correct (PAC) [18, 14]bounds on DPSS-UCB satisfying QC after τ rounds: Theorem 2.

For τ = T (cid:15) , if each agent is explored τ number of rounds, then if we invoke DPSS with targetthreshold α + (cid:15) and { (ˆ q ti ) + } i ∈ N as the input, the QC isapproximately met with high probability. P (cid:32) q tav < α − (cid:15) | s t (cid:88) i ∈ S t (ˆ q ti ) + ≥ α + (cid:15) , t > τ (cid:33) ≤ exp ( − (cid:15) t ) . where (cid:15) is the tolerance parameter and refers to theplanner’s ability to tolerate a slighty lower average qualitythan required.Henceforth, a super-arm will be called correct if it sat-isﬁes the QC approximately as described above. Proof.

The proof is divided into two parts. Firstly, weshow that for each t > τ round, the average value of(ˆ q ti ) + and that of ˆ q ti of the agents i in selected super-arm S t is less than (cid:15) . Secondly, we show that if the averageof ˆ q ti is guaranteed to be above the threshold, then theaverage of q i over the selected agents would not be lessthan α − (cid:15) with a high probability. Lemma 3.

The diﬀerence between the average of (ˆ q ti ) + and the average of ˆ q ti over the agents i in S t is less than (cid:15) , ∀ t > τ . Proof.

We have,1 s t (cid:88) i ∈ S t (cid:0) (ˆ q ti ) + − ˆ q ti (cid:1) = 1 s t (cid:88) i ∈ S t √ t (cid:112) w ti ≤ √ t (cid:112) w tmin . where w tmin = min i w ti . Since, for t < τ , we are exploringall the agents, thus, w τi = τ . Now, since w ti ≥ w τi , ∀ t > τ ,thus, we claim that w tmin ≥ τ for t > τ . Hence, √ t (cid:112) w tmin ≤ √ T √ τ . For τ = T (cid:15) , we have,1 s t (cid:88) i ∈ S t (cid:0) (ˆ q ti ) + − ˆ q ti (cid:1) ≤ (cid:15) . (cid:4) Lemma 4. ∀ t > τ P (cid:32) q tav < α − (cid:15) | s t (cid:0) (cid:88) i ∈ S t (cid:0) ˆ q ti ) ≥ α (cid:1) (cid:33) ≤ exp ( − (cid:15) t ) . Proof.

Let Y t = s t (cid:80) i ∈ S t ˆ q ti . Since E[ˆ q ti ] = E[ X ji ] = q i ,E[ Y t ] = q tav . Hence, we have, P ( E [ Y t ] < α − (cid:15) ) | Y t ≥ α ) ≤ P (cid:0) Y t ≥ E [ Y t ] + (cid:15) ) ≤ exp ( − (cid:15) w t ) . where w t = (cid:80) i ∈ S t w ti , i.e., total number of agents selectedtill round t . Since we pull atleast one agent in each round,we can say that, w t ≥ t . Thus, ∀ t > τ P (cid:32) q tav < α − (cid:15) | s t (cid:32) (cid:88) i ∈ S t (cid:0) ˆ q ti ) ≥ α (cid:33) (cid:33) ≤ exp ( − (cid:15) t ) . (cid:4) From Lemma 3 and Lemma 4, the proof follows. (cid:4)

In this section, we propose the regret deﬁnition for ourproblem setting that encapsulates the QC. We then upperbound the regret incurred by DPSS-UCB to be of theorder O (ln T ).We deﬁne regret incurred by an algorithm A on round t as follows: Reg t ( A ) = (cid:40) ( r q ( S (cid:63) ) − r q ( S t )) if S t satisﬁes QC L otherwise . where S (cid:63) = argmax S ∈ S f r q ( S ) and L =max S ∈ S f ( r q ( S (cid:63) ) − r q ( S )) is some constant. Here, S f are the feasible subsets which satisﬁes QC; S f = { S | S ⊆ N and (cid:80) i ∈ s x i q i (cid:80) i ∈ s x i ≥ q av } .Hence, the cumulative regret in T rounds incurred by thealgorithm is: Reg ( A ) = T (cid:88) t =1 Reg t ( A ) . (2)

5e now analyse the regret when the algorithm, A , isDPSS-UCB. Reg ( A ) = τ (cid:88) t =1 Reg t ( A ) + T (cid:88) t = τ +1 Reg t ( A ) ≤ L · τ + T (cid:88) t = τ +1 Reg t ( A ) ≤ L · T (cid:15) + T (cid:88) t = τ +1 Reg t ( A ) . Since our algorithm ensures that S t satisﬁes the ap-proximate QC for t > τ with a probability greater than1 − σ , where σ = exp ( − (cid:15) t ), we have, E [ Reg ( A )] ≤ L · T (cid:15) + (cid:88) t ≥ τ (cid:2) (1 − σ )( r q ( S (cid:63) ) − r q ( S t )) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) Reg u ( T ) + σL  . (3) where S t ∈ S f .Now, (cid:88) t ≥ τ σL = (cid:88) t ≥ τ Le ( − (cid:15) t ) ≤ Le ( − (cid:15) τ ) − e ( − (cid:15) ) ∼ O (cid:18) T a (cid:19) , where a = 3 (cid:15) (cid:15) . Now we bound the cumulative regret incurred after t >τ rounds when QC is satisﬁed, i.e.,

Reg u ( T ). Here weadapt the regret proof given by [10]. We highlight thesimilarities and diﬀerences of our setting with theirs anduse it to bound Reg u ( T ). Bounding

Reg u ( T ):[10] have proposed CUCB algorithm to tackle CMABproblem which they prove to have an upper bound re-gret of O (ln T ). Following is the CMAB problem settingconsidered in [10]: • There exists a constrained set of super-arms χ ⊆ N available for selection. • There exists an oﬄine ( η, ν )-approximation oracle,( η, ν ≤

1) s.t. for a given quality vector q (cid:48) as input,it outputs a super-arm, S, such P ( r q (cid:48) ( S ) ≥ η · opt q (cid:48) ) ≥ ν , where opt q (cid:48) is the optimal reward for quality vector q (cid:48) as input. • Their regret bounds hold for any reward functionthat follows the properties of monotonicity andbounded smoothness (deﬁned below). • Similar to our setting, they assume a semi-banditfeedback mechanism.Now, we state the reasons to adopt the regret analysisprovided by [10] to bound

Reg u ( T )1. We have shown that after τ rounds, we get the con-strained set of super-arms, χ , i.e., the set of super-arms that satisﬁes QC), which forms a well deﬁnedconstrained set, to select from in future rounds ( t >τ ). 2. We remark here that the utility function consideredin our problem setting follows both the required prop-erties, namely,(i) Monotonicity:

The expected reward of playingany super-arm S ∈ χ is monotonically non-decreasingwith respect to the quality vector, i.e., let q and q be two quality vectors such that ∀ i ∈ N , q i ≤ q i , wehave r q ( S ) ≤ r q ( S ) for all S ∈ χ . Since our rewardfunction is linear, it is trivial to note that it is mono-tone on qualities.(ii) Bounded Smoothness:

There exists a strictly in-creasing (and thus invertible) function f ( . ), calledbounded smoothness function, such that for any twoquality vectors q and q , we have r q ( S ) − r q ( S ) ≤ f(Λ) if max i ∈ S q i - q i ≤ Λ. As our reward functionis linear in qualities, f (Λ) = nR × Λ is the boundedsmoothness function for our setting, where n is thenumber of agents.3. Oracle : Analogous to the oracle assumption in [10],we have assumed the existence of an algorithm SSA(Section 4.2). For DPSS-UCB, we use DPSS (Al-gorithm 1) as our SSA . As DPSS provides exactsolution, it acts as an ( η, ν )- approximate oracle forDPSS-UCB with η = 1 = ν .However, to ensure χ consists of all the correct super-arms, we need one additional property that should besatisﬁed, namely (cid:15) -seperatedness property. Deﬁnition 1.

We say q = ( q , q , . . . , q n ) satisﬁes (cid:15) -seperatedness if ∀ S ⊆ N , U ( S ) = s (cid:80) i ∈ S q i s.t. U ( S ) (cid:54)∈ ( α − (cid:15), α )This suggests that there is no super-arm S ∈ χ , suchthat α − (cid:15) ≤ | S | (cid:80) i ∈ S q ti ≤ α . It is important for DPSS-UCB to satisfy (cid:15) -seperatedness because if there existssuch a super-arm, for which the average quality is between( α − (cid:15) , α ), DPSS-UCB will include it in χ due to toleranceparameter (cid:15) while it would violate the QC. Theorem 5.

If qualities of the agents satisfy (cid:15) -seperatedness, then Reg u ( T ) is bounded by O (ln T ) .Proof. Following from the proof in [10], we deﬁne someparameters. A super-arm, S is bad if r q ( S ) < opt q . De-ﬁne S B as the set of bad super-arms. For a given under-lying agent i ∈ [ n ], deﬁne:∆ i min = opt q − max { r q ( S ) | S ∈ S B , i ∈ S } ∆ i max = opt q − min { r q ( S ) | S ∈ S B , i ∈ S } . Using the same proof as in [10], we can show that, V T ,the expected number of times we play a sub-optimal agenttill round T , is upper bounded as: V T ≤ n ( l T ) + T (cid:88) t = τ nt ≤ n ( l T ) + T (cid:88) t =1 nt ≤ n · ln T ( f − (∆ min )) + (cid:18) π (cid:19) · n. l T = T ( f − (∆ min )) . Hence, we can bound the regretas:- Reg u ( T ) ≤ V T · ∆ max ≤ (cid:18) · ln T ( f − (∆ min )) + π (cid:19) n · ∆ max = (cid:32) · ln T ( ∆ min R ) + π (cid:33) n · ∆ max . (cid:4) Substituting the results of Theorem 5 in Equation 3,we prove that DPSS-UCB incurs a regret of O (ln T ). In the previous sections, we propose a framework and dy-namic programming based algorithm to solve our sub-set selection problem for both when the agents’ qualityis known and not. Since DPSS explores all the possiblecombinations of the selection vector and the utility asso-ciated with it, the complexity of DPSS is of O (2 n ), whichmakes it diﬃcult to scale when n is large.To overcome this limitation, we propose a greedy basedapproach to our problem. When the quality of agents areknown, we propose GSS that runs in polynomial time, O ( n log n ), and provides an approximate solution to ourILP. Then, we use GSS as our SSA in the SS-UCB frame-work and propose GSS-UCB as an alternate algorithm toDPSS-UCB in the setting where the qualities of the agentsare unknown. Greedy algorithms have been proven eﬀective to provideapproximate solutions to ILP problems such as 0-1 knap-sack. They do so by solving linearly relaxed variants ofan ILP, such as fractional knapsack, and removing anyfractional unit from its solution. We propose a similaralgorithm for our subset selection problem by allowing x i ∈ [0 , n = 2 agents with qualities, q = [0 . , . c =[10 , α = 0 .

7. Allowing fractional units to betaken, the optimal solution would be to take x = 1 , x =0 . S , S , S , S , as described in Section 3.4.2. Select all agents in S . Let d = (cid:80) i ∈ S ( q i − α ) bethe excess quality accumulated and as before, dropall agents in S .3. For agents in S , sort them in the decreasing order ofrevenue gained per unit loss in quality ( r i α − q i ). Sim-ilarly, for agents in S , sort them in the increasingorder of revenue lost per unit gain in quality ( r i α − q i ). 4. Select units (could be fractional) from agents from S until the total loss of quality is no more than d . Es-sentially, we use the agents in S to increase revenuewhile ensuring average quality is above the threshold.5. For agents in S with remaining fractional units, wepair them up with an equivalent fractional unit of anagent in S that balances the loss in average quality.6. When the revenue gained per unit loss in quality fromthe ﬁrst non-exhausted agent in S is less than therevenue lost per unit gain of quality from the ﬁrstnon-exhausted agent in S , terminate the algorithm.An agent is exhausted if the unit produce is com-pletely selected.7. For any agent in S with a fractional unit, take thecomplete unit instead. For all other agents, removeany fractional units selected. Algorithm 3

GSS Inputs: N , α , R , costs c = [ c i ], qualities q = [ q i ] Output:

Quantities procured x = ( x , . . . , x n ) Initialization: ∀ i ∈ N , r i = Rq i − c i Segregate S , S , S , S as described in Section 3.4 ∀ i ∈ S , x i = 1; d = (cid:80) i ∈ S ( q i − α ) ∀ i ∈ S , x i = 0 L = sort ( S ) on decreasing order of r i α − q i L = sort ( S ) on increasing order of r i α − q i p = 0 , q = 0 while d > p < | S | do i = L [ p ]; if α − q i ≤ d then x i = 1, d = d − α − q i , p + = 1 else x i = dα − q i , d = 0 while p < | S | and q < | S | do i = L [ p ], j = L [ q ] a = r i α − q i , b = r j α − q j if a ≤ b then break ; w = min((1 − x i )( α − q i ) , (1 − x j )( q j − α )) x i + = w α − q i , x j + = w q j − α if x i == 0 then p + +; if x j == 0 then q + +; if < x j < then x j = 1 return (cid:98) x (cid:99) While GSS is computationally more eﬃcient than DPSS,it is important to note that it may not always return theoptimal subset of agents. We show for the following ex-ample, that GSS doesn’t have a constant approximationw.r.t. the optimal solution:Consider n = 3 agents with qualities, q =[1 . , . , .

97] and c = [ R − (cid:15), R , R ]. Hence, r =[ (cid:15), R , R ], where R is some constant as discussed beforesuch that r i = Rq i − c i . If α = 0 .

99, the value of r i α − q i for the third agent is higher than that of the second, butonly a fractional unit can pair with the ﬁrst agent. Hence,according to GSS, we only select the ﬁrst agent giving usa utility of (cid:15) , whereas the optimal utility is equal to (cid:15) + R (cid:15)(cid:15) + R . Since (cid:15) can takean arbitrary small value, the approximation ratio betweenthe utility achieved by GSS and DPSS can be arbitrarysmall.However, through experiments, we show that in prac-tice, GSS gives close to optimal solutions at a huge com-putational beneﬁt that allows us to scale our frameworkfor a large number of agents, such as in an E-commercesetting. When we use GSS as the SSA in our SS-UCB framework,we refer to the algorithm as GSS-UCB. While the regretanalysis may not necessarily hold, as GSS does not havea constant approximation, we still show that in practice,it works as good as DPSS-UCB in both (i) achieving con-straint satisfaction after τ rounds and (ii) the regret in-curred thereafter. We show this via experiments, as dis-cussed in Section 6. In this section, we compare the performance of GSS withDPSS in the setting where quality of the agents is known.In Figure 1a, we compare the ratio of the utility achievedby GSS ( z gss ) to the utility achieved by DPSS ( z dpss )while ensuring the QC is met. In Figure 2, we presenta box plot of the distribution of the ratios of these util-ities over 1000 iterations for α = 0 .

7. To compare theperformance of GSS for much larger values of n , we com-pare it against the utility achieved by an ILP solver ( z ilp ),namely, the COIN-OR Branch and Cut Solver (CBC) [15]since the computational limitations of DPSS made it in-feasible to run experiments for large values of n . Theresults for the same are presented in Figure 1b. Lastly, inTable 1, we compare the ratio of the time taken by GSS( t gss ) with respect to DPSS ( t dpss ) and the ILP solver( t ilp ) for diﬀerent values of n with α being set to 0 . For diﬀerent values of n , the number of agents and α , thequality threshold, we generate agents with q i and c i both ∼ U [0 , n , α ) pair, while in Figure2, we plot the distribution of the ratios obtained in eachof the 1000 iterations for diﬀerent values of n with α setto 0 .

7. We use R = 1 for all our experiments. As can be seen from Figures 1a and 1b, the average ra-tio of both ( z gss z dpss ) and ( z gss z ilp ) lies approximately between[0.94,1.0], with a median of 1 . n and only a few outliers and a few rare instances whenthe ratio drops below 0.2 as evident from Figure 2. This indicates that GSS performs almost as good as DPSS inpractice with an exponentially improving computationalperformance in terms of time complexity with respect toDPSS and an almost 50x improvement over the ILP solveras well. This establishes the eﬃcacy of GSS for practicaluse at scale. n t dpss : t gss t ilp : t gss t ilp : t gss In this section, we present experimental results of DPSS-UCB and GSS-UCB towards the following:1. Constraint Satisfaction: As discussed in section 4.3,DPSS-UCB satisﬁes the QC approximately with highprobability after τ = T (cid:15) rounds. Here, α + (cid:15) is thetarget constraint of the agent when α is the requiredaverage quality threshold. Towards this, we plot theaverage number of iterations where DPSS-UCB andGSS-UCB returns a subset that satisﬁes QC at eachround in our experiment for diﬀerent values of (cid:15) .2. Regret incurred for t > τ : We show that the regretincurred by our algorithm for t > τ , follows a curveupper bounded by O (ln T ). Towards this we plot thecumulative regret vs. round t , where τ < t ≤ T . To carry out these experiments, we generated n = 10agents with both q i , c i ∼ U [0 , α = 0 . α the number of super-arms satisfyingQC is very low and hardly much to learn whereas for alow value, the number of super-arms that satisfy QC isvery high but practically of not much interest. In Figure8 a) w.r.t DPSS (b) w.r.t. ILP Figure 1: Performance of GSS on diﬀerent values of α Figure 2: GSS vs DPSS ratio distri-butionFigure 3: Regret incurred for t > τ

Figure 4: Constraint satisfaction at each round4, we perform the experiment over a varied range of valuesof (cid:15) , whereas in Figure 3, we set (cid:15) = 0 .

01. We averageour results for 1000 iterations of each experiment. Forexample, in Figure 4, a value of 0 . t ,would denote that in 40% of the iterations, the QC wassatisﬁed at round t . For both the experiments, R = 1 and T = 100000. Higher the value of (cid:15) , higher is the target constraintand thus more conservative is our algorithm in selectingthe subset of agents. Therefore, we achieve correctnessquickly, which is evident from Figure 4. In all three cases,the algorithm achieves correctness in close to 100% of theiterations, after T (cid:15) rounds (indicated by the verticaldotted line), which justiﬁes our value of τ . Similarly, theregret incurred by DPSS-UCB for t > τ follows a curveupper bounded by O (ln T ). The regret incurred by GPSS-UCB is slightly lower than DPSS-UCB which further es-tablishes the eﬃcacy of our greedy approach. In this paper, we addressed the class of problems wherea central planner had to select a subset of agents thatmaximized its utility while ensuring a quality constraint.We ﬁrst considered the setting where the agents’ qualityis known and proposed DPSS that provided an exact so-lution to our problem. When the qualities were unknown,we modeled our problem as a CMAB problem with semi-bandit feedback. We proposed SS-UCB as a framework toaddress this problem where both the constraint and theobjective function depend on the unknown parameter, asetting not considered previously in the literature. UsingDPSS as our SSA in SS-UCB, we proposed DPSS-UCBthat incurred a O (ln T ) regret and achieved constraint sat-isfaction with high probability after τ = O (ln T ) rounds.To address the computational limitations of DPSS, weproposed GSS for our problem that allowed us to scaleour framework to a large number of agents. Via simula-tions, we showed the eﬃcacy of GSS.The SS-UCB framework proposed in this paper can beused to design and compare other approaches to this classof problems that ﬁnd its applications in many ﬁelds. Itcan also easily be extended to solve for other interest-ing variants of the problem such as (i) where the pool ofagents to choose from is dynamic with new agents enter-ing the setting, (ii) where an agent selected in a particularround is not available for the next few rounds (sleepingbandits) possibly due to lead time in procuring the units,a setting which is very common in operations research lit-erature. Our work can also be extended to include strate-gic agents where the planner needs to design a mechanismto elicit the agents’ cost of production truthfully. References [1] Kontogeorgos Achilleas and Semos Anastasios. Mar-keting aspects of quality assurance systems: Theorganic food sector case.

British Food Journal ,110(8):829–839, 2008.[2] Shipra Agrawal and Nikhil R Devanur. Bandits withconcave rewards and convex knapsacks. In

Proceed-ings of the ﬁfteenth ACM conference on Economicsand computation , pages 989–1006, 2014.93] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer.Finite-time analysis of the multiarmed bandit prob-lem.

Machine learning , pages 235–256, 2002.[4] Ashwinkumar Badanidiyuru, Robert Kleinberg, andAleksandrs Slivkins. Bandits with knapsacks. In , pages 207–216. IEEE, 2013.[5] Ashwinkumar Badanidiyuru, John Langford, andAleksandrs Slivkins. Resourceful contextual bandits.In

Conference on Learning Theory , pages 1109–1134,2014.[6] Arpita Biswas, Shweta Jain, Debmalya Mandal, andY. Narahari. A truthful budget feasible multi-armedbandit mechanism for crowdsourcing time criticaltasks. In

Proceedings of the 2015 InternationalConference on Autonomous Agents and MultiagentSystems , AAMAS ’15, pages 1101–1109, Richland,SC, 2015. International Foundation for AutonomousAgents and Multiagent Systems.[7] S´ebastien Bubeck and Nicolo Cesa-Bianchi. Regretanalysis of stochastic and nonstochastic multi-armedbandit problems.

Machine Learning , 5(1):1–122,2012.[8] Shouyuan Chen, Tian Lin, Irwin King, Michael RLyu, and Wei Chen. Combinatorial pure explorationof multi-armed bandits. In

Advances in Neural Infor-mation Processing Systems 27 , pages 379–387. 2014.[9] Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, andPinyan Lu. Combinatorial multi-armed bandit withgeneral reward functions. In

Advances in NeuralInformation Processing Systems , pages 1659–1667,2016.[10] Wei Chen, Yajun Wang, and Yang Yuan. Com-binatorial multi-armed bandit: General frameworkand applications. In Sanjoy Dasgupta and DavidMcAllester, editors,

Proceedings of the 30th Interna-tional Conference on Machine Learning , volume 28of

Proceedings of Machine Learning Research , pages151–159, Atlanta, Georgia, USA, 17–19 Jun 2013.PMLR.[11] Wei Chen, Yajun Wang, Yang Yuan, and QinshiWang. Combinatorial multi-armed bandit and itsextension to probabilistically triggered arms.

TheJournal of Machine Learning Research , 17(1):1746–1778, 2016.[12] Richard Combes, Mohammad SadeghTalebi Mazraeh Shahi, Alexandre Proutiere,et al. Combinatorial bandits revisited. In

Advancesin Neural Information Processing Systems , pages2116–2124, 2015.[13] Richard Combes, Mohammad Sadegh TalebiMazraeh Shahi, Alexandre Proutiere, and marclelarge. Combinatorial bandits revisited. In

Ad-vances in Neural Information Processing Systems28 , pages 2116–2124. 2015. [14] Eyal Even-Dar, Shie Mannor, and Yishay Mansour.Pac bounds for multi-armed bandit and markov deci-sion processes. In

International Conference on Com-putational Learning Theory , pages 255–270. Springer,2002.[15] John J. Forrest, Stefan Vigerske, Haroldo GambiniSantos, Ted Ralphs, Lou Hafer, Bjarni Kristjansson,jpfasano, EdwinStraver, Miles Lubin, rlougee, jp-goncal1, h-i gassmann, and Matthew Saltzman. coin-or/cbc: Version 2.10.5, March 2020.[16] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain.Learning multiuser channel allocations in cognitiveradio networks: A combinatorial multi-armed banditformulation. In

IEEE Symposium on New Frontiersin Dynamic Spectrum , pages 1–9, 2010.[17] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain.Combinatorial network optimization with unknownvariables: Multi-armed bandits with linear rewardsand individual observations.

IEEE/ACM Transac-tions on Networking , pages 1466–1478, 2012.[18] David Haussler.

Probably approximately correctlearning . University of California, Santa Cruz, Com-puter Research Laboratory, 1990.[19] Chien-Ju Ho, Shahin Jabbari, and Jennifer Wort-man Vaughan. Adaptive task assignment for crowd-sourced classiﬁcation. In

International Conferenceon Machine Learning , pages 534–542, 2013.[20] Wassily Hoeﬀding. Probability inequalities for sumsof bounded random variables.

Journal of the Amer-ican Statistical Association , 58(301):13–30, 1963.[21] Shweta Jain, Satyanath Bhat, Ganesh Ghalme, Di-vya Padmanabhan, and Y. Narahari. Mechanismswith learning for stochastic multi-armed bandit prob-lems.

Indian Journal of Pure and Applied Mathemat-ics , 47(2):229–272, Jun 2016.[22] Shweta Jain, Sujit Gujar, Satyanath Bhat, OnnoZoeter, and Y Narahari. A quality assuring, costoptimal multi-armed bandit mechanism for expert-sourcing.

Artiﬁcial Intelligence , 254:44–63, 2018.[23] David R Karger, Sewoong Oh, and Devavrat Shah.Iterative learning for reliable crowdsourcing systems.In

Advances in neural information processing sys-tems , pages 1953–1961, 2011.[24] Branislav Kveton, Zheng Wen, Azin Ashkan, andCsaba Szepesvari. Tight regret bounds for stochasticcombinatorial semi-bandits. In

Artiﬁcial Intelligenceand Statistics , pages 535–543, 2015.[25] Jason D Papastavrou, Srikanth Rajagopalan, andAnton J Kleywegt. The dynamic and stochasticknapsack problem with deadlines.

Management Sci-ence , 42(12):1706–1718, 1996.1026] Robert P Rooderkerk and Harald J van Heerde.Robust optimization of the 0–1 knapsack problem:Balancing risk and return in assortment optimiza-tion.

European Journal of Operational Research ,250(3):842–854, 2016.[27] Prabhakant Sinha and Andris A Zoltners. Themultiple-choice knapsack problem.

Operations Re-search , 27(3):503–515, 1979.[28] Aleksandrs Slivkins. Introduction to multi-armedbandits.

Foundations and Trends ® in MachineLearning , 2019.[29] Mil´e Terziovski, Danny Samson, and Douglas Dow.The business value of quality management sys-tems certiﬁcation. evidence from australia and newzealand. Journal of operations management , 15(1):1–18, 1997.[30] William R Thompson. On the likelihood that oneunknown probability exceeds another in view of theevidence of two samples.

Biometrika , 25(3/4):285–294, 1933.[31] Long Tran-Thanh, Sebastian Stein, Alex Rogers, andNicholas R Jennings. Eﬃcient crowdsourcing of un-known experts using bounded multi-armed bandits.

Artiﬁcial Intelligence , 214:89–111, 2014.[32] Long Tran-Thanh, Matteo Venanzi, Alex Rogers,and Nicholas R Jennings. Eﬃcient budget allocationwith accuracy guarantees for crowdsourcing classiﬁ-cation tasks. In

Proceedings of the 2013 internationalconference on Autonomous agents and multi-agentsystems , pages 901–908. International Foundation forAutonomous Agents and Multiagent Systems, 2013.[33] GJ Zaimai. Optimality conditions and duality forconstrained measurable subset selection problemswith minmax objective functions.