Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem
DDynamic Learning with Frequent New Product Launches:A Sequential Multinomial Logit Bandit Problem
Junyu Cao [email protected] Wei Sun [email protected] Industrial Engineering and Operations Research, University of California, Berkeley IBM Research
Abstract
Motivated by the phenomenon that companies introduce new products to keep abreastwith customers’ rapidly changing tastes, we consider a novel online learning setting where aprofit-maximizing seller needs to learn customers’ preferences through offering recommenda-tions, which may contain existing products and new products that are launched in the middle ofa selling period. We propose a sequential multinomial logit (SMNL) model to characterize cus-tomers’ behavior when product recommendations are presented in tiers. For the offline versionwith known customers’ preferences, we propose a polynomial-time algorithm and characterizethe properties of the optimal tiered product recommendation. For the online problem, we pro-pose a learning algorithm and quantify its regret bound. Moreover, we extend the setting toincorporate a constraint which ensures every new product is learned to a given accuracy. Ourresults demonstrate the tier structure can be used to mitigate the risks associated with learningnew products.
Keywords : sequential, multinomial logit model, bandit, new product, dynamic a r X i v : . [ c s . L G ] A p r Introduction
Facing increasingly savvy customers whose preferences are rapidly changing, companies that chooseto play it safe by remaining with traditional product lines risk being overtaken by competitors morein tune with their customers. A coping strategy adopted by companies is to frequently launch newproducts and learn from the market responses. Between the tried-and-true existing products andnew products with little or no history, companies face a dilemma - they have to offer new productsin order to understand the changing market dynamics so as to improve longer-term profitability,yet they may have to sacrifice short-term profitability. The central question is, how can a companyquickly learn customers’ preferences while mitigating the risks inherent in new products?We approach this question as an online learning task. We consider a seller whose goal is to maximizecumulative profit over a selling horizon T . She will introduce several new products at different timesduring the selling period. For every customer, the seller determines some products to offer , whichmay include the existing and/or new products. Based on the customer’s response, the seller updatesher belief on the latent customers’ preferences (also known as product valuations), and uses theinformation to optimize the product selection for the next customer. As we will show in the paper,many new products with relatively low profit will never be offered from a pure profit-maximizingobjective.In reality, companies often intentionally price new products low to gain exposure and to enticecustomers to give them a try. Thus, many new products may have relatively low profits, yet learningfrom these product is crucial for understanding customers’ preferences, and enabling companies tomake better business decisions in the future. To model such behavior, we impose a constraint,termed “minimum learning criterion”, which requires every new product to be offered and learnedto a given accuracy. A direct implication is that the seller will be bearing additional cost of learningas she makes less money from these products. It is natural to ask what can be done to reduce suchcost.We will show that a judicious choice of presenting products is capable of mitigating some costsassociated with learning new products. In our setting, products are presented in tiers, indicatingthe precedence in which customers discover them. For a given customer, a seller first offers the We use “recommend” and “offer” interchangeably in this work.
SMNL Bandit problem. The contribution of our work is threefold:1. We propose a novel SMNL model to capture customer’s sequential choice behavior. For theoffline problem with known customers’ preferences, we provide a polynomial-time algorithm tosolve the profit-maximization problem, and characterize the properties of the optimal tieredproduct offering.2. For an online setting where new products are frequently launched at different times during theselling horizon, we propose an online learning algorithm for the
SMNL Bandit problem, andcharacterize its regret bound.3. We extend the online setting to incorporate a constraint which ensures all new products arelearned to a given accuracy, and demonstrate how the tier structure in product presentation canbe exploited to mitigate risks with new products.3
Literature review
The first stream of work that our paper is related to is assortment optimization. It refers to theproblem of selecting a set of products to offer to a group of customers so as to maximize therevenue when customers make purchases according to their preferences. It is a central topic ineconomics, marketing, and the operations management research literature. We refer the readerto K¨ok et al., 2008 for a comprehensive review. Talluri and Van Ryzin, 2004 is the first paperthat models customers’ preferences with the MNL model for the assortment planning problem.Flores et al., 2018 study the assortment optimization problem with a different sequential choicemodel known as the perception-adjusted Luce model and characterize the optimal assortment forthe offline problem. Besides the customers’ preferences are modelled differently, we also study theproblem in the online setting and investigate the learning policy with new products.Another related topic is the multi-armed bandit (MAB) problem (e.g., Robbins, 1985; Suttonet al., 1998). Our problem falls under the combinatorial setting (Chen et al., 2013) since theretailer’s decision is a combination of different products. A naive approach is to treat each possiblecombination as an arm. However, the number of arms increases exponentially with the numberof products with this approach. Other combinatorial bandit work assuming linear reward (Auer,2002; Rusmevichientong and Tsitsiklis, 2010) or independent rewards (Chen et al., 2013) cannot bedirectly applied to our model. Recent work on assortment optimization (such as Cheung and Simchi-Levi, 2017; Agrawal et al., 2017a,b; Saur´e and Zeevi, 2013; Rusmevichientong et al., 2010) extendthe MNL assortment problem from the offline setting to online where customers’ preferences areunknown a priori and need to be learned. Our work is more closely related to Agrawal et al., 2017a,but with the following key differences. Firstly, we consider multi-tiered assortment. Despite theirubiquity in practice, there is little formal analysis in the literature on either the offline optimizationproblem or the online learning algorithms. Our work helps to bridge this gap. Secondly, we focuson learning in conjunction with new products launches, where we differentiate two cases dependingon whether all new products need to be learned.4
Problem formulation
In this section, we will formally set up our problem. We will first introduce the SMNL model,which describes the customers’ behavior, and follow by formulating a profit maximization problemthat the seller needs to solve.
Discrete choice models such as the popular MNL model are derived under the assumption that autility-maximizing customer chooses a product with the highest valuation among a available choiceset S Train (2009). In a SMNL model, S consists of multiple tiers of products. For ease of notation,we will present a two-tier model where the choice set consists of two sets, i.e., S := ( S , S ). Wewill refer to S and S as the priority tier and the secondary tier respectively, as products in S enjoys greater visibility. Note that all our results can be generalized to incorporate more tiers.Customers arrive at discrete time t = 1 , · · · , T . For a customer arriving at time t , she is presentedwith a choice set S t that is selected by the seller. Under the SMNL model, a customer first considersproducts from the priority tier S . If none are selected, she will then consider the secondary tier S and decide whether to select any product from S . Note that no-purchase is also one of the choicesthat the customer can make. The probability that a customer purchases product i is denoted as p i ( S ) and no-purchase as p ( S ), i.e., p i ( S ) = v i (cid:80) j ∈ S v j , if i ∈ S
11 + (cid:80) j ∈ S v j v i (cid:80) j ∈ S v j , if i ∈ S
11 + (cid:80) j ∈ S v j
11 + (cid:80) j ∈ S v j , if i = 00 , otherwise,where v i is the product valuation or customers’ preference for product i , which is assumed to beless than 1. For a product from S , its purchase probability follows that of a standard MNL model.On the other hand, the probability of purchasing a product from S , is the joint probability of twoevents, i.e., the customer has not selected any product from S and the customer selects a product5rom S . Knowing customers’ purchase probability as p i ( S ) when offering S , the seller needs to select a subsetof products from all available products to form S and S . We assume there are two pre-determinedsets of product candidates, X and X . We want to point out that the two candidate sets neednot be mutually exclusive, and can completely overlap each other. A seller has the flexibility toassign products as candidates for the priority tier based on sales, trendiness, inventory, and otherbusiness criteria.Denote the profit of product i by r i and the profit obtained from S by R ( S ). The expectedprofit can be expressed as E [ R ( S )] = (cid:80) i ∈ S r i p i ( S ) = (cid:80) i ∈ S r i v i (cid:80) i ∈ S v i + (cid:80) i ∈ S v i (cid:80) i ∈ S r i v i (cid:80) i ∈ S v i . The seller’soptimization problem is to select two subsets of products S and S from the candidate sets X and X respectively. That is, max S E [ R ( S )] (3.1)s.t. S k ⊆ X k , ∀ k ∈ { , } . We use S ∗ = ( S ∗ , S ∗ ) to denote the optimal tiered product offering. ∗ We begin this section with a simple example to compare a two-tiered product offering with itssingle-tiered counterpart.
Example 1.
Suppose there are two products with profit r = 10 , r = 1 and valuation v = 0 . , v = 1 respectively. The optimal one-tier recommendation is to offer both products simul-taneously and the corresponding expected profit is given by E [ R ( { , } )] = r v + r v v + v = ∗ . ∗ . =0 . . The optimal two-tier recommendation is to offer product 1 on the priority tier and product 2on the secondary tier. The resulting profit E [ R (( { } , { } ))] = ∗ . . + . ∗ = 1 . > E [ R ( { , } )].This example shows that the tiered structure offers flexibility in presenting products, which trans-6ates into higher profit. Intuitively, the tiered recommendation prioritizes products with higherprofits to be shown first. We can formalize this observation by analyzing the seller’s problem (3.1)in an offline setting where the product valuation v i is given.We now introduce two definitions which will help us characterize the properties of the optimaltiered product offering. Definition 4.1 (Profit-ordered set)
We call S k ⊆ X k is a profit-ordered set if min i ∈ S k r i ≥ max i ∈ X k \ S k r i , for k ∈ { , } . Definition 4.2 (Profit-ordered by tier)
If there exist i ∈ S and j / ∈ S such that r i < r j , then S = ( S , S ) is not profit-ordered by tier. Otherwise, it is profit-ordered by tier. Example 4.3
Suppose X = { , , , } , X = { , , , } , with profit r i = i for all i , thenthe sets S = ( { , } , { , , } ) , ( { } , { , , , } ) are both profit-ordered by tier while the sets S = ( { , , } , { , , } ) , { (6 , , (9 , } are not. Proposition 4.4
The optimal product offering S ∗ to the optimization problem (3.1) in each tieris a profit-ordered set. In addition, S ∗ is profit-ordered by tier. Due to the space constraint, we only include proof sketches for the key results in the paper. Alldetailed proofs can be found in the supplementary material.
Proof sketch:
We show S ∗ is profit-ordered by contradiction. Supposedly, there exists a S ∗ where i ∈ S ∗ and r i < E [ R ( S ∗ )], then we show that removing this product will increase the expected profit.Hence, S ∗ is not optimal. A similar argument is used to show that if i / ∈ S ∗ , and r i > E [ R ( S ∗ )],then adding it to the offering will increase the profit. Next, use the same argument to S ∗ to obtainthe desired result.To prove S ∗ is profit-ordered by tier, notice that the expected profit of S ∗ is at least as large asonly offering S since S = ( ∅ , S ) is also a feasible solution. Since we have shown that each tier in S ∗ is a profit-ordered set, i.e., for any j ∈ S ∗ , r j ≥ E [ R ( S ∗ )], and for any i / ∈ S ∗ , r i < E [ R ( S ∗ )].Therefore, r i ≤ E [ R ( S ∗ )] ≤ E [ R ( S ∗ )] ≤ r j for any i / ∈ S ∗ , j ∈ S ∗ . This completes the proof. (cid:4) Proposition 4.4 implies that a two-tier optimal recommendation can be characterized by a pair ofprofit thresholds ( θ , θ ) with θ ≥ θ , where r i ≥ θ and r j ≥ θ for any i ∈ S and j ∈ S .7herefore, the seller’s optimization problem is polynomial-time solvable, as it follows directly fromthe fact that there are at most | X || X | pairs of profit thresholds to enumerate through. In retail,as prices are discrete and often end with 9 or .99, there are far fewer unique price points than thenumber of products and the actual search space of profit thresholds is significantly smaller.The profit-ordered structure of the optimal tiered recommendation provides important insightsregarding the placement of a new product. We will generalize the result to a setting with multipletiers. Proposition 4.5
Denote the optimal recommendation before and after including a new productwith profit r m to a candidate set as S ∗ = ( S ∗ , S ∗ , · · · , S ∗ W ) and ˆ S ∗ , respectively. Define S ∗ j =( S ∗ j , S ∗ j +1 , · · · , S ∗ W ) . The following properties holds.a.) E [ R (ˆ S ∗ j )] ≥ E [ R (ˆ S ∗ j +1 )] for any j = 1 , · · · , W − .b.) If E [ R ( S ∗ j )] < r m < E [ R ( S ∗ j − )] for some j , then m ∈ ˆ S ∗ but m / ∈ ˆ S ∗ ∪ ˆ S ∗ ∪ · · · ∪ ˆ S ∗ j − .c.) If r m < E [ R ( S ∗ W )] , then m / ∈ ˆ S ∗ . Proposition 4.5 states that, for a two-tier product offering, unless a new product’s profit is higherthan E [ R ( S ∗ )], where S ∗ refers to what is currently being offered on the secondary tier, it willnot be included. Therefore, this product will never be introduced or learned. As discussed inthe introduction, many new products could have relatively low profit, but learning is crucial forproviding insights to improve long-term profitability. This provides motivation for us to investigatean online learning task with a constraint to ensure all new products are learned to a given accuracy,which we will discuss in Section 6. In the previous section, we have assumed that valuations of products are known. In practice, thesequantities are not given to the seller and have to be learned.8 .1 Online setup
We consider a general setting where K new products are introduced at different time stamps duringa selling horizon T . We allow several products to be launched at the same time. We use regret tomeasure the performance of a learning algorithm, where the regret for a policy π is defined as, Reg π ( T ; v ) = E π (cid:34) T (cid:88) t =1 R t ( S ∗ , v ) − R t ( S t , v ) (cid:35) , where S ∗ is the optimal tiered product offering when v is known, while S t is the tiered recommen-dation offered to the customer arriving at time t . R t ( S , v ) denotes the profit accrued at time t when offering recommendation S .For our learning task, we extend the framework in Agrawal et al., 2017a which proposed a UCB-based algorithm for an online learning task with a MNL model. We want to emphasize that thetiered structure in the SMNL model significantly complicates the analysis as the decisions across thetiers are interdependent . Next, we will describe a counting process to derive an unbiased estimatorof v i for i ∈ S . We divide the time horizon into epochs for the priority and the secondary tier respectively, i.e., L and L . Let L = L ∪ L . In each epoch l ∈ L k for k = 1 ,
2, we offer the same product selection S lk for tier k until a no-purchase in S lk occurs. An epoch is labeled as l if and only if l epochs have beencompleted before t . Let ε kl contain all time steps during epoch l when S lk is shown to a customer. Example 5.1
Figure 1 illustrates the counting process with an example, which shows the purchasedecisions of 9 customers, i.e., t = 1 , · · · , . The first customer selects a product from the prioritytier, and the second customer selects a product from the secondary tier, and so on. The tablein Figure 1 shows how epochs are labeled for different tiers. Here we have L = { , , , } and L = { , } . For L , the epoch count at time t is the same as the total number of no-purchasesfrom both tiers before time t . Thus, when t = 6 , epoch l = 3 since there is a total of 3 no-purchasesacross both tiers by t = 5 . Note that for the secondary tier k = 2 , we only keep track of the time Figure 1: An illustrative example. steps and the epoch count when S l is shown to a customer (i.e., the customer does not purchaseany product from S l ). In terms of the time steps for each epoch, we have ε = { , } , ε = { , , } , ε = { } , ε = { , , } , ε = { , } , and ε = { , } . For any time step t , we use c kt to denote the purchase decision of customer t on tier k , i.e.,1( c kt = i ) = 1 if the consumer purchased product i ∈ S k , and 0 for a no-purchase. For anyproduct i ∈ S l and j ∈ S l , define ˆ v (1) i,l = (cid:80) t ∈ ε l c t = i ) and ˆ v (2) j,l = (cid:80) t ∈ ε l c t = j ) as the numberof times a product i is purchased in epoch l as part of the primary or secondary tier selectionsrespectively.Let T ki ( l ) be the set of epochs which contain product i in tier k offering before epoch l . Define T ki ( l ) = |T ki ( l ) | , which denotes the number of epochs which contain i in tier k offering beforeepoch l . Let T i ( l ) = T i ( l ) + T i ( l ), as the total number of epochs which contain i in the tieredrecommendation before epoch l . We compute ¯ v i,l as the average number of times product i is10urchased per epoch, i.e., ¯ v i,l = 1 T i ( l ) (cid:88) τ ∈T i ( l ) ˆ v (1) i,τ + (cid:88) τ ∈T i ( l ) ˆ v (2) i,τ . (5.1) Lemma 5.2 ˆ v ( k ) i,l are i.i.d. geometric random variables with parameter v i for any l and k = 1 , .Therefore, they are unbiased i.i.d. estimators of v i . Define the upper confidence bound on v i as the follows, v UCBi,l : = ¯ v i,l + (cid:115) ¯ v i,l
48 log( K ( l − l i, ) + 1) T i ( l )+ 48 log( K ( l − l i, ) + 1) T i ( l ) , (5.2)where l i, is the initial launch epoch of product i , ¯ v i,l is defined in Equation (5.1), and K is thetotal number of products.We briefly describe our UCB-based algorithm: In each epoch l , we use v UCBl to compute theoptimal product offering . Denote ˜ S l as the optimal product set when the value of products is v UCBl and S ∗ is the optimal set selected from the entire candidate sets including the new product.To bound the profit difference between S ∗ and ˜ S l , we derive the following result. Lemma 5.3
Assume ≤ v i ≤ v UCBi for all i = 1 , · · · , K . Suppose S ∗ is an optimal tieredrecommendation when the parameters of SMNL model are given by v . Then E [ R ( S ∗ , v UCB )] ≥ E [ R ( S ∗ , v )] . Lemma 5.3 is a key step in the regret analysis for this UCB-based algorithm. With Lemma 5.3, onthe “large probability” event that 0 ≤ v i ≤ v UCBi for all i = 1 , · · · , K , we can bound the difference E [ R ( S ∗ , v )] − E [ R (˜ S l , v )] by E [ R (˜ S l , v UCB )] − E [ R (˜ S l , v )]. We will expand the regret analysis withmore details in next section, where we impose an additional constraint to our learning task, as thecurrent setting is a special case when the constraint is absent.11 Regret analysis with the minimum learning criterion
As we have discussed in Section 4, by default a new product will only be included in the productoffering if its profit r m ≥ E [ R ( S )], where S is the current product offering at the secondary tier.In other words, new products with profit r m < E [ R ( S )] will never be offered and and deprivedof the learning opportunity. To have a more realistic setting, we will formally define a minimumlearning constraint. We will then investigate a learning algorithm and quantify its resulting regret,starting with a single new product and later generalize to multiples. We impose a constraint in our learning task to ensure that every product will be offered for atleast a number of times to allow us to learn its valuation to a certain accuracy. More specifically,we require the estimated valuation ¯ v i of every new product to be within (cid:15) to the true v i with aprobability which is at least 1 − α , where (cid:15) and α are two pre-determined parameters. We derivethe following lemma which specifies the number of epochs M needed to achieve a given level ofestimation accuracy. Lemma 6.1 (Minimum learning criterion)
For any (cid:15) and α > , if the number of epochs M ≥
192 log(2 /α +1)( − √ (cid:15) ) , then ¯ v i is within the (cid:15) confidence bound of v i with probability at least − α .That is, P ( | ¯ v i,l − v i | > (cid:15) ) < − α if T i ( l ) >
192 log(2 /α +1)( − √ (cid:15) ) . We want to emphasize that the constraint only affects a subset of new products which are otherwiseexcluded from being offered due to their relatively low profitability. Once they are offered and M samples have been collected, they will be dropped out from future product recommendations. Onthe other hand, new products (along with some existing products) with relatively high profit willcontinuously be offered after M epochs and the estimation on their product valuations will befurther improved. This is echoing what typically happens after product launches, where companieschoose to continue or stop certain new products based on market response.12 .2 Learning with r m < E [ R ( S )] In this section, we focus on with a setting when a single new product with low profit is launchedin the middle of a selling horizon. Part of our goal is to determine the best way to include thisproduct into learning.By Proposition 4.5, this low-profit product will be excluded from learning by default. In orderto satisfy the minimum learning criterion, this new product will have to be offered for M epochs,where M is determined by Lemma 6.1. There are two possible strategies for us to learn this newproduct, i.e., either assigning it to the priority tier or the secondary tier.The answer to which is a better strategy is not immediately clear: While the duration of an epochis shorter when a product is placed on the priority tier, it could also mean that more of this productwill be purchased. Hence, more profit loss and higher regret. On the other hand, even though aproduct placed on the secondary tier might make fewer sales, the duration of a single epoch couldbe much longer and the resulting regret could still be high since other products (in addition to thenew product) also contribute to the total regret. We now formally compare the two strategies byquantifying the corresponding regrets incurred during a single epoch. Strategy 1: Assigning new product to the priority tier
Let S (cid:48) = S ∪ { m } , S (cid:48) = S , and S (cid:48) = ( S (cid:48) , S (cid:48) ). Let N denote the number of times S (cid:48) has beenshown to customers until a no-purchase occurs. Note that N follows the geometric distributionwith mean 1 + (cid:80) j ∈ S (cid:48) v j , which depends on the valuation of all products in S (cid:48) .Define the regret function during one epoch when the new product is included in the first tier as G (1) ( S , v ), i.e., G (1) ( S , v ) : = E (cid:34) N (cid:88) t =1 R t ( S ∗ , v ) − R t ( S (cid:48) , v ) (cid:35) , where S = ( S , S ) and S (cid:48) = ( S (cid:48) , S ) = ( S ∪ { m } , S ). Strategy 2: Assigning new product to the secondary tier
Let S (cid:48)(cid:48) = S , S (cid:48)(cid:48) = S ∪{ m } , and S (cid:48)(cid:48) = ( S (cid:48)(cid:48) , S (cid:48)(cid:48) ). N denotes the number of times S (cid:48)(cid:48) has been shownto customers until a no-purchase from the entire product offering (i.e., both tiers). N follows the13eometric distribution with mean (1 + (cid:80) j ∈ S v j )(1 + (cid:80) j ∈ S (cid:48) v j ).Similarly, we define the corresponding regret function as follows, G (2) ( S , v ) := E (cid:34) N (cid:88) t =1 R t ( S ∗ , v ) − R t ( S (cid:48)(cid:48) , v ) (cid:35) , (6.1)where S = ( S , S ) and S (cid:48)(cid:48) = ( S , S (cid:48) ) = ( S , S ∪ { m } ).To compare the two strategies, we first need to determine the optimal action under a given strategy,then evaluate its “best” loss. The strategy which yields the lower regret is then considered a“better” strategy. Let Q ∗ and Q (cid:48)∗ denote the optimal solution that minimizes the regret G (1) and G (2) , respectively, i.e., Q ∗ = argmin S G (1) ( S , v ) and Q (cid:48)∗ = argmin S G (2) ( S , v ). Theorem 6.2
The optimal solution to G (1) and G (2) is the same as S ∗ . That is, Q ∗ = Q (cid:48)∗ = S ∗ . In addition, we have G (1) ( S ∗ , v ) ≥ G (2) ( S ∗ , v ) = v m ( E [ R ( S ∗ )] − r m ) . The implication of Theorem 6.2 is twofold. Firstly, it shows that the optimal offerings excluding thenew product are identical for both strategies, irrespective of which tier the new product has beenadded to. In addition, they are also the same as the optimal offering S ∗ before the new product isadded. In other words, there is no need to resolve the optimization problem with the added newproduct. Thus, it provides a simple learning algorithm for a new product with r m < E [ R ( S ∗ )]: Itis optimal to just add it to the secondary tier of the existing optimal product offering to satisfy thelearning criterion.Secondly, Theorem 6.2 also shows that with this optimal product offering S ∗ , the regret is lowerwhen the new product is added to the secondary tier. This result highlights the advantage ofshowcasing product recommendations in multiple tiers, in the sense we incur a smaller loss bydisplaying new products with higher risks (i.e., lower profit) on tiers with lower priorities. This section focuses on a general setting similar to the one addressed in Section 5.1, except withthe minimum learning constraint in place. 14 lgorithm 1
Exploration-Exploitation algorithm for SMNL-bandit with new products
Initialization: input
M, T ; l = 0; repeat input product sets X and X ; N = { i : T i ( l ) + T i ( l ) < M } ;compute ˜ S l given valuation v UCBl ; H l = ∅ ; for i ∈ N doif i / ∈ ˜ S l then H l = H l ∪ { i } ; end ifend for offer ( ˜ S l , ˜ S l ∪ H l ), observe the purchasing decision c t = c t ∪ c t ; l = l ; repeatif c t = ∅ then compute ˆ v i,l = (cid:80) t ∈ ε l c t = i );update T i ( l ) = { τ ≤ l | i ∈ S τ } , T i ( l ) = |T i ( l ) | , no. of epochs until l that offered product i in the first tier;update ¯ v i,l and v UCBi,l according to Eq (5.1) and Eq (5.2); l = l + 1;compute ˜ S l given ˜ S l and v UCBl and offer ( ˜ S l , ˜ S l ∪ H l ), observe the purchasing decision c t ; ε l = ε l ∪ t ; else offer ( ˜ S l , ˜ S l ∪ H l ), observe the purchasing decision c t ; ε l = ε l ∪ t ; ε l = ε l ∪ t ; end if t = t + 1; until t = T or c t = ∅ compute ˆ v (1) i,l = (cid:80) t ∈ ε l c t = i ); ε l = ε l ∪ t ;update T i ( l ) = { τ ≤ l | i ∈ S τ } , T i ( l ) = |T i ( l ) | , no. of epochs until l that offered product i inthe first tier; l = l + 1;compute ˆ v (2) i,l = (cid:80) t ∈ ε l c t = i ); ε l = ε l ∪ t ;update T i ( l ) = { τ ≤ l | i ∈ ˜ S τ ∪ H τ } , T i ( l ) = |T i ( l ) | , no. of epochs until l that offered product i in the second tier;update ¯ v i,l and v UCBi,l according to Eq (5.1) and Eq (5.2); l = l + 1; t = t + 1; until t = T We propose Algorithm 1 to dynamically offer the recommendation which simultaneously exploresand exploits. In Algorithm 1, for each epoch l , we compute the optimal tiered recommendation ˜ S l given valuation v UCBl . Based on Proposition 6.2, for any new product i ∈ N which is not includedin ˜ S l , we add it to the second tier ˜ S l . At the end of each epoch, we update ¯ v l and v UCBl , whichwill be used to compute the recommendation for the next epoch.We are now ready to present an upper bound on the regret for Algorithm 1. We provide a proof15ketch here and the detailed proof can be found in the Supplementary Material.
Theorem 6.3 (Performance bound for Algorithm 1)
The regret during time [0 , T ] is boundedabove by Reg π ( T ; v ) ≤ CK log ( KT ) + C (cid:112) T K log( KT )+ M (cid:88) i ∈ X v i ( r max − r i ) , for some constant C , where r max is the highest profit of products among X , and K is the totalnumber of products.Proof sketch: We first rewrite the regret in terms of the epochs. Note that one learning epochon the secondary tier may correspond to multiple learning epochs on the priority tier. Let κ ( l )denote as a set of epochs on tier 1 which corresponds to epoch l ∈ L . In Example 2 as shownin Figure 1, we have κ (0) = { , } , κ (3) = { , } . Thus, the regret until time T can be expressed Reg π ( T ; v ) = E π [ (cid:80) l ∈L (cid:80) j ∈ κ ( l ) (cid:80) t ∈ ε j ( R t ( S ∗ j , v ) − R t (( ˜ S j , ˜ S l ∪ H l ) , v ))], where the set H l denotesthe set of new products with low profit which are added to the second tier at epoch l ∈ L to satisfythe minimum learning criterion.Define the “large probability” event A l = (cid:84) Ki =1 { v UCBi,l − C (cid:113) v i log( K ( l − l i, )+1) T i ( l ) − C K ( l − l i, )+1) T i ( l ) Experiment 1 (Robustness study) We consider a setting where X contains 80 products withprofit r i uniformly distributed on [0,1] and 20 products with r i uniformly distributed [0,0.2]. Wecompare four scenarios, when the product valuation v i is uniformly distributed on [0,0,1], [0,0.2],[0,0.3], and [0,0.5]. A new product is introduced after every 800 time steps. We set M = 100 forthe minimum learning criteria.Figure 2 shows the results based on 10 independent simulations for different distributions of v . Theaverage regrets are 129.87, 243.38, 348.31, and 620.14 for the four scenarios. Notice that both themean and variance of the regret are increasing with the support of v . It implies that the learningprocess is harder when the product valuations v lie on a larger support and have higher variability. Experiment 2 (Comparison with a explore-then-exploit benchmark) The benchmarkwe consider is adapted from Saur´e and Zeevi, 2013. As shown in Section 4, there are at most | X || X | candidates which are profit-ordered by tier. In the exploration phase of the benchmarkalgorithm, every candidate whose profit is higher than the current optimum is offered for at least γ log( t ) times, where γ is a tuning parameter. In the exploitation phase, the algorithm uses theestimated parameters to determine a tiered offering with the highest expected profit and offer it toall customers.For the experiment, consider the setting that X contains 12 products, where the profit r i of 8 of17 T R eg r e t v~unif[0,0.1] T R eg r e t v~unif[0,0.2] T R eg r e t v~unif[0,0.3] T R eg r e t v~unif[0,0.5] Figure 2: Comparison of regrets generated under Algorithm 1 for four different scenarios.them are uniformly distributed on [0,1], and that of 4 products on [0,0.2]. The valuation v i isuniformly distributed on [0,0.1]. For ease of comparison, all products are launched at t = 0. Set M = 100.Figure 3 shows the results based on 10 independent simulation. It depicts the superiority of ouralgorithm over the benchmark, where the average regrets are 14.39 and 247.78 under Algorithm 1and the benchmark respectively. Experiment 3 (Comparison with an alternative learning strategy for new products) We have shown in Algorithm 1 that new products with profit lower than E [ R ( S ∗ )] will be added tothe secondary tier. In this experiment, we compare it with an alternative strategy where those newproducts with low profit will be randomly added to either tier with equal probability for learning.To be precise, we consider a setting where X contains 20 products with profit uniformly distributedon [0.5,1] and valuation on [0,0.1]. X contains 30 products with profit uniformly distributed on[0,0.6] and valuation on [0,0.2]. We compute the optimal product offering as the current offering18 T R eg r e t Algorithm1 T R eg r e t Benchmark Figure 3: Comparison of Algorithm 1 with an explore-then-exploit benchmark algorithm. T R eg r e t Adding to tier 2 T R eg r e t Adding randomly Figure 4: Comparison with an alternative learning strategy for new products.based on these values. Next, we assume 15 new products with profit uniformly distributed on[0,0.55] and valuation on [0,0.3] are launched at time t = 0. For the benchmark, new products withprofit below E [ R ( S ∗ )] will be randomly added to one of the tiers. Set M = 300.As shown in Figure 4, the average regrets are 102.21 under Algorithm 1 and 178.00 under thealternative strategy. It highlights the benefit of having a tiered offering as one could use thesecondary tier to mitigate some profit risk when learning with new products.19 Conclusion In this work, we studied a product selection problem with a SMNL model which specifies the order inwhich products are being presented. For the offline setting where the product valuations are known,a polynomial-time solvable algorithm was provided. For the online setting, we analyzed a novelsetup where multiple new products could arrive in the middle of a selling period. Depending onthe presence of the minimum learning criterion, we proposed an online algorithm and characterizedits regret.There are several future directions of this work. For instance, products’ valuations may vary withtime, especially for fashion and technology products. Thus, there is a need for an online algorithmthat learns the dynamic valuations. In addition, it would be interesting to utilize customer attributedata and historical sales data to provide personalized recommendations. References Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. (2017a). Mnl-bandit: a dynamic learningapproach to assortment selection. arXiv preprint arXiv:1706.03880 .Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. (2017b). Thompson sampling for the mnl-bandit. arXiv preprint arXiv:1706.00977 .Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Ma-chine Learning Research , 3(Nov):397–422.Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General frameworkand applications. In International Conference on Machine Learning , pages 151–159.Cheung, W. C. and Simchi-Levi, D. (2017). Thompson sampling for online personalized assortmentoptimization problems with multinomial logit choice models.Flores, A., Berbeglia, G., and Van Hentenryck, P. (2018). Assortment optimization under thesequential multinomial logit model. European Journal of Operational Research .20¨ok, A. G., Fisher, M. L., and Vaidyanathan, R. (2008). Assortment planning: Review of literatureand industry practice. In Retail supply chain management , pages 99–153. Springer.Robbins, H. (1985). Some aspects of the sequential design of experiments. In Herbert RobbinsSelected Papers , pages 169–177. Springer.Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. (2010). Dynamic assortment optimizationwith a multinomial logit choice model and capacity constraint. Operations research , 58(6):1666–1680.Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematicsof Operations Research , 35(2):395–411.Saur´e, D. and Zeevi, A. (2013). Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management , 15(3):387–404.Sutton, R. S., Barto, A. G., et al. (1998). Reinforcement learning: An introduction . MIT press.Talluri, K. and Van Ryzin, G. (2004). Revenue management under a general discrete choice modelof consumer behavior. Management Science , 50(1):15–33.Train, K. E. (2009).