[PDF] Reserve Price Optimization for First Price Auctions

Abstract

The display advertising industry has recently transitioned from second- to first-price auctions as its primary mechanism for ad allocation and pricing. In light of this, publishers need to re-evaluate and optimize their auction parameters, notably reserve prices. In this paper, we propose a gradient-based algorithm to adaptively update and optimize reserve prices based on estimates of bidders' responsiveness to experimental shocks in reserves. Our key innovation is to draw on the inherent structure of the revenue objective in order to reduce the variance of gradient estimates and improve convergence rates in both theory and practice. We show that revenue in a first-price auction can be usefully decomposed into a \emph{demand} component and a \emph{bidding} component, and introduce techniques to reduce the variance of each component. We characterize the bias-variance trade-offs of these techniques and validate the performance of our proposed algorithm through experiments on synthetic data and real display ad auctions data from Google ad exchange.

Full PDF

RReserve Price Optimization for First Price Auctions

Zhe Feng ∗ a , S´ebastien Lahaie b , Jon Schneider b , and Jinchao Ye b a John A. Paulson School of Engineering and Applied Sciences, Harvard University zhe [email protected] b Google Inc, NYC slahaie, jschnei, [email protected]

June 28, 2020

Abstract

The display advertising industry has recently transitioned from second- to ﬁrst-price auctionsas its primary mechanism for ad allocation and pricing. In light of this, publishers need tore-evaluate and optimize their auction parameters, notably reserve prices. In this paper, wepropose a gradient-based algorithm to adaptively update and optimize reserve prices based onestimates of bidders’ responsiveness to experimental shocks in reserves. Our key innovation isto draw on the inherent structure of the revenue objective in order to reduce the variance ofgradient estimates and improve convergence rates in both theory and practice. We show thatrevenue in a ﬁrst-price auction can be usefully decomposed into a demand component and a bidding component, and introduce techniques to reduce the variance of each component. Wecharacterize the bias-variance trade-oﬀs of these techniques and validate the performance of ourproposed algorithm through experiments on synthetic data and real display ad auctions datafrom Google ad exchange.

A reserve price in an auction speciﬁes a minimum acceptable winning bid, below which the itemremains with the seller. The reserve price may correspond to some outside oﬀer, or the value ofthe item to the seller itself, and more generally may be set to maximize expected revenue [25].In a data-rich environment like online advertising auctions it becomes possible to learn a revenue-optimal reserve price over time, and there is a substantial literature on optimizing reserve pricesfor second-price auctions, which have been commonly used to allocate ad space [27, 23, 24].In this work we examine the problem of reserve price optimization in ﬁrst-price (i.e., pay-your-bid) auctions, motivated by the fact that all the major ad exchanges have recently transitioned tothis auction format as their main ad allocation mechanism [10, 6]. First-price auctions have grownin favor because they are considered more transparent, in the sense that there is no uncertaintyin the ﬁnal price upon winning [5]. Unless restrictive assumptions are met, there is in theory norevenue ranking between ﬁrst- and second-price auctions [19], and there is no guarantee that reserveprices optimized for second-price auctions will continue to be eﬀective in a ﬁrst-price setting. ∗ Supported by a NSF award CCF-1841550 and a Google PhD Fellowship. This work was done when the ﬁrstauthor was an intern in Google Inc, NYC. The full reasons for the transition are complex, and include the rise of “header bidding” [30]. A header biddingauction is a ﬁrst-price auction usually triggered by code in a webpage header (hence the name). a r X i v : . [ c s . G T ] J un rom a learning standpoint the shift from second- to ﬁrst-price auctions introduces several newchallenges. In a second-price auction, truthful bidding is a dominant strategy no matter what thereserve. The bidders’ value distributions are therefore readily available, and bids stay static (inprinciple) as the reserve is varied. In a ﬁrst-price auction, in contrast, bidders have an incentive toshade their values when placing their bids, and bid-shading strategies can vary by bidder. The gainfrom setting a reserve price now comes if (and only if) it induces higher bidding, so an understandingof bidder responsiveness becomes crucial to setting eﬀective reserves.Bid adjustments in response to a reserve price can occur at diﬀerent timescales. If a bidderobserves that it wins too few auctions because of the reserve price, it may increase its bid in thelong-term (in a matter of hours up to weeks). Our focus here is on setting reserves prices by takinginto account immediate bidder responses to reserves. We assume that each bidder has a ﬁxed,unknown bidding function b ( r, v ) that depends on its private value v and the observed auctionreserve r . This agrees with practice in display ad auctions because the reserve r is normally sentout in the ’bid request’ message to potential bidders [18]. To the extent that the bid functionresponds to r , ﬁrst-price reserves can potentially show an immediate positive eﬀect on revenue. Our Results

We propose a gradient-based approach to adaptively improve and optimize reserve prices, wherewe perturb current reserves upwards and downwards (e.g., by 10%) on random slices of traﬃc toobtain gradient estimates.Our key innovation is to draw on the inherent structure of the revenue objective in orderto reduce the variance of gradient estimates and improve convergence rates in both theory (e.g.,see Corollary 4.1) and practice. We show that revenue in a ﬁrst-price auction can be usefullydecomposed into two terms: a demand curve component which depends only on the bidder’s valuedistribution; and a bidding component whose variance can be reduced based on natural assumptionson bidding functions.A demand curve is a simpler, more structured object than the original revenue objective (e.g.,it is downward-sloping), so the demand component lends itself to parametric modeling to reducethe variance. We oﬀer two variance reduction techniques for the bidding component , referred toas bid truncation and quantile truncation . Bid truncation can strictly decrease variance with noadditional bias assuming the right bidding model, whereas quantile truncation may introduce biasbut is less sensitive to assumptions on the bidding model.We evaluate our approach over synthetic data where bidder values are drawn uniformly, and alsoover real bid distributions collected from the logs of the Google ad exchange with diﬀerent bidderresponse models. Our experimental results conﬁrm that the combination of variance reduction onboth objective components leads to the fastest convergence rate. For the demand component, asimple logistic model works well over the synthetic (i.e., uniform) data, but a ﬂexible neural net isneeded over the semi-synthetic data. For the bidding component, we ﬁnd that quantile truncationis much more robust to assumptions on the bidding model. Related Work

This paper connects with the rich literature on reserve price optimization for auctions , e.g., [25, 29].How to set optimal reserve prices in second price auctions based on access to bidders’ historicalbid data has been an increasingly popular research direction in Machine Learning community,e.g., [26, 23, 24]. Another related line of work uses no-regret learning in second price auctions with Variance reduction of the bidding component relies on the insight that bids far above the reserves are little aﬀectedby them (under natural bidding models), so these bids can be ﬁltered out when computing gradient estimates—changesin such bids are likely due to noise rather than any eﬀect of reserves. revenue optimization against strategic biddersin repeated auctions , e.g., [3, 17]. In this paper, instead of assuming bidders act strategically,we assume each bidder has a ﬁxed bidding function in response to reserves. This is a commonassumption in large market settings and in the dynamic pricing literature [21].The algorithms developed in this paper are related to the literature on online convex optimiza-tion with bandit feedback [11, 16, 1, 2]. However, there are two key diﬀerences with our work: (1)the revenue function in a ﬁrst price auction is non-convex, and (2) the seller cannot obtain perfectrevenue feedback under perturbed reserves with just a single query (i.e., auction)—the seller needsmultiple queries to achieve accurate estimates with high conﬁdence. Our algorithm is also related to zeroth-order stochastic gradient methods [14, 4, 13, 20], which we discuss in detail later in Section 3.

We consider a setting where a seller repeatedly sells a single item to a set of m bidders via a ﬁrstprice auction. In such an auction, the seller ﬁrst sends out a reserve price r to all bidders. Eachbidder i then submits a bid b i . The bidder with the highest bid larger than r wins the item andpays their bid; if no bidder bids above r , the item goes unallocated. Note that the type of reserveprice we consider in this work is anonymous in the sense that each bidder sees the same reserveprice.Each bidder i has a private valuation v i ∈ [0 ,

1] for the item, where each value v i is drawnindependently (but not necessarily identically) from some unknown distribution. In a ﬁrst-priceauction, only the highest bid matters for both allocation and pricing. Thus, to simplify the nota-tion, we write v = max i v i to denote the maximum value and v is drawn i.i.d. from an unknowndistribution F across each auction. Our analysis from here on will refer to this ‘representative’highest bidder. (See Appendix A for a rigorous justiﬁcation of why we can reduce multiple biddersto a single bidder.)We write b ( r, v ) to denote the maximum bid when the reserve price is r and the maximum valueis v , and B ( r ) to denote the distribution of b ( r, v ) for a ﬁxed r when v is drawn according to F .The main goal of the seller considered in this work is to learn the optimal reserve price r ∈ [0 , E v ∼F [ b ( r, v ) · I { b ( r, v ) ≥ r } ] . (1)Note that there is no reason for a bidder to bid a positive value less than the reserve r : such a bidis guaranteed to lose. Therefore, without loss of generality we can assume that if b ( r, v ) < r , then b ( r, v ) = 0. This allows us to write the revenue simply as: µ ( r ) = E b ∼B ( r ) [ b ] = E v ∼F [ b ( r, v )] . In this paper, we focus on maximizing the revenue function µ ( r ) in the steady state. This is without loss of generality, our analysis can easily be applied to any bounded valuation setting. esponse Models We begin by describing some general properties of bidding functions that hold for any utility-maximizing bidders (see [22] for further discussion).

Deﬁnition 2.1. A bidding function b ( r, v ) satisﬁes the following properties: 1) b ( r, v ) ≤ v for all v ; 2) b ( r, v ) ≥ r for v ≥ r ; 3) b ( r, v ) = 0 for v < r ; 4) b ( r, v ) is non-decreasing in v for all r . In some of our algorithms, we would like to impose additional constraints on the response modelwhich, while not a consequence of utility-maximizing behavior, are likely to hold in practice. Onesuch constraint is the diminishing sensitivity in value of bid to reserve . This says that bidders witha larger value will change their bid less in response to a change in reserves.

Deﬁnition 2.2 (Diminishing sensitivity of bid to reserve) . If v H > v L ≥ r , then for δ > and v L ≥ r + δ we have b ( r + δ, v H ) − b ( r, v H ) ≤ b ( r + δ, v L ) − b ( r, v L ) . One natural and concrete example of a response model is a bidder that increases its bid to thereserve as long as the reserve is below its value. We refer to this as the perfect response model ,formally deﬁned as follows.

Deﬁnition 2.3. A perfect response bidding function takes the form: b ( r, v ) =  b (0 , v ) if b (0 , v ) ≥ rr if b (0 , v ) < r ≤ v if v < r Note that the perfect response model is based on the original bid of the bidder under reserve price 0,namely b (0 , v ). If b (0 , v ) is already above the reserve, then this bidder is unaﬀected by the reserve.Note that the perfect response model satisﬁes the diminishing sensitivity property.In practice, bidders are unlikely to exactly follow the perfect response model; for example,bidders will often increase their bid to some amount strictly above the reserve r so as to remaincompetitive with other bidders. For this reason, we propose a relaxation of the perfect responsemodel which we call the ε -bounded response model : the bid is at most ε greater than what it wouldhave been under the perfect response model if b (0 , v ) < r ≤ v (see also Deﬁnition C.6). Note thatthe ε -bounded response model becomes the perfect response model when ε = 0. The ﬁrst-price auction setting introduces several challenges for setting reserve prices. First, theseller cannot observe true bidder values because truthful bidding is not a dominant strategy ina ﬁrst-price auction. Second, how the bidders will react to diﬀerent reserves is unknown to theseller—the only information that the seller receives is bids drawn from distribution B ( r ) when theseller sets a reserve price r .One natural idea, and the approach we take in this paper, is to optimize the reserve price viagradient descent. Gradient descent is only guaranteed to converge to the optimal reserve whenour objective is convex (or at least, unimodal), which is not necessarily true for an arbitraryrevenue function. However, gradient descent has a number of practical advantages for reserve priceoptimization, including:1. Gradient descent allows us to incorporate prior information we may have about the locationof a good reserve price (possibly signiﬁcantly reducing the overall search cost).4 lgorithm 1 Zeroth-order stochastic projected gradient framework for reserve optimization.

Input:

Initial reserve r ∈ (0 , T (a variable to be ﬁxed later). Output:

Reserve prices r , r , . . . , r T +1 . for t = 1 , , . . . , T do Set a reserve price of r + t = (1 + β t ) r t in n t auctions.Set a reserve price of r − t = (1 − β t ) r t in n t auctions.Construct an estimate ˆ G t of the gradient of revenue at r t , based on the feedback of experiments.Update reserve: r t +1 = Π( r t + α t ˆ G t ), whereΠ( x ) = arg min z ∈ (0 , | z − x | . end for

2. The adaptivity of gradient descent allows us to quickly converge to a local optimum andfollow this optimum if it changes over time, signiﬁcantly saving on search cost (over globalmethods such as grid search).3. In practice, many revenue curves have a unique local optimum (see Section 5), so gradientdescent is likely to converge to the optimal reserve.More speciﬁcally, since the seller has no direct access to the gradients (i.e, ﬁrst-order infor-mation) of µ ( r ), we consider approaches that ﬁt in the framework of zeroth-order stochastic opti-mization . Our framework, summarized in Algorithm 1, proceeds in rounds. In round t where thecurrent reserve is r t , the seller selects a perturbation size β t and randomly sets the reserve price toeither (1 + β t ) r t or (1 − β t ) r t on separate slices of experiment traﬃc, until it has received n t samplesfrom both B ((1 + β t ) r t ) and B ((1 − β t ) r t ). The seller then uses these 2 n t samples to estimate thegradient ˆ G t of the revenue curve µ ( r ) at r t and updates the reserve price based on this gradientestimate using learning rate (step size) α t .We assume that we have access to a ﬁxed total number of samples N = (cid:80) Tt =1 n t (the numberof iterations T is a variable that will be ﬁxed later). There is then a trade-oﬀ between n t (i.e, thenumber of samples per iteration) and T (the number of iterations available to optimize the reserveprice).Zeroth-order stochastic gradient descent is a well-studied problem [14, 4, 13, 20]. In this paper,we focus on taking advantage of the structure of b ( r, v ) to construct good discrete gradient estimatesˆ G t , as this aspect is speciﬁc to the problem of reserve price optimization. Speciﬁcally, we tacklethe following problem which we term the discrete gradient problem : • Input: n samples X +1 , · · · , X + n drawn i.i.d from B ( r + ) and n samples X , · · · , X − n drawni.i.d from B ( r − ), for known r + > r − . • Output:

An estimator ˆ G for the discrete derivative ( µ ( r + ) − µ ( r − )) / ( r + − r − ). Thisestimator has bias Bias( ˆ G ) and variance Var( ˆ G ), where Bias( ˆ G ) = (cid:12)(cid:12)(cid:12) E [ ˆ G ] − µ ( r + ) − µ ( r − ) r + − r − (cid:12)(cid:12)(cid:12) .Solutions to the discrete gradient problem with small bias and variance directly translate intofaster convergence rates for our gradient descent. We provide a detailed convergence result inTheorem C.2 in Appendix C.1. We summarize this result informally as follows. Theorem 3.1 (Informal Restatement of Theorem C.2) . If for all t , Bias( ˆ G t ) ≤ B and Var( ˆ G t ) ≤ V hen for optimal choices of α t and n t (and ﬁxing β t = δ/ r t ), Algorithm 1 satisﬁes min t ∈ [ T ] |P t C | = ˜ O (cid:16) T − / + δ + B + V + ( T /N ) (cid:17) . Here P t C can be thought of as the true gradient at round t (see Deﬁnition C.1 in Appendix). Intuitively, we want to design an estimator and choose our parameters α t , β t , n t , so as to tradeoﬀ between δ , B , and V . In the following sections, we show how to do this for a variety of bidderresponse models. Naive Gradient Estimation

The simplest method for estimating the discrete gradient is to take the diﬀerence between theaverage revenue from bids from B ( r + ) and the average revenue from bids from B ( r − ). Moreformally, we compute discrete gradient as, ˆ G = (cid:80) ni =1 X + i − (cid:80) ni =1 X − i n ( r + − r − ) . (2)We show that ˆ G has the following properties. Theorem 3.2.

Assume that r + − r − = δ , then Bias( ˆ G ) = 0 , Var( ˆ G ) ≤ δ n . This leads to the following convergence rate via Theorem 3.1.

Corollary 3.1.

Using this estimator ˆ G , and setting T = N / and δ = Θ( N − / ) , Algorithm 1achieves convergence, min t ∈ [ T ] |P t C | ≤ (cid:101) O (cid:0) N − / (cid:1) . Although there are no matching lower bounds, this is the best known asymptotic convergencerate for zeroth-order optimization over a non-convex objective [14, 4]. The naive gradient estimationapproach has the advantage that it works regardless of response model, is simple to compute (ituses only revenue information and not individual bids), and leads to an unbiased estimator for thediscrete derivative. The disadvantage is that the variance of this estimator can be large (especiallyas we take δ small). In the following section, we show how to address this by taking into accountthe inherent structure of the revenue objective based on an underlying bidder response model. In this section, we ﬁrst introduce another representation of the revenue formula by decomposing itinto a demand component and a bidding component. We then propose techniques to reduce thevariance of the discrete gradient of each component.

We can decompose the revenue µ ( r ) in the following way. Theorem 4.1.

We have that µ ( r ) = E v ∼F [max( b ( r, v ) − r, r Pr v ∼F [ v ≥ r ] . (3)6eﬁne E ( r ) = E v ∼F [max( b ( r, v ) − r, D ( r ) = Pr v ∼F [ v ≥ r ], so that µ ( r ) = E ( r ) + rD ( r ).These two terms capture two diﬀerent aspects of bidder behavior which contribute to revenue. Thefunction D ( r ) amounts to a “demand curve” which gives the proportion of values that clear thereserve r , and therefore the proportion of auctions that are bid on at r . If the auction were justa simple posted-price auction (i.e., the winner is charged the quoted price r ), then the demand component rD ( r ) would be the associated revenue. However, in a ﬁrst-price auction the winningbidder pays its bid, not the reserve. Therefore the bidding component E ( r ) captures the excesscontribution from bids greater than the reserve.To construct a good estimator ˆ G for the discrete gradient of µ ( r ), it suﬃces to construct goodestimators ˆ G E and ˆ G D for the discrete gradients of E ( r ) and rD ( r ) respectively, and then outputˆ G = ˆ G E + ˆ G D . Note that Bias( ˆ G ) ≤ Bias( ˆ G E ) + Bias( ˆ G D ) and Var( ˆ G ) ≤ G D ) + Var( ˆ G E )),so it suﬃces to bound the bias and variance of each component separately. We begin by discussing how to estimate the gradient ˆ G D of the demand component of revenue.One method of doing so is by estimating D ( r ) with a parametric function f θ ( r ), and using thisapproximation to estimate the gradient ˆ G D . (See Appendix B for additional justiﬁcation for whythis is likely to be possible and helpful). Suppose that we have access to additional historical data S with which we can ﬁt our parametric class to D ( r ); let ˆ θ be the resulting learned parameter.This learned demand function gives rise to the following estimator ˆ G D :ˆ G D = r + f ˆ θ ( r + ) − r − f ˆ θ ( r − ) r + − r − (4)Note that this decreases overall variance, the variance of ˆ G D is 0 because the randomness ofˆ G D only comes from historical samples S , which are independent of the samples obtained in thecurrent round, at the cost of a possible increase in bias (due to inaccuracy in estimating D ( r )). In this section we propose a variance reduction method to achieve a better estimator for ˆ G E for avariety of bidder models. Variance reduction via bid truncation.

We ﬁrst consider the special case of the perfectresponse (and more generally, the ε -bounded response) bidding model. In the perfect responsemodel, if you were going to bid b > r + when the reserve was r + , you will bid the same bid b whenthe reserve is r − . This means that large bids (bids larger than r + ) do not contribute in expectationto µ ( r + ) − µ ( r − ), but they do add noise to our gradient estimation. By ﬁltering these out, we canreduce the variance of our estimator while keeping our estimator unbiased.Since we only apply this ﬁltering when estimating the bidding component E ( r ) but not thedemand component rD ( r ), we must be careful when implementing this. Note that a large bid b > r + contributes b − r + to E ( r + ) and b − r − to E ( r − ), and therefore r + − r − to E ( r + ) − E ( r − ).We can therefore construct an unbiased estimator for E ( r + ) − E ( r − ) by computing the contributionof unﬁltered bids ( b < r + ) from both B ( r + ) or B ( r − ) and then adding r + − r − for each ﬁltered bidin B ( r − ) (or equivalently, each ﬁltered bid in B ( r + ); under perfect response, the fraction of ﬁlteredbids is equal in both models in expectation). Note that every bid from B ( r + ) is either ﬁltered orhas excess 0, so we can write this gradient ˆ G E entirely in terms of bids from B ( r − ). Formally, wedeﬁne truncated bid Y − i as Y − i = (cid:26) max( X − i − r − ,

0) if X − i ≤ r + ( r + − r − ) otherwise E ( r ) is then given byˆ G E = − (cid:80) ni =1 Y − i n ( r + − r − ) . (5)Since any bid in an ε -bounded model only diﬀers from one in the perfect response model by atmost ε , we can apply this same estimator to an ε -bounded response model. The following theoremcharacterizes the bias and variance of the estimator for the ε -bounded response model. Theorem 4.2.

Assume that r + − r − = δ , then the estimator ˆ G E in Eq. (5) for ε -bounded responsemodel, satisﬁes: Bias( ˆ G E ) = εδ , Var( ˆ G E ) ≤ n . Note that the bias of estimator ˆ G E is 0 for the perfect response model. The complete proofis given in Appendix C.5. Combining the above results for ˆ G E and ˆ G D , we have the followingimproved convergence result for the ε -bounded response model. Corollary 4.1.

Suppose

Bias( ˆ G D ) ≤ ε D /δ . Using the estimator ˆ G E proposed in Eq. (5) for the ε -bounded response model, setting T = N / and δ = Θ( √ ε + ε D ) , Algorithm 1 achieves convergence, min t ∈ [ T ] |P t C | ≤ (cid:101) O (cid:0) ε + ε D + N − / (cid:1) . For perfect response bidding models, the above convergence rate is strictly faster than the con-vergence rate of naive estimator in Corollary 3.1 (state-of-the-art convergence rate for zeroth-orderstochastic gradient descent), but with additional bias coming from demand estimation. However,we show this bias has practically negligible eﬀect on the revenue in our experiments.

Variance reduction via quantile truncation.

In Eq. (5), we reduced the variance of ˆ G E bytruncating all bids at the ﬁxed threshold of t = r + . In general, this does not quite work: for bidderresponse models that are far from perfect response, this truncation can introduce a very large bias.Here we demonstrate one technique for constructing good estimators ˆ G E as long as the biddingfunction b ( r, v ) possesses diminishing sensitivity in value to reserve.Instead of truncating in bid space , we will instead want to truncate in value space to reducethe variance. Even though we cannot directly truncate by values, since b ( r, v ) is monotonicallyincreasing in v , quantiles of bids (e.g., of B ( r + ) and B ( r − )) directly correspond to quantiles ofvalues (of F ). Instead of setting a threshold t directly on the value, it is therefore equivalent totruncate at a ﬁxed quantile of the bid distribution .To achieve this, we ﬁrst sort X + i and X − i in ascending order. Then we compute ˆ G E asˆ G E = (cid:80) qni =1 max( X + i − r + , − (cid:80) qni =1 max( X − i − r − , n ( r + − r − ) − (1 − q ) , (6)where q ∈ [0 ,

1] is the quantile threshold used to truncate bids. The following theorem characterizesthe bias and variance of the above ˆ G E , Theorem 4.3.

Let r + − r − = δ , t = F − ( q ) , and ˜ t = F − ( q + n − / ) . Then the estimator ˆ G E inEq. (6) satisﬁes, Bias( ˆ G E ) ≤ (1 − q )( b ( r + ,t ) − b ( r − ,t )) δ + O ( n − / ) , Var( ˆ G E ) ≤ t nδ + O ( n − / δ − ) . Unlike with bid truncation, with quantile truncation we have a clear bias-variance tradeoﬀ as wechange q : larger values of q decrease the bias (both by decreasing (1 − q ) and b ( r + , t ) − b ( r − , t ), whichis decreasing due to diminishing sensitivity) but lead to larger variance. Since one can estimate thisbound on the bias (by approximating b ( r + , t ) − b ( r − , t ) via Y + qn − Y − qn ), it is possible to choose q tooptimize this bias-variance tradeoﬀ as one sees ﬁt (for example, to minimize B + V in Theorem3.1). We show a convergence rate result for this quantile truncation approach in Corollary C.1 inAppendix C.7. 8 Experiments

We evaluate the performance of our algorithms on synthetic and semi-synthetic data sets. Due tospace limitations, we present the complete experimental results in Appendix D.

The data generation process consists of two parts: a base bid distribution specifying the distributionof bids when no reserve is set, and a response model describing how a bidder with bid b would updateits bid in response to a reserve of r . Response models.

We assume that in the absence of a reserve bidders bid a constant fraction γ of their value v (i.e., b = γv ), which we refer to as linear shading . We consider linear shadingcombined with perfect response and with (cid:15) -bounded response, which we implement by adding auniform [0 , (cid:15) ] random variable to the bid. We also examine equilibrium bidding for n i.i.d. bidderswith uniformly distributed valuation [19]: b = r n +( n − v n nv n − . Synthetic data.

In our synthetic data sets, the (base) bid distribution is the uniform [0 , ε -bounded response model and equilibriumbidding model. In the simulations, we apply a constant shading factor of 0 . ε -bounded response model. For equilibrium bidding, we assume that eachauction contains n = 2 bidders. Semi-synthetic data.

For our semi-synthetic data sets, we separately collected the empiricaldistributions of winning bids over one day for 20 large publishers on a major display ad exchange.Each distribution was ﬁltered for outliers and normalized to the interval [0 , ε -bounded response model, since thereis no closed-form solution for the equilibrium bidding strategy. We use 0.3 as the constant shadingfactor for semi-synthetic data. Gradient descent algorithms.

We examine ﬁve diﬀerent gradient descent algorithms: (I)

NaiveGD : naive gradient descent using the gradient estimator in Eq. (2); (II)

Naive GD with bid trun-cation : gradient descent using the gradient estimator in Eq. (5) for the bidding component, anda naive estimate of the demand component; (III) Naive GD with quantile truncation : gradientdescent using the gradient estimator in Eq. (6) for the bidding component, and naive estimate ofthe demand component; (IV)

Demand modeling with bid truncation : Same as the second variant,but with a parametric model of the demand curve to estimate demand component of gradient;(V)

Demand modeling with quantile truncation : Same as the third variant, but with a parametricmodel of the demand curve to estimate demand component of gradient. The parameters used inthese algorithms are speciﬁed in Appendix D.

Demand curve estimation.

To reduce variance following the ideas of Section 4, we needa model ˆ G D for the demand component of the discrete gradient. Instead of estimating ˆ G D fromhistorical data, we adaptively learn the demand curve during the training process. Concretely, ateach round t , we observe new (reserve, demand) pairs from 2 n t samples and retrain our demandcurve using all the samples observed up to the current round. We use this trained demand curveto compute ˆ G D based on (4). For the synthetic data, a simple logistic regression can eﬀectivelylearn the demand curve. However, the semi-synthetic data required a more ﬂexible model so for We can form a naive unbiased estimator ˆ G D = ˆ D ( r + ) − ˆ D ( r − ) r + − r − , where ˆ D ( r + ) = n (cid:80) i I { x + i ≥ r + } and similarly forˆ D ( r − ). a) Synthetic data with perfect re-sponse. (b)

Synthetic data with equilibriumresponse. (c)

Semi-synthetic data with per-fect response.

Figure 1:

Revenue as a function of round t for (a) synthetic data with perfect response, (b) synthetic datawith equilibrium response, and (c) semi-synthetic data with perfect response. Figure 2:

Revenue as a function of round t for synthetic data with equilibrium response. this case we model demand using a fully connected neural network with 1 hidden layer, 15 hiddennodes and ReLU activations. Eﬀectiveness of gradient descent.

First, we conﬁrm that gradient descent can eﬀectively ﬁndoptimal reserves in our models. For each semi-synthetic model, we construct the revenue curveas a function of reserve with assumed response models. We ﬁnd that 19 out of the 20 revenuecurves have a clear single local maximum (the remaining curve has 2). In all cases (synthetic andsemi-synthetic models), the revenue learned by the naive gradient descent algorithm is at least 95%of the revenue at the optimal reserve, which indicates that gradient descent can eﬃciently ﬁnd theoptimal reserve in these cases despite the lack of convexity.

Eﬀectiveness of variance reduction methods.

We ﬁrst evaluate the performance of thequantile-based variance reduction method. We run the algorithm variants (I), (III) and (V) undersynthetic data and semi-synthetic data with multiple bidder response models. Figures (1a) and (1c)show the revenue achieved by the three algorithms over time under the perfect response model. Weﬁnd that quantile-based variance reduction leads to a more stable training process which convergesfaster than naive gradient descent. Figure (1b) evaluates the performance of the three algorithmvariants under synthetic data and an equilibrium response model, with similar conclusions. Overall,quantile-based variance reduction outperforms naive gradient descent. Moreover, with the additionof demand curve estimation, algorithm variant (V) achieves better revenue and converges to anoptimal reserve faster than the other two algorithms, in agreement with our theoretical guarantees.We next consider variance reduction using bid truncation, which is used in algorithm variants(II) and (IV). Bid truncation is tailored to perfect response and performs the best overall forthis response model, in accordance with the theoretical guarantees, but quantile truncation iscompetitive and often performs as well over the semi-synthetic data (see Appendix D for a detailedcomparison). Under the equilibrium response model, bid truncation can in fact hinder the training10rocess and lead to a substantially suboptimal reserve price (see Figure 2). In summary, quantile-based variance reduction coupled with a good demand-curve estimation is the method of choice toachieve good reserve prices under a range of diﬀerent bid distributions and bidder response models.

References [1] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimizationwith multi-point bandit feedback. In

Proceedings of the 23rd Annual Conference on LearningTheory (COLT) , June 2010.[2] Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin.Stochastic convex optimization with bandit feedback. In J. Shawe-Taylor, R. S. Zemel, P. L.Bartlett, F. Pereira, and K. Q. Weinberger, editors,

Advances in Neural Information ProcessingSystems 24 , pages 1035–1043, 2011.[3] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Learning prices for repeated auctionswith strategic buyers. In

Advances in Neural Information Processing Systems 26

Theor. Comput. Sci. , 324:137–146, 2003.[8] St´ephane Boucheron, Maud Thomas, et al. Concentration inequalities for order statistics.

Electronic Communications in Probability , 17, 2012.[9] N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in second-price auctions.

IEEE Transactions on Information Theory , 61(1):549–564, Jan 2015.[10] Yuyu Chen. Programmatic advertising is preparing for the ﬁrst-price auction era.https://digiday.com/marketing/ programmatic- advertising-readying-ﬁrst-price-auction-era,October 2017. Accessed: 2020-01-29.[11] Abraham D. Flaxman, Adam Tauman Kalai, Adam Tauman Kalai, and H. Brendan McMahan.Online convex optimization in the bandit setting: Gradient descent without a gradient. In

Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA’05, pages 385–394, Philadelphia, PA, USA, 2005.[12] James E Gentle.

Computational statistics , volume 308. Springer, 2009.[13] Saeed Ghadimi. Conditional gradient type methods for composite nonlinear and stochasticoptimization.

Mathematical Programming , 173(1):431–464, Jan 2019.1114] Saeed. Ghadimi and Guanghui. Lan. Stochastic ﬁrst- and zeroth-order methods for nonconvexstochastic programming.

SIAM Journal on Optimization , 23(4):2341–2368, 2013.[15] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximationmethods for nonconvex stochastic composite optimization.

Mathematical Programming , 155:267–305, 2013.[16] Elad Hazan and Kﬁr Y. Levy. Bandit convex optimization: Towards tight bounds. In

Proceed-ings of the 27th International Conference on Neural Information Processing Systems - Volume1 , NIPS’14, pages 784–792, 2014.[17] Zhiyi Huang, Jinyan Liu, and Xiangning Wang. Learning optimal reserve price against non-myopic bidders. In

Proceedings of the 32nd International Conference on Neural InformationProcessing Systems

Auction theory . Academic press, 2009.[20] S. Liu, X. Li, P. Chen, J. Haupt, and L. Amini. Zeroth-order stochastic projected gradient de-scent for nonconvex optimization. In , pages 1179–1183, Nov 2018.[21] Jieming Mao, Renato Leme, and Jon Schneider. Contextual pricing for lipschitz buyers. In

Advances in Neural Information Processing Systems 31 , pages 5643–5651, 2018.[22] Steven A. Matthews. A Technical Primer on Auction Theory I: Independent Private Val-ues. Discussion Papers 1096, Northwestern University, Center for Mathematical Studies inEconomics and Management Science, May 1995.[23] Mehryar Mohri and Andr´es Mu˜noz Medina. Learning algorithms for second-price auctionswith reserve.

J. Mach. Learn. Res. , 17(1):2632–2656, January 2016.[24] Andres Munoz and Sergei Vassilvitskii. Revenue optimization with approximate bid predic-tions. In

Advances in Neural Information Processing Systems 30 , pages 1858–1866, 2017.[25] R. Myerson. Optimal auction design.

Mathematics of Operations Research , 6:58–73, 1981.[26] Michael Ostrovsky and Michael Schwarz. Reserve prices in internet advertising auctions: Aﬁeld experiment. In

Proceedings of the 12th ACM Conference on Electronic Commerce , EC’11, pages 59–60, 2011.[27] Renato Paes Leme, Martin Pal, and Sergei Vassilvitskii. A ﬁeld guide to personalized reserveprices. In

Proceedings of the 25th international conference on world wide web , pages 1093–1102,2016.[28] Sashank J. Reddi, Suvrit Sra, Barnab´as P´oczos, and Alexander J. Smola. Proximal stochasticmethods for nonsmooth nonconvex ﬁnite-sum optimization. In

NIPS , 2016.[29] John G Riley, William F Samuelson, et al. Optimal auctions.

American Economic Review , 71(3):381–392, 1981. 1230] Mark Weiss. Digiday research: Header bidding and ﬁrst-price auctions boost publisher rev-enues. https://digiday.com/ media/digiday-research-header-bidding-and-ﬁrst-price-auctions-boost-publisher-revenues, January 2019. Accessed: 2020-01-29.13 eserve Price Optimization for First Price Auctions

Appendix

A From multiple bidders to a single bidder

Our auction can contain multiple bidders, each with their own value distribution F i and bid function b i ( r, v i ). But when setting reserve prices, we only care about the maximum bid; more speciﬁcally,the distribution of maximum bid at each reserve. Thus it is useful to abstract away the set ofbidders in the auction as a single “mega-bidder” whose value is the maximum of all the bidders’values and who always bids the maximum of all the bids. Theorem A.1.

Let F be the distribution of max( v , v , . . . , v m ) (where each v i is independentlydrawn from F i ) and let B ( r ) be the distribution of max( b ( r, v ) , b ( r, v ) , . . . , b m ( r, v m )) . Thenthere exists a bid function b ( r, v ) such that the distribution of b ( r, v ) when v ∼ F is equal to thedistribution B ( r ) .Proof. If B r ( b ) is the CDF of B ( r ) and F ( v ) is the cdf of F , let b ( r, v ) = B − r ( F ( v )). This guaranteesthat if v ∼ F , then b ( r, v ) ∼ B ( r ).Note that this reduction also preserves the properties of Deﬁnition 2.1. For example, if b i ( r, v i ) ≤ v i for every bidder i , then the induced b ( r, v ) also satisﬁes b ( r, v ) ≤ v . B Demand Function Estimation

As in Section 3, it is possible to form a naive unbiased estimate of the demand component via theestimator ˆ D ( r ) = n (cid:80) ni =1 I ( X i ≥ r ). The variance of the resulting unbiased estimator ˆ G D is thenbounded by (see Theorem C.5), Var( ˆ G D ) ≤ ( r + ) δ n .Note that for small r , the variance guarantee here is signiﬁcantly better than the varianceguarantee in Theorem 3.2. Thus, in instances where the optimal reserve is small (and hence wemostly test small r + ), combining this naive estimator with better estimators for ˆ G E (like the oneswe explore in the next section) can already lead to better convergence rates overall.To obtain even better estimators, we can leverage the following two facts about the demandfunction. First, the demand function only depends on the value distribution F of the bidders, andnot their speciﬁc bidding behavior. Since we expect value to be relatively stable in comparison tobidding behavior, this means that we can reasonably use data from previous rounds to learn thedemand function and inform calculation of ˆ G D (whereas the naive gradient update only uses datafrom the current round). Second, we expect the demand function D ( r ) to be simpler and morenicely structured than the full revenue function µ ( r )—for example, D ( r ) is weakly decreasing in r —and therefore more amenable to parametric modeling. C Omitted Proofs

C.1 Formal convergence rate

To show a convergence result for a non-convex problem with constraints, a measure called gradientmapping is widely used in the literature e.g. [15, 28, 20]. We deﬁne the gradient mapping used inthis paper as follows, 14 eﬁnition C.1 (Gradient Mapping) . Let f ( x ) be a diﬀerentiable function deﬁned on [0 , , C ∈ (0 , is a convex space, and Π C is the projection operator deﬁned as Π C ( x ) = arg min z ∈C | z − x | , ∀ x ∈ R . (7) The gradient mapping P C is then deﬁned as P C ( r, ˆ g, α ) := 1 α (Π C ( r + α ˆ g ) − r ) where ˆ g is a gradient estimate of µ ( r ) (can be biased), r ∈ C is the reserve and α is the learningrate. The gradient mapping P C can be interpreted as the projected gradient , which oﬀers a feasibleupdate from the previous reserve r . Indeed, if the projection operator is the identity function, thegradient mapping just returns the gradient. Theorem C.2.

Suppose µ ( r ) is L -smooth. Let ˆ G t = n t (cid:80) n t i =1 ˆ G i,t be the gradient estimator at time t , where ∀ i ∈ [ n t ] , | ˆ G i,t | ≤ C almost surely. Fix α t = Θ( √ T ) , β t = δ/ r t and n t = N/T . Assumefor all t that Bias( ˆ G t ) ≤ B and Var( ˆ G t ) ≤ V . Then with probability at least − δ , Algorithm 1satisﬁes that min t ∈ [ T ] |P C ( r t , µ (cid:48) ( r t ) , α t ) | ≤ O (cid:32)(cid:114) T + (cid:18) δ + B + CT ln(2 T /δ ) N + (cid:112) V ln(2 T /δ ) (cid:19) (cid:33) . where P C is the gradient mapping. To prove Theorem C.2, we ﬁrst show some useful inequalities summarized in Lemma C.3 andLemma C.4.

Lemma C.3 (Bernstein Inequality) . Let G = n (cid:80) ni =1 G i be the random variable of estimation ofrevenue’s gradient ( G i can be correlated), where | G i | ≤ C almost surely, and V = Var[ G ] . Then,we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 G i − E (cid:34) n n (cid:88) i =1 G i (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ln(2 T /δ )3 n + (cid:112) V ln(2 T /δ ) holds with probability at least − δT .Proof. By Bernstein’s inequality, we have P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 G i − n n (cid:88) i =1 E [ G i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ z (cid:33) ≥ − (cid:18) − z n V + Cz/ n ) (cid:19) Let 2 exp (cid:16) − z V + Cz/ n ) (cid:17) = δT and solving for z , we have z = C ln(2 T/δ )3 n + (cid:114)(cid:16) C ln(2 T/δ )3 n (cid:17) + 8 V ln(2 T /δ )2 ≤ C ln(2 T /δ )3 n + (cid:112) V ln(2 T /δ ) := z (cid:48) √ a + b ≤ √ a + √ b . Therefore, we have P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 G i − n n (cid:88) i =1 E [ G i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ z (cid:48) (cid:33) ≥ P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 G i − n t n (cid:88) i =1 E [ G i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ z (cid:33) ≥ − δT . Lemma C.4.

Since C is a convex space, then for any x ∈ C , z ∈ R , we have ( x − Π C ( z )) · ( z − Π C ( z )) ≤ z = r t + α t ˆ G t and x = r t , we have( r t − Π C ( r t + α t ˆ G t )) · ( r t + α t ˆ G t − Π C ( r t + α t ˆ G t )) ≤ , which implies α t ˆ G t (Π C ( r t + α t ˆ G t ) − r t ) ≥ | r t − Π C ( r t + α t ˆ G t ) | , Thus, we have ˆ G t · P C ( r t , ˆ G t , α t ) ≥ |P C ( r t , ˆ G t , α t ) | . Again, since C is a convex space, ∀ x, z ∈ R , | Π C ( x ) − Π C ( z ) | ≤ | x − Z | . Then we can prove the second inequality, |P C ( r t , µ (cid:48) ( r t ) , α t ) − P C ( r t , ˆ G t , α t ) | ≤ α t | Π C ( r t + α t µ (cid:48) ( r t )) − Π C ( r t + α t ˆ G t | ≤ | µ (cid:48) ( r t ) − ˆ G t | Proof of Theorem C.2.

Denote r + t = (1 + β t ) r t and r − t = (1 − β t ) r t . First we bound the bias of ˆ G t compared with µ (cid:48) ( r t ). | µ (cid:48) ( r t ) − ˆ G t | can be decomposed as follows, (cid:12)(cid:12)(cid:12) µ (cid:48) ( r t ) − ˆ G t (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) µ (cid:48) ( r t ) − µ ( r + t ) − µ ( r − t ) r + t − r − t (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) E [ ˆ G t ] − µ ( r + t ) − µ ( r − t ) r + t − r − t (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E [ ˆ G t ] − ˆ G t (cid:12)(cid:12)(cid:12) (8)Then we bound the three terms above separately. For the ﬁrst term, we have µ (cid:48) ( r t ) − µ ( r + t ) − µ ( r − t ) r + t − r − t = µ (cid:48) ( r t ) β t r t − µ ( r + t ) + µ (cid:48) ( r t ) β t r t + µ ( r − t ) r + t − r − t By the smoothness of µ ( r ), (cid:12)(cid:12) µ ( r + t ) − µ ( r t ) − µ (cid:48) ( r t ) · β t r t (cid:12)(cid:12) ≤ L · β t r t (cid:12)(cid:12) µ ( r − t ) − µ ( r t ) + µ (cid:48) ( r t ) · β t r t (cid:12)(cid:12) ≤ L · β t r t , Thus, we get (cid:12)(cid:12)(cid:12)(cid:12) µ (cid:48) ( r t ) − µ ( r + t ) − µ ( r − t ) r + t − r − t (cid:12)(cid:12)(cid:12)(cid:12) ≤ β t r t · Lβ t r t = Lβ t r t B . Combining Lemma C.3 in Appendix, forany ﬁxed t ∈ [ T ], (cid:12)(cid:12)(cid:12) µ (cid:48) ( r t ) − ˆ G t (cid:12)(cid:12)(cid:12) ≤ Lβ t r t B + 2 C ln(2 T /δ )3 n t + (cid:112) V ln(2 T /δ ) = B δ,t , (9)holds with probability at least 1 − δT .The L -smoothness of revenue function µ ( r ) implies the following inequalities, for any t =1 , · · · , T , µ ( r t +1 ) ≥ µ ( r t ) + µ (cid:48) ( r t ) · ( r t +1 − r t ) − L r t +1 − r t ) ≥ µ ( r t ) + α t µ (cid:48) ( r t ) · P C ( r t , ˆ G t , α t ) − Lα t | ˆ G t | = µ ( r t ) + α t ˆ G t · P C ( r t , ˆ G t , α t ) + α t ( µ (cid:48) ( r t ) − ˆ G t ) · P C ( r t , ˆ G t , α t ) − Lα t | ˆ G t | ≥ µ ( r t ) + α t |P C ( r t , ˆ G t , α t ) | + α t ( µ (cid:48) ( r t ) − ˆ G t ) · P C ( r t , ˆ G t , α t ) − Lα t | ˆ G t | The ﬁrst inequality is because | r t +1 − r t | ≤ α t ˆ G t and the last inequality is based on ﬁrstinequality in Lemma C.4. Rearranging the above inequalities, we have, α t |P C ( r t , ˆ G t , α t ) | ≤ µ ( r t +1 ) − µ ( r t ) + α t ( ˆ G t − µ (cid:48) ( r t )) · P C ( r t , ˆ G t , α t ) + Lα t | ˆ G t | ≤ µ ( r t +1 ) − µ ( r t ) + α t | ˆ G t − µ (cid:48) ( r t ) | α t |P C ( r t , ˆ G t , α t ) | Lα t | ˆ G t | (10)The last inequality is based on Cauchy-Schwartz inequality and the second statement in Lemma C.4.Rearranging Equation (10), we have α t |P C ( r t , ˆ G t , α t ) | ≤ µ ( r t +1 ) − µ ( r t )) + α t | ˆ G t − µ (cid:48) ( r t ) | + Lα t | ˆ G t | (11)Using Cauchy-Schwartz inequality and Lemma C.4, we can bound |P C ( r t , µ (cid:48) ( r t ) , α t ) | in thefollowing way, |P C ( r t , µ (cid:48) ( r t ) , α t ) | ≤ |P C ( r t , ˆ G t , α t ) | + 2 |P C ( r t , µ (cid:48) ( r t ) , α t ) − P C ( r t , ˆ G t , α t ) | ≤ |P C ( r t , ˆ G t , α t ) | + 2 | µ (cid:48) ( r t ) − ˆ G t | Then we can write down the following bound, α t |P C ( r t , µ (cid:48) ( r t ) , α t ) | ≤ α t |P C ( r t , ˆ G t , α t ) | + 2 α t | µ (cid:48) ( r t ) − ˆ G t | ≤ µ ( r t +1 ) − µ ( r t )) + 4 α t | ˆ G t − µ (cid:48) ( r t ) | + 2 Lα t | ˆ G t | ≤ µ ( r t +1 ) − µ ( r t )) + 4 α t | ˆ G t − µ (cid:48) ( r t ) | + 4 Lα t ( | ˆ G t − µ (cid:48) ( r t ) | + | µ (cid:48) ( r t ) | )(12)Let r ∗ ∈ C be the point with the minimum absolute gradient, i.e., r ∗ = arg min r ∈C | µ (cid:48) ( r ) | . Notice | µ (cid:48) ( r t ) | ≤ | µ (cid:48) ( r ∗ ) | + 2 | µ (cid:48) ( r ∗ ) − µ (cid:48) ( r t ) | ≤ L + 2 | µ (cid:48) ( r ∗ ) | := L ∗ . Summing up the inequality (12)from t = 1 to T , based on inequality (9) with probability at least 1 − δ , we have T (cid:88) t =1 α t |P C ( r t , µ (cid:48) ( r t ) , α t ) | ≤ µ ( r T +1 ) − µ ( r )) + 4 T (cid:88) t =1 α t | B δ,t | + 4 L T (cid:88) t =1 α t ( | B δ,t | + L ∗ )17etting α t = Θ (cid:16) √ T (cid:17) , β t = δ/ r t and n t = N/T , By Cauchy-Schwartz inequality and the factthat (cid:80) Tt =1 α t = Θ( √ T ) and (cid:80) Tt =1 α t = Θ(1). Therefore, we getmin t =1 , ··· ,T |P C ( r t , µ (cid:48) ( r t ) , α t ) | ≤ (cid:80) Tt =1 α t T (cid:88) t =1 α t |P C ( r t , µ (cid:48) ( r t ) , α t ) | ≤ O (cid:32)(cid:114) T + (cid:18) δ + B + CT ln(2 T /δ ) N + (cid:112) V ln(2 T /δ ) (cid:19) (cid:33) C.2 Proof of Theorem 3.2

Proof.

Since X + i and X − i are independent random samples from B r + and B r − respectively, E [ ˆ G ] = µ ( r + ) − µ ( r − ) r + − r − . For the variance, since X + i is bounded by [0 , X + i and X − i is at most 1 /

4, which implies, Var( ˆ G ) ≤ δ n = 12 δ n C.3 Variance bounds for ˆ G D In this subsection we bound the variance of the unbiased demand estimatorˆ G D = r + (cid:80) ni =1 I ( X + i ≥ r + ) − r + (cid:80) ni =1 I ( X + i ≥ r + ) n ( r + − r − ) . Since E (cid:2) n (cid:80) ni =1 I ( X + i ≥ r + ) (cid:3) = D ( r + ) and E (cid:2) n (cid:80) ni =1 I ( X − i ≥ r − ) (cid:3) = D ( r − ), it follows imme-diately that E [ ˆ G D ] = r + D ( r + ) − r − D ( r − ) r + − r − and therefore that Bias( ˆ G D ) = 0. In the following theorem, we bound Var( ˆ G D ). Theorem C.5.

Let δ = r + − r − . We have that Var( ˆ G D ) ≤ ( r + ) nδ . Proof.

Since the X + i and X − i are independent random variables, their variances are additive soVar( ˆ G D ) = 1 n δ (cid:0) n ( r + ) Var( I ( X + i ≥ r + )) + n ( r − ) Var( I ( X − i ≥ r − )) (cid:1) ≤ ( r + ) nδ , where in the last line we have used the fact that the variance of a Bernoulli random variable isbounded above by 1 /

4. 18 .4 Proof of Theorem 4.1

Proof.

Note that µ ( r ) = E v ∼F [ b ( r, v )]= E v ∼F [max( b ( r, v ) , r ) − r I { b ( r, v ) = 0 } ]= E v ∼F [max( b ( r, v ) , r )] − r E v ∼F [ I { b ( r, v ) = 0 } ]= E v ∼F [max( b ( r, v ) , r )] − r Pr v ∼F [ v < r ]= E v ∼F [max( b ( r, v ) , r )] − r (1 − Pr v ∼F [ v ≥ r ])= E v ∼F [max( b ( r, v ) , r ) − r ] + r Pr v ∼F [ v ≥ r ]) C.5 Proof of Theorem 4.2

In the following proof we slightly abuse notation, by denoting B r to be the CDF of the bid dis-tribution w.r.t reserve price r . We also formally state the deﬁnition of ε -bounded response modelhere. Deﬁnition C.6 ( ε -bounded response model) . A ε -bounded response bidding function takes theform, b ( r, v ) =  b (0 , v ) if b (0 , v ) ≥ rr + z if b (0 , v ) ≤ r ≤ v if v < r where z ∈ [0 , ε ] , and z can be a random variable.Proof. To bound the variance, note that each Y − i is constrained to the interval [0 , r + − r − ]. Sincethis interval has length at most δ , the variance of each Y − i is at most ( δ / G E ) ≤ n ( δ / n δ ≤ n . Then we focus on bounding the bias, E v (cid:2) max( b ( r + , v ) − r + , − max( b ( r − , v ) − r − , (cid:3) = (cid:90) max( b, r + ) d B r + − (cid:90) max( b, r − )) d B r − − ( r + − r − )= (cid:90) r + + ε max( b, r + ) d B r + − (cid:90) r + + ε max( b, r − ) d B r − + (cid:90) r + + ε max( b, r + ) d B r + − (cid:90) r + + ε max( b, r − ) d B r − − ( r + − r − )= (cid:90) r + max( b, r + ) d B r + − (cid:90) r + max( b, r − ) d B r − + (cid:90) r + + εr + max( b, r + ) d B r + − (cid:90) r + + εr + max( b, r − ) d B r − − ( r + − r − )= r + B r + ( r + ) − (cid:90) r + max( b, r − ) d B r − − ( r + − r − ) + (cid:90) r + + εr + bd B r + − (cid:90) r + + εr + bd B r − b > r + + ε, B r + ( b ) = B r − ( b ) by propertyof the ε -bounded response. The fourth equality is based on max( b ( r + , v ) , r + ) = r + when v ≤ r + and F ( r + ) = B r + ( r + ). Then we consider E [ Y − i ], where E [ Y − i ] = E [max( X − i − r − , I { X − i ≤ r + } ] + E [( r + − r − ) I { X − i > r + } ]= (cid:90) r + max( b, r − ) d B r − + r + (1 − B r − ( r + )) − r − Before bounding the bias of ˆ G E , we state some useful equations based on integral by part. (cid:90) r + + εr + bd B r + = ( r + + ε ) B r + ( r + + ε ) − r + B r + ( r + ) − (cid:90) r + + εr + B r + ( b ) db (13) (cid:90) r + + εr + bd B r − = ( r + + ε ) B r − ( r + + ε ) − r + B r − ( r + ) − (cid:90) r + + εr + B r − ( b ) db (14)Based on deﬁnition of ε -bounded response, B r − ( r + + ε ) = B r + ( r + + ε ). Then we have (cid:12)(cid:12)(cid:12)(cid:12) E [ ˆ G E ] − E ( r + ) − E ( r − ) r + − r − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) − E [ Y − i ] r + − r − − E ( r + ) − E ( r − ) r + − r − (cid:12)(cid:12)(cid:12)(cid:12) = 1 r + − r − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r + B r + ( r + ) − r + B r − ( r + ) + (cid:90) r + + εr + bd B r + − (cid:90) r + + εr + bd B r − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 r + − r − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) r + + εr + B r + ( b ) db − (cid:90) r + + εr + B r − ( b ) db (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (Based on Equations (13) and (14) as well as B r − ( r + + ε ) = B r + ( r + + ε )) ≤ εr + − r − where the inequality is because ∀ b ∈ [ r + , r + + ε ] , B r + ( b ) , B r − ( b ) ≤ C.6 Proof of Theorem 4.3

We start with the following helpful auxiliary lemmas.

Lemma C.7.

Let Y , Y , . . . , Y n be n iid uniform random variables. Let Y ( k ) be the k th largest Y i .Then with probability at least − n − / , (cid:12)(cid:12)(cid:12)(cid:12) Y ( k ) − kn + 1 (cid:12)(cid:12)(cid:12)(cid:12) ≤ n − / . Proof.

From the theory of order statistics [12], we know that Y ( k ) ∼ Beta( k, n + 1 − k ). It isknown that E [ Y ( k ) ] = kn +1 and that Var( Y k ) ≤ / (8 n ). The statement immediately follows fromChebyshev’s inequality. 20 emma C.8. Let f : [0 , → [0 , be an increasing function and let Y , Y , . . . , Y n be n iid uniformrandom variables. Let X i = f ( Y i ) , and let S k be the r.v. equal to the sum of the k smallest X i .Then (cid:12)(cid:12)(cid:12)(cid:12) n E [ S qn ] − (cid:90) q f ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ n − / Proof.

Let Z , Z , . . . , Z qn be (a random permutation) of the qn smallest Y i (so S k = (cid:80) f ( Z i )).Note that conditioned on Z qn +1 = r , the Z i are independently distributed according to U ([0 , r ]).In particular, we have that 1 n E [ S qn | Z qn +1 = r ] = (cid:90) r f ( x ) dx. From Lemma C.7, we know that with probability at least 1 − n − / , Z qn +1 ∈ [ q − n − / , q +2 n − / ]. Since f ( x ) ∈ [0 , (cid:82) r f ( x ) dx is 1-Lipshitz and therefore (conditioned on Z qn +1 ∈ [ q − n − / , q + 2 n − / ]), (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) q f ( x ) dx − (cid:90) Z qn +1 f ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ n − / . On the other hand, in the ( n − / probability) case where Z qn +1 / ∈ [ q − n − / , q + 2 n − / ], n E [ S qn ] is still bounded in [0 , Lemma C.9.

Let X i be an iid collection of n rvs. Let X ( k ) be the k th smallest of the X i (so X (1) ≤ X (2) ≤ · · · ≤ X ( n ) ). Then, if S k = (cid:80) ki =1 X ( i ) , we have that Var( S k ) ≤ n E [( X ( k ) ) ] . Proof.

The Efron-Stein inequalities (see Theorem 2 of [8]) state that for any collection of n randomvariables X i and any measurable functions f : R n → R and f i : R n − → R we have thatVar( f ( X )) ≤ n (cid:88) i =1 E [( f ( X ) − f i ( X − i )) ] , where X − i is the ( n − X , X , . . . , X i − , X i +1 , . . . , X n ).Let f ( X ) equal the sum of the k smallest entries in X , and let f i ( X (cid:48) ) equal the sum of the k − X (cid:48) . Note that for this choice of f and f i , f ( X ) = S k , and 0 ≤ f ( X ) − f i ( X − i ) ≤ X ( k ) (since the k − X − i are a subset of the k smallest entries in X ). It followsthat Var( S k ) ≤ n E [( X ( k ) ) ], as desired.We can now proceed to prove Theorem 4.3. Proof of Theorem 4.3.

We begin by bounding the variance of our estimator ˆ G E . Let us begin byfocusing on Var( (cid:80) qni =1 Y − i ). Since these Y i are sorted, Lemma C.9 implies that Var( (cid:80) qni =1 Y − i ) ≤ n E [( Y − qn ) ]. Since ˜ t = F − ( q + n − / ), by Lemma C.7, with probability at least 1 − n − / , Y − qn ≤ ˜ t ,and therefore Var( (cid:80) qni =1 Y − i ) ≤ n ˜ t + n / . Similarly, Var( (cid:80) qni =1 Y + i ) ≤ n ˜ t + n / . Since the sets ofrvs Y − i and Y + i are independent, we have thatVar( ˆ G E ) ≤ t nδ + O ( n − / δ − ) .

21e now proceed to bound the bias of ˆ G E . First, note that by Lemma C.8, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n E (cid:34) qn (cid:88) i =1 Y − qn (cid:35) − (cid:90) q max( b ( r − , F − ( x )) − r − , dx (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n − / , and therefore (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n E (cid:34) qn (cid:88) i =1 Y − qn (cid:35) − (cid:90) t max( b ( r − , v ) − r − , d F ( v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n − / . Likewise (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n E (cid:34) qn (cid:88) i =1 Y + qn (cid:35) − (cid:90) t max( b ( r + , v ) − r + , d F ( v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n − / . It follows that (cid:12)(cid:12)(cid:12)(cid:12) ˆ G E − (cid:18) δ (cid:90) t (max( b ( r + , v ) − r + , − max( b ( r − , v ) − r − , d F ( v ) − (1 − q ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n − / . (15)On the other hand, note that E ( r + ) − E ( r − ) = (cid:90) (max( b ( r + , v ) − r + , − max( b ( r − , v ) − r − , d F ( v )= (cid:90) t (max( b ( r + , v ) − r + , − max( b ( r − , v ) − r − , d F ( v )+ (cid:90) t (max( b ( r + , v ) − r + , − max( b ( r − , v ) − r − , d F ( v ) . Now, note that for v ≥ r + , max( b ( r + , v ) − r + , − max( b ( r − , v ) − r − ,

0) = b ( r + , v ) − b ( r − , v ) − ( r + − r − ). Since Pr[ v ≥ t ] = 1 − q , it follows that (cid:90) t (max( b ( r + , v ) − r + , − max( b ( r − , v ) − r − , d F ( v )= (cid:90) t ( b ( r + , v ) − b ( r − , v ) − ( r + − r − )) d F ( v )= (cid:90) t ( b ( r + , v ) − b ( r − , v )) d F ( v ) − (1 − q )( r + − r − ) ∈ (cid:2) − (1 − q ) δ, − (1 − q ) δ + ( b ( r + , t ) − b ( r − , t ))(1 − q ) (cid:3) . Here the last line follows since b ( r + , v ) − b ( r − , v ) is decreasing in v (due to diminishing sensitivityto reserve) but always non-negative. Combining this with equation 15, we have that: (cid:12)(cid:12)(cid:12)(cid:12) ˆ G E − δ ( E ( r + ) − E ( r − )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( b ( r + , t ) − b ( r − , t ))(1 − q ) δ + 6 n − / , as desired. 22 .7 Convergence Rate of Quantile Truncation Corollary C.1.

Suppose

Bias( ˆ G D ) ≤ ε D /δ . Using the estimator ˆ G E proposed in Eq. (5) for theresponse model with diminishing sensitivity property, for any ﬁxed quantile q , setting T = N / and δ = Θ( √ ε D + 1 − q ) , Algorithm 1 achieves convergence, min t ∈ [ T ] |P t C | ≤ (cid:101) O (cid:32) ε D + 1 − q + (cid:16) F − ( q + N − / ) ε D + 1 − q (cid:17) · N − / (cid:33) D Additional Experiments

In this section, we show the parameters used in the algorithms and some additional experiments.For additional experiments, we compare the two truncation methods for perfect response models,and then we show the complete results of 20 semi-synthetic data sets with perfect response. Finally,we test for other diﬀerent response models in synthetic data and one of the semi-synthetic data sets(the ﬁrst data set). In addition to the ﬁgures about the revenue curve learned by our algorithms,we also report the average revenue of the ﬁrst several rounds learned by our algorithms in Tables(see Table 1, 2, and 3).

Set up.

For all the algorithms, we set the learning rate to 0 .

05, the minimum reserve priceto 0 .

1, the maximum reserve price to 5 .

0, and the perturbation size to β t = 0 . r − and 50 for r + ),which forms 1 trial. We repeat 50 trials for each algorithm and report the mean (solid line) and95% conﬁdence interval (translucent and colored area) of the revenue achieved during training foreach algorithms shown in the ﬁgures. To obtain revenue curves learned by the algorithms overtime, the revenue at each point is estimated through 10,000 bids randomly drawn from the biddistribution and bidder response model. Comparison of bid truncation and quantile truncation for perfect response models.

We show the average revenue achieved by algorithms with bid truncation and quantile truncationfor synthetic data with perfect response in Table 1 (ﬁrst 50th rounds) and semi-synthetic data withperfect response in Table 2 (ﬁrst 20th rounds) and Table 3 (ﬁrst 50th rounds). We show that thebid truncation (Algorithm (IV)) performs the best in most of cases, but quantile truncation (Al-gorithm (V)) is also competitive and performs well for perfect response in both synthetic data andsemi-synthetic data sets. Bid truncation doesn’t signiﬁcantly outperform than quantile truncationsigniﬁcantly, since the algorithms only take small number of the samples at each round and thequantile truncation is more stable in this setting.

Performance of all 20 semi-synthetic data sets with perfect response.

We evaluatethe performance of ﬁve algorithms for 20 semi-synthetic data sets with perfect response. Here westill repeat 50 trials for each algorithm and report the mean and 95% conﬁdence interval of therevenue. The results are summarized in Table 2 and Table 3, where each table records the averagerevenue over the ﬁrst 20 rounds and 50 rounds, respectively. The revenue is normalized by theoptimal revenue of each data set (empirically evaluated by grid search). We ﬁnd in semi-syntheticdata, the variance reduction methods (bid truncation and quantile truncation) improve the revenueachieved by the algorithms. Interestingly, we ﬁnd Naive gradient descent with bid truncation alsoworks well in several semi-synthetic data sets, this is because our demand modeling approach relieson a good estimator ˆ G D and sometimes, the simple neural network cannot learn the demand curvevery accurately in the beginning. 23 a) Synthetic data with no response (b)

Semi-synthetic data with no response

Figure 3:

Plots of reserve price as a function of round t for (a) synthetic data with no response, and (b)one semi-synthetic data with no response. No response model.

In no response model, bidders don’t change their bids based on reserveprices, i.e ∀ r > , r ≥ , B r ( r (cid:48) ) = B ( r (cid:48) ). This can be regarded as a perfect response model withlinear shading factor 1. In this case, we expect the algorithms converge to the lower bound of thereserve prices 0.1. For no response models, we set perturbation be 0 . Mixture of no response and perfect response models.

We also consider another non-perfect response model, which is a mixture of perfect response and no response models. We assumethe bidder will not respond to reserve price with probability 0.1 and use perfect response withprobability 0.9. Figure 4 shows the revenue curve learned by the algorithms for synthetic dataand one semi-synthetic data. We ﬁnd quantile-based variance reduction speed up the training andconverges to optimal reserve faster than naive gradient descent. Since this is a mixture of perfectresponse and no response model, the revenue achieved by bid truncation methods is worse than thequantile-based approach. Through this experiments, we ﬁnd quantile truncation is not sensitivewith diﬀerent response models, whereas, the bid truncation method is very sensitive to a slightlynon-perfect response model. ε -bounded response. In the experiments for ε -bounded response model, we set ε = 0 .

05 andthe bias term z ∼ Unif [0 , ε ] in ε -bounded response model (see deﬁnition C.6). We visualize therevenue learned by algorithms for synthetic data and semi-synthetic data in Figure 5. The ﬁguresshow that for ε -bounded response, the quantile truncation still works better than naive gradientdescent. Since we have demonstrated that bid truncation is sensitive to non-perfect response model,the performance of the bid truncation is worse than quantile truncation in ε -bounded response. Response model Algorithm (I) Algorithm (II) Algorithm (III) Algorithm (IV) Algorithm (V) rev rev rev rev rev

Perfect response 86.7 ± ± ± ± ± Equilibrium response 95.5 ± ± ± ± ± ε -bounded response 86.4 ± ± ± ± ± Mixture response 90.5 ± ± ± ± ± Table 1:

Average revenue of the ﬁrst 50th rounds of ﬁve algorithms for synthetic data with diﬀerent responsemodels. The revenue is normalized by the optimal revenue and we repeat 50 trials for each algorithm toreport the 95% conﬁdence interval. a) Synthetic data with mixture response (b)

Semi-synthetic data with mixture response

Figure 4:

Plots of reserve price and revenue as a function of round t for (a) synthetic data and (b) onesemi-synthetic data, with mixture response, where the bidder uses perfect response model with probability0.9 and doesn’t respond to the reserve, otherwise. (a) Synthetic data with ε -bounded response (b) Semi-synthetic data with ε -bounded response Figure 5:

Plots of revenue as a function of round t for (a) synthetic data with ε -bounded response, and (b)one semi-synthetic data with ε -bounded response, where ε = 0 .

05 and the bias term z ∼ Unif [0 , ε ]. Semi-synthetic data sets Algorithm (I) Algorithm (II) Algorithm (III) Algorithm (IV) Algorithm (V) rev rev rev rev rev (1) 73.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (7) 86.4 ± ± ± ± ± (8) 95.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2:

Average revenue of ﬁrst 20th rounds of ﬁve algorithms for semi-synthetic data sets with perfectresponse. The revenue is normalized by the optimal revenue of each data set. We repeat 50 trials to get the95% conﬁdence interval. emi-synthetic data sets Algorithm (I) Algorithm (II) Algorithm (III) Algorithm (IV) Algorithm (V) rev rev rev rev rev (1) 75.1 ± ± ± ± ± (2) 93.4 ± ± ± ± ± (3) 93.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (8) 97.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (13) 95.6 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± (17) 96.9 ± ± ± ± ± ± ± ± ± ± (19) 84.7 ± ± ± ± ± (20) 97.0 ± ± ± ± ± Table 3: