Buying Data Over Time: Approximately Optimal Strategies for Dynamic Data-Driven Decisions
BBuying Data Over Time: Approximately Optimal Strategies forDynamic Data-Driven Decisions
Nicole Immorlica ∗ Ian A. Kash † Brendan Lucier ‡ Abstract
We consider a model where an agent has a repeated decision to make and wishes to maximizetheir total payoff. Payoffs are influenced by an action taken by the agent, but also an unknownstate of the world that evolves over time. Before choosing an action each round, the agent canpurchase noisy samples about the state of the world. The agent has a budget to spend on thesesamples, and has flexibility in deciding how to spread that budget across rounds. We investigatethe problem of choosing a sampling algorithm that optimizes total expected payoff. For example:is it better to buy samples steadily over time, or to buy samples in batches? We solve for theoptimal policy, and show that it is a natural instantiation of the latter. Under a more generalmodel that includes per-round fixed costs, we prove that a variation on this batching policy is a -approximation. ∗ Microsoft Research, [email protected] † Department of Computer Science, University of Illinois at Chicago, [email protected] ‡ Microsoft Research, [email protected] a r X i v : . [ c s . G T ] J a n Introduction
The growing demand for machine learning practitioners is a testament to the way data-drivendecision making is shaping our economy. Data has proven so important and valuable because somuch about the current state of the world is a priori unknown. We can better understand the worldby investing in data collection, but this investment can be costly; deciding how much data to acquirecan be a non-trivial undertaking, especially in the face of budget constraints. Furthermore, thevalue of data is typically not linear. Machine learning algorithms often see diminishing returns toperformance as their training dataset grows [22, 10]. This non-linearity is further complicated bythe fact that a data-driven decision approach is typically intended to replace some existing method,so its value is relative to the prior method’s performance.As a motivating example for these issues, consider a politician who wishes to accurately representthe opinion of her constituents. These constituents have a position on a policy, say the allocation offunding to public parks. The politician must choose her own position on the policy or abstain fromthe discussion. If she states a position, she experiences a disutility that is increasing in the distanceof her position from that of her constituents. If she abstains, she incurs a fixed cost for failing totake a stance. To help her make an optimal decision she can hire a polling firm that collects data onthe participants’ positions.We focus on the dynamic element of this story. In many decision problems, the state of theworld evolves over time. In the example above, the opinions of the constituents might change astime passes, impacting the optimal position of the politician. As a result, data about the state ofthe world becomes stale. Furthermore, many decisions are not made a single time; instead, decisionsare made repeatedly. In our example, the politician can update funding levels each fiscal quarter.When faced with budget constraints on data collection and the issue of data staleness, decisionsneed to be made about when to collect data and when to save budget for the future, and whether tomake decisions based on stale data or apply a default, non-data-driven policy. Our main contributionis a framework that models the impact of such budget constraints on data collection strategies. Inour example, the politician has a budget for data collection. A polling firm charges a fixed cost toinitiate a poll (e.g., create the survey) plus a fee per surveyed participant. The politician may nothave enough budget to hire the firm to survey every constituent every quarter. Should she thensurvey fewer constituents every quarter? Or survey a larger number of constituents every otherquarter, counting on the fact that opinions do not drift too rapidly?We initiate the study with arguably the simplest model that exhibits this tension. The state ofthe world (constituents’ opinions) is hidden but drawn from a known prior distribution, then evolvesstochastically. Each round, the decision-maker (politician) can collect one or more noisy samplesthat are correlated with the hidden state at a cost affine in the number of samples (conduct a poll).Then she chooses an action and incurs a loss. Should the decision-maker not exhaust her budget ina given round, she can bank it for future rounds. A sampling algorithm describes an online policyfor scheduling the collection of samples given the budget and past observations.We instantiate this general framework by assuming Gaussian prior, perturbations and samplenoise. We capture the decisions that need to be made as the problem of estimating the currentstate value, using the classic squared loss to capture the cost of making a decision using imprecise A Gaussian prior is justified in our running example if we assume a large population limit of constituents’ opinions.That the prior estimate of drift is also Gaussian is likewise motivated as the number of periods grows large. Wediscuss alternative distributional assumptions on the prior, perturbations and noise in Section 6.
To illustrate our technical model, suppose the hidden state (constituents’ average opinion) is initiallydrawn from a mean-zero Gaussian of variance . In each round, the state is subject to mean-zeroGaussian noise of variance (the constituents update their opinions), which is added to the previousround’s state. Also, any samples we choose to take are also subject to mean-zero Gaussian noiseof variance (polls are imperfect). Our budget for samples is per period, and one can eitherguess at the hidden state (incurring a penalty equal to the squared loss) or pass and take a defaultloss of / . What is the expected average loss of the policy that takes a single sample each round,and then takes the optimal action? As it turns out, the expected loss is precisely φ − ≈ . ,where φ is the golden ratio √ (see Section 3.5 for the analysis). However, this is not optimal:saving up the allotted budget and taking two samples every other round leads to an expected loss of . √ − ≈ . . The intuition behind the improvement is that taking a single sample every roundbeats the outside option, but not by much; it is better to beat the outside option significantly oneven-numbered rounds (by taking 2 samples), then simply use the outside option on odd-numberedrounds. It turns out that one cannot improve on this by saving up for 3 or more rounds to take evenmore samples all at once. However, one can do better by alternating between taking no samples fortwo periods and then two samples each for two periods, which results in a long-run average loss of ≈ . . As we can see from the example above, the space of policies to consider is quite large. One simpleobservation is that since samples become stale over time it is never optimal to collect samples andthen take the outside option (i.e., default fixed-cost action) in the same round; it would be betterto defer data collection to later rounds where decisions will be made based on data. As a result, anatural class of policies to consider is those which alternate between collecting samples and savingbudget. Such “on-off” policies can be thought of as engaging in “data drives” while neglecting datacollection the rest of the time.Our main result is that these on-off policies are asymptotically optimal, with respect to alldynamic policies. Moreover, it suffices to collect samples at a constant rate during the sampling partof the policy’s period. Our argument is constructive, and we show how to compute an asymptoticallyoptimal policy. This policy divides time into exponentially-growing chunks and collects data in thelatter end of each chunk.The solution above assumes that costs are linear in the number of samples collected. We nextconsider a more general model with a fixed up-front cost for the first sample collected in each round.This captures the costs associated with setting up the infrastructure to collects samples on a givenround, such as hiring a polling firm which uses a two-part tariff. Under such per-round costs, it canbe suboptimal to sample in sequential periods (as in an on-off policy), as this requires paying thefixed cost twice. For this generalized cost model, we consider simple and approximately optimalpolicies. When evaluating performance, we compare against a null “baseline” policy that eschewsdata collection and simply takes the outside option every period. We define the value of a policy to2e its improvement over this baseline, so that the null policy has a value of and every policy hasnon-negative value. While this is equivalent to simply comparing the expected costs of policies thisalternative measure is intended to capture how well a policy leverages the extra value obtainablefrom data; we feel that this more accurately reflects the relative performance of different policies.We focus on a class of lazy policies that collect samples only at times when the variance ofthe current estimate is worse than the outside option. This class captures a heuristic based on athreshold rule: the decision-maker chooses to collect data when they do not have enough informationto gain over the outside option. We show the optimal lazy policy is a / -approximation to theoptimal policy. The result is constructive, and we show how to compute an asymptotically optimallazy policy. Moreover, this approximation factor is tight for lazy policies.To derive these results, we begin with the well-known fact that the expected loss under thesquared loss cost function is the variance of the posterior. We use an analysis based on Kalmanfilters [23], which are used to solve localization problems in domains such as astronautics [27],robotics [35], and traffic monitoring [37], to characterize the evolution of variance given a samplingpolicy. We show how to maximize value using geometric arguments and local manipulations totransform an optimal policy into either an on-off policy or a lazy policy, respectively.We conclude with two extensions. We described our results for a discrete-time model, but onemight instead consider a continuous-time variant in which samples, actions, and state evolutionoccur continuously. We show how to extend all of our results to such a continuous setting. Second,we describe a non-Gaussian instance of our framework, where the state of the world is binary andswitches with some small probability each round. We solve for the optimal policy, and show that(like the Gaussian model) it is characterized by non-uniform, bursty sampling. We motivated our framework with a toy example of a politician polling his or her constituents. Butwe note that the model is general and applies to other scenarios as well. For example, suppose aphone uses its GPS to collect samples, each of which provides a noisy estimate of location (reasonablyapproximated by Gaussian noise). The “cost” of collecting samples is energy consumption, andthe budget constraint is that the GPS can only reasonably use a limited portion of the phone’sbattery capacity. The worse the location estimate is, the less useful this information is to apps;sufficiently poor estimates might even have negative value. However, as an alternative, apps alwayshave the outside option of providing location-unaware functionality. Our analysis shows that itis approximately optimal to extrapolate from existing data to estimate the user’s location mostof the time, and only use the GPS in “bursts” once the noise of the estimate exceeds a certainthreshold. Note that in this scenario the app never observes the “ground truth” of the phone’slocation. Similarly, our model might capture the problem faced by a firm that runs user studieswhen deciding which features to include in a product, given that such user studies are expensive torun and preferences may shift within the population of customers over time.
Our results provide insight into the trade-offs involved in designing data collection policies indynamic settings. We construct policies that navigate the trade-off between cost of data collectionand freshness of data, and show how to optimize data collection schedules in a setting with Gaussiannoise. But perhaps our biggest contribution is conceptual, in providing a framework in which these3uestions can be formalized and studied. We view this work as a first step toward a broader studyof the dynamic value of data. An important direction for future work is to consider other models ofstate evolution and/or sampling within our framework, aimed at capturing other applications. Forexample, if the state evolves in a heavy-tailed manner, as in the non-Gaussian instance explored inSection 6, then we show it is beneficial to take samples regularly in order to detect large, infrequentjumps in state value, and then adaptively take many samples when such a jump is evident. We solvethis extension only for a simple two-state Markov chain. Can we quantify the dynamic value of dataand find an (approximately) optimal and simple data collection policy in a general Markov chain?
While we are not aware of other work addressing the value of data in a dynamic setting, there hasbeen considerable attention paid to the value of data in static settings. Arietta-Ibarra et al. [4]argue that the data produced by internet users is so valuable that they should be compensatedfor their labor. Similarly, there is growing appreciation for the value of the data produced oncrowdsourcing platforms like Amazon Mechanical Turk [6, 20]. Other work has emphasized that notall crowdsourced data is created equal and studied the way tasks and incentives can be designedto improve the quality of information gathered [17, 31]. Similarly, data can have non-linear valueif individual pieces are substitutes or complements [8]. Prediction markets can be used to gatherinformation over time, with participants controlling the order in which information is revealed [11].There is a growing line of work attempting to determine the marginal value of training data fordeep learning methods. Examples include training data for classifying medical images [9] and chemicalprocesses [5], as well as for more general problems such as estimating a Gaussian distribution [22].These studies consider the static problem of learning from samples, and generally find that additionaltraining data exhibits decreasing marginal value. Koh and Liang [25] introduced the use of influencefunctions to quantify how the performance of a model depends on individual training examples.While we assume samples are of uniform quality, other work has studied agents who have dataof different quality or cost [29, 7, 16]. Another line studies the way that data is sold in currentmarketplaces [33], as well as proposing new market designs [28]. This includes going beyond marketsfor raw data to markets which acquire and combine the outputs of machine learning models [34].Our work is also related to statistical and algorithmic aspects of learning a distribution fromsamples. A significant body of recent work has considered problems of learning Gaussians usinga minimal number of noisy and/or adversarial samples [21, 13, 14, 26, 15]. In comparison, we arelikewise interested in learning a hidden Gaussian from which we obtain noisy samples (as a steptoward determining an optimal action), but instead of robustness to adversarial noise we are insteadconcerned about optimizing the split of samples across time periods in a purely stochastic setting.Our investigation of data staleness is closely related to the issue of concept drift in streamingalgorithms; see, e.g., Chapter 3 of [2] Concept drift refers to scenarios where the data being fed toan algorithm is pulled from a model that evolves over time, so that, for example, a solution builtusing historical data will eventually lose accuracy. Such scenarios arise in problems of histogrammaintenance [18], dynamic clustering [3], and others. One problem is to quantify the amount of driftoccurring in a given data stream [1]. Given that such drift is present, one approach to handlingconcept drift is via sliding-window methods, which limit dependence on old data [12]. The choice ofwindow size captures a tension between using a lot of stale data or a smaller amount of fresh data.However, in work on concept drift one typically cannot control the rate at which data is collected.Another concept related to staleness is the “age of information.” This captures scenarios where a4ource generates frequent updates and a receiver wishes to keep track of the current state, but dueto congestion in the transmission technology (such as a queue or database locks) it is optimal tolimit the rate at which updates are sent [24, 32]. Minimizing the age of information can be capturedas a limit of our model where a single sample suffices to provide perfect information. Recent workhas examined variants of the model where generating updates is costly [19], but the focus in thisliterature is more on the management of the congestible resource. Closer to our work, several recentpapers have eliminated the congestible resource and studied issues such as an energy budget that isstochastic and has limited storage capacity [38] and pricing schemes for when sampling costs arenon-uniform [36, 39]. Relative to our work these papers have simpler models of the value of data andfocus on features of the sampling policy given the energy technology and pricing scheme, respectively.
We first describe our general framework, then describe a specific instantiation of interest in Section 2.1.Time occurs in rounds, indexed by t = 1 , , . . . . There is a hidden state variable x t ∈ Ω that evolvesover time according to a stochastic process. The initial state x is drawn from known distribution F .Write m t for the (possibly randomized) evolution mapping applied at round t , so that x t +1 ← m t ( x t ) .In every round, the decision-maker chooses an action y t ∈ A , and then suffers a loss (cid:96) ( y t , x t ) that depends on both the action and the hidden state. The evolution functions ( m t ) and lossfunction (cid:96) are known to the decision-maker, but neither the state x t nor the loss (cid:96) ( y t , x t ) is directlyobserved. Rather, on each round before choosing an action, the decision-maker can request one ormore independent samples that are correlated with x t , drawn from a known distribution Γ( x t ) .Samples are costly, and the decision-maker has a budget that can be used to obtain samples.The budget is B per round, and can be banked across rounds. A sampling policy results in a numberof samples s t taken in each round t , which can depend on all previous observations. The cost oftaking s t samples in round t is C ( s t ) ≥ . We assume that C is non-decreasing and C (0) = 0 . Asampling policy is valid if (cid:80) Tt =1 C ( s t ) ≤ B · T for all T . For example, C ( s t ) = s t corresponds to acost of per sample, and setting C ( s t ) = s t + z · s t > adds an additional cost of z for each roundin which at least one sample is collected.To summarize: on each round, the decision-maker chooses a number of samples s t to observe,then chooses an action y t . Their loss (cid:96) ( y t , x t ) is then realized, the value of x t is updated to x t +1 , andthe process proceeds with the next round. The goal is to minimize the expected long-run average of (cid:96) ( y t , x t ) , in the limit as t → ∞ , subject to (cid:80) Tt =0 C ( s t ) ≤ B · T for all T ≥ . We will be primarily interested in the following instantiation of our general framework. The hiddenstate variable is a real number (i.e.,
Ω = R ) and the decision-maker’s goal is to estimate the hiddenstate in each round. The initial state is x ∼ N (0 , ρ ) , a Gaussian with mean and variance ρ > .Moreover, the evolution process m t sets x t +1 = x t + δ t , where each δ t ∼ N (0 , ρ ) independently. Assuming that the ground truth for (cid:96) ( y t , x t ) is unobserved captures scenarios like our political example, andapproximates settings where the decision maker only gets weak feedback, feedback at a delay, or feedback in aggregateover a long period of time. Observing the loss provides additional information about x t +1 , and this could be considereda variant of our model where the decision-maker gets some number of samples “for free” each round from observing anoisy version of the loss.
5e recall that the decision-maker knows the evolution process (and hence ρ ) but does not directlyobserve the realizations δ t .Each sample in round t is drawn from N ( x t , σ ) where σ > . Some of our results will also allowfractional sampling, where we think of an α ∈ (0 , fraction of a sample as a sample drawn from N ( x t , σ/α ) . The action space is A = R ∪ {⊥} . If the decision-maker chooses y t ∈ R , her loss isthe squared error of her estimate ( y t − x t ) . If she is too unsure of the state, she may instead takea default action y t = ⊥ , which corresponds to not making a guess; this results in a constant lossof c > . Let G t be a random variable whose law is the decision maker’s posterior after observingwhatever samples are taken in round t as well as all previous samples. The decision maker’s subjectiveexpected loss when guessing y t ∈ R is E [( y t − G t ) ] . This is well known to be minimized by taking y t = E [ G t ] , and that furthermore the expected loss is E [( E [ G t ] − G t ) ] = V ar ( G t ) . It is thereforeoptimal to guess y t = E [ G t ] if and only if Var ( G t ) < c , otherwise pass.We focus on deriving approximately optimal sampling algorithms. To do so, we need to trackthe variance of G t as a function of the sampling strategy. As the sample noise and random statepermutations are all zero-mean Gaussians, G t is a zero-mean Gaussian as well, and the evolution ofits variance has a simple form. Lemma 1.
Let v t be the variance of G t and suppose each δ t ∼ N (0 , ρ ) independently, and that eachsample is subject to zero-mean Gaussian noise with variance σ . Then, if the decision-maker takes s samples in round t + 1 , the variance of G t +1 is v t +1 = v t + ρ sσ ( v t + ρ ) . The proof, which is deferred to the appendix along with all other proofs, follows from our modelbeing a special case of the model underlying a Kalman filter.The optimization problem therefore reduces to choosing a number of samples s t to take in eachround t in order to minimize the long-run average of min( v t , c ) , the loss of the optimal action. Thatis, the goal is to minimize lim sup T →∞ T (cid:80) Tt =1 min( v t , c ) , where we take the superior limit so that thequantity is defined even when the average is not convergent. We choose C ( s t ) = s t + z · s t > , so thisoptimization is subject to the budget constraint that, at each time T ≥ , (cid:80) Tt =1 s t + z · s t > ≤ BT .This captures two kinds of information acquisition costs faced by the decision-maker. First she facesa cost per sample, which we have normalized to one. Second, she faces a fixed cost z (which may be0) on each day she chooses to take samples, expressed in terms of the number of samples that couldinstead have been taken on some other day had this cost not been paid. This captures the costsassociated with setting up the infrastructure to collects samples on a given round, such as gettingdata collectors to the location where they are needed, hiring a polling firm which uses a two-parttariff, or establishing a satellite connection to begin using a phone’s GPS.A useful baseline performance is the cost of a policy that takes no samples and simply choosesthe outside option at all times. We refer to this as the null policy . The value of a sampling policy s , denoted Val ( s ) , is defined to be the difference between its cost and the cost of the null policy: lim inf T →∞ T (cid:80) Tt =1 max( c − v t , . Note that maximizing value is equivalent to minimizing cost,which we illustrate in Section 3.1. We say that a policy is α -approximate if its value is at least an α fraction of the optimal policy’s value. One can view fractional sampling as modeling scenarios where the value of any one single sample is quite small;i.e., has high variance, so that a single “unit” of variance is derived from taking many samples. E.g., sampling a singleconstituent in our polling example. It also captures settings where it is possible to obtain samples of varying qualitywith different levels of investment.
Before moving on to our main results, we show how to analyze the evolution of the variance resultingfrom a given sampling policy. We first illustrate our model with a particularly simple class of policies:those where s t takes on only two possible values. We then analyze arbitrary periodic policies, andshow via contraction that they result in convergence to a periodic variance evolution. To visualize the problem, we begin by plotting the result of an example policy where the spendingrate is constant for some interval of rounds, then shifts to a different constant spending rate. Figure 1illustrates one such policy. The spending rates are indicated as alternating line segments, while thevariance is an oscillating curve, always converging toward the current spending rate. Note that thisparticular policy is periodic, in the sense that the final variance is the same as the initial variance.The horizontal line gives one possible value for the cost of the outside option. Given this, the optimalpolicy is to guess whenever the orange curve is below the green line and take the outside optionwhenever it is above it. Thus, the loss associated with this spending policy is given by the orangeshaded area in Figure 1. Minimizing this loss is equivalent to maximizing the green shaded area,which corresponds to the value of the spending policy. The null policy, which takes no samples andhas variance greater than c always (possibly after an initial period if v < c ), has value . We next consider policies that are periodic. A periodic policy with period R has the property that s t = s t + R for all t ≥ . Such policies are natural and have useful structure. In a periodic policy, thevariance ( v t ) converges uniformly to being periodic in the limit as t → ∞ . This follows because theimpact of sampling on variance is a contraction map. Definition 1.
Given a normed space X with norm || · || , a mapping Ψ : X → X is a contractionmap if there exists a k < such that, for all x, y ∈ X , || Ψ( x ) − Ψ( y ) || ≤ k || x − y || . Lemma 2.
Fix a sampling policy s , and a time R ≥ , and suppose that s takes a strictly positivenumber of samples in each round t ≤ R . Let Ψ be the mapping defined as follows: supposing that v = x and v is the variance function resulting from sampling policy s , set Ψ( x ) := v R . Then Ψ is acontraction map over the non-negative reals, under the absolute value norm. Ψ to the initial variance in order toobtain v , v R , v R , . . . , we conclude that the variance will converge uniformly to a periodic functionfor which v t = v t + R . Thus, for the purpose of evaluating long-run average cost, it will be convenient(and equivalent) to replace the initial condition on v with a periodic boundary condition v = v R ,and then choose s to minimize the average cost over a single period, R (cid:82) R min { v t , c } dt, subject tothe budget constraint that, at any round T ≤ R , we have (cid:80) Tt =1 s t ≤ BT . Write ˜ v = v t − + ρ for the variance that would be obtained in round t if s t = 0 . We say that a policyis lazy if s t = 0 whenever ˜ v t < c . That is, samples are collected only at times where the variancewould otherwise be at or above the outside option value c . Intuitively, we can think of such a policyas collecting a batch of samples in one round, then “free-riding” off of the resulting information insubsequent rounds. The free-riding occurs until the posterior variance grows large enough that itbecomes better to select the outside option, at which point the policy may collect another batch ofsamples.If a policy is lazy, then its variance function v increases by ρ whenever ˜ v t < c , with downwardsteps only at times corresponding to when samples are taken. Furthermore, the value of such apolicy decomposes among these sampling instances: for any t where s t > , resulting in a varianceof v t < c , if we write h = (cid:98) c − v t (cid:99) then we can attribute a value of h ( h + 1) + ( h + 1)( c − v t − h ) .Geometrically, this is the area of the “discrete-step triangle” formed between the increasing sequenceof variances v t and the constant line at c , over the time steps t, . . . , t + h + 1 . An On-Off policy is a periodic policy parameterized by a time interval T and a sampling rate S .Roughly speaking, the policy alternates between intervals where it samples at a rate of S each round,and intervals where it does not sample. The two interval lengths sum to T , and the length of thesampling interval is set as large as possible subject to budget feasibility. More formally, the policysets s t = 0 for all t ≤ (1 − α ) · T , where α = min { B/S, } ∈ [0 , and s t = S for all t such that (1 − α ) T < t ≤ T . This policy is then repeated, on a cycle of length T . The fraction α is chosen tobe as large as possible, subject to the budget constraint. We can now justify the simple example we presented in the introduction, where ρ = σ = 1 , B = 1 ,and c = 0 . . The policy that takes a single sample each round is periodic with period , and hencewill converge to a variance that is likewise equal each round. This fixed point variance, v ∗ , satisfies v ∗ = v ∗ +11+( v ∗ +1) by Lemma 1. Solving for v ∗ yields v ∗ = √ − < . , which is the average cost perround.If instead the policy takes k samples every k rounds, this results in a variance that is periodic ofperiod k . After the round in which samples are taken, the fixed-point variance satisfies v ∗ = v ∗ + k k ( v ∗ + k ) ,again by Lemma 1. Solving for v ∗ , and noting that v ∗ + 1 ≥ > c , yields that the cost incurred bythis policy is minimized when k = 2 . 8o solve for the policy that alternates between taking no samples for two round, followed bytaking two samples on each of two rounds, suppose the long-run, periodic variances are v , v , v , v ,where samples are taken on rounds and . Then we have v = v +1 , v = v +11+2( v +1) , v = v +11+2( v +1) ,and v = v + 1 . Combining this sequence of equations yields v + 4 v −
13 = 0 , which we can solveto find v = − √ ≈ . . Plugging this into the equations for v , v , v and taking the averageof min { v i , . } over i ∈ { , , , } yields the reported average cost of ≈ . . In this section we show that when the cost of sampling is linear in the total number of samplestaken (i.e., z = 0 ) , and when fractional sampling is allowed, then the supremum value over all on-offpolicies is an upper bound on the value of any policy. This supremum is achieved in the limit asthe time interval T grows large. So, while no individual policy achieves the supremum, one can getarbitrarily close with an on-off policy of sufficiently long period. Proofs appear in Appendix C.We begin with some definitions. For a given period length T > , write s T for the on-off policyof period T with optimal long-run average value. Recall Val ( s T ) is the value of policy s T . We firstargue that larger time horizons lead to better on-off policies. Lemma 3.
With fractional samples, for all
T > T (cid:48) , we have Val ( s T ) > Val ( s T (cid:48) ) . Write V ∗ = sup T →∞ Val ( s T ) . Lemma 3 implies that V ∗ = lim T →∞ Val ( s T ) as well. We showthat no policy satisfying the budget constraint can achieve value greater than V ∗ . Theorem 1.
With fractional samples, the value of any valid policy s is at most V ∗ . The proof of Theorem 1 proceeds in two steps. First, for any given time horizon T , it is suboptimalto move from having variance below the outside option to above the outside option; one shouldalways save up budget over the initial rounds, then keep the variance below c from that point onward.This follows because the marginal sample cost of reducing variance diminishes as variance grows, soit is more sample-efficient to recover from very high variance once than to recover from moderatelyhigh variance multiple times.Second, one must show that it is asymptotically optimal to keep the variance not just below c ,but uniform. This is done by a potential argument, illustrating that a sequence of moves aimedat “smoothing out” the sampling rate can only increase value and must terminate at a uniformpolicy. The difficulty is that a sample affects not only the value in the round it is taken, but in allsubsequent rounds. We make use of an amortization argument that appropriately credits value tosamples, and use this to construct the sequence of adjustments that increase overall value whilebringing the sampling sequence closer to uniform in an appropriate metric.We also note that it is straightforward to compute the optimal on-off policy for a given timehorizon T , by choosing the sampling rate that maximizes [value per round] × [fraction of time thepolicy is “on”]. One can implement a policy whose value asymptotically approaches V ∗ by repeateddoubling of the time horizon. Alternatively, since lim T →∞ Val ( s T ) = V ∗ , s T will be an approximatelyoptimal policy for sufficiently large T . Recall that z is the fixed per-round cost of taking a positive number of samples. Even when z = 0 , there is still apositive per-sample cost. c (green). The squares (drawn in blue) cover the gap between the curves, except possibly when | v t − c | < (cid:15) (for technical reasons). The lazy policy samples on rounds corresponding to the left edgeof each square, bringing the variance to each square’s bottom-left corner. In the previous section we solved for the optimal policy when z = 0 , meaning that there is no fixedper-round cost when sampling. We now show that for general z , lazy policies are approximatelyoptimal, obtaining at least / of the value of the optimal policy. All proofs are deferred toAppendix D.We begin with a lemma that states that, for any valid sampling policy and any sequence oftimesteps, it is possible to match the variance at those timesteps with a policy that only samples atprecisely those timesteps, and the resulting policy will be valid. Lemma 4.
Fix any valid sampling policy s (not necessarily lazy) with resulting variances ( v t ) ,and any sequence of timesteps t < t < . . . < t (cid:96) < . . . . Then there is a valid policy s (cid:48) such that { t | s (cid:48) t > } ⊆ { t , . . . , t (cid:96) , . . . } , resulting in a variances (˘ v t ) with ˘ v t i ≤ v t i for all i . The intuition is that if we take all the samples we would have spent between timesteps t (cid:96) and t (cid:96) +1 and instead spend them all at t (cid:96) +1 the result will be a (weakly) lower variance at t (cid:96) +1 . We nextshow that any policy can be converted into a lazy policy at a loss of at most half of its value. Theorem 2.
The optimal lazy policy is / -approximate. See Figure 2 for an illustration of the intuition behind the result. Consider an arbitrary policy s ,with resulting variance sequence ( v t ) . Imagine covering the area between ( v t ) and c with squares,drawn left to right with their upper faces lying on the outside option line, each chosen just largeenough so that v t never falls below the area covered by the squares. The area of the squares is anupper bound on Val ( s ) . Consider a lazy policy that drops a single atom on the left endpoint ofeach square, bringing the variance to the square’s lower-left corner. The value of this policy coversat least half of each square. Moreover, Lemma 4 implies this policy is (approximately) valid, as itmatches variances from the original policy, possibly shifted early by a constant number of rounds.This shifting can introduce non-validity; we fix this by delaying the policy’s start by a constantnumber of rounds without affecting the asymptotic behavior.10igure 3: Simulating the optimal policy for the non-Gaussian extension. The round number is onthe horizontal axis. The hidden state of the world is binary and evolves stochastically (blue). Theoptimal policy tracks a posterior distribution over the hidden state (red), and takes samples in orderto maintain a tuned level of certainty (dashed green). Note that most rounds have only a smallnumber of samples, with occasional spikes triggered adaptively in response to uncertainty.The factor / in Theorem 2 is tight. To see this, fix the value of c and allow the budget B togrow arbitrarily large. Then the optimal value tends to c as the budget grows, since the achievablevariance on all rounds tends to . However, the lazy policy cannot achieve value greater than c/ , asthis is what would be obtained if the variance reached on the rounds on which samples are taken.Finally, while this result is non-constructive, one can compute a policy whose value approachesan upper bound on the optimal lazy policy, in a similar manner to the optimal on-off policy. Onecan show the best lazy policy over any finite horizon has an “off” period (with no sampling) followedby an “on” period (where v t ≤ c ). One can then solve for the optimal number of samples to takewhenever ˜ v t > c by optimizing either value per unit of (fixed plus per-sample) sampling cost, or byfully exhausting the budget, whichever is better. See Lemma 8 in the appendix for details. We describe two extensions of our model in the appendix. First, we consider a continuous-timevariant where samples can be taken continously subject to a flow cost, in addition to being requestedas discrete atoms. The decision-maker selects actions continuously, and aims to minimize loss overtime. All of our results carry forward to this continuous extension.Second, returning to discrete time, we consider a non-Gaussian instance of our framework. In thismodel, there is a binary hidden state of the world, which flips each round independently with somesmall probability (cid:15) > . The decision-maker’s action in each round is to guess the hidden state ofthis simple two-state Markov process, and the objective is to maximize the fraction of time that thisguess is made correctly. Each sample is a binary signal correlated with the hidden state, matchingthe state of the world with probability + δ where δ > . The decision-maker can adaptively requestsamples in each round, subject to the accumulating budget constraint, before making a guess.In this extension, as in our Gaussian model, the optimal policy collects samples non-uniformly.In fact, the optimal policy has a simple form: it sets a threshold θ > and takes samples untilthe entropy of the posterior distribution falls below θ . Smaller θ leads to higher accuracy, but alsorequires more samples on average, so the best policy will set θ as low as possible subject to the11udget constraint. Notably, the result of this policy is that sampling tends to occur at a slow butsteady rate, keeping the entropy around θ , except for occasional spikes of samples in response to aperceived change in the hidden state. See Figure 3 for a visualization of a numerical simulation witha budget of samples (on average) per round.More generally, whenever the state evolves in a heavy-tailed manner, it is tempting to takesamples regularly in order to detect large, infrequent jumps in state value, and then adaptively takemany samples when such a jump is evident. This simple model is one scenario where such behavioris optimal. More generally, can we quantify the dynamic value of data and find an (approximately)optimal data collection policy for more complex Markov chains, or other practical applications? References [1] Charu C. Aggarwal. A framework for diagnosing changes in evolving data streams. In
Proceedingsof the 2003 ACM SIGMOD International Conference on Management of Data , SIGMOD ’03,pages 575–586, New York, NY, USA, 2003. ACM.[2] Charu C. Aggarwal.
Data Streams: Models and Algorithms (Advances in Database Systems) .Springer-Verlag, Berlin, Heidelberg, 2006.[3] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. A framework for clusteringevolving data streams. In
Proceedings of the 29th International Conference on Very Large DataBases - Volume 29 , VLDB ’03, pages 81–92. VLDB Endowment, 2003.[4] Imanol Arrieta-Ibarra, Leonard Goff, Diego Jiménez-Hernández, Jaron Lanier, and E Glen Weyl.Should we treat data as labor? moving beyond" free". In
AEA Papers and Proceedings , volume108, pages 38–42, 2018.[5] Claudia Beleites, Ute Neugebauer, Thomas Bocklitz, Christoph Krafft, and Jürgen Popp. Samplesize planning for classification models.
Analytica chimica acta , 760:25–33, 2013.[6] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon’s mechanical turk: A newsource of inexpensive, yet high-quality, data?
Perspectives on psychological science , 6(1):3–5,2011.[7] Yiling Chen, Nicole Immorlica, Brendan Lucier, Vasilis Syrgkanis, and Juba Ziani. Optimaldata acquisition for statistical estimation. In
Proceedings of the 2018 ACM Conference onEconomics and Computation , pages 27–44. ACM, 2018.[8] Yiling Chen and Bo Waggoner. Informational substitutes. In , pages 239–247. IEEE, 2016.[9] Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, and Synho Do. How much data is neededto train a medical image deep learning system to achieve necessary high accuracy?
CoRR ,abs/1511.06348, 2015.[10] Corinna Cortes, Lawrence D Jackel, Sara A Solla, Vladimir Vapnik, and John S Denker.Learning curves: Asymptotic values and rate of convergence. In
Advances in Neural InformationProcessing Systems , pages 327–334, 1994. 1211] Bo Cowgill, Justin Wolfers, and Eric Zitzewitz. Using prediction markets to track informationflows: Evidence from google. In
AMMA , page 3, 2009.[12] Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statisticsover sliding windows: (extended abstract). In
Proceedings of the Thirteenth Annual ACM-SIAMSymposium on Discrete Algorithms , SODA ’02, pages 635–644, Philadelphia, PA, USA, 2002.Society for Industrial and Applied Mathematics.[13] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators inhigh dimensions without the computational intractability. In , pages 655–664, 2016.[14] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimationof high-dimensional gaussians and gaussian mixtures. In , pages 73–84, 2017.[15] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and AlistairStewart. Robustly learning a gaussian: Getting optimal error, efficiently. In
Proceedings ofthe Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’18, pages2683–2702, Philadelphia, PA, USA, 2018. Society for Industrial and Applied Mathematics.[16] Fang Fang, Maxwell Stinchcombe, and Andrew Whinston. " putting your money where yourmouth is"-a betting platform for better prediction.
Review of Network Economics , 6(2), 2007.[17] Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin. Instructing peoplefor training gestural interactive systems. In
Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems , pages 1737–1746. ACM, 2012.[18] Anna C. Gilbert, Sudipto Guha, Piotr Indyk, Yannis Kotidis, S. Muthukrishnan, and Martin J.Strauss. Fast, small-space algorithms for approximate histogram maintenance. In
Proceedings ofthe Thiry-fourth Annual ACM Symposium on Theory of Computing , STOC ’02, pages 389–398,New York, NY, USA, 2002. ACM.[19] Shugang Hao and Lingjie Duan. Regulating competition in age of information under networkexternalities.
IEEE Journal on Selected Areas in Communications , 38(4):697–710, 2020.[20] Panagiotis G Ipeirotis. Analyzing the amazon mechanical turk marketplace.
XRDS: Crossroads,The ACM Magazine for Students , 17(2):16–21, 2010.[21] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of twogaussians. In
Proceedings of the Forty-Second ACM Symposium on Theory of Computing , STOC’10, page 553–562, New York, NY, USA, 2010. Association for Computing Machinery.[22] H. M. Kalayeh and D. A. Landgrebe. Predicting the required number of training samples.
IEEETransactions on Pattern Analysis and Machine Intelligence , PAMI-5(6):664–667, Nov 1983.[23] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems.
Journal ofbasic Engineering , 82(1):35–45, 1960.[24] Sanjit Kaul, Roy Yates, and Marco Gruteser. Real-time status: How often should one update?In , pages 2731–2735. IEEE, 2012.1325] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions.In
International Conference on Machine Learning , pages 1885–1894, 2017.[26] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In , pages 665–674,Los Alamitos, CA, USA, oct 2016. IEEE Computer Society.[27] Ern J Lefferts, F Landis Markley, and Malcolm D Shuster. Kalman filtering for spacecraftattitude estimation.
Journal of Guidance, Control, and Dynamics , 5(5):417–429, 1982.[28] Chao Li and Gerome Miklau. Pricing aggregate queries in a data marketplace. In
WebDB ,pages 19–24, 2012.[29] Annie Liang, Xiaosheng Mu, and Vasilis Syrgkanis. Dynamic information acquisition frommultiple sources. arXiv preprint arXiv:1703.06367 , 2017.[30] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models.
Neuralcomputation , 11(2):305–345, 1999.[31] Nihar Bhadresh Shah and Denny Zhou. Double or nothing: Multiplicative incentive mechanismsfor crowdsourcing. In
Advances in neural information processing systems , pages 1–9, 2015.[32] Xiaohui Song and Jane W-S Liu. Performance of multiversion concurrency control algorithmsin maintaining temporal consistency. In
Proceedings Fourteenth Annual International ComputerSoftware and Applications Conference , pages 132–133. IEEE Computer Society, 1990.[33] Florian Stahl, Fabian Schomm, and Gottfried Vossen. The data marketplace survey revisited.Technical report, Working Papers, ERCIS-European Research Center for Information Systems,2014.[34] Amos Storkey. Machine learning markets. In
Proceedings of the Fourteenth InternationalConference on Artificial Intelligence and Statistics , pages 716–724, 2011.[35] Sebastian Thrun. Probabilistic algorithms in robotics.
Ai Magazine , 21(4):93, 2000.[36] Xuehe Wang and Lingjie Duan. Dynamic pricing for controlling age of information. In , pages 962–966. IEEE, 2019.[37] Daniel B Work, Olli-Pekka Tossavainen, Sébastien Blandin, Alexandre M Bayen, TochukwuIwuchukwu, and Kenneth Tracton. An ensemble kalman filtering approach to highway trafficestimation using gps enabled mobile devices. In
Decision and Control, 2008. CDC 2008. 47thIEEE Conference on , pages 5062–5068. IEEE, 2008.[38] Xianwen Wu, Jing Yang, and Jingxian Wu. Optimal status update for age of informationminimization with an energy harvesting source.
IEEE Transactions on Green Communicationsand Networking , 2(1):193–204, 2017.[39] Meng Zhang, Ahmed Arafa, Jianwei Huang, and H Vincent Poor. How to price fresh data. arXiv preprint arXiv:1904.06899 , 2019. 14
Appendix: A Continuous Model
We now define a continuous version of our optimization problem, which is useful for modelingbig-data situations in which the value of individual data samples is small, but the budget is largeenough to allow the accumulation of large datasets. Our continuous model will correspond to a limitof discrete models as the variance of the sampling errors and the budget B grow large.Our first step is to consider relaxing the discrete model to allow fractional samples. This extendsLemma 1 so that s can be fractional. Note that since the effect of taking s samples, from Lemma 1,depends on the ratio s/σ , we can think of taking a fraction α < of a sample with variance σ asequivalent to taking a single sample with variance σ/α . With this equivalence in mind, we canwithout loss of generality scale the variance of samples so that σ = 1 ; this requires only that weinterpret the sample budget and numbers of samples taken as scaled in units of inverse variance.Next, in the continuous model, the hidden state x t evolves continuously for t ≥ . The initialprior is Gaussian with some fixed variance v . At each time t we will have a posterior distributionover the hidden state. Write v ( t ) for the variance of the posterior distribution at time t ≥ . Inparticular, we have the initial condition v (0) = v .Samples can be collected continuously over time at a specified density, as well as in atoms atdiscrete points of time. As discussed above, we assume without loss of generality that the varianceof a single sample is equal to . Write s ( t ) for the density at which samples are extracted at time t ,and write a ( t ) for the mass of samples collected as an atom at time t . Assume atoms are collectedat times { t i , i ∈ N } , i.e., a ( t ) > only if t ∈ { t i } . Both s and a , as well as the times { t i } are chosenby the decision-maker.To derive the evolution of the hidden state and the variance v ( t ) of the posterior at time t , weinterpret this continuous model as a limit of the following discretization. Partition time into intervalsof length (cid:15) , say [ t, t + (cid:15) ) for each t a multiple of (cid:15) . We will consider a discrete problem instance, withdiscrete rounds corresponding to times (cid:15), (cid:15), (cid:15), . . . . At round i , corresponding to time t = i · (cid:15) , azero-mean Gaussian with variance (cid:15) is added to the state. I.e., we take ρ = (cid:15) in our discrete model.We then imagine drawing (cid:82) i(cid:15)t =( i − (cid:15) s ( t ) dt + (cid:80) j : t j ∈ [( i − (cid:15),i(cid:15) ) a ( t j ) samples at round i , correspondingto time t = i · (cid:15) . This represents the samples that would have been drawn over the course of theinterval [( i − (cid:15), i(cid:15) ) . We will also take the budget in this discrete approximation to be B/(cid:15) , so thatthe approximation is valid if the continuous policy satisfies its budget requirement. Note that we canthink of this discretization as an approximation to the continuous problem, with the same budget B ,but with time scaled by a factor of (cid:15) so that a single round in the discrete model corresponds to atime interval of length (cid:15) in the continuous model.Approximate s by a function that is constant over intervals of length (cid:15) , say equal to s ( t ) over thetime interval [ t, t + (cid:15) ) for each t a multiple of (cid:15) . Suppose for now that there are no atoms over thisperiod. We are then drawing (cid:15)s ( t ) samples at time t + (cid:15) . By Lemma 1, this causes the variance todrop by a factor of (cid:15)s ( t ) v ( t ) . The new variance at time t + (cid:15) is therefore v ( t + (cid:15) ) = v ( t ) + (cid:15) (cid:15)s ( t )( v ( t ) + (cid:15) ) As this change occurs over a time window of length (cid:15) , the average rate of change of v over thisinterval is v ( t + (cid:15) ) − v ( t ) (cid:15) = 1 (cid:15) (cid:18) v ( t ) + (cid:15) (cid:15)s ( t )( v ( t ) + (cid:15) ) − v ( t ) (cid:19) = 1 − s ( t ) v ( t ) − (cid:15)s ( t ) v ( t )1 + (cid:15)s ( t )( v ( t ) + (cid:15) ) (cid:15) → , the instantaneous rate of change of v at t is v (cid:48) ( t ) = 1 − s ( t ) · v ( t ) . The variance function v is therefore described by the differential equation above, for any t at which a ( t ) = 0 .If there is an atom at t , so that a ( t ) > , then for sufficiently small (cid:15) the number of samplesin the range [ t, t + (cid:15) ) is instead a ( t ) + (cid:15)s ( t ) . As (cid:15) → , this introduces a discontinuity in v ( · ) at t ,since the number of samples taken does not vanish in the limit. With this in mind, we will takethe convention that v ( t ) represents lim t (cid:48) → + t v ( t ) , the right-limit of v ; this informally corresponds tothe variance “after” having taken atoms at t . We then define ˜ v ( t ) = lim t (cid:48) → − t v ( t ) to be the variance“before” applying any such atom. Lemma 1 then yields that v ( t ) = ˜ v ( t )1 + a ( t )˜ v ( t ) . We emphasize that, under this notation, v ( t ) represents the variance after having applied the atomat time t , if any. These discontinuities combined with the differential equation above provide a fullcharacterization of the evolution of the variance v ( · ) , given s ( · ) and a ( · ) and the initial condition v (0) = v .Then the total number of samples acquired over a time period [0 , T ] is (cid:82) T s ( t ) dt + (cid:80) i : t i ∈ [0 ,T ] a ( t i ) .We normalize the cost per sample to 1 as in the discrete case, but modeling the fixed costs is moresubtle. In particular, consider some intermediate discretization. If we apply the fixed cost f ateach interval, we get the counterintuitive result that taking s samples today is less expensive thantaking s/ samples in the morning and s/ in the afternoon, when logically the two should haveat least similar costs. On the other hand if we scale the cost to be f / we could only ever takesamples in the morning and implement the same policy as we would have at the “day” level at halfthe fixed cost. To avoid this we make the fixed cost history dependent. If the fixed cost was notpaid in the previous interval, the fixed cost is f . If the fixed cost was paid in the previous interval,the cost is instead (cid:15)f . This ensures that the cost of implementing a policy from the original levelof discretization at a finer level has the exact same cost, while keeping policies which spread outsamples evenly over the interval at a similar cost. Furthermore, to allow similar properties to holdwhen considering multiple possible levels of discretization, we allow the decision maker to pay thefixed cost even in periods when samples are not taken. Thus, for example, taking samples in earlymorning and early afternoon but none in the late morning has the same cost as taking the samesamples in the morning and in the afternoon. This interpretation has natural analogs in some of ourexample scenarios, such as maintaining the satellite lock for the GPS even if limited numbers ofsamples are taken.In the continuous limit, this cost model becomes a flow cost of f , which will be paid at all timessamples are taken as well as during intervals when samples are not taken of length at most 1. Thereis also a fixed cost f when sampling resumes after an interval of length greater than 1. Let φ be ameasure which has density f when when the flow cost is paid and measure f at times when the fixedcost is paid. Our budget requirement is that (cid:82) T s ( t ) dt + (cid:80) i : t i ∈ [0 ,T ] a ( t i ) + (cid:82) T dφ ≤ BT for all T ≥ .The optimization problem in the continuous setting is to choose the functions s ( t ) and a ( t ) thatminimizes the long-run average cost incurred by the decision-maker, lim sup T →∞ T (cid:90) T min( v ( t ) , c ) dt, null policy to be the sampling policy that takesno samples (i.e., s ( t ) = 0 and a ( t ) = 0 for all t ) and selects the outside option at all times, for anaverage cost of c . Again, we define the value of a policy to be the difference between its average costand the average cost of the null policy: lim inf T →∞ T (cid:90) T max( c − v t , dt. We say a policy is α -approximate if it achieves an α fraction of the value of the optimal policy. B Appendix: Proofs from Section 2
Lemma 1.
Let v t be the variance of G t and suppose each F t is a zero-mean Gaussian with variance ρ , and that each sample is subject to zero-mean Gaussian noise with variance σ . Then, if thedecision-maker takes s samples in round t + 1 , the variance of G t +1 is v t +1 = v t + ρ sσ ( v t + ρ ) . Proof.
Our model is a special case of the model underlying a Kalman filter. There, generally, theevolution of the state can depend on a linear transformation of x t , a control input, and someGaussian noise. In our model the transformation of x t is the identity, there is no control input, andby assumption the Gaussian noise is mean 0 variance ρ . Similarly, our sampling model correspondsto the observation model assumed by a Kalman filter.Therefore, using the standard update rules for a Kalman filter [30], the innovation variance attime t + 1 (i.e., the variance of the posterior after x t is updated by δ ∼ F t +1 but before observingthe samples) is ˜ v t +1 = v t + ρ . (Alternatively this can be observed directly as we are summing twoGaussians.) This matches the desired quantity for the case s = 0 , where no samples are taken.For s = 1 , we can again apply the standard update rules for a Kalman filter to get a posteriorvariance of ˜ v t +1 σ ˜ v t +1 , as desired. By induction, if the decision maker instead takes s > samples, theposterior variance will instead be ˜ v t +1 sσ ˜ v t +1 , as desired. C Appendix: Proofs from Sections 3 and 4
All of these results hold in both the discrete and continuous models with essentially the same proofs.Therefore, we provide a unified treatment of them for both cases.
Lemma 2.
Fix a sampling policy s , and a time R > , in either the continuous or discrete setting,and suppose that s takes a strictly positive number of samples in (0 , R ] . Let Ψ be the mapping definedas follows: supposing that v = x and v is the variance function resulting from sampling policy s , set Ψ( x ) := v ( R ) . Then Ψ is a contraction map over the non-negative reals, under the absolute valuenorm.Proof. We need to show | Ψ( x ) − Ψ( y ) | ≤ k · | x − y | whenever x > y , where k < is some constantthat depends on s . We’ll prove this first for a discrete policy. Write v xt and v yt for the varianceat time t with starting condition x and y , respectively. Take v x = x and v y = y for notational17onvenience. We then have that v x = x + ρ s /σ )( x + ρ ) and v y = y + ρ s /σ )( y + ρ ) by Lemma 1. We thenhave that v x ≥ v y , and moreover v x − v y = x − y (1 + ( s /σ )( x + ρ ))(1 + ( s /σ )( y + ρ )) ≤ x − y and the inequality is strict if s > . We can therefore apply induction on the rounds in (0 , R ] , plusthe assumption that at least one of these rounds has a positive number of samples, to conclude that Ψ( x ) − Ψ( y ) < x − y . To bound the value of k , suppose that t ≥ is the first round in which apositive number of samples is taken. Then we can find some sufficiently small (cid:15) > so that v xt − > (cid:15) and s t /σ > (cid:15) . Then we will have Ψ( x ) − Ψ( y ) ≤ v xt − v yt = v xt − − v yt − (1 + ( s /σ )( v xt − + ρ ))(1 + ( s /σ )( v yt − + ρ )) ≤ x − y s /σ )( v xt − + ρ ) < x − y (cid:15) and hence we have k ≤ (cid:15) < as required.To extend to continuous policies, take v x ( t ) and v y ( t ) for the variances under the two respectivestart conditions, and note first that there must exist some sufficiently small (cid:15) such that v x ( t ) > (cid:15) forall t ∈ [0 , R ] , and for which the total mass of samples taken over range (0 , R ] is at least (cid:15) . Take anydiscretization of the range [0 , R ] , say into r > rounds, and consider the corresponding discretizationof the continuous policy, so that the sum of the number of samples taken over all discrete rounds inthe interval [0 , R ] is at least (cid:15) . As above, take v xi and v yi to be the variances resulting from thesediscretized policies after i discrete rounds. Say s i samples are taken at round i . Then, consideringeach round in sequence and applying the same reasoning as in the discrete case above, we have that v xr − v yr ≤ ( x − y ) · r (cid:89) i =1
11 + s i v xi − ≤ ( x − y ) · r (cid:89) i =1
11 + s i (cid:15) ≤ ( x − y ) ·
11 + (cid:15) (cid:80) ri =1 s i ≤ x − y (cid:15) . Thus, for each such discretization, we have a contraction by a factor of at least (cid:15) . Taking alimit of such discretizations, we conclude that this holds in the continuous limit as well, so that Ψ( x ) − Ψ( y ) < (cid:15) ( x − y ) as required.It is well known that a contraction mapping has a unique fixed point, and repeated applicationwill converge to that fixed point. Since we can view the impact of the periodic sampling policy asrepeated application of mapping Ψ to the initial variance in order to obtain v (0) , v ( R ) , v (2 R ) , . . . , we18onclude that the variance will converge uniformly to a periodic function for which v ( t ) = v ( t + R ) .Thus, for the purpose of evaluating long-run average cost of a periodic policy, it will be convenient(and equivalent) to replace the initial condition on v , v (0) = v , with a periodic boundary condition v (0) = v ( R ) , and then choose s to minimize the average cost over a single period: R (cid:90) R min { v ( t ) , c } dt, subject to the budget constraint that, at any time T ∈ (0 , R ] , we have (cid:82) T s ( t ) dt + (cid:80) i : t i ∈ (0 ,T ] a ( t i ) ≤ BT .For the remainder of the proofs in this section, we will allow fractional sampling rates even inthe discrete setting. Recall that one can define an α ∈ (0 , fraction of a sample to be one in whichthe variance is increased by a factor of /α . Lemma 3.
With fractional samples, for all
T > T (cid:48) , we have Val ( s T ) > Val ( s T (cid:48) ) .Proof. If s is an on-off policy with period T (cid:48) , then the policy that uses the same “on” sampling ratewith a period of T has weakly better average value. This is because the variance of this policy isdecreasing over the “on” period, so is lowest at the end of the period. Thus the optimal on-off policywith period T has better average value than s .We will write V ∗ = sup T →∞ Val ( s T ) . From the lemma above, we have that V ∗ = lim T →∞ Val ( s T ) as well. The following lemma will be useful for analyzing non-periodic policies. Lemma 5.
For any policy s , there is a sequence of policies { s T } for T = 1 , , . . . , where s T isperiodic with period T , such that lim T →∞ V al ( s T ) ≥ V al ( s ) .Proof. (sketch) Fix T , let W T be the average value of policy s over rounds [0 , T ] , and let s T bea periodic policy that mimics policy s over rounds [0 , T ] . Note that lim T →∞ W T = V al ( s ) . Thedifference between W T and V al ( s T ) is driven by the initial condition: by Lemma 2, V al ( s T ) issimply the average period value under the boundary condition v = v T , whereas s may have somealternative initial condition (say v = v ). If the initial condition for s lies below that of s T , we willmodify policy s T into a new periodic policy, as follows. First, we’ll note the (constant) number ofextra samples needed on round to reach variance v from an initially unbounded variance. Ourmodified policy will first wait the (constant) number of rounds (say r ) needed to acquire this muchbudget, without sampling. Then, starting at round r , it will bring the variance to v (using thisaccumulated budget) and then simulate policy s over rounds [0 , T − r ] . Relative to W T , this policyobtains all of the value except for that accumulated by policy s over rounds [ T − r, T ] , which is atmost a constant. This policy’s value therefore matches W T in the limit as T grows large.The following lemma shows that if there are intervals during which one takes the outside option,then it is better to have them occur at the beginning of the range [0 , R ] . The intuition is that it ischeaper to reduce the variance to c from a large value once than to reduce from a small value to c many times. Note that we omit t i = 0 from the summation over atoms, to handle the edge case where there is an atom at time T , and hence at time as well, which should not be counted twice. emma 6. Consider a valid policy, given by s ( · ) and a ( · ) , and any time R > . Then there isanother valid policy s ∗ ( · ) , a ∗ ( · ) and time r ∈ [0 , R ] such that a ∗ ( t ) = s ∗ ( t ) = 0 for all t < r , v ∗ ( t ) ≤ c for all t ∈ [ r, R ] , and the average value of a ∗ ( · ) up to time R is at least the average value of theoriginal policy up to time R . This is true in both the discrete and continuous models (with fractionalsamples).Proof. We will write the following proof in the continuous model, but we note that the same proofapplies to the discrete model with just minor adjustments to the notation. Let s ( · ) and a ( · ) bea sampling policy, and suppose that it does not satisfy the conditions of s ∗ and a ∗ in the lemmastatement. Write v ( · ) for the resulting variance, and suppose that t is the infimum of all times forwhich v ( t ) < c . Note that we can assume that s ( t ) = 0 for any t such that v ( t ) > c , without loss.Suppose time interval [ t , t ) is the earliest maximal interval following t such that v ( t ) ≥ c forall t ∈ [ t , t ) . In other words, [ t , t ) is an interval during which the decision-maker would choosethe outside option, and this interval occurs after some point at which the decision-maker has notchosen the outside option. Such an interval must exist, since we assumed that the given policy doesnot satisfy the conditions of the lemma.Our strategy will be to transform this policy a ( · ) into a different policy that is closer to satisfyingthe conditions of the lemma. Roughly speaking, we will do this by “shifting” the interval [ t , t ) sothat it lies before t : we will push the sampling policy over the range [ t , t ) forward ( t − t ) unitsof time. This will result in a policy with one fewer intervals of time in which the variance lies above c .And, as we will show, this policy has the same total value as the original and is valid. We can applythis construction to each such interval to construct the policy s ∗ and a ∗ required by the lemma.Let us more formally describe what we mean by shifting the interval [ t , t ) . We have that v ( t ) = c and a ( t ) > from the definitions of t and t . Write δ = t − t . Then we have ˜ v ( t ) = c + δ .Let γ > be such that c + δ γ ( c + δ ) = c . That is, γ is the size of atom such that, if a ( t ) = γ , thenwe would have v ( t ) = c . Since in fact we have v ( t ) ≤ c by maximality of the interval, and since s ( t ) = 0 for all t ∈ ( t , t ) by assumption, it must be that a ( t ) ≥ γ . On the other hand, let γ (cid:48) > be such that ˜ v ( t )+ δ γ (cid:48) (˜ v ( t )+ δ ) = ˜ v ( t ) . Since ˜ v ( t ) ≥ c , we must have γ (cid:48) ≤ γ .We are now ready to describe the shifted policy, given by s ∗ and a ∗ . We set s ∗ ( t ) = a ∗ ( t ) = 0 forall t < t + δ , a ∗ ( t + δ ) = a ( t ) + γ (cid:48) , a ∗ ( t ) = a ( t − δ ) and s ∗ ( t ) = s ( t − δ ) for all t ∈ ( t + δ, t ) , a ∗ ( t ) = a ( t ) − γ , and a ∗ ( t ) = a ( t ) and s ∗ ( t ) = s ( t ) for all t > t . Roughly speaking, the new policy“moves” the interval [ t , t ) , where the variance lies above c , to occur before the sampling behaviorthat began at time t . It also reduces the atom at t and increases the atom at t (if any); theamounts are chosen so that v ∗ ( t ) = v ( t ) , as we shall see.We claim that v ∗ ( t ) ≥ c for all t < t + δ , v ∗ ( t ) = v ( t − δ ) for t ∈ [ t + δ, t ) , and v ∗ ( t ) = v ( t ) for t ≥ t . This will imply that the average value of policy a ∗ is equal to that of policy a , sincethey differ only in that a portion of the variance curve lying below c has been shifted by δ . That v ∗ ( t ) ≥ c for all t < t + δ follows from the definition of a ∗ . That v ∗ ( t + δ ) = v ( t ) followsbecause ˜ v ∗ ( t + δ ) = ˜ v ( t ) + δ , and a ∗ ( t + δ ) consists of an atom γ (cid:48) that shifts the variance from ˜ v ( t ) + δ to ˜ v ( t ) , plus another atom a ( t ) that shifts the variance from ˜ v ( t ) to v ( t ) . Given that v ∗ ( t + δ ) = v ( t ) , we also have v ∗ ( t + δ ) = v ( t ) for all t ∈ [ t , t ) , as the policy a ∗ is simply a shiftedby δ within this range. Finally, we have ˜ v ∗ ( t ) = c , and a ∗ ( t ) = a ( t ) − γ , which is precisely thesize of atom needed to shift the variance from c to v ( t ) . So we have that v ∗ ( t ) = v ( t ) for all t ≥ t ,as a ∗ and a coincide for all such t .We conclude that s ∗ and a ∗ have the same average value as s and a . Also, s ∗ and a ∗ uses lesstotal budget than s and a , and shifts some usage of budget to later points in time, so the new policy20s valid. Finally, the new policy has least one fewer maximal interval in which the variance liesstrictly above c . By repeating this construction inductively, we obtain the policy required by thelemma.We are now ready to show that no policy that satisfies the budget constraint can achieve valuegreater than V ∗ . This establishes our main first claim, that on-off policies are optimal when z = 0 . Theorem 1.
With fractional samples, the value of any valid policy s is at most V ∗ .Proof. We will prove this claim under the discrete model. Taking the limit over ever-finer discretiza-tions then establishes the result for the continuous model as well.Choose some
T > , and fix any policy s that is periodic with period T . We will show thatthe average value of s is at most Val ( s T ) + o (1) , where the asymptotic notation is with respect to T → ∞ . Taking the limit as T → ∞ and applying the Lemma 5 above will then complete the result.Our approach to showing that the average value is at most Val ( s T ) + o (1) will be to convert s into an on-off policy s (cid:48) of period T , without decreasing its average value. Recall from Lemma 2(and the discussion following its proof) that the long-run average value of s (cid:48) is simply the averageperiod value under the periodic boundary condition v (cid:48) = v (cid:48) T . Moreover, when constructing s (cid:48) , wecan without loss of generality relax the validity condition to be that at most BT samples are takenat any point over the interval [0 , T ] . This is because we can strengthen any policy under this weakercondition to satisfy the original budget constraint by delaying the start of the policy for T roundswithout taking any samples, and only then starting policy s (cid:48) . This will have the same long-runaverage value as our relaxed policy, again by Lemma 2.We now describe a sequence of operations to convert s into an on-off policy. First, by Lemma 6,we can assume that s spends no samples in the range [0 , T (cid:48) ) for some T (cid:48) < T , then has v t ≤ c for all t ∈ [ T (cid:48) , T ] . We will show that, given any policy of this form, one can convert it into an on-off policywithout degrading the total value over the range [0 , T ] by more than a constant.We will apply a potential argument. Given a policy with variances given by v t , we will write ¯ v for the average variance in the range [ T (cid:48) , T ] . That is, ¯ v : = 1 T − T (cid:48) + 1 (cid:88) T (cid:48) ≤ t ≤ T v t . Let v + = max T (cid:48) ≤ t ≤ T v t and v − = min T (cid:48) ≤ t ≤ T v t . That is, v + is the maximum variance and v − is theminimum variance achieved during the interval where the variance lies below c . Note that we musthave v + ≥ ¯ v and v − ≤ ¯ v . Let Ψ be the total number of timesteps t between T (cid:48) and T in which thevariance is equal to either v − or v + , plus if v + = v − = ¯ v . Then Ψ is an integer lying between and T − T (cid:48) + 2 . Also, Ψ = T − T (cid:48) + 2 only if v + = v − , which implies that all variances are preciselyequal to ¯ v . We will show how to modify a policy s (in which v + > v − ) into a new policy s (cid:48) so that Ψ strictly increases, without changing the average policy value.Write A for the set of timesteps with variance equal to v + , and B for the set of timesteps withvariance equal to v − . Say | A | = a and | B | = b . Note that A ∩ B = ∅ , since we are assuming v + > v − .We will update the sampling policy so that, roughly speaking, the variance of the timesteps in A areall decreased, and the variance of the timesteps in B are all increased, until either a new timestepbecomes either a maximal or minimal point, or until all timesteps have variance ¯ v . More formally,our update is parameterized by some (cid:15) > , and satisfies the following conditions:• at all timesteps t ∈ A , the variance is reduced by (cid:15)/a ,21 at all timesteps t ∈ B , the variance is increased by (cid:15)/b ,• at timesteps not in A ∪ B , the variance is unchanged,• (cid:15) is maximal so that all elements of A still have the maximum variance, and all elements of B still have the minimum variance.Note that by the final condition, after making this change, either there will be one more maximaltimestep or one more minimal timestep, or else the minimum equals the maximum. In either case, Ψ will strictly increase. Moreover, this update does not change the average variance of the policy.It remains to show that we can implement this update without increasing the total spend ofthe policy. To see this, consider updating just a single timestep t ∈ A . The change involves addingsamples at time t to decrease the variance by (cid:15) , then removing samples from time t + 1 to offset theresulting decrease at that point. To decrease variance from v t to v t − (cid:15)/a requires an extra (cid:15)/av t ( v t − (cid:15)/a ) samples. The number of samples that can be saved in the subsequent round is the amount requiredto move the variance from v t + 1 to v t + 1 − (cid:15) , which is (cid:15)/a ( v t +1)( v t +1 − (cid:15)/a ) . The net increase in samplesis therefore (cid:18) (cid:15)/av t ( v t − (cid:15)/a ) − (cid:15)/a ( v t + 1)( v t + 1 − (cid:15)/a ) (cid:19) Applying this operation to all timesteps in A (of which there are a ), and recalling that they all havevariance v + , we have that the total cost in samples is (cid:18) (cid:15)v + ( v + − (cid:15)/a ) − (cid:15) ( v + + 1)( v + + 1 − (cid:15)/a ) (cid:19) = (cid:15) · v + + 1 − (cid:15)/av + ( v + − (cid:15)/a )( v + + 1)( v + + 1 − (cid:15)/a ) . A similar calculation yields that the total number of samples saved by increasing the variance by (cid:15)/b for all timesteps in B is (cid:18) (cid:15)v − ( v − + (cid:15)/b ) − (cid:15) ( v + + 1)( v + + 1 + (cid:15)/b ) (cid:19) = (cid:15) · v − + 1 + (cid:15)/bv − ( v − + (cid:15)/b )( v − + 1)( v − + 1 + (cid:15)/b ) . Recalling that v + − (cid:15)/a ≥ v − + (cid:15)/b and v + > v − , we have that (cid:15) · v − + 1 + (cid:15)/bv − ( v − + (cid:15)/b )( v − + 1)( v − + 1 + (cid:15)/b ) = ( v − + 1) + ( v − + (cid:15)/b ) v − ( v − + (cid:15)/b )( v − + 1)( v − + 1 + (cid:15)/b )= ( v + + 1) v + − (cid:15)/av − + (cid:15)/b + ( v + − (cid:15)/a ) v + +1 v − +1 v − ( v + − (cid:15)/a )( v + + 1)( v − + 1 + (cid:15)/b ) > ( v + + 1) + ( v + − (cid:15)/a ) v − ( v + − (cid:15)/a )( v + + 1)( v − + 1 + (cid:15)/b ) > ( v + + 1) + ( v + − (cid:15)/a ) v + ( v + − (cid:15)/a )( v + + 1)( v + + 1 − (cid:15)/b ) and hence the total number of samples saved is greater than the total number of samples spent inmaking this change. Thus, the new policy also satisfies the average budget constraint.Repeating this procedure, we conclude that we must eventually reach a state in which theresulting policy has constant variance equal to ¯ v in the range [ T (cid:48) , T ] . This policy has s t = 1 / √ ¯ v forall t ∈ ( T (cid:48) , T ] , and possibly a larger number of samples at s (cid:48) T . The on-off policy that sets s (cid:48) t = 1 / √ ¯ v for all t ∈ [ T (cid:48) , T ] is therefore also valid. Moreover, from our previous analysis, this on-off policy hasaverage value within o (1) of policy s . We conclude that the value of s is at most Val ( s T ) + o (1) , asrequired. 22 Appendix: Proofs from Section 5 in the Continuous Model
It is technically more convenient to present these results for the continuous model first, as that allowsus to avoid rounding issues. Therefore we first prove these results for the continuous model and thenin Appendix E provide proofs for the discrete version for those that do not have a unified proof here.We first prove a structural result about transforming policies.
Lemma 4.
Fix any valid sampling policy (not necessarily lazy) with resulting variance function v ( · ) ,and any sequence of timestamps t < t < . . . < t (cid:96) < . . . . Then there is a valid policy that spendssamples only in atoms, with { t | a ( t ) > } ⊆ { t , . . . , t (cid:96) , . . . } , resulting in a variance function ˘ v ( · ) with ˘ v ( t i ) ≤ v ( t i ) for all i .Proof. We will prove this result for the continuous model. The proof for the discrete model followssimilarly, and is given in Section E.For the continuous model, we’ll first prove the result for the case of a single timestep t . Theresult then follows by repeated application to each subsequent t i inductively. Let s ( t ) and a ( t ) bethe continuous sampling rate and atoms of the original policy, respectively, with resulting variance v ( t ) . Assume first that s ( t ) = 0 for all t and that the atoms in the interval [0 , t ] occur at times < τ < τ < . . . < τ k = t , with corresponding number of samples a ( τ ) , . . . , a ( τ k ) . Note thatthe assumption τ k = t is without loss, as we could set a ( τ k ) = 0 . If k = 1 then we are done, soassume k ≥ . We will show that the alternative policy which lumps together the first two atoms,given by a , where a ( τ ) = 0 and a ( τ ) = a ( τ ) + a ( τ ) , and a ( τ i ) = a ( τ i ) for all i > , resultsin a variance function v such that v ( τ i ) ≤ v ( τ i ) for all i ≥ . This will complete the claim, byrepeated application to the first non-zero atom in the sequence.Recall that ˜ v ( τ ) = v (0) + τ is the the variance just prior to applying the atom at τ . Then wehave that v ( τ ) = ˜ v ( τ )1+ a ( τ )˜ v ( τ ) . Similarly, ˜ v ( τ ) = v ( τ ) + ( τ − τ ) , so ˜ v ( τ ) = ˜ v ( τ )1 + a ( τ )˜ v ( τ ) + τ − τ . We then have that v ( τ ) = ˜ v ( τ ) / (1 + a ( τ )˜ v ( τ )) .Alternatively, with a we have ˜ v ( τ ) = v (0) + τ . If we let ˜˜ v ( τ ) denote the variance afterhaving applied an atom of size a ( τ ) at time τ but before applying an atom of size a ( τ ) , we have ˜˜ v ( τ ) = ˜ v ( τ ) + τ − τ a ( τ )(˜ v ( τ ) + τ − τ ) . We will then have that v ( τ ) = ˜˜ v ( τ ) / (1 + a ( τ )˜˜ v ( τ )) . We now note that ˜˜ v ( τ ) ≤ ˜ v ( τ ) , since ˜ v ( τ ) = ˜ v ( τ )1 + a ( τ )˜ v ( τ ) + τ − τ ≥ ˜ v ( τ ) + τ − τ a ( τ )˜ v ( τ ) ≥ ˜ v ( τ ) + τ − τ a ( τ )(˜ v ( τ ) + τ − τ )= ˜˜ v ( τ ) .
23e can therefore conclude that v ( τ ) ≥ v ( τ ) . This further implies that v ( τ i ) ≥ v ( τ i ) for all i > as well, since a ( τ i ) = a ( τ i ) for all i > and, inductively, the variance is weakly lower under v thanunder v just prior to each atom, and hence is lower after the application of each atom as well.We now turn to the more general case where s ( t ) is not identically . We again consider the caseof a single timestep t , and the more general result will follow inductively. We can view s as thelimit, as (cid:15) → , of a sequence of discretized policies that only use atoms, and only at times that aremultiples of (cid:15) . We will take our sequence to be (cid:15) = t /k for k = 1 , , , . . . , so that time t is presentin each of these discretizations. For each such (cid:15) = t /k , take s k to be the discretized version of s , sothat s k → s as k → ∞ . Applying our lazy result above to policy s k , we have that for each s k , thereis a policy that matches the variance of s k at time t , and that only applies an atom at t . Takingthe limit as k grows, we conclude that there is a policy that only takes an atom at time t , whosevariance at t is no greater than v ( t ) , the variance generated by policy s .We can now prove our main approximation result for the continuous setting without costs. Weactually prove two versions, as a stronger bound is possible for the case of z = 0 . Theorem 2 (Version 1).
The optimal lazy policy is / -approximate, in the continuous settingand with z = 0 .Proof. Write s ∗ ( t ) , a ∗ ( t ) for an optimal policy, and v ∗ ( t ) for the corresponding variance function.That is, s ∗ is a policy that minimizes lim sup T →∞ T (cid:90) Tt =0 min { c, v ∗ ( t ) } dt subject to budget constraints.We note that, without loss of generality, we can assume that s ∗ ( t ) = 0 whenever v ∗ ( t ) > c . Thatis, the policy has no continuous sampling when the variance is above the outside option. This followsfrom our structural lemma above, since any policy can be replaced by one that takes no sampling inan interval where the variance lies above the outside option, and instead uses an atom at the end ofsuch an interval to bring the variance back down to the outside option level c .We now define an lazy policy that approximates s ∗ , a ∗ up to a fixed (cid:15) > . We do so by defininga sequence of intervals iteratively. We begin by setting t = 0 . For each t i , we will define t i +1 > t i as follows. If v ∗ ( t ) > c − (cid:15) for all t ∈ [ t i , t i + (cid:15) ] , take t i +1 = inf { t > t i : v ∗ ( t ) ≤ c − (cid:15) } . Otherwise, choose t i +1 = t i + δ , where δ = inf { δ > (cid:15) : c − v ∗ ( t ) ≤ δ ∀ t ∈ [ t i , t i + δ ) } . Note that in this latter case we must always have δ ≥ (cid:15) , and hence t i +1 ≥ t i + (cid:15) . We also must have δ ≤ c , since certainly v ∗ ( t ) ≥ everywhere.For each i ≥ , let m i = arg inf m ∈ ( t i ,t i +1 ] { v ∗ ( m ) } . That is, m is the time in the subinterval ( t i , t i +1 ] where v ∗ takes its lowest value.Consider the policy that applies atoms at times t , t , . . . and at times m , m , . . . , so as tomatch the variance of v ∗ at each of those times. By Lemma 4, this policy is valid.Next consider the policy that applies atoms only at times t , t , . . . , and applies those atoms sothat v ( t i ) = v ∗ ( m i ) for each i . This policy is not necessarily valid, but we claim that this policy can24nly ever go budget negative by at most c · B , the amount of budget accrued in c time units. Thisis because within each interval [ t i , t i +1 ] , this policy uses no more budget than the previous policy.It may cause budget to be spent earlier than before, within the same window. However, it onlydiffers from the previous policy on subintervals where v ∗ ( m i ) < c − (cid:15) , and hence can only shift thespending of budget earlier by at most c time units as, in these cases, δ (the difference between t i and t i +1 ) is at most c . Thus, this new policy can go budget-negative, but never by more than cB .We next claim this policy is lazy. Indeed, for each sub-interval [ t i , t i +1 ] , we have that v ( t i ) ≥ c − δ where δ = t i +1 − t i . Thus, our policy has the property that lim t → t i +1 v ( t ) ≥ c , so each atom occursat a point where the variance is at or above c . Note that this makes use of the fact that variancedrifts upward at a rate of per unit time, in the continuous model.We claim this (budget-infeasible) policy is / -approximate, up to an additive (cid:15) term on theaverage cost. Since the policy is lazy, its value (relative to the outside option) is the area of asequence of isoceles right-angled triangles. By construction, the squares that form the completion ofthose triangles cover the entirety of the value of s ∗ , a ∗ , except possibly for regions where v ∗ ( t ) ≥ c − (cid:15) .See Figure 2 for a visualization. So our policy is / -approximate, possibly excluding regions where v ∗ has an average contribution of (cid:15) .Finally, we note we can transform this policy to a budget-feasible one without loss in theapproximation factor. First, if we shift our policy to start at time c , rather than time , then itis precisely valid. As this decreases its value by at most a bounded amount, the loss in averagevalue over a time horizon T vanishes as T → ∞ . So we have a valid / -approximation subject toan arbitrarily small additional additive loss. Taking (cid:15) → , and noting that there is a universallyoptimal lazy policy, gives us that the optimal lazy policy is exactly a / -approximation. Theorem 2 (Version 2).
The optimal lazy policy is / -approximate, in the continuous settingand with z > .Proof. We consider the same construction as in Theorem D. The only step that could increase costsis the step that shifts atoms to the “left endpoint” of its corresponding square. This might changethe distance between atoms, possibly increasing costs by requiring us to pay the fixed startup costmore times.To fix this, we’ll change the definition of a square so that a new square cannot begin until theoriginal policy takes a sample. This might “extend” some squares to the right, forming rectangles.The value generated from the extended part of the rectangle can be at most half the area of thesquare, since by definition the original policy isn’t sampling during this time so its variance risesat rate 1. This change therefore increases the approximation factor to at most 3, since now one“triangle” is covering a square plus one extra triangle.Having made this change, we know by definition that the original policy sampled at the leftendpoint of each rectangle. We can therefore sample (only) at the left endpoint of each rectangle,without increasing costs relative to the original policy. Note that this can increase costs relativeto the policy that takes atoms at the minimum-variance point within each rectangle; so our costcomparison will only be with respect to the original policy.We show how to extend these results to the discrete setting in Appendix E. We note that unlikethe continuous setting, we does not suffer a loss in approximation factor for the discrete settingwhen adding fixed costs.We next show how to compute the best lazy policy, a result which was sketched in Section 5, butnot formally stated. 25 emma 7.
One can compute an asymptotically optimal valid lazy policy in closed form.Proof.
We will consider some large R and solve for a policy that maximizes average value up to time R , then take a limit as R → ∞ . By Lemma 6, we can assume the optimal policy sets a ( t ) = 0 for all t < r , and has v ( t ) ≤ c for all t ≥ r , where r ≤ R .For a given atom of size s , say taken at time t ≥ r , recall that the subsequent atom will be takenat time t + ( c − c sc ) , the next time at which the variance is equal to c . We will therefore definethe cost of this atom as s + min { f, f ( c − c sc ) } . This is the cost of the s samples, plus the cost ofeither maintaining flow until the next atom is taken (if c − c sc < ) or of paying the start-up costwhen the next atom is taken (otherwise). The sum of costs over all atoms is equal to the total costof the policy. We therefore define the value density of the atom to be · s + min { f, f ( c − c sc ) · (cid:18) c − c sc (cid:19) . This expression can be maximized with respect to s by considering separately the cases ( c − c sc ) > and ( c − c sc ) ≤ . The resulting solution will be the optimal choice of atom size, assuming that itdoes exhaust the total budget I.e., when the total cost of the resulting policy is at least BR .If the optimal value of s corresponds to a policy that does not exhaust the total budget, then thismeans that it is time, rather than budget, that is the binding constraint. In this case the optimalpolicy takes r = 0 and chooses s so that the budget is exhausted at time R . That is, s is chosenso that s + min { f, f ( c − c sc ) } = B ( c − c sc ) , meaning that the total cost of an atom equals thebudget acquired over the time interval between that atom and the next. Again, one can solve for s by considering cases ( c − c sc ) > and ( c − c sc ) ≤ . This s will correspond to the optimalsampling policy, which will take atoms at regular intervals up to time R . E Appendix: Proofs from Section 5 in the discrete model
In this section we complete proofs of statements that hold in both the discrete and continuousmodels, that we have previously proved only in the continuous model. We begin with the Lemma 4.
Lemma 4.
Fix any valid (not necessarily lazy) sampling policy with resulting variances v t , and anysequence of timesteps t < t < . . . < t k < . . . . Then there is a valid policy that spends samples onlyat timesteps that lie in the set { t , . . . , t k , . . . } , resulting in variances ˘ v t with ˘ v t i ≤ v t i for all i .Proof. We’ll prove this for the interval [0 , t ] , and the result then follows by repeated application toeach subsequent t i inductively. Assume that the original policy and spends samples in the interval (0 , t ] occur at times < a < a < . . . < a k = t , with corresponding number of samples s , . . . , s k ,where s i ≥ for each i . Note that the assumption a k = t is without loss, as we could set s k = 0 . If k = 1 then we are done, so assume k ≥ . We will show that the alternative policy with samples s (cid:48) , . . . , s (cid:48) k , where s (cid:48) = 0 and s (cid:48) = s + s , and s (cid:48) i = s i for all i > , results in a variance function v (cid:48) such that v (cid:48) ( a i ) ≤ v ( a i ) for all i ≥ . This will complete the claim, by repeated application to thefirst non-zero atom in the sequence.Write v = v (0) + a for the variance just prior to taking the samples at a . Then we have that v ( a ) = v s v . Writing v = v ( a ) + ( a − a ) for the variance just prior to taking the samples at a , we have v = v s v + a − a .
26e then have that v ( a ) = v / (1 + s v ) .Alternatively, if we write v (cid:48) for the variance of the policy with s (cid:48) = 0 , after having taken s samples at time a but before taking s more samples, v (cid:48) = v + a − a s ( v + a − a ) . We will then have that v (cid:48) ( a ) = v (cid:48) / (1 + s v (cid:48) ) . We now note that v (cid:48) ≤ v , since v = v s v + a − a ≥ v + a − a s v ≥ v + a − a s ( v + a − a ) = v (cid:48) . We can therefore conclude that v ( a ) ≥ v (cid:48) ( a ) . This further implies that v ( a i ) ≥ v (cid:48) ( a i ) for all i > as well, since s (cid:48) i = s i for all i > and, inductively, the variance is weakly lower under v (cid:48) than under v just prior to each set of samples, and hence is lower after the application of each set of samples.We next complete the proof of our main approximation result in the discrete setting. Theorem 2.
In the discrete model, the optimal lazy policy is / -approximate.Proof. Write s ∗ t for the optimal sampling policy, and v ∗ t for the corresponding variances. We notethat, without loss of generality, we can assume that if s ∗ t > then v ∗ t < c . That is, the policy doesnot take samples if the resulting variance is still above the outside option. This follows from ourstructural lemma above, since any policy can be replaced by one that takes no samples until thevariance would be below c , then takes all the forgone samples then.We now define a lazy policy that approximates s ∗ . We do so by defining a sequence of intervalsiteratively. We begin by setting t = 0 . For each t i , we will define t i +1 > t i as follows. Choose t i +1 = t i + δ , where δ = inf { δ ∈ N + : c − v ∗ t ≤ δ ∀ t ∈ [ t i , t i + δ ) } . We must have δ ≤ c , since certainly v ∗ ( t ) ≥ everywhere.For each i ≥ , let m i = arg inf m ∈ ( t i ,t i +1 ] { v ∗ m } . That is, m is the time in the subinterval ( t i , t i +1 ] where v ∗ takes its lowest value.Consider the policy that takes samples at times t , t , . . . and at times m , m , . . . , so as tomatch the variance of v ∗ at each of those times. By Lemma 4, this policy uses no more budget than s ∗ at any given point of time, and it only takes samples on rounds when the original policy tooksamples, so is therefore valid even when z > .Next consider the policy that applies atoms only at times t , t , . . . , and applies those atoms sothat v t i = v ∗ m i for each i . This policy is not necessarily valid, but we claim that this policy can onlyever go budget negative by at most c · B , the amount of budget accrued in c time units. Proof ofclaim: within each interval ( t i , t i +1 ] , this policy uses no more budget than the previous one. It maycause budget to be spent earlier than before, within the same window. However, it can only shift thespending of budget earlier by at most c time units. Thus, this new policy can go budget-negative,but never by more than cB .We next claim this policy is lazy. Indeed, for each sub-interval [ t i , t i +1 ] , we have that v t i ≥ c − δ where δ = t i +1 − t i . Thus, our policy has the property that lim t → t ∗ i +1 v ( t ) ≥ c , so each atom occursat a point where the variance is at or above c .Finally, we claim that our policy is / -approximate. Since the policy is lazy, its value (relativeto the outside option) is lower bounded by the area of a sequence of isoceles right-angled triangles.27y construction, the squares that form the completion of those triangles cover the entirety of thevalue of s ∗ .To complete the proof, we need to restore validity by shifting it to start at time c , rather thantime . As this decreases its value by at most a bounded amount, the loss in average value over atime horizon T vanishes as T → ∞ . Lemma 8.
One can compute a valid policy whose asymptotic value is at least the value of theoptimal lazy policy.Proof.
We will consider some large R and solve for a policy that maximizes average value up to time R , then take a limit as R → ∞ . By Lemma 6, we can assume the optimal policy sets s t = 0 forall t < r , and has v ( t ) ≤ c for all t ≥ r , where r ≤ R . Suppose for now that we are allowed to takefractional samples.Consider a lazy policy which takes samples until the variance is v r at time r , then every time itis due to take samples takes the number s such that the variance becomes r again. By construction,this means s samples are taken at each time other than r at which they are taken, which happensevery δ = (cid:100) c − v r (cid:101) periods. Therefore, s satisfies the equation v r = v r + δ s ( v r + δ ) , or s = δv r ( v r + δ ) . Wetherefore define the value density of this policy as δv r ( v r + δ ) + f · (cid:32) δ − (cid:88) i =0 c − v r + i (cid:33) . (1)We can then optimize over v r in the same manner as in Lemma 7 to find the optimal way for a lazypolicy to exhaust its budget. For any given R there may be a slight suboptimality due to the needto spend some samples to reach v r the first time and some budget that does not get spent if there isnot budget for an integer number of spending periods, but these go to zero for R large.This gives an asymptotically valid lazy policy using fractional samples. Note that the objective inEquation (1) is quasiconcave in v r for each fixed δ (the derivative is a quadtratic in v r with negativecoefficient on the v term and positive on the other two terms). Therefore the optimal integer choicefor a given δ can be found by “rounding” v r up or down the smallest amount that gives an integer s and choosing one of these. While the resulting policy of waiting and taking s samples every δ periods may not be lazy, it is asymptotically so for large R by Lemma 2 F Appendix: Optimal Policy for the Non-Gaussian Extension
In this section we solve for the form of the optimal policy in the binary extension discussed inSection 6.First we recall the model. There is a binary hidden state of the world, x t ∈ { , } , whichflips each round independently with some small probability (cid:15) > . The decision-maker’s action ineach round is to guess the hidden state and the objective is to maximize the fraction of time thatthis guess is made correctly. Write y t ∈ { , } for the guess made in round t . Each sample is abinary signal correlated with the hidden state, equal to x t with probability + δ where δ > . Thedecision-maker can adaptively request samples in each round, subject to the budget constraint (of B > samples per round on average), before making a guess. Note that sampling is adaptive: thedecision-maker can observe the outcome of one sample in a round before choosing whether to take28he next. While this adaptivity was unimportant in the Gaussian setting (since the sample outcomeswere not payoff-relevant, only the induced variance), it is significant for non-Gaussian evolution.After sampling in each round, the decision-maker has a posterior distribution G t over the stateof the world. We’ll write G tk for the posterior after k ≥ samples have been taken in round t . Notethat G tk is fully described by the probability that x t = 1 , which we will denote by p tk . We claim thatthere is an optimal policy that sets a threshold θ ∈ (0 , / , and after having taken k ≥ samples inround t , it takes another sample if and only if p tk ∈ [ θ, − θ ] . We call such policies threshold policies . Theorem 3.
For any (cid:15) and δ , there is an optimal threshold policy.Proof (sketch). Fix some large T . We will evaluate performance over the first T rounds, and showoptimality up to a loss that is vanishing as T grows large.We first note that it suffices to consider policies that are admissible subject to a budget constraintthat binds in expectation. To see this, take a policy that satisfies the budget constraint in expectation;we will construct a policy with the same asymptotic value that satisfies the budget constraint expost. To do so, we’ll make two changes to the budget-in-expectation policy. First, delay the policy’sexecution by Θ( √ T ) rounds. This has negligible impact on asymptotic performance, but beginsthe policy with a pool of funds to pull from. Standard concentration bounds will then imply thatit the policy will exceed its ex post budget exponentially rarely; whenever it does, simply pauseexecution by another Θ( √ T ) rounds to recover the pool of funds and then continue. The resultingpolicy satisfies the budget ex post, and the impact of such pauses on asymptotic performance willvanish in the limit as T grows large.Next, we claim that it suffices to consider policies whose action after having taken k ≥ samplesin round t depends only on p tk . That is, the actions are otherwise history independent. This isbecause the optimal policy starting from a state with posterior p tk depends only on p tk , and inparticular does not depend on the number of samples that have been taken on round t or anyprevious round (as the constraints bind only in expectation). Thus, for any optimal policy thatsometimes takes an additional sample when the posterior is p tk and sometimes does not, it wouldlikewise be optimal to simply ignore the history and choose a decision independently at random,according to a distribution consistent with how frequently each choice is made when the posterior is p tk over the long-run execution of the policy.We next claim that the long-run payoff of the optimal policy, starting in a round t where theposterior begins at p t , is weakly increasing in | p t − / | . This follows from the fact that p t beingfarther from / corresponds to additional certainty about the hidden state. For any / < p (cid:48) t < p t ,any policy with posterior mean p t could choose to “forget” information and behave as though itsposterior is p (cid:48) t , and achieve at least as high a payoff (in expectation) as a policy whose true posteriorin round t is p (cid:48) t .We next claim that the total long-run payoff of guessing after having taken k samples is increasingand weakly concave in p tk , for p tk ≥ / . (The case p tk < / will follow similarly by symmetry.) Thefact that payoffs are increasing follows by backward induction: this is certainly true in the last round T , as the payoff in round t is strictly increasing. Then, for any t < T , the payoff in the subsequentthreshold policy likewise depends only on the value of the posterior at the beginning of the round(as argued above), which is increasing in p tk and will be at least / if p tk ≥ / . Concavity followsfrom the fact that in-round payoffs increase linearly in p tk , but inter-round reversion to the meanis more pronounced for larger p tk , so an increase in p tk has a sublinear effect on the payoffs fromsubsequent rounds. 29ince the payoff function is increasing in p tk in each round, the optimal policy will be monotone:in each round, there will be a threshold above which a guess is made, and below which samples aretaken. This choice of thresholds will be made to optimize long-run payoff given the average budgetconstraint. Since the payoffs are concave in p tk , and this value function is identical across rounds,the optimal choice of thresholds will likewise be uniform across rounds. By symmetry, if θ is chosenas the threshold for p tk > / , the threshold for p tk < / will be − θθ