Optimal allocation of finite sampling capacity in accumulator models of multi-alternative decision making
OOptimal allocation of finite sampling capacity in accumulatormodels of multi-alternative decision making
Jorge Ramírez-Ruiz and Rubén Moreno-Bote Center for Brain and Cognition, and Department of Information and Communication Technologies, UniversitatPompeu Fabra, Barcelona, Spain Serra Húnter Fellow Programme, Universitat Pompeu Fabra, Barcelona, Spain
Abstract
When facing many options, we narrow down our focus to very few of them. Although behaviors likethis can be a sign of heuristics, they can actually be optimal under limited cognitive resources. Herewe study the problem of how to optimally allocate limited sampling time to multiple options, modelledas accumulators of noisy evidence, to determine the most profitable one. We show that the effectivesampling capacity of an agent increases with both available time and the discriminability of the options,and optimal policies undergo a sharp transition as a function of it. For small capacity, it is best to allocatetime evenly to exactly five options and to ignore all the others, regardless of the prior distribution ofrewards. For large capacities, the optimal number of sampled accumulators grows sub-linearly, closelyfollowing a power law for a wide variety of priors. We find that allocating equal times to the sampledaccumulators is better than using uneven time allocations. Our work highlights that multi-alternativedecisions are endowed with breadth-depth tradeoffs, demonstrates how their optimal solutions depend onthe amount of limited resources and the variability of the environment, and shows that narrowing downto a handful of options is always optimal for small capacities.
The problem of allocating finite resources to determine the best of several options is common in decisionmaking, from deciding which vaccine candidates to fund for further research to choosing a movie forSaturday night. In these cases, planning, and thus resource allocation, needs to be made in advance,well before feedback about the success of the choice is observed. Consequently, two important questionsarise: How many options should we examine? And, for how long? When resources are limited, suchas number of participants or weekend free time, a decision maker should balance breadth, how manyoptions to sample, and depth, how much to sample each. This ubiquitous decision making problem underconstrained resources is what has been called the breadth-depth (BD) dilemma [1–3].In the face of many alternatives, humans quickly narrow down the number of considered optionsto around two to five [4–8], and, when presented with more than 6 options, experienced overloadproduces suboptimal choices in certain conditions [9, 10]. Models describe this behavior by assumingthat considering more options incurs search or mental costs [8, 11, 12], but why people consider smallsets in a wide range of environments is still a matter of debate. While this could be explained by strictsmall capacity limits in attention or working memory [13, 14], the nature of this small capacity wouldstill need to be addressed [15]. Another possibility is that capacity is not necessarily small, but rather thatsampling few options and ignoring the vast majority is actually an optimal policy that favors depth overbreadth [3]. This possibility is supported by the fact that neuronal resources devoted to decision makingare not precisely low, as dozens of brain areas and several billions of neurons are involved in even simple1 a r X i v : . [ q - b i o . N C ] F e b ecision making tasks [16–19]. Thus, processing bottlenecks could be reflections of close-to-optimalpolicies.Bounded rationality accounts [20–23] surmise that many features of cognition arise from the finitelimits of the nervous system. This must also be the case for the nature of the policies chosen bypeople in decision making, but oftentimes the constraints imposed by the limited resources are not madeexplicit. Indeed, choices between two or three options have been typically modelled as optimal stoppingproblems [24–30], where agents should optimally balance the prospect of learning the value of theoptions with the costs of sampling them, but they do so without computational or capacity constraints.The effect of resource limitations on decision making might not be important when there are only twoor three available options, but it might be critical when going beyond those low numbers. In that case,the allocation of resources might be governed by two-stage processes [8, 11, 31, 32], instead of purelysequential processes, where the first decision is about the subset of options that will be considered forfurther processing.Here we study whether narrowing attention to a few options results from optimally allocating finiteresources. To this end, we consider an infinitely divisible sampling resource (e.g. time), such that thereare no bounds in the number of alternatives that can be considered. In our model, an agent can firstallocate finite sampling time over an arbitrarily large number of options, modelled as accumulators ofnoisy evidence, with the only restriction that the total sampling time is fixed. Accumulation of evidenceruns in parallel and independently for each accumulator, and only their final states are observed. Based onthe observations, the agent picks up the one with the highest expected drift rate, which defines the utility ofthe choice. The goal of the agent is to optimize the allocation of sampling time such that expected utilityis maximized. We identify a critical variable in the problem, that we simply call capacity , that increaseswith the actual size of the resources of the agent as well as with the discriminability between options, andwe find that this capacity separates two distinct regimes of optimal allocation. When sampling capacityis small, the optimal policy is to sample exactly five options, regardless of the prior. In contrast, whencapacity is large, the number of options to sample grows with capacity in a sub-linear fashion that dependson the prior. We find a duality between allocated time and allocated precision to the options, such that allour results generalize to allocating precision while keeping fixed sampling time. Finally, we show thateven allocations are optimal, and thus better than more complex asymmetric time allocations over theconsidered options. Overall, our results suggest that decisional bottlenecks can be a byproduct of optimalpolicies in the face of uncertainty. Multi-accumulator model
We consider an environment that generates many options ( N (cid:29) ) from which to choose (Fig. 1, top),each one characterized by a ‘drift’ parameter µ i ( i = 1 , . . . , N ), unknown to the agent. All drifts µ i aredrawn identically and independently from a prior probability distribution p θ ( µ ) , known to the agent andassumed to have finite mean and variance.The agent learns about the options, in order to choose between them, by allocating sampling time t i to each (Fig. 1, bottom). The critical aspect of our model is that sampling times t i ≥ need to beallocated before feedback about the drifts is received, and with the constraint that the total sampling timeequals a finite available time T , N (cid:88) i =1 t i = T. (1)In practice, the agent needs to decide on the number of options M ≤ N to be sampled and theircorresponding sampling times t i > for i ≤ M , while the remaining options i > M are ignored bygiving them no sampling time, t i = 0 . The ordering of the options is irrelevant, as they are initiallyindistinguishable, and thus we take the first M as those that are sampled. We assume that non-sampled2 igure 1: A multi-accumulator model with finite sampling resources. The environment produces a large number ofoptions, each characterized by a drift µ i , unknown to the agent and drawn from a prior distribution characterizedby hyperparameters θ , which is known to the agent. The agent has a finite resource T , that they divide and allocateacross options, (cid:80) i t i = T , in order to sample them. In practice, the agent allocates finite sampling time to afinite number M of accumulators to infer their unknown drifts. After allocation, evidence (red lines) is optimallyintegrated by the accumulators. The agent observes the integrated evidence x i at the allocated time t i , infers thedrifts for each of the accumulators and chooses the one that is deemed to have the highest drift (in this case, µ M ;green box). options cannot be chosen, although a ‘default’ option can be added to our framework with no change ofour main results.Once total sampling time is allocated, noisy evidence about the drift µ i of each of the sampled options i ≤ M is integrated by independent accumulators (Fig. 1, middle) according to the drift-diffusion process dx i ( t ) dt = µ i + η i ( t ) , (2)where x i ( t ) is the accumulated evidence up to time t with initial condition x i (0) = 0 , and η i ( t ) isa Gaussian white noise with zero mean and fixed variance σ , independent and identical for all theaccumulators.The result of the accumulation is the total evidence x i at time t i , both of which are observed by theagent and constitute the sufficient statistics for the unknown drift µ i [33]. With these observations, theagent builds the posterior distribution of the drifts by using Bayes rule as p ( µ i | x i , t i , σ, θ ) = L ( µ i | x i , t i , σ ) p θ ( µ ) p ( x i | t i , σ, θ ) , (3)where L ( µ i | x i , t i , σ ) = N ( x i | µ i t i , σ t i ) is the likelihood function for the drift, p θ ( µ ) is the priordistribution and p ( x i | t i , σ, θ ) = (cid:82) d µ N ( x i | µt i , σ t i ) p θ ( µ ) is the marginal distribution of the evidence,which serves as a normalization constant. The posterior mean of the drift becomes ˆ µ i ≡ ˆ µ i ( x i , t i , σ, θ ) ≡ (cid:82) d µ µ p ( µ | x i , t i , σ, θ ) , which is finally used by the agent to choose the option with the highest inferreddrift (Fig. 1, middle, green box), to optimize utility, U ( M, x , t , σ, θ ) = max i ≤ M ˆ µ i ( x i , t i , σ, θ ) ,where x = ( x , ..., x M ) is the vector of observations for the M accumulators with allocated times t = ( t , ..., t M ) . To avoid notation clutter, from now on we will stop writing the dependence on σ and θ of the various functions and leave it implicit. 3he last expression is the utility of the choice of accumulator, which depends on the observations andallocation times. However, before time is allocated, the observations x themselves will be unknown tothe agent. Therefore, the expected utility of a given allocation t is given by taking the expectation of theabove utility over all possible observations as ˆ U ( M, t ) ≡ E (cid:20) max i ≤ M ˆ µ i | t (cid:21) = (cid:90) d x ... d x M p ( x , ..., x M | t ) max i ˆ µ i ( x i , t i ) , (4)where, using the independence of the accumulators, p ( x , ..., x M | t ) = (cid:81) k ≤ M p ( x k | t k ) is the product ofthe marginal distribution of the evidences.Optimally inferring the drifts from observations is readily accessible through Bayesian inference asshown above. Thus, the main, and harder, objective of the agent is to optimize the allocation policy, i.e.to select both the number of sampled accumulators M ≤ N and the time t i allocated to each, in orderto maximize expected reward, while satisfying the total sampling time constraint in Eq. (1). This isaccomplished by optimizing the utility with respect to M and t = ( t , ..., t M ) as ( M ∗ , t ∗ ) = arg max M, t ˆ U ( M, t ) . (5) Capacity and time-precision duality
While time is the resource that the agent allocates, we found a dimensionless scale that expresses theiractual sampling capacity, i.e. their ability to sample and differentiate between drifts, which we callcapacity C (Fig. 2). As the agent integrates noisy evidence through Eq. (2), the likelihood of the drift µ i for accumulator i is proportional to a Gaussian (Fig. 2 a , orange curve) with mean x i /t i and variance σ /t i , L ( µ i | x i , t i , σ ) ∝ N ( µ i | x i t i σ t i ) [33]. Its variance σ /t i shows how the sampling time and thevariance of the sampling noise are related when inferring the drift µ i , since the Gaussian gets broaderwith increasing σ or decreasing time t i . In fact, the sampling capacity of the agent should capture thisduality. Thus, having a fixed capacity could be interpreted as having a fixed noise variance σ for allaccumulators and allocating time T between them (Fig. 2 b , left) or as having a fixed sampling time T for each of the accumulators and allocating precision /σ between them (Fig. 2 b , right).Moreover, the posterior in Eq. (3) depends on the prior as well (Fig. 2 a , cyan curve). For fixedevidence, the broader the prior is, the easier it is to differentiate between sampled drifts, since the expectedsquared distance between two drifts drawn from the same distribution is twice its variance Var[ p θ ( µ )] .Therefore, we define the capacity allocated to option i as the ratio between the precision of the observationand the precision of the prior, c i = Var[ p θ ( µ )]Var[ N ( µ i | x i t i , σ t i )] = σ σ t i . (6)Adding the individual capacities results in the total sampling capacity of the agent, C = (cid:88) i c i = σ σ T. (7)For the rest of this article, we stick to the interpretation of allocating capacity as dividing the totaltime T while fixing the accumulation noise σ , such that the variable we can control is the sampling timeallocated to each option, keeping in mind that all the results presented below can be readily reinterpretedas dividing precision while giving to all options the same sampling time. Even sampling
Optimally dividing sampling capacity C into options is an a priori hard problem due to its high dimen-sionality. However, we show in a section below through numerical simulations and for Gaussian priorsthat the optimal allocation lies within the family of even allocations, where M options receive equal4 igure 2: Time/precision duality and the notion of capacity. ( a ) The likelihood of the drift µ (in orange) giventhe evidence has variance σ /t i and the prior distribution of the drifts (in cyan) has variance σ . These quantitiesdetermine capacity as in Eq. (6). ( b ) Time and sampling noise are intricately related (see text). In this example,allocating time T / to each accumulator under fixed precision /σ (left) is equivalent to allocating precision / σ to each accumulator under fixed sampling time T (right). ( c ) Small capacity means that the variance of theobservation is much larger than the variance of the prior, indicating that it is difficult to confidently identify the bestdrift from the observations. ( d ) In the large capacity limit, it is easier to differentiate the good drifts from the poorones. sampling time t i = t ≡ T /M , while the remaining others are given no time. Thus, finding the optimalpolicy reduces to finding the optimal number M of accumulators to sample.In this section, we exploit the structure of even sampling. First, the posterior mean of the drift ˆ µ i ( x i , t ) , computed from Eq. (3), is a monotonously increasing function of the evidence x i for any prior(see proof in Sec. 4.1 of the Methods). Therefore, the option that maximizes the posterior mean ˆ µ i is theone that has the highest evidence x i ( t ) , as all M sampled options are given the same sampling time t .This allows us to work by maximizing evidence instead of maximizing the posterior means of the driftsin Eq. (4). Secondly, by changing variables y ≡ max i x i , and using the probability distribution of themaximum y , denoted p max ( y | t, σ, θ ) , the expected utility in Eq. (4) can be recast in the one-dimensionalintegral ˆ U ( M, t ) = (cid:90) d y p max ( y | t )ˆ µ ( y, t ) . Finally, given that the M options are sampled evenly, the probability distribution of the maximumcan be simplified by using the cumulative distribution of the evidence x for an arbitrary accumulator, F x ( y | t ) = (cid:82) y −∞ d x (cid:48) p ( x (cid:48) | t ) , where p ( x | t ) is the marginal of the evidence x of the accumulator, as p max ( y | t ) = dd y [ F x ( y | t )] M . (8)With all the above, the expected utility in Eq. (4) can thus be written as ˆ U ( M, t ) = M (cid:90) d y [ F x ( y | t )] M − p ( y | t ) ˆ µ ( y, t ) . (9)5 Number M of sampled accumulators0 . . . . . . E x p ec t e du t ili t y C = 100C = 10C = 1
Figure 3: Expected utility as a function of sampled accumulators exhibits the breadth-depth tradeoff. Results for the Gaussian prior case ( µ = 0 . , σ = 1 ), for threedifferent capacities. Blue points denote the maxima. Note log horizontal scale (points,Monte Carlo simulations; lines, theoretical predictions, Eq. 10). When the prior distribution is a Gaussian with mean µ and variance σ , it is possible to identify the totalcapacity C = σ σ T explicitly and Eq. (9) simplifies to ˆ U ( M, C ) = µ + M σ (cid:113) MC (cid:90) ∞−∞ d y [Φ( y )] M − N ( y | , y, (10)where Φ( y ) = (cid:104) (cid:16) y √ (cid:17)(cid:105) is the cumulative distribution function of a normal distribution.Plotting the utility in Eq. (10) as a function of the number of sampled accumulators M reveals a clearbreadth-depth tradeoff (Fig. 3). At the depth limit, M = 1 , only one accumulator is sampled and it isgiven all sampling time T . In this case, the expected utility will simply be the expected value of the prior, µ (Fig. 3, left point), since there is no choice to be made between accumulators. At the breadth extreme, M/C → ∞ , the evidence gathered for each accumulator is very noisy because each has been allocateda very short sampling time, and thus choosing any will amount to an expected utility again equal to theprior mean (rightmost points). Therefore, for all capacities, there is an intermediate optimal value for thenumber of accumulators to sample, M ∗ . Sharp transition between the small and large capacity regimes
Our main result is that the optimal allocation policies are qualitatively different at small and large capacity,and that there is a abrupt transition between the two regimes. We provide useful asymptotic analyticalexpressions for the utility in Eq. (9) and the optimal M ∗ in both limits and describe their characteristicfeatures.The limit C (cid:28) corresponds to the case where the uncertainty in the observation σ /T is muchlarger than the variance of the prior σ , i.e. the Gaussian likelihood is much wider than the prior (Fig.2 c ). In this limit, we find that the utility in Eq. (9) can be expanded a series in powers of √ C , which atfirst order is given by (see Sec. 4.2) ˆ U ( M, C ) = µ + σ (cid:114) C π (cid:34) √ M (cid:90) ∞−∞ d z z exp (cid:18) − z (cid:19) (cid:18)
12 + 12 erf (cid:18) z √ (cid:19)(cid:19) M − (cid:35) + O ( C ) . (11)6 − − Capacity C O p t i m a l M − − µ p r i o r pd f M = 5 C/W ( C ) exp = 0.33 Figure 4: The optimal number of sampled accumulators undergoes qualitatively different behaviors atsmall and large capacity values. Results come from searching the maximum expected utility via MonteCarlo simulations (points) and numerical integration (lines) for a Gaussian (Eq. 10), uniform (Eq. 20)and a bimodal (Eq. 21) priors (illustrated in inset). In addition, diamond markers indicate simulationswith a ‘default’ option. We used µ = 0 . for all priors and σ = 1 / for the Gaussian prior to matchthe uniform distribution. For the bimodal prior, the variance of each mode equals σ . Dashed red linecorresponds to the asymptotic limit in the Gaussian prior case, Eq. (13). Dashed gray line is the bestpower law fit for the uniform prior case. Remarkably, this expression holds for any prior distribution as long as capacity is small enough. UsingExtreme Value Theory (see Sec. 4.4) and noting that the only depencence of M is in the quantity in thesquare brackets, we find that utility decreases with M for large M . On the other hand, it is easy to seethat expected utility attains its lowest value when M = 1 . Thus, as utility is positive, the optimal M should be attained at some intermediate value. We find numerically that the optimum happens when M ∗ ( C (cid:28)
1) = 5 . In summary, at small capacity the optimal number of sampled accumulators is constant and equal to five,regardless of the prior and the value of capacity. We have confirmed this strong prediction by directnumerical integration of Eq. (9) using different prior distributions, including Gaussian, uniform andbimodal (Fig. 4), which also holds even when a non-sampled, default, option can be chosen (diamondmarkers).The opposite limit C (cid:29) corresponds to the case where the precision of the observation is muchgreater than the one of the prior (Fig. 2 d ). Intuitively, this means that the quality of the observations isgood enough to likely differentiate the drifts between two randomly chosen accumulators, and thus weexpect that the optimal number of accumulators to increase with increasing capacity, giving a qualitativelydifferent behaviour than the small capacity limit. To study this limit, we assume that the optimal numberof sampled options M ∗ increases with C , an assumption that is consistent with the results shown below,and thus we study the behaviour of Eq. (9) for large M . In this limit, we can find an analytical expressionfor the expected utility in Eq. (10), when the prior distribution is Gaussian, ˆ U ( M, C ) → µ + σ b M (cid:113) MC , (12)7here b M = (2 log( M ) − log(log( M )) − log(4 π )) / (see Sec. 4.4). By relaxing M to be continuous,we can maximize expected utility, and we find that the optimal number of sampled options for largecapacity satisfies, up to leading order, the implicit equation M ∗ log( M ∗ ) = C . After inverting it, theoptimal number of sampled options is M ∗ ( C (cid:29)
1) = CW ( C ) , (13)where W ( C ) is the Lambert function. This asymptotic limit provides a very good approximation to theoptimal M ∗ at large C obtained from direct numerical integration of Eq. (10) (Fig. 4; red dashed line,theory; blue points, simulations). For prior distributions other than the Gaussian, we rely on numericalintegration of Eq. (9) (see Secs. 4.5 and 4.6 for analytical expressions). For a uniform prior, the optimalnumber of sampled options increases as a power law with an exponent close to / (Fig. 4, pink), whilefor a bimodal prior the optimal number increases in a similar fashion to the Gaussian prior case (green).While differences of asymptotic limits are due to the presence of bounded or unbounded drifts in thepriors, in all cases the increase is sub-linear, indicating that increasingly longer times are allocated toeach of the sampled accumulators as capacity increases.The above results show that there are two distinct regimes, one at small and another at large capacities,characterized by qualitatively different optimal allocations: while at small capacity the optimal numberof sampled options should be five regardless of the prior, at large capacity the optimal number of sampledoptions grows sublinearly regardless of the tested prior. Further, we observe that there is an abrupttransition between the two regimes as capacity grows, with a bump being observed at intermediatecapacity values. Even allocation is optimal
Above we have assumed that we could find the optimal time allocation within the subset of even allocations,such that, given finite total time T , an agent just needs to determine how many options will be sampledand split equal time to all of them. Conveniently, this set is discrete and thus amenable to effective searchof the optimum. However, in general, the set of allocation policies is the infinite-dimensional simplex (cid:80) i t i = T , t i ≥ for all i , as a priori the agent could unevenly split time to options in any arbitrary way.Despite its infinite-dimensionality, we have seen in the case of even sampling that it is optimal to ignore(infinitely) many options, such that t i > only for i ∈ { , ..., M } , with finite M , and then we say thatthere are M active dimensions.To address the most general case, using the above intuitions we first generalize the expected utility,Eq. (9), to the case when allocated time is unevenly distributed among M accumulators, as ˆ U ( M, t ) = (cid:90) ∞−∞ d y dd y (cid:34) M (cid:89) i =1 F x ( y | t i ) (cid:35) ˆ µ ( y, t i ) , (14)where F x ( y | t i ) is the cumulative distribution function of the posterior when using t i sampling time. Ourgoal is then, for every M , to find the allocation t that maximizes Eq. (14) under the capacity equalityconstraint and the inequalities t i ≥ for all i , and then select the optimal M , the one that achieves thehighest utility.In this more general setup, an even allocation corresponds to the symmetrical point in M activedimensions given by t eM , where t eM,i = T /M for i = 1 , ..., M (superscript reflects ‘even’ allocation).As the expected utility in Eq. (14) is symmetric under any permutation t j ↔ t k for any j and k , all itspartial derivatives have to be equal at t eM . Therefore, every even allocation for each M corresponds to acritical point of the constrained optimization problem (see Sec. 4.7).We still need to characterize these critical points in order to show that the global maximum is indeedan even allocation. We first remember that the optimal number of active dimensions M needs to befound, and thus it is useful to see how expected utility varies as a function of M . To do this, we note thatany M -dimensional simplex is in fact the border of an ( M − )-dimensional simplex. For example, for8 igure 5: Even allocations correspond to critical points of utility lying at the center of M -simplices. (a) In onedimension, there is only one point that complies with the constraint. (b)
For M = 2 dimensions, constraints definea line segment or 1-simplex. The circle depicts the symmetric critical point t e . (c) For M = 3 , constraints form atriangle or 2-simplex. The black triangle is the symmetric critical point t e . The colors at the extremes reflect theminimum and maximum utility reached in this simplex, which was computed with Monte Carlo simulations of Eq.(4) for the Gaussian prior with T = 0 . , σ = 1 , σ = 1 , µ = 0 . . (d) Expected utility computed along directionsthat go orthogonally from t eM to t eM +1 (as illustrated with orange arrows in panel c , same parameters). The reddot shows the maximum occurring at t e . (e) Using the stochastic projected gradient ascent detailed in Sec. 4.7,we initialized the algorithm at random points (ten shown here) in a high-dimensional simplex and measured thecoefficient of variation (CV) of the allocation vector at every step of the algorithm until convergence, for variousvalues of capacity. Zero CV implies even allocation. M = 2 , the constraints describe a line segment, or 1-simplex, where we have the symmetric critical point t e = ( T / , T / (Fig. 5 b , black circle). However, the line t + t = T is one of the 3 edges of the triangle,or 2-simplex (Fig. 5 c : pink lines are the edges of triangle), where in fact we have another symmetriccritical point in its interior (black triangle). With this, we can ‘visualize’ the infinite-dimensional natureof this problem, since all critical points of the utility lie at the edges of a higher dimensional simplex.To asses the landscape of expected utility in high-dimensional simplices, we can evaluate it at allsymmetric critical points t eM and along directions that go orthogonally between them (Fig. 5 c , orangearrows). Thus, we devised a one-dimensional path that allows to continuously connect all symmetricalcritical points, and applied it to the small capacity limit C (cid:28) . As we move from the 1-simplex tohigher dimensional simplices (as in Fig. 5 c ), we find that first utility increases, reaching a maximum atthe even allocation in M = 5 dimensions, and then decreases (Fig. 5 d ). Therefore, critical points t e , t e and t e are ‘saddle’-like points, as they are maxima in the interior of their corresponding simplex, andminima as one moves to the interior of the higher dimensional simplex.Although the above analysis suggests that the optimum lies at an even allocation point, it is stillunclear whether there are other critical points that are asymmetrical and have a larger utility. To arguethat the presence of non-symmetrical local optima is unlikely, we used a stochastic gradient projection9ethod [34] that maximizes expected utility subject to the constraints, and applied it to the Gaussianprior case (see Sec. 4.7 in Methods for details). Indeed, we find for various capacities that, regardless ofthe initial condition, i.e. random initial allocations, a maximum utility is attained when time is evenlydivided (Fig. 5 e ), and the global maxima coincide with the ones found in the previous sections. We have studied a model of multi-alternative decision making where an agent can allocate finite samplingresources to options and choose the best one amongst them. We found that the capacity of the agentdepends on both the amount of sampling resources, i.e. time or precision, as well as on the discriminabilityof the options in the environment. As a function of capacity, optimal policies undergo an abrupt transition:at small capacity, allocating time to a handful of options is optimal; at large capacity, the number of optionsgrows sub-linearly, well below the actual sampling capacity of the agent. Our results show that decisionbottlenecks, such as option-narrowing, can arise from optimal policies in the face of uncertainty, andprovide so far untested predictions on choice behaviors in multi-alternative decision making as a functionof capacity.Seemingly strict limits pervade cognition, from the so-called attentional bottleneck [35–37], overworking memory [13, 14, 38–40], to executive control [41–43]. These limits might result from usingscarce neuronal resources or from using them inefficiently. However, a likely alternative is that bottlenecksreflect strategies that make optimal use of limited but large resources. Indeed, past work has recognizedthat some apparent limits, most notably dual tasking bottlenecks [44, 45], could be the result of optimalallocation of finite resources to avoid overlap and interference between the different representationsneeded to solve the two tasks [45–47]. Further, it has been recognized that the narrow focus of attentioncould be at the heart of solution to the the binding problem by integrating separate features into a coherentobject [48], and thus its narrowness might reflect a function more than a limitation. Our work follows thisline of argument and provides for the first time a quantitative account for why it is optimal for an agent toconsider a handful of options in the face of uncertainty, well above two but well below 10. In addition,our results shed light on why people might ignore hundreds of accessible options and focus resources toa very small number of options [8–10]. Thus, some of the seemingly strict limits in decision making canbe the result of optimal policies that favour depth versus breadth processing of the options.It has been long recognized that people often consider a small set of options while ignoring manyothers [4, 8, 11, 12]. In the ‘consumer’ literature this is explained by arguing that small considerationsets are favored because they optimally balance the probability of finding a good option in the set withthe search and mental costs incurred in adding new options to that set. These models thus assume thatresources are not limited, but are costly. In contrast, the assumptions in our work do not explicitly tunethe cost of sampling, but rather an implicit cost arises naturally from the strict capacity constraint, whichdepends intrinsically on the agent as well as extrinsically on the environment. A more fundamentaldistinction is that previous work did not focus on allocating resources intensively into the options, suchthat the only decision was whether to include an option into the set or not, without considering the amountof resources allocated to it. This distinction makes that problem drastically different than the tradeoffs ofthe BD dilemma considered here. This can explain why transitions of optimal policies as a function ofagent’s parameters have not been reported before.Previous work has characterized optimal BD tradeoffs in multi-alternative choices like the onesstudied here, but by assuming that agents have a finite ‘discrete’ capacity [3]. Our assumption of acontinuous resource (e.g. time) that can be infinitely divided has allowed us to uncover qualitatively noveloptimal policies at small capacity. This is because a discrete small capacity can never produce numbersof sampled options above that capacity. Our modeling assumptions are also different and more in linewith current theories of decision making based on accumulators of evidence [25, 27, 33, 49] that tradeaccuracy over time. Here we have not considered, however, a sequential process where accumulation ofevidence can be stopped at any time, a topic that should be addressed in the future. In any event, any agentwith finite capacity cannot avoid the problem of first deciding how many options to allocate capacity to,10s dividing resources up to too small portions is clearly suboptimal, and thus BD tradeoffs as describedabove will be generally at play. Secondly, a sequential decision process can ensue after the first decisionof how many options to sample, while we have assumed that allocated times cannot be reallocated onthe fly. Although this will be as well a very relevant extension of our work, it is important to note thatin many decisions it is actually hard, if not impossible, to reallocate already assigned resources, beingneuronal, temporal or economical, and thus our framework more closely applies to those circumstances.Bounded rationality accounts [20–23] propose that cognition results from the finite limits of thenervous system from where it emerges. Our work follows this line of research in two ways. First,we propose that agents indeed have a finite sampling capacity that can be arbitrarily allocated to theavailable options. However, an important assumption in our work is that while the intrinsic resources ofan agent might seem large, the interaction of the agent with the environment might render their effectivedecision-making capacity small. Therefore, capacity is not an absolute quantity that describes an agent,but a relative quantity that contextualizes the agents and characterizes how well they are suited to solvea given task in the world. An important contribution of our work is to show that optimal policiesdepend on effective capacity in a highly non-linear way, such that small-capacity agents would behavequalitatively different than large-capacity agents (or even the behavior of the same agent operating indifferent environments could be qualitatively different). This is clearly a prediction that can be testedwith humans where time or other resources are constrained and varied on a trial by trial basis. Secondly,agents perform the allocation before feedback is received, which relates to a bounded-optimal agent thatis optimized at ‘design’-time, which eliminates the paradox of perfect rationality by not letting the agentoptimize their decisions at run-time [50], an argument that further supports the validity and relevance oftwo-stage decisions.Another important result of our work is that evenly dividing time to a small set of options is optimalwhen they are initially indistinguishable. This optimal division of resources coincides with the /N heuristic rule [51] or equality heuristic [52], which has proven to be implemented in human decisionmaking and highly efficient as a portfolio strategy [53]. In our case, the fact that options are drawnfrom the same prior (known to the agent) contributes to the optimality of the even allocation. Althoughthe optimal allocation of non-identically distributed options is not addressed here, this heuristic can beefficient in such situations [54]. It is important to realize that the optimal low numbers of consideredoptions have been found in the case where their values are not known in advance and come from the samedistribution. If agents have strong preferences or have additional information about the expected valuesof the options, then the number of considered alternatives will be further reduced. This shows once againthat a low number of considered options can hardly be taken as evidence of a decisional bottleneck andis more in line with an optimal tradeoff between breadth and depth.Finally, our results can have important implications for the optimal wiring of neural networks inthe brain [16–19]. First, as just few options should be considered at the same time, it is expected thatonly those would be encoded in different, albeit possibly overlapping, pools of neurons. Thus, althoughmodels consisting of two or three pools that compete for dominance through mutual inhibition can be asensible idea for binary and ternary decision making [25,55–60], extrapolating this to many more options(e.g., larger than 10) by splitting neurons into corresponding pools of neurons would be hardly optimal.Our results are, in contrast, consistent with the opposite view that posits that a single pool of neurons issufficient for decision making [61]. In this framework, a single pool encodes just one of the availableoptions, the one that is under the focus of attention. Previously attended options produce a backgroundactivity against which the current option is compared to, and other options fall outside the representation ofthe neural network [61–65]. Thus, comparison and selection between options occurs through a temporalcontrast, rather than through mutual inhibition between simultaneously encoded options. This model canbe readily extrapolated to multiple many options, with the only dilemma of dividing time or precisioninto few or many options (like in Fig. 1), thus addressing the associated BD tradeoffs. The debateof the one-pool versus several-pools models remains open [61, 66], but electrophysiology experimentswith many options should be able to arbitrate between the two hypotheses under the new computationalconstraints that we have identified here. 11 cknowledgments This work is supported by the Howard Hughes Medical Institute (HHMI, ref 55008742), MINECO (Spain;BFU2017-85936-P) and ICREA Academia (2016) to R.M.-B, and MINECO/ESF (Spain; PRE2018-084757) to J.R.-R. J.R.-R. would like to thank Fred Callaway for helpful suggestions at early stages ofthis work.
Code availability
All the numerical work performed to generate the various figures is available as documented Julia codealong with a guided notebook at this public GitHub repository.
References [1] Dwight P Miller. The depth/breadth tradeoff in hierarchical computer menus. In
Proceedings of theHuman Factors Society Annual Meeting , volume 25, pages 296–300. SAGE Publications Sage CA:Los Angeles, CA, 1981.[2] Ellis Horowitz and Sartaj Sahni.
Fundamentals of computer algorithms . Computer Science Press,1978.[3] Rubén Moreno-Bote, Jorge Ramírez-Ruiz, Jan Drugowitsch, and Benjamin Y Hayden. Heuristicsand optimal solutions to the breadth–depth dilemma.
Proceedings of the National Academy ofSciences , 117(33):19799–19808, 2020.[4] John W Payne. Task complexity and contingent processing in decision making: An informationsearch and protocol analysis.
Organizational behavior and human performance , 16(2):366–387,1976.[5] Richard W Olshavsky. Task complexity and contingent processing in decision making: A replicationand extension.
Organizational behavior and human performance , 24(3):300–316, 1979.[6] Lee Roy Beach. Broadening the definition of decision making: The role of prechoice screening ofoptions.
Psychological science , 4(4):215–220, 1993.[7] Irwin P Levin, JD Jasper, and Wendy S Forbes. Choosing versus rejecting options at different stagesof decision making.
Journal of Behavioral Decision Making , 11(3):193–210, 1998.[8] John R Hauser and Birger Wernerfelt. An evaluation cost model of consideration sets.
Journal ofconsumer research , 16(4):393–408, 1990.[9] Sheena S Iyengar and Mark R Lepper. When choice is demotivating: Can one desire too much of agood thing?
Journal of personality and social psychology , 79(6):995, 2000.[10] Benjamin Scheibehenne, Rainer Greifeneder, and Peter M Todd. Can there ever be too manyoptions? a meta-analytic review of choice overload.
Journal of consumer research , 37(3):409–425,2010.[11] Nitin Mehta, Surendra Rajiv, and Kannan Srinivasan. Price uncertainty and consumer search: Astructural model of consideration set formation.
Marketing science , 22(1):58–84, 2003.[12] George J Stigler. The economics of information.
Journal of political economy , 69(3):213–225,1961.[13] George A Miller. The magical number seven, plus or minus two: Some limits on our capacity forprocessing information.
Psychological review , 63(2):81, 1956.1214] Nelson Cowan, Emily M Elliott, J Scott Saults, Candice C Morey, Sam Mattox, Anna Hismjatullina,and Andrew RA Conway. On the capacity of attention: Its estimation and its role in workingmemory and cognitive aptitudes.
Cognitive psychology , 51(1):42–100, 2005.[15] Timothy F Brady, Viola S Störmer, and George A Alvarez. Working memory is not fixed-capacity:More active storage capacity for real-world objects than for simple stimuli.
Proceedings of theNational Academy of Sciences , 113(27):7459–7464, 2016.[16] Matthew FS Rushworth, MaryAnn P Noonan, Erie D Boorman, Mark E Walton, and Timothy EBehrens. Frontal cortex and reward-guided learning and decision-making.
Neuron , 70(6):1054–1069, 2011.[17] Markus Siegel, Timothy J Buschman, and Earl K Miller. Cortical information flow during flexiblesensorimotor decisions.
Science , 348(6241):1352–1355, 2015.[18] Timothy J Vickery, Marvin M Chun, and Daeyeol Lee. Ubiquity and specificity of reinforcementsignals throughout the human brain.
Neuron , 72(1):166–177, 2011.[19] Seng Bum Michael Yoo and Benjamin Yost Hayden. Economic choice as an untangling of optionsinto actions.
Neuron , 99(3):434–447, 2018.[20] Herbert A Simon. Theories of bounded rationality.
Decision and organization , 1(1):161–176, 1972.[21] Stuart Russell and Eric Wefald. Principles of metareasoning.
Artificial intelligence , 49(1-3):361–395, 1991.[22] Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality: Aconverging paradigm for intelligence in brains, minds, and machines.
Science , 349(6245):273–278,2015.[23] Thomas L Griffiths, Falk Lieder, and Noah D Goodman. Rational use of cognitive resources:Levels of analysis between the computational and the algorithmic.
Topics in cognitive science ,7(2):217–229, 2015.[24] Roger Ratcliff and Bennet B Murdock. Retrieval processes in recognition memory.
Psychologicalreview , 83(3):190, 1976.[25] Joshua I Gold and Michael N Shadlen. The neural basis of decision making.
Annual review ofneuroscience , 30, 2007.[26] Ian Krajbich and Antonio Rangel. Multialternative drift-diffusion model predicts the relationshipbetween visual fixations and choice in value-based decisions.
Proceedings of the National Academyof Sciences , 108(33):13852–13857, 2011.[27] Jan Drugowitsch, Rubén Moreno-Bote, Anne K Churchland, Michael N Shadlen, and AlexandrePouget. The cost of accumulating evidence in perceptual decision making.
Journal of Neuroscience ,32(11):3612–3628, 2012.[28] Satohiro Tajima, Jan Drugowitsch, Nisheet Patel, and Alexandre Pouget. Optimal policy for multi-alternative decisions.
Nature neuroscience , 22(9):1503–1511, 2019.[29] Anthony I Jang, Ravi Sharma, and Jan Drugowitsch. Optimal policy for attention-modulateddecisions explains human fixation behavior. bioRxiv , 2020.[30] Frederick Callaway, Antonio Rangel, and Thomas L Griffiths. Fixation patterns in simple choiceare consistent with optimal use of cognitive resources.
PsyArXiv preprint PsyArXiv: https://doi.org/10.31234/osf. io/57v6k , 2020. 1331] Allan D Shocker, Moshe Ben-Akiva, Bruno Boccara, and Prakash Nedungadi. Consideration setinfluences on consumer decision-making and choice: Issues, models, and suggestions.
Marketingletters , 2(3):181–197, 1991.[32] John H Roberts and James M Lattin. Development and testing of a model of consideration setcomposition.
Journal of Marketing Research , 28(4):429–440, 1991.[33] Rubén Moreno-Bote. Decision confidence and uncertainty in diffusion models with partially corre-lated neuronal integrators.
Neural computation , 22(7):1786–1811, 2010.[34] Roger Fletcher.
Practical methods of optimization . John Wiley & Sons, 2013.[35] J Anthony Deutsch and Diana Deutsch. Attention: Some theoretical considerations.
Psychologicalreview , 70(1):80, 1963.[36] Anne M Treisman. Strategies and models of selective attention.
Psychological review , 76(3):282,1969.[37] Steven Yantis and James C Johnston. On the locus of visual selection: Evidence from focused at-tention tasks.
Journal of experimental psychology: Human perception and performance , 16(1):135,1990.[38] Steven J Luck and Edward K Vogel. Visual working memory capacity: from psychophysics andneurobiology to individual differences.
Trends in cognitive sciences , 17(8):391–400, 2013.[39] Wei Ji Ma, Masud Husain, and Paul M Bays. Changing concepts of working memory.
Natureneuroscience , 17(3):347, 2014.[40] Timothy F Brady, Talia Konkle, and George A Alvarez. A review of visual memory capacity:Beyond individual items and toward structured representations.
Journal of vision , 11(5):4–4, 2011.[41] Amitai Shenhav, Sebastian Musslick, Falk Lieder, Wouter Kool, Thomas L Griffiths, Jonathan DCohen, and Matthew M Botvinick. Toward a rational and mechanistic account of mental effort.
Annual review of neuroscience , 40:99–124, 2017.[42] DA Norman and T Shallice. Attention to action [w:] rj davidson, ge schwartz, d. shapiro (red.).
Consciousness and Self-Regulation , pages 1–18, 1986.[43] Brianna J Sleezer, Meghan D Castagno, and Benjamin Y Hayden. Rule encoding in orbitofrontalcortex and striatum guides selection.
Journal of Neuroscience , 36(44):11223–11237, 2016.[44] Rico Fischer and Franziska Plessow. Efficient multitasking: parallel versus serial processing ofmultiple tasks.
Frontiers in psychology , 6:1366, 2015.[45] David E Meyer and David E Kieras. A computational theory of executive cognitive processes andmultiple-task performance: Part i. basic mechanisms.
Psychological review , 104(1):3, 1997.[46] Samuel F Feng, Michael Schwemmer, Samuel J Gershman, and Jonathan D Cohen. Multitaskingversus multiplexing: Toward a normative account of limitations in the simultaneous execution ofcontrol-demanding behaviors.
Cognitive, Affective, & Behavioral Neuroscience , 14(1):129–146,2014.[47] Ariel Zylberberg, Stanislas Dehaene, Pieter R Roelfsema, and Mariano Sigman. The human turingmachine: a neural framework for mental programs.
Trends in cognitive sciences , 15(7):293–300,2011.[48] Anne Treisman. Feature binding, attention and object perception.
Philosophical Transactions ofthe Royal Society of London. Series B: Biological Sciences , 353(1373):1295–1306, 1998.1449] Roger Ratcliff and Philip L Smith. A comparison of sequential sampling models for two-choicereaction time.
Psychological review , 111(2):333, 2004.[50] Stuart J Russell and Devika Subramanian. Provably bounded-optimal agents.
Journal of ArtificialIntelligence Research , 2:575–609, 1994.[51] Gerd Gigerenzer and Wolfgang Gaissmaier. Heuristic decision making.
Annual review of psychol-ogy , 62:451–482, 2011.[52] David M Messick. Equality as a decision heuristic.
Psychological perspectives on justice: Theoryand applications , pages 11–31, 1993.[53] Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: Howinefficient is the 1/n portfolio strategy?
The review of Financial studies , 22(5):1915–1953, 2009.[54] Warren Thorngate. Efficient decision heuristics.
Behavioral Science , 25(3):219–225, 1980.[55] Paul Cisek and John F Kalaska. Neural mechanisms for interacting with a world full of actionchoices.
Annual review of neuroscience , 33:269–298, 2010.[56] Robert M Roe, Jermone R Busemeyer, and James T Townsend. Multialternative decision fieldtheory: A dynamic connectionst model of decision making.
Psychological review , 108(2):370,2001.[57] Marius Usher and James L McClelland. The time course of perceptual choice: the leaky, competingaccumulator model.
Psychological review , 108(3):550, 2001.[58] Rubén Moreno-Bote, John Rinzel, and Nava Rubin. Noise-induced alternations in an attractornetwork model of perceptual bistability.
Journal of neurophysiology , 98(3):1125–1139, 2007.[59] Anne K Churchland, Roozbeh Kiani, and Michael N Shadlen. Decision-making with multiplealternatives.
Nature neuroscience , 11(6):693–702, 2008.[60] Xiao-Jing Wang. Decision making in recurrent neuronal circuits.
Neuron , 60(2):215–234, 2008.[61] Benjamin Y Hayden and Rubén Moreno-Bote. A neuronal theory of sequential economic choice.
Brain and Neuroscience Advances , 2:2398212818766675, 2018.[62] Ian Krajbich, Carrie Armel, and Antonio Rangel. Visual fixations and the computation and com-parison of value in simple choice.
Nature neuroscience , 13(10):1292–1298, 2010.[63] Seung-Lark Lim, John P O’Doherty, and Antonio Rangel. The decision value computations inthe vmpfc and striatum use a relative value code that is guided by visual attention.
Journal ofNeuroscience , 31(37):13214–13223, 2011.[64] A David Redish. Vicarious trial and error.
Nature Reviews Neuroscience , 17(3):147, 2016.[65] Erin L Rich and Jonathan D Wallis. Decoding subjective decisions from orbitofrontal cortex.
Natureneuroscience , 19(7):973–980, 2016.[66] Sébastien Ballesta and Camillo Padoa-Schioppa. Economic decisions through circuit inhibition.
Current Biology , 29(22):3814–3824, 2019.[67] Laurens De Haan and Ana Ferreira.
Extreme value theory: an introduction . Springer Science &Business Media, 2007.[68] Christopher M Bishop.
Pattern recognition and machine learning . springer, 2006.15
Methods
Comments and mathematical proofs supporting claims in the manuscript.
Here we prove that the posterior mean of the drift, which is a random variable with probability distributiondefined by Bayes’ rule, Eq. (3), is a monotonously increasing function of evidence x . We have seen thatthe expected value of a drift µ given the accumulated evidence x , for any option (and thus here droppingindices) is given by ˆ µ ( x, t, σ, θ ) = (cid:82) d µ µ N ( x | µt, σ t ) p θ ( µ ) p ( x | t, σ, θ ) , (15)where p θ ( µ ) is the prior probability of the drifts, with hyperparameters θ and p ( x | t, σ, θ ) is the marginal-ized probability distribution of the evidence. To know if ˆ µ ( x, t, σ, θ ) is an increasing function of x , wesimply derive, and we expect the derivative to be always positive, dˆ µ d x = p ( x | t, σ, θ ) (cid:82) d µ µ (cid:0) − x − µtσ t (cid:1) N ( x | µt, σ t ) p θ ( µ ) p ( x | t, σ, θ ) − (cid:82) d µ µ N ( x | µt, σ t ) p θ ( µ ) (cid:82) d µ (cid:0) − x − µtσ t (cid:1) N ( x | µt, σ t ) p θ ( µ ) p ( x | t, σ, θ ) > ⇐⇒ σ (cid:16) E (cid:2) µ | x, t, σ, θ (cid:3) − E [ µ | x, t, σ, θ ] (cid:17) = Var [ p ( µ | x, t, σ, θ )] σ > , where E [ µ n | x, t, σ, θ ] = (cid:82) d µ µ n N ( x | µt, σ t ) p θ ( µ ) (cid:82) d µ N ( x | µt, σ t ) p θ ( µ ) . Since the variance is the expected value of a positive quantity, then we conclude that the expected valueof the drift is a monotonously increasing function of the observed accumulated evidence x for any prior. Here we show that in the small capacity limit, the utility in Eq. (9) can be written as in Eq. (11) forany regular prior distribution. Our strategy is to study the limiting behaviors of the cumulative densityfunction (described below in Sec. 4.3) and the posterior mean of the drift (detailed in this section) thatappear in Eq. (9) as C = σ σ T goes to zero.From Bayes’s rule, Eq. (3), the posterior mean of the drift is given by ˆ µ ( x, t, σ, θ ) = 1 √ πσ t (cid:82) d µ µ exp (cid:0) − σ t ( µt − x ) (cid:1) p θ ( µ | θ ) p ( x | t, σ, θ ) . (16)Let us focus on the numerator, which we will interpret as the expectation value of µ exp (cid:0) − σ t ( µt − x ) (cid:1) with respect to the prior. We assume the prior to be such that this expectation is finite for all x and that allits moments are finite (e.g, Gaussian and uniform distributions). We define z ≡ z ( x ) ≡ √ σ t ( x − µ t ) and µ s ≡ √ σ t ( µt − µ t ) , and by adding and subtracting µ t at the exponent, we can write the numeratorin the above equation as 16 θ (cid:20) µ exp (cid:18) − σ t ( µt − x ) (cid:19)(cid:21) = E θ (cid:20) µ exp (cid:18) − σ t ( µt − µ t + µ t − x ) (cid:19)(cid:21) = exp (cid:18) − z (cid:19) E θ (cid:20) µ exp (cid:18) zµ s − µ s (cid:19)(cid:21) = exp (cid:18) − z (cid:19)(cid:110) E θ (cid:20) µ exp (cid:18) zµ s − µ s (cid:19)(cid:21) + E θ (cid:34)(cid:114) σ t µ s exp (cid:18) zµ s − µ s (cid:19)(cid:35) (cid:111) . Next, we note that the exponential in the expectations is the generating function of the Hermitepolynomials, and thus exp (cid:18) zµ s − µ s (cid:19) = ∞ (cid:88) n =0 He n ( z ) µ ns n ! . By replacing the exponential with the infinite series in the above expectation, Eq. (16), we obtain p ( x | t, σ, θ )ˆ µ ( x, t, σ, θ ) = exp (cid:16) − z (cid:17) √ πσ t (cid:40) E θ (cid:34) µ ∞ (cid:88) n =0 He n ( z ) µ ns n ! (cid:35) + E θ (cid:34)(cid:114) σ t µ s ∞ (cid:88) n =0 He n ( z ) µ ns n ! (cid:35)(cid:41) = N ( z | , √ σ t (cid:40) ∞ (cid:88) n =0 n ! E θ [ µ ns ] µ He n ( z ) + ∞ (cid:88) n =0 n ! E θ (cid:2) µ n +1 s (cid:3) (cid:114) σ t He n ( z ) (cid:41) = N ( z | , √ σ t (cid:40) ∞ (cid:88) n =0 n ! E θ (cid:34) ( µ − µ ) n (cid:112) σ /t n (cid:35) (cid:32) µ He n ( z ) + (cid:114) σ t n He n − ( z ) (cid:33)(cid:41) = N ( z | , √ σ t (cid:40) ∞ (cid:88) n =0 n ! (cid:114) CM n − E θ (cid:20) ( µ − µ ) n σ n (cid:21) × (cid:32)(cid:114) CM µ He n ( z ) + σ n He n − ( z ) (cid:33) (cid:41) , where we have used that all the moments of the prior are finite and the sum is well defined. Note that toobtain the third line we have shifted the second index n + 1 → n and used that the term n He n − ( z ) iszero for n = 0 .We now insert the above series into the expression of utility in Eq. (9) to obtain ˆ U ( t, σ, θ ) = (cid:90) d x dd x (cid:110) [ F x ( x | t, σ, θ )] M (cid:111) ˆ µ ( x, t, σ, θ )= (cid:90) d x M [ F x ( x | t, σ, θ )] M − N ( z | , √ σ t × (cid:40) ∞ (cid:88) n =0 n ! (cid:114) CM n − E θ (cid:20) ( µ − µ ) n σ n (cid:21) (cid:32)(cid:114) CM µ He n ( z ) + σ n He n − ( z ) (cid:33)(cid:41) = (cid:90) d z M [ F z ( z | t, σ, θ )] M − N ( z | , (cid:34) µ + (cid:114) CM σ z (cid:35) + O ( C ) , where in the second line it is implicit that z depends on x , and in the last line we have made a lineartransformation of variables from x to z = z ( x ) . We also note that as the integral in the last line onlyinvolves polynomials in z that are weighted by the standard normal (and by a cumulative, which isbounded to be in the range [0 , ), their integrals are finite, and thus we can truncate the series at the firstleading order, which is order √ C . It remains to see whether the cumulative density function F z ( z | t, σ, θ ) √ C or larger, and we show below in Sec. (4.3) that the former is actually true, suchthat F z ( z | t, σ, θ ) = (cid:104) (cid:16) z √ (cid:17)(cid:105) + O ( C ) . With all this, we can approximate the utility up to order √ C as ˆ U ( C, M, µ ) = µ + σ √ C √ M (cid:90) ∞−∞ d z (cid:20) (cid:18) (cid:18) z √ (cid:19)(cid:19)(cid:21) M − N ( z | , z + O ( C ) , (17)which is identical to Eq. (11). Here, we find an approximation to the marginalized probability distribution of the evidence at smallcapacity. From Bayes’ rule and the law of the unconscious statistician, p ( x | t, σ, θ ) = (cid:90) d µ N ( x | µt, σ t ) p θ ( µ | θ ) = E θ (cid:2) N ( x | µt, σ t ) (cid:3) . To compute this expectation, we follow the same procedure as in Sec. 4.2. We define z ≡ √ σ t ( x − µ t ) and µ s ≡ √ σ t ( µt − µ t ) and add and subtract µ t at the exponent, to obtain p ( x | t, σ, θ ) = E θ (cid:20) exp (cid:18) − σ t ( x − µt ) (cid:19)(cid:21) = E θ (cid:20) exp (cid:18) − σ t ( x − µ t + µ t − µt ) (cid:19)(cid:21) = exp (cid:18) − z (cid:19) E θ (cid:20) exp (cid:18) zµ s − µ s (cid:19)(cid:21) . Next, we again identify the exponential generating function of the Hermite polynomials, exp (cid:18) zµ s − µ s (cid:19) = ∞ (cid:88) n =0 He n ( z ) µ ns n ! , and thus we obtain a series for the probability distribution of the evidence, p ( x | t, σ, θ ) = 1 √ πσ t exp (cid:18) − z (cid:19) E θ (cid:34) ∞ (cid:88) n =0 He n ( z ) µ ns n ! (cid:35) = exp (cid:16) − z (cid:17) √ πσ t ∞ (cid:88) n =0 n ! E θ [ µ ns ] He n ( z )= exp (cid:16) − z (cid:17) √ πσ t ∞ (cid:88) n =0 n ! E θ (cid:34) ( µ − µ ) n (cid:112) σ /t n (cid:35) He n ( z )= exp (cid:16) − z (cid:17) √ πσ t ∞ (cid:88) n =0 n ! (cid:114) CM n E θ (cid:20) ( µ − µ ) n σ n (cid:21) He n ( z ) . We see that the leading order the distribution of the evidence is a normal distribution, while the order √ C is zero. Therefore, its cumulative in the variable z = z ( x ) is, exactly, up to order √ C , F z ( z | t, σ, θ ) = (cid:104) (cid:16) z √ (cid:17)(cid:105) + O ( C ) . This expression has been used in Sec. (4.2).18 .4 Asymptotic limit of relevant integral In this subsection we want to obtain the asymptotic limit, M → ∞ , of the integral I ( M ) = (cid:90) ∞−∞ d y y dd y Φ M ( y ) , appearing in Eqs. (11) and (12), where Φ( y ) is the normal cumulative distribution function, Φ( y ) = (cid:18)
12 + 12 erf (cid:18) y √ (cid:19)(cid:19) . Using Extreme Value Theory [67], it can be shown that this cumulative distribution function Φ( y ) belongsto the Gumbel class of the generalized extreme value distributions, lim M →∞ Φ M ( a M y + b M ) = G ( y ) , where G ( y ) = exp( − exp( − y )) and b M = (2 log( M ) − log(log( M )) − log(4 π )) / and a M = 1 /b M . Using this result, then our integral develops quite easily, I ( M ) → (cid:90) ∞−∞ d y y dd y G (cid:18) y − b M a M (cid:19) = (cid:90) ∞−∞ d y (cid:18) yb M + b M (cid:19) dd y G ( y )= 1 b M (cid:90) ∞−∞ d y y exp( − y ) exp( − exp( − y )) + b M (cid:90) ∞−∞ d y dd y G ( y ) I ( M → ∞ ) = γb M + b M , where γ ≈ . is Euler’s constant. For this choice of prior, drifts are all drawn independently and identically from a uniform probabilitydistribution between zero and one. That is, p ( µ i ) = Θ( µ i )Θ(1 − µ i ) , where Θ( x ) is the Heaviside step function. We can substitute this prior into eq. (3) to obtain the posteriorprobability distribution for the drifts, p ( µ i | x i , σ, t i , θ ) = N (cid:16) µ i | xiti , σ ti (cid:17)(cid:82) N (cid:16) µ i | xiti , σ ti (cid:17) d µ i µ i ∈ [0 , otherwise . This will produce an expectation value for each drift, ˆ µ i ( x i , t i , σ ) ≡ E [ µ i | x i , t i , σ ] = x i t i + σ √ πt i exp (cid:16) − x i σ t i (cid:17) − exp (cid:16) − ( x i − t i ) σ t i (cid:17) (cid:20) erf (cid:18) x i √ σ t i (cid:19) − erf (cid:18) x i − t i √ σ t i (cid:19)(cid:21) , (18)19here the denominator is related to the probability distribution of the evidence x i , which we can find bymarginalizing over drifts, p ( x i | t i , σ ) = (cid:90) d µ √ πσ t i exp (cid:18) − σ t i ( x i − µ i t i ) (cid:19) = 12 t i (cid:20) erf (cid:18) x i √ σ t i (cid:19) − erf (cid:18) x i − t i √ σ t i (cid:19)(cid:21) . (19)We will use from now on the assumption of even time allocation, t i = t = TM for all i . The cumulativeprobability distribution for the evidence in eq. (19) is, integrating by parts, F ( x | t, σ ) = (cid:90) x −∞ p ( x (cid:48) | t, σ ) d x (cid:48) = 12 (cid:18) (cid:18) x − t √ σ t (cid:19)(cid:19) + x t (cid:18) erf (cid:18) x √ σ t (cid:19) − erf (cid:18) x − t √ σ t (cid:19)(cid:19) + (cid:114) σ πt (cid:20) exp (cid:18) − x σ t (cid:19) − exp (cid:18) − ( x − t ) σ t (cid:19)(cid:21) = 12 (cid:20) (cid:18) x − t √ σ t (cid:19)(cid:21) + tp ( x | t, σ )ˆ µ ( x, t, σ ) , where in the last equality we have rewritten the solution in a convenient form. Hence, the product of theexpected value with the probability density can be rewritten in terms of the cumulative function, fromthe previous equation, ˆ µ ( x, t, σ ) p ( x | t, σ ) = 1 t F ( x | t, σ ) − t (cid:20) (cid:18) x − t √ σ t (cid:19)(cid:21) , and using eq. (9) we get the expression for the utility, ˆ U ( M, t, σ ) = Mt (cid:90) ∞−∞ d x [ F ( x | t, σ )] M − (cid:26) F ( x | t, σ ) − (cid:20) (cid:18) x − t √ σ t (cid:19)(cid:21)(cid:27) . (20) The expected utility for the bimodal Gaussian prior with modes µ and µ , each with a variance σ , isquite similar to the unimodal, Eq. (10), and follows the straightforward application of Eq. (9). Theprobability distribution of the evidence marginalized over drifts is p ( x | t, σ, θ ) = N ( x | µ t, σ t + σ t )+ N ( x | µ t, σ t + σ t ) . Therefore the cumulative is F ( x | t, σ, θ ) = 12 Φ( x | µ t, σ t + σ t ) + 12 Φ( x | µ t, σ t + σ t ) , where Φ( x | µ m , σ m ) is the normal cumulative distribution for one mode. However, the expected value ofthe drift is a bit more involved, since the posterior distribution over drifts takes a different form, p ( µ | x, t, σ, θ ) = σ t (cid:112) πσ tσ p ( x | t, σ, θ ) (cid:88) i N ( µ | ˆ µ i , σ t ) exp (cid:18) − σ t + σ t ) ( µ i t − x ) (cid:19) , where /σ t = t/σ + 1 /σ and ˆ µ i = σ t σ µ i + σ t σ x. Consequently, the expected value will be ˆ µ ( x, t, σ, θ ) = 1 (cid:112) π ( σ t + σ t ) 1 p ( x | t, σ, θ ) (cid:88) i ˆ µ i (cid:18) − σ t + σ t ) ( µ i t − x ) (cid:19) . ˆ U ( M, t, σ, θ ) = M (cid:90) ∞−∞ d x F ( x | t, σ, θ ) M − (cid:88) i ˆ µ i N (cid:0) x | µ i t, σ t + σ t (cid:1) . (21)This expression is numerically integrated and used in Fig. 4. To maximize utility, Eq. (14), under the time constraint, we can make use of unconstrained optimizationthrough Lagrangian multipliers. We construct the Lagrangian given by L ( t , θ, λ ) = ˆ U ( t , σ, θ ) + λh ( t ) + µ · g ( t ) , (22)where h ( t ) = (cid:80) Mi =1 t i − T is the equality constraint that defines the hyperplane and ≤ g i ( t ) = t i is theinequality constraint forcing all times i to be non-negative and thus defining the simplex. The quantities λ and µ are the Lagrangian multipliers. In other words, maximizing utility, Eq. (14), subject to the initialconstraints can be done by optimizing the Lagrangian, Eq. (22), with respect to t , λ and µ subject toKarush-Kuhn-Tucker conditions [68] g i ( t ) ≥ , for all i (23a) µ j ≥ , for all j (23b) µ · g ( t ) = 0 (23c)We notice that the first two conditions imply that the third can be rewritten as µ i t i = 0 for all i . Byoptimizing the Lagrangian, Eq. (22), we obtain the following system of equations ∇ t ˆ U ( t ∗ , σ, θ ) + λ ∗ + µ ∗ = , (24)where t ∗ , λ ∗ , µ ∗ denote the critical points of the Lagrangian. We note that the symmetrical point in M active dimensions, denoted by t eM , where t eM,i = T /M for i = 1 , ..., M , is a critical point of theLagrangian. This is because the partial derivatives with respect to the utility have to be equal at t eM , andsince this point lies in the interior of the ( M − -simplex, the µ ei = 0 for i = 1 , . . . , M . Therefore t eM complies with Eq. (24) and is indeed a critical point.Next, we detail the gradient ascent method used to obtain Fig. 5 e . As explained above, we wantto optimize utility, Eq. (14), subject to a set of equality, Eq. (1), and inequality constraints, t i ≥ , asdescribed in section “Even allocation is optimal" of the Results. As all our constraints are linear, we canmake use of the gradient projection method [34]. In this case, we want to obtain the gradient of utility inEq. (10) and project it in the ( M − -simplex such that the capacity constraint in Eq. (1) is satisfied.Due to the linear capacity equality constraint, this projection is simply given by the linear operator Π = Id M × M − M M × M where Id M × M is the M × M identity matrix and M × M is an M × M matrix full of ones. Therefore,we can maximize utility by updating t ( k ) appropriately, t ( k +1) = t ( k ) + η Π (cid:16) ∇ t ˆ U ( t , σ, θ ) (cid:17) , (25)where η = 10 − T is the default step size, k is the iteration number, and θ corresponds to the parametersof the Gaussian prior. The utility for an arbitrary time allocation t for the Gaussian prior case is, usingEq. (14), ˆU ( t , σ, θ ) = µ + σ N (cid:88) i =1 (cid:90) ∞−∞ d y y exp (cid:16) − y σ i (cid:17)(cid:113) πσ i (cid:89) j (cid:54) = i (cid:40) (cid:34) (cid:32) y √ σ j (cid:33)(cid:35)(cid:41) , (26)21here σ i = σ t i σ t i + σ . We can therefore compute the derivative of the previous equation with respect toall components t i and numerically integrate the expression that results.In addition to the linear capacity constraint, we have to enforce the inequality constraints as well, i.e. t i ≥ , which we do by utilizing an active set of constraints. To implement it, we start in a relatively high-dimensional ( M − )-simplex, choosing M to be M ∗ , where M ∗ is the optimal number of accumulatorsto sample in the even sampling case (which is estimated before through exploration, see main text). If andwhenever any of the components of t k +1 derived from Eq. (25) is approaching a border ( t ( k +1) i ≈ τ forsome i and small τ ), the step size decreases until the component effectively reaches zero. In such a case,this dimension is added to the active constraints set (we inactivate the dimension), thus downgrading thesimplex to a lower dimension. In this way, our algorithm only reduces the initial dimension of the simplexand never extends it. To initially activate the M ∗ dimensions, for any random initial condition t , wemake sure that all the components are greater than our threshold t ,i > τ for all i = 1 , ..., M ∗ .Finally, in order to avoid potentially getting trapped in local maxima, we add noise at every iterationas follows. At every step k of Eq. (25), and with probability (cid:15) = 0 . , we push the t ( k ) i of a randomlychosen dimension i by a magnitude δ = 10 − T and pull the t ( k ) j of another random dimension jj