[PDF] Human Inference in Changing Environments With Temporal Structure

Abstract

Full PDF

HHuman Inference in Changing Environments With TemporalStructure

Arthur Prat-Carrabin † , Robert C. Wilson ‡ , Jonathan D. Cohen , and Rava Azeredo daSilveira Laboratoire de Physique de l’École Normale Supérieure, ENS, Université PSL, CNRS,Sorbonne Université, Université de Paris, 75005 Paris, France IOB, Faculty of Science, University of Basel, Basel, Switzerland Princeton Neuroscience Institute, Princeton University, Princeton, USA Department of Neurobiology, Weizmann Institute of Science, Rehovot, Israel † Present address:

Department of Economics, Columbia University, USA ‡ Present address:

Department of Psychology and Cognitive Science Program, University ofArizona, Tucson, USA * For correspondence: [email protected]

Abstract

To make informed decisions in natural environments that change over time, humans mustupdate their beliefs as new observations are gathered. Studies exploring human inference as adynamical process that unfolds in time have focused on situations in which the statistics of ob-servations are history-independent. Yet temporal structure is everywhere in nature, and yieldshistory-dependent observations. Do humans modify their inference processes depending on thelatent temporal statistics of their observations? We investigate this question experimentally andtheoretically using a change-point inference task. We show that humans adapt their inferenceprocess to ﬁne aspects of the temporal structure in the statistics of stimuli. As such, humansbehave qualitatively in a Bayesian fashion, but, quantitatively, deviate away from optimality.Perhaps more importantly, humans behave suboptimally in that their responses are not deter-ministic, but variable. We show that this variability itself is modulated by the temporal statisticsof stimuli. To elucidate the cognitive algorithm that yields this behavior, we investigate a broadarray of existing and new models that characterize diﬀerent sources of suboptimal deviationsaway from Bayesian inference. While models with ‘output noise’ that corrupts the response-selection process are natural candidates, human behavior is best described by sampling-basedinference models, in which the main ingredient is a compressed approximation of the posterior,represented through a modest set of random samples and updated over time. This result comesto complement a growing literature on sample-based representation and learning in humans.©2021, American Psychological Association. This paper is not the copy of record and may not exactlyreplicate the ﬁnal, authoritative version of the article. Please do not copy or cite without authors’ permission.The ﬁnal article will be available, upon publication, via its DOI: 10.1037/rev0000276 a r X i v : . [ q - b i o . N C ] J a n n a variety of inference tasks, human subjects use sensory cues as well as prior information ina manner consistent with Bayesian models. In tasks requiring the combination of a visual cue(such as the shape, position, texture, or motion of an object) with a haptic [1, 2], auditory [3],proprioceptive [4], or a secondary visual cue [5, 6, 7], human subjects weigh information comingfrom each cue according to its uncertainty, in agreement with an optimal, probabilistic approach.Moreover, subjects appear also to integrate optimally prior knowledge on spatial [8, 9] and temporal[10, 11] variables relevant to inference, in line with Bayes’ rule.The Bayesian paradigm hence oﬀers an elegant and mathematically principled account of theway in which humans carry inference in the presence of uncertainty. In most experimental designs,however, successive trials are unrelated to each other. Yet, in many natural situations, the brainreceives a stream of evidence from the environment: inference, then, unfolds in time. Moreover,natural mechanisms introduce sophisticated temporal statistics in the course of events (e.g., rhyth-micity in locomotion, day-night cycles, and various structures found in speech). Are these temporaldynamics used by the brain to reﬁne its online inference of the state of the environment?Furthermore, most studies that support a Bayesian account of human inference discuss averagebehaviors of subjects, and, thereby, side-step the issue of the variability in human responses. Whilean optimal Bayesian model yields a unique, deterministic action in response to a given set of obser-vations, human subjects exhibit noisy, and thus suboptimal, responses. Methods commonly used tomodel response variability, such as ‘softmax’ and probability-matching response-selection strategies,or, more recently, stochastic inference processes, correspond to diﬀerent forms of departure fromBayesian optimality. One would like to identify the nature of the deviations from Bayesian modelsthat can account for the observed discrepancies from optimality in human behavior.To explore these questions, we use an online inference task based on a ‘change-point’ paradigm,i.e., with random stimuli originating from a hidden state that is subject to abrupt, occasional vari-ations, which are referred to as ‘change points’. A growing theoretical and experimental literatureexamines inference problems for this class of signals [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,25, 26]. All these studies, with the exception of the work of [13], focus on the history-independentcase of random change points that obey Poisson temporal statistics. Such problems are character-ized by the absence of temporal structure: the probability of occurrence of a change point does notdepend on the realization of past change points. [15, 23] and [22] extend their studies beyond thissimple framework by considering ‘hierarchical-Poisson’ models in which the change probability isitself subject to random variations; but, here also, the occurence of a change point does not dependon the timing of earlier change points. The experimental studies among the ones cited above haveinvestigated the way in which human subjects and rodents infer hidden states, and whether theylearn history-independent change probabilities.Because of the pervasiveness of temporal structure in natural environments, we decided to studyhuman inference in the presence of ‘history-dependent’ statistics in which the occurrence of a changepoint depends on the timing of earlier change points. This introduces considerable complexity inthe optimal inference model (as the hidden state is no longer Markovian), and serves as a ﬁrst steptoward a more ecological approach to human inference. For the purpose of comparison, we considertwo diﬀerent statistics of change points: the ﬁrst one is the Poisson statistics commonly used inearlier studies; the second is the simplest non-Markovian statistics, in which the probability of achange point is a function of the timing of the preceding change point. This setup allows us toexamine the eﬀect of the latent temporal structure on both human behavior and model responses.In these two contrasting conditions, the behavior of the Bayesian model and that of humansubjects exhibit both similarities and discrepancies. A salient departure from optimality exhibited2y subjects is the variability in their responses. What is more, the shape of the distribution ofresponses is not constant, but, rather, subject to modulations during the course of inference. Thestandard deviation and skewness of the empirical response distribution are correlated with thatof the optimal, Bayesian posterior; this suggests that the randomness in subjects’ responses doesnot reﬂect some ‘passive’ source of noise but is in fact related to the uncertainty of the Bayesianobserver.To account for this non-trivial variability in human responses and other deviations from opti-mality, we investigate in what ways approximations of the Bayesian model alter behavior, in ourtask. The optimal estimation of a hidden state can be split into two steps: Bayesian posteriorinference (computing optimally the belief distribution over the state space) and optimal responseselection (using the belief distribution to choose the response that maximizes the expected reward).Suboptimal models introduce systematic errors or stochastic errors in the inference step or in theresponse-selection step, or in both, thus impacting behavior. Models discussed in the change-pointliterature, along with new models we introduce, provide a wide range of such deviations from op-timality, which we compare to experimental data. This allows us to assess how diﬀerent sources ofsuboptimality impact behavior, and to what extent they can capture the salient features in humanbehavior.The paper is outlined as follows. We ﬁrst present the main aspects of our task, in which subjectsobserve a visual stimulus and infer an underlying, changing, hidden state. The susceptibility ofsubjects to a new stimulus is shown to diﬀer appreciably between the two conditions (with andwithout latent temporal structure), and to adapt to the statistics of change points. We then analyzethe variability in the subjects’ responses, and show how it is modulated over the course of inference.After deriving the optimal, Bayesian solution of the inference problem in the context of our task,we examine its behavior in comparison with experimental data. We then turn to investigating abroad family of suboptimal models. In particular, motivated by the form of the variability presentin our human data, we examine stochastic perturbations in both the inference step and in theresponse-selection step. These models reﬂect diﬀerent forms of sampling : model subjects eitherperform inference using samples of probability distributions or select responses by sampling; theformer option includes models with limited memory as well as sequential Monte Carlo (particle-ﬁlter) models. Finally, we discuss model ﬁtting, from which we conclude that humans carry outstochastic approximations of the optimal Bayesian calculations through sampling-based inference(rather than sampling-based response selection).Our observations conﬁrm and extend the results reported in the change-point literature onhuman inference in the context of Poisson statistics, by exploring a more ecological [27, 28, 29, 30,31, 32, 33, 34, 35], non-Poisson, temporally structured environment. Likewise, our results come tocomplement those of a number of studies on perception and decision-making that also investigateinference from stimuli with temporal statistics [36, 37, 38, 10, 11, 39]. Our experimental resultsdemonstrate that humans learn implicitly the temporal statistics of stimuli. Moreover, our workhighlights the variability ubiquitous in behavioral data, and shows that it itself exhibits structure:it depends on the temporal statistics of the signal, and it is modulated over the course of inference.We ﬁnd that a model in which the Bayesian posterior is approximated with a set of samples capturesthe behavioral variability during inference. This proposal adds to the growing literature on cognitive‘sample-based representations’ of probability distributions [40, 41, 42, 43]. Our results suggest thatthe brain carries out complex inference by manipulating a modest number of samples, selected as alow-dimensional approximation of the optimal, Bayesian posterior.3 Results

In our computer-based task, subjects are asked to infer, at successive trials, t , the location, ona computer screen, of a hidden point, the state, s t , based on an on-screen visual stimulus, x t ,presented as a white dot on a horizontal line (Fig. 1A,B). Subjects can only observe the whitedots, whose positions are generated around the hidden state according to a likelihood probability, g ( x t | s t ) (Fig. 1C,E, blue distribution). The state itself, s t , follows a change-point process, i.e., it isconstant except when it ‘jumps’ to a new location, which happens with probability q t (the ‘hazardrate’ or ‘change probability’). The dynamics of change points are, hence, determined by the changeprobability, q t . To examine the behavior of models and human subjects in diﬀerent ‘environments’,we choose two kinds of signals which diﬀer in their temporal structure. History-independent ( HI )signals are memoryless, Poisson signals: q t is constant and equal to . . Consequently, the intervalsbetween two change points last, on average, for 10 trials, and the distribution of these intervals isgeometric (Fig. 1D, blue bars). Conversely, history-dependent ( HD ) signals are characterized bytemporal correlation. Change points also occur every 10 trials, on average, but the distributionof the duration of inter-change-point intervals is peaked around 10. This corresponds to a changeprobability, q t , that is an increasing function of the number of trials since the last change point —a quantity referred to as the ‘run-length’, τ t . We thus denote it by q ( τ t ) . In HD signals, changepoints occur in a manner similar to a ‘jittered periodic’ process, though the regularity is not readilydetected by subjects.When a change point occurs, the state randomly jumps to a new state, s t +1 , according to a statetransition probability, a ( s t +1 | s t ) (Fig. 1C,E, green distribution). The likelihood, g , and the statetransition probability, a , overlap, thus allowing for ambiguity when a new stimulus is viewed: is ita random excursion about the current state, or has the state changed? At each trial, subjects clickwith a mouse to give their estimate, ˆ s t , of the state. The reward they receive for each response isa decreasing function, R , of the distance between the state and the estimate, | ˆ s t − s t | : one rewardpoint if the estimate falls within a given, short distance from the state, 0.25 point if it falls withintwice that distance, and 0 point otherwise (Fig. 1E). The task is presented as a game to subjects:they are told that someone is throwing snowballs at them. They cannot see this hidden person(whose location is the state, s t ), but they observe the snowballs as white dots on the screen (thestimulus, x t ). After several tutorial runs (in some of which the state is shown), they are instructedto use the snowballs to guess the location of the person (i.e., produce an estimate, ˆ s t ). Additionaldetails on the task are provided in Methods. A typical example of a subject’s responses is displayed in Fig. 2A. To describe the data, we focus,throughout this paper, on three quantities: the learning rate, deﬁned as the ratio of the ‘correction’, ˆ s t +1 − ˆ s t , to the ‘surprise’, x t +1 − ˆ s t ; the repetition propensity, deﬁned as the proportion of trialsin which the learning rates vanishes ( ˆ s t +1 = ˆ s t ); and the standard deviation of the responses of thesubjects. The learning rate represents a normalized measure of the susceptibility of a subject to anew stimulus. If the new estimate, ˆ s t +1 , is viewed as a weighted average of the previous estimate, ˆ s t , and the new stimulus, x t +1 , the learning rate is the weight given to x t +1 . A learning rate of means that the subject has not changed its estimate upon observing the new stimulus; a learningrate of . means that the new estimate is equidistant from the previous estimate and the newstimulus; and a learning rate of means that the new estimate coincides with the new stimulus,and the past is ignored (Fig. 2A). 4 . . . q ( τ ) HIHD

Run - length τ . . . p ( i n t e r v a l ) -20 -10 0 10 2002468100246802468100246810 R u n - l e n g t h τ State s t Signal x t State transition probability a t ( s t +1 | s t ) Likelihood g ( x t | s t ) Change probability q t ( τ t ) T i m e Change - point stimulus CD

60 70 80 90 100 110 120 130 140 s t , x t , s t +1 pt . ptg ( x t | s t ) a ( s t +1 | s t ) E Figure 1: Inference task and change probability, q , in the HI and HD conditions. A. The variouselements in the task appear on a horizontal white line in the middle of a gray screen. a: subject’s pointer(green disk). b: new stimulus (white disk). c: state (red disk, only shown during tutorial runs). d: positionof subject’s previous click (green dot). e: for half of subjects, previous stimuli appear as dots decaying withtime. B. Successive steps of the task:

1, 2: a new stimulus is displayed; to attract the subject’s attention,it appears as a large, white dot for 400ms, after which it becomes smaller. the subject moves the pointer. The subject clicks to provide an estimate of the position of the state. After 100ms, a new stimulusappears, initiating the next trial. C. The position of the stimulus on the horizontal axis, x t , is generatedrandomly around the current state, s t , according to the triangle-shaped likelihood function, g ( x t | s t ) . Thestate itself is constant except at change points, at which a new state, s t +1 , is generated around s t from thebimodal-triangle-shaped transition probability, a ( s t +1 | s t ) . The run-length, τ t , is deﬁned as the number oftrials since the last change point. Change points occur with the change probability q ( τ t ) (orange bars), whichdepends on the run-length in the HD condition (depicted here). D. Top panel: change probability, q ( τ ) , as afunction of the run-length, τ . It is constant and equal to 0.1 in the HI condition, while it increases with therun-length in the HD condition. Consequently, the distribution of intervals between two consecutive changepoints (bottom panel) is geometric in the HI condition whereas it is peaked in the HD condition; in bothconditions, the average duration of inter-change-point intervals is 10. E. Compared extents of the likelihood, g ( x t | s t ) (green), the state transition probability, a ( s t +1 | s t ) (blue), the ‘shot’ resulting from a click (greendot), and the radii of the 1-point (red disk) and 0.25-point (grey circle) reward areas. A shot overlappingthe red (gray) area yields 1 (0.25) point. Learning Rate ≈ Learning Rate ≈ . Learning Rate ≈ Learning Rate = CorrectionSurpriseSurprise CorrectionStateSignal Subject

13 0 2 4 6 8 10

Run - length ˜ τ . . . . L e a r n i n g r a t e ∗ ∗∗ ∗ ∗∗∗ HI HDHI ˜ τ ∈ [5 , HI ˜ τ ∈ [9 , HD ˜ τ ∈ [5 , HD ˜ τ ∈ [9 , . . . . A v e r a g e l e a r n i n g r a t e ∗∗∗∗ ∗∗ ∗ ∗ ∗ AB C

Figure 2: Human learning rates depends on the temporal statistics (HI or HD) of the stimulus.A.

Illustration of the learning rate using a sample of a subject’s responses (red line). The ‘surprise’ (bluearrow) is the diﬀerence, x t +1 − ˆ s t , between the estimate at trial t , ˆ s t (red), and the new stimulus at trial t +1 , x t +1 (blue). The ‘correction’ (red arrow) is the diﬀerence between the estimate at trial t and the estimateat trial t + 1 , ˆ s t +1 − ˆ s t . The ‘learning rate’ is the ratio of correction to surprise. B. Average learning rates inHI (blue) and HD (orange) conditions, at short run-lengths ( ˜ τ ∈ [5 , ) and long run-lengths ( ˜ τ ∈ [9 , ). Inthe HD condition the change probability increases with the run-length, which advocates for higher learningrates at long run-lengths. C. Average learning rates in HI (blue) and HD (orange) conditions, vs. run-length ˜ τ . Shaded bands indicate the standard error of the mean. B , C. Stars indicate p-values of one-sided Welch’st-tests, which do not assume equal population variance. Three stars: p < 0.01; two stars: p < 0.05; onestar: p < 0.1. Bonferroni-Holm correction [44] is applied in panel B . Our data show that for human subjects the learning rate is not constant, and can vary from nocorrection at all (learning rate ≈

0) to full correction (learning rate ≈ ˜ τ , a similar quantity derived from the subjects’ data (see Methods). Unless otherwise stated, wefocus our analyses on cases in which the surprise, x t +1 − ˆ s t , is in the [8,18] window, in which thereis appreciable ambiguity in the signal.A ﬁrst observation emerging from our data is that the learning rate changes with the run-length, in a quantitatively diﬀerent fashion depending on the condition (HI or HD). In the HIcondition, learning rates at short run-length ( ˜ τ ∈ [5 , ) are signiﬁcantly higher than at long run-length ( ˜ τ ∈ [9 , ), i.e., the learning rate decreases with run-length (Fig. 2B, blue bars). Inthe HD condition, the opposite occurs: learning rates are signiﬁcantly higher at long run-lengths(Fig. 2B, orange bars), indicating that subjects modify their inference depending on the temporal6tructure of the signal. In addition, at short run-lengths, learning rates are signiﬁcantly lower inthe HD condition than in the HI condition; this suggests that subjects take into account the factthat a change is less likely at short run-lengths in the HD condition. The opposite holds at longrun-lengths: HD learning rates are markedly larger than HI ones (Fig. 2B).Inspecting the dependence of the learning rate on the run-length (Fig. 2C), we note that theHD learning-rate curve adopts a ‘smile shape’, unlike the monotonic curve in the HI condition.(A statistical analysis conﬁrms that these curves have signiﬁcantly diﬀerent shapes; see Methods.)The HI curve is consistent with a learning rate that simply decreases as additional informationis accumulated on the state. In the HD condition, initially the learning rate is suppressed, thenboosted at longer run-lengths, reﬂecting the modulation in the change probability.These observations demonstrate that subjects adapt their learning rate to the run-length, andthat in the HD condition subjects make use of the temporal structure in the signal. These resultsare readily intuited: shortly after a change point, the learning rate should be high, as little isknown about the new state, while at longer run-lengths the learning rate should tend to zero as thestate is more and more precisely inferred. This decreasing behavior is observed, but only in the HIcondition. The HD condition introduces an opposing eﬀect: as the run-length grows, new stimuli areincreasingly likely to divulge the occurrence of a new state, which advocates for adopting a higherlearning rate. This tradeoﬀ is reﬂected in our data in the ‘smile shape’ of the HD learning-rate curve(Fig. 2C; these trends subsist at longer run-lengths, see Supplementary Fig. 17.). The increase inlearning rate at long run-lengths is reminiscent of the behavior of a driver waiting at a red light:as time passes, the light is increasingly likely to turn green; as a result, the driver is increasinglysusceptible to react and start the car. A closer look at the data presented in the previous section reveals that in a number of trials thelearning rate vanishes, i.e., ˆ s t +1 = ˆ s t . The distribution of the subjects’ corrections, ˆ s t +1 − ˆ s t ,exhibits a distinct peak at zero (Fig. 3A). In other words, in a fraction of trials, subjects clicktwice consecutively on the same pixel. We call such a response a ‘repetition’, and the fraction ofrepetition trials the ‘repetition propensity’. The latter varies with the run-length: it increases with τ in both HI and HD conditions, before decreasing in the HD condition for long run-lengths (Fig.3B).What may cause the subjects’ repetition behavior? The simplest explanation is that, afterobserving a new stimulus, a subject may consider that the updated best estimate of the state landson the same pixel as in the previous trial. The width of one pixel in arbitrary units of our statespace is 0.28. As a comparison, the triangular likelihood, g , has a standard deviation, σ g , of 8.165.An optimal observer estimating the center of a Gaussian density of standard deviation σ g , using 10samples from this density, comes up with a posterior density with standard deviation σ g / √ ≈ . .Therefore, after observing even 10 successive stimuli, the subjects’ resolution is not as ﬁne as a pixel(it is, in fact, 10 times coarser). This indicates that the subjects’ repetition propensity is higherthan the optimal one (the behavior of the optimal model, presented below, indeed exhibits a loweraverage repetition propensity than that of the subjects). Another possible explanation is that eventhough the new estimate falls on a nearby location, a motor cost prohibits a move if it is notsuﬃciently extended to be ‘worth it’ [45, 46, 47, 48]. A third, heuristic explanation is that humansare subject to a ‘repetition bias’ according to which they repeat their response irrespective of theirestimate of the state.Regardless of its origin, the high repetition propensity in data raises the question of whetherit dominates the behavior of the average learning rate. As a control, we excluded all occurrences7 Run - length R e p e t i t i o n p r o p e n s i t y * *** HIHD − − − Corrections H i s t o g r a m HIHD AB Figure 3: Human repetition propensity depends on the temporal statistics, and dynamicallyon the run-length. A.

Histogram of subject corrections (diﬀerence between two successive estimates, ˆ s t +1 − ˆ s t ), in the HI (blue) and HD (orange) conditions. The width of bins corresponds to one pixel on thescreen, thus the peak at zero represents the repetition events ( ˆ s t +1 = ˆ s t ). B. Repetition propensity, i.e.,proportion of occurrences of repetitions in the responses of subjects, as a function of run-length, in the HI(blue) and HD (orange) conditions. Stars indicate p-values of Fisher’s exact test of equality of the repetitionpropensities between the two conditions, at each run-length. of repetitions in subjects’ data and carried out the same analyses on the truncated dataset. Wereached identical conclusions, namely, signiﬁcant discrepancies between the HI and HD learning ratesat short and long run-lengths, albeit with, naturally, higher average rates overall (see SupplementaryFig. 18).

In the previous two sections, we have examined two aspects of the distribution of responses: theaverage learning rate and the probability of response repetition. We now turn to the variability insubjects’ responses. Although all subjects were presented with identical series of stimuli, x t , theirresponses at each trial were not the same (Fig. 4A). This variability appears in both HI and HDconditions. The distribution of responses around their averages at each trial has a width comparableto that of the likelihood distribution, g ( x t | s t ) (Fig. 4B). More importantly, the variability in theresponses (as measured by the standard deviation) is not constant, but decreases with successivetrials following a change point, at short run-lengths (Fig. 4C). Comparing the HI and HD conditions,we observe that for run-lengths shorter than 7, the standard deviation in the HD condition issigniﬁcantly lower than that in the HI condition. At longer run-lengths, the two curves cross andthe variability in the HD condition becomes signiﬁcantly higher than in the HI condition. The8D curve adopts, again, a ‘smile shape’ (Fig. 4C). What is the origin of the response variability?Because it changes with the run-length and the HI vs. HD condition, it cannot be explained merelybe the presence of noise independent from the inference process, such as pure motor noise. In orderto encompass human behavior in a theoretical framework and to investigate potential sources ofthis inference-dependent variability, we start by comparing the recorded behavior with that of anoptimal observer. We derive the optimal solution for the task of estimating the hidden state, s t , given random stimuli, x t . The ﬁrst step (the ‘inference step’) is to derive the optimal posterior distribution over thestate, s t , using Bayes’ rule. Because the state is a random variable coupled with the run-length, τ t , another random variable, it is convenient to derive the Bayesian update equation for the ( s t , τ t ) pair (more precisely, the ( s t , τ t ) pair veriﬁes the Markov property, whereas s t alone does not, in theHD condition). We denote by x t the string of stimuli received between trial 1 and trial t , and by p t ( s, τ | x t ) the probability distribution over ( s, τ ) , at trial t , after having observed the stimuli x t .At trial t + 1 , Bayes’ rule yields p t +1 ( s, τ | x t +1 ) ∝ g ( x t +1 | s ) p t +1 ( s, τ | x t ) . Furthermore, we havethe general transition equation, p t +1 ( s, τ | x t ) = (cid:88) τ t (cid:90) s t p t +1 ( s, τ | s t , τ t ) p t ( s t , τ t | x t )d s t , (1)given by the change-point statistics. As the transition probability, p t +1 ( s, τ | s t , τ t ) , can be expressedusing q ( τ t ) and a ( s | s t ) (see Methods for details), we can reformulate the update equation as p t +1 ( s, τ | x t +1 ) = 1 Z t +1 g ( x t +1 | s ) (cid:34) τ =0 (cid:88) τ t q ( τ t ) (cid:90) s t a ( s | s t ) p t ( s t , τ t | x t )d s t + τ> (1 − q ( τ − p t ( s, τ − | x t ) (cid:35) , (2)where C = 1 if condition C is true, 0 otherwise; and Z t +1 is a normalization constant. This equationincludes two components: a ‘change-point’ one ( τ = 0 ) and a ‘no change-point’ one ( τ > ). Wecall the model that performs this Bayesian update of the posterior the OptimalInference model.Finally, following the inference step just presented (i.e., the computation of the posterior), a‘response-selection step’ determines the behavioral response. At trial t and for a response ˆ s t , theexpected reward is E s R = (cid:82) R ( | ˆ s t − s | ) p t ( s | x t )d s . The optimal strategy selects the response, ˆ s t ,that maximizes this quantity. Before exploring the impact of relaxing the optimality in the inferencestep, in the response-selection step, or both, we examine, ﬁrst, the behavior of the optimal model. Equipped with the optimal model for our inference task, we compare its output to experimentaldata. For short run-lengths ( τ < ), the learning rates in both HI and HD conditions decrease as afunction of the run-length, and the HD learning rates are lower than their HI counterparts. Theyincrease, however, at longer run-lengths ( τ ≥ ) and ultimately exceed the HI learning rates; these,by contrast, decrease monotonically (Fig. 5A, solid line). These trends are similar to those observedin behavioral data (Fig. 5A, dashed line). Hence, the modulation of the subjects’ learning rates9 s t , x t , ˆ s t State s t Signal x t Estimates ˆ s t ˆ s t distribution − −

10 0 10 20

Distance to s t or to average response H i s t o g r a m Distribution of responses around averageg ( x t | s t ) a ( s t +1 | s t ) Responses

Run - length ˜ τ . . . . . . S t a n d a r dd e v i a t i o n ∗∗ ∗∗ ∗∗ ∗ ∗∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗∗ V ariability of responsesHI HD t BAC

Figure 4: The variability in subjects’ responses is modulated during inference, and thesemodulations depend on the temporal statistics of the stimulus. A.

Responses of subjects in anexample of 5 consecutive stimuli. In this example, there is no change point and the state (green) is constant.At each trial (from top to bottom), subjects observe the stimuli (blue) and provide their responses (redbars). A histogram of the locations of the responses is obtained by counting the number of responses in binsof width 3 (light red). B. Distribution of the responses of subjects around their average (red), comparedto the likelihood, g (blue), and the state transition probability, a (green). C. Standard deviation of theresponses of subjects vs. run-length, ˜ τ , in the HI (blue) and HD (orange) conditions. Stars indicate p-valueof Levene’s test of equality of variance between the two conditions, at each ˜ τ . Shaded bands indicate thestandard error of the standard deviation [49]. with the temporal statistics of the stimuli, and over the course of inference, is consistent, at leastqualitatively, with that of a Bayesian observer.Although a Bayesian observer can, in principle, hold a continuous posterior distribution, wediscretize, instead, the posterior, in order to reproduce the experimental condition of a pixelated10 Run - length . . . . Learning rateHIHD SubjectsOptimalInference

Run - length Repetition propensity

Run - length . . . . Standard deviationBayesian P osterior ( pdf ) A B C

Figure 5: The optimal model captures qualitatively the behavior of the learning rate and of therepetition propensity in subjects, but does not account for their variability. A.

Average learningrate as a function of the run-length. In the HI condition, the learning rate decreases with the run-length, forboth the optimal model and the subjects. In the HD condition, learning rates in the optimal model are lowerthan in the HI condition, for short run-lengths, and higher for long run-lengths. The learning rate of subjectsexhibits a similar smile shape, in the HD condition. B. Repetition propensity, i.e., proportion of repetitiontrials, as a function of the run-length. C. Standard deviation of the responses of the subjects (dashed lines)and of the optimal model (solid lines), and standard deviation of the optimal, Bayesian posterior distribution(long dashes), as a function of the run-length. The optimal model is deterministic and, thus, exhibits novariability in its responses. The optimal posterior distribution, however, has a positive standard deviationwhich decreases with the run-length, in the HI condition, and exhibits a smile shape, in the HD condition. screen. This discretization allows for repetitions. The repetition propensity of the optimal modelvaries with the run-length: it increases with τ in both HI and HD conditions, and decreases in theHD condition for long run-lengths, a pattern also found in experimental data (Fig. 5B).Hence, the optimal model captures the qualitative trends in learning rate and repetition propen-sity present in the responses of the subjects. Quantitative diﬀerences, however, remain. The learningrates of the subjects, averaged over both HI and HD conditions, are 43% higher than the averagelearning rate in the optimal model, and the average repetition propensity of the subjects is 9 per-centage points higher than that in the optimal model. The optimal model captures qualitatively the modulations of learning rate and repetition propensityin subjects, but it is deterministic (at each trial, the optimal estimate is a deterministic function ofpast stimuli) and, as such, it does not capture the variability inherent to the behavior of subjects.The modulations of the behavioral variability as a function of the run-length and of the temporalstructure of the signal (Fig. 4C) is a sign that the variability evolves as the inference process unfolds.The standard deviation of the optimal Bayesian posterior decreases with the run-length, in the HIcondition: following a change point, the posterior becomes narrower as new stimuli are observed.In the HD condition, the standard deviation of the posterior exhibits a ‘smile shape’ as a functionof the run-length: it decreases until the run-length reaches 5, then increases for larger run-lengths(Fig. 5C). This behavior is similar to that of the standard deviation of the responses of the subjects.In fact, the standard deviation of the Bayesian posterior and that of subjects’ responses across trialsare signiﬁcantly correlated, both in the HI condition (Pearson’s r = . , p < . ) and in the HDcondition ( r = . , p < . ). In other words, when the Bayesian posterior is wide there is morevariability in the responses of subjects, and vice-versa (Fig. 6A).Turning to higher moments of the distribution of subjects’ responses, we ﬁnd that the skewness11 Sub j ec t s R e s p o n s e s Standard deviation

HIHD

Lin. reg. − − − − . − . − . . . . Sub j ec t s R e s p o n s e s Skewness

HIHD

Lin. reg.

A B

85% 15% of trials

Figure 6: Both width and skewness of the distribution of subjects’ responses are correlatedwith those of the Bayesian posterior.

Empirical standard deviation (A) and skewness (B) of subjects’responses as a function of the standard deviation and skewness of the Bayesian posterior, in the HI (blue)and HD (orange) conditions, and linear regressions (ordinary least squares; dashed lines). On of trials,the standard deviation of the Bayesian posterior is lower than . (vertical grey line). Shaded bands indicatethe standard error of the mean. of this distribution appears, also, to grow in proportion to the skewness of the Bayesian posterior(Fig. 6B). The correlation between these two quantities is positive and signiﬁcant in the twoconditions (HI: r = . , p < . ; HD: r = . , p < . . These results are not driven bythe boundedness of the response domain, which could have artiﬁcially skewed the distribution ofresponse; see Supplementary Fig. 19.). Thus, not only the width, but also the asymmetry inthe distribution of subjects’ responses is correlated with that of the Bayesian posterior. Theseobservations support the hypothesis that the behavioral variability in the data is at least in partrelated to the underlying inference and decision processes.In what follows, we introduce an array of suboptimal models, with the aim of resolving thequalitative and quantitative discrepancies between the behavior of the optimal model and that ofthe subjects. In particular, we formulate several stochastic models which include possible sourcesof behavioral variability. Two scenarios are consistent with the modulations of the magnitude andasymmetry of the variability with the width and skewness of the Bayesian posterior: stochasticityin the inference step (i.e., in the computation of the posterior) and stochasticity in the response-selection step (i.e., in the computation of an estimate from the posterior). The models we examinebelow cover both these scenarios. In the previous sections, we have examined the learning rate of the subjects, their repetition propen-sity, and the variability in their responses; comparison of the behaviors of these quantities to that ofthe Bayesian, optimal model, revealed similarities (namely, the qualitative behaviors of the learningrate and of the repetition propensity) and discrepancies (namely, quantitative diﬀerences in thesetwo quantities, and lack of variability in the optimal model). Although the latter call for a non-Bayesian account of human behavior, the former suggest not to abandon the Bayesian approachaltogether (in favor, for instance, of ad hoc heuristics). Thus, we choose to examine a family ofsub-optimal models obtained from a sequence of deviations away from the Bayesian model, each of12hich captures potential cognitive limitations hampering the optimal performance.In the Bayesian model, three ingredients enter the generation of a response upon receiving astimulus: ﬁrst, the belief on the structure of the task and on its parameters; second, the inferencealgorithm which produces a posterior on the basis of stimuli; third, the selection strategy which mapsthe posterior into a given response. The results presented above, exhibiting the similarity betweenthe standard deviation of the Bayesian posterior and the standard deviation of the responses of thesubjects (Fig. 5C), points to a potential departure from the optimal selection strategy, in whichthe posterior is sampled rather than maximized. This sampling model, which we implement (seebelow), captures qualitatively the modulated variability of responses; sizable discrepancies in thethree quantities we examine, however, remain (see Methods). Hence, we turn to the other ingredientsof the estimation process, and we undertake a systematic analysis of the eﬀects on behavior of anarray of deviations away from optimality.Below, we provide a conceptual presentation of the resulting models; we ﬁt them to experimentaldata, and comment on what the best-ﬁtting models suggest about human inference and estimationprocesses. For the detailed mathematical descriptions of the models, and an analysis of the waysin which their predictions depart from the optimal behavior, we refer the reader to the Methodssection.

Models with erroneous beliefs on the statistics of the signal

Our ﬁrst model challengesthe assumption, made in the optimal model, of a perfectly faithful representation of the set ofparameters governing the statistics of the signal. Although subjects were exposed in training phasesto blocs of stimuli in which the state, s t , was made visible, they may have learned the parametersof the generative model incorrectly. We explore this possibility, and, here, we focus on the changeprobability, q ( τ ) , which governs the dynamics of the state. (We found that altering the value ofthis parameter had a stronger impact on behavior than altering the values of any of the otherparameters.) In the HD condition, q ( τ ) is a sigmoid function shaped by two parameters: its slope, λ = 1 , which characterizes ‘how suddenly’ change points become likely, as a function of τ ; and theaverage duration of inter-change-points intervals, T = 10 . In the HI condition, q = 0 . is constant;it can also be interpreted as an extreme case of a sigmoid in which λ = 0 and T = 1 /q = 10 . Weimplement a suboptimal model, referred to as IncorrectQ , in which these two quantities, λ and T ,are treated as free parameters, thus allowing for a broad array of diﬀerent beliefs in the temporalstructure of the signal (Fig. 7A). Models with limited memory

Aside from operating with an inexact representation of thegenerative model, human subjects may use a suboptimal form of inference. In the HD condition, theoptimal model maintains ‘in memory’ a probability distribution over the entire ( s, τ ) -space (see Eq.(2)), thus keeping track of a rapidly increasing number of possible histories, each characterized by asequence of run-lengths. Such a process entails a large computational and memory load. We exploresuboptimal models that alleviate this load by truncating the number of possible scenarios stored inmemory; this is achieved through various schemes of approximations of the posterior distribution.More speciﬁcally, in the following three suboptimal models, the true (marginal) probability of therun-lengths, p t ( τ | x t ) , is replaced by an approximate probability distribution.A ﬁrst, simple way of approximating the marginal distribution of the run-lengths is to consideronly its mean, i.e., to replace it by a Kronecker delta which takes the value at an estimate ofthe expected value of the run-lengths. [16] introduce a suboptimal model based on this idea, somedetails of which depend on the speciﬁcs of the task; we implement a generalization of this model,adapted to the parameters of our task. We call it the τ Mean model. While the optimal marginal13 . q ( τ ) Belief : λ = 0 ; q = cst = 1 /T T = 6 T = 10 T = 20 Belief : λ = 0 . Belief : λ = 1 . . . p ( i n t e r v a l ) Run - length . . p ( τ ) OptimalInferencet − t ¯ τ t − ¯ τ t Expected run - length . τMean . Nodes . . τNodes Run - length . . τMaxP rob AB Figure 7: Illustration of the erroneous beliefs in the

IncorrectQ model and of the approxi-mations made in the τ Mean , τ Nodes , and τ MaxProb models. A.

Change probability, q ( τ ) , as afunction of the run-length (ﬁrst row), and distribution of intervals between two consecutive change points(second row), for various beliefs on the parameters of the change probability: the slope, λ , and the averageduration of intervals, T . For a vanishing slope ( λ = 0 ), the change probability is constant and equal to /T (ﬁrst panel). With T = 10 this corresponds to the HI condition (blue lines). For a positive slope ( λ > ),the change probability increases with the run-length (i.e., a change-point becomes more probable as the timesince the last change-point increases), and the distribution of intervals between two successive change-pointsis peaked. The HD condition (orange lines) corresponds to λ = 1 and T = 10 . B. Schematic illustration ofthe marginal distribution of the run-length, p ( τ ) , in each model considered. The OptimalInference modelassigns a probability to each possible value of the run-length, τ , and optimally updates this distribution uponreceiving stimuli (ﬁrst panel). The τ Mean model uses a single run-length which tracks the inferred expectedvalue, ¯ τ t (second panel). The τ Nodes model holds in memory a limited number, N τ , of ﬁxed hypotheses on τ (“nodes”), and updates a probability distribution over these nodes; N τ = 2 in this example (third panel).The τ MaxProb model reduces the marginal distribution by discarding less likely run-lengths; in this example,2 run-lengths are stored in memory at any given time (fourth panel). distribution of the run-lengths, p t ( τ | x t ) , spans the whole range of possible values of the run-length,it is approximated, in the τ Mean model, by a delta function parameterized by a single value, whichwe call the ‘approximate expected run-length’ and which we denote by ¯ τ t . Upon the observation of anew stimulus, x t +1 , the updated approximate expected run-length, ¯ τ t +1 , is computed as a weightedaverage between two values of the run-length, and ¯ τ t + 1 , which correspond to the two possiblescenarios: with and without a change point at trial t + 1 . Each scenario is weighted according tothe probability of a change point at trial t + 1 , given the stimulus, x t +1 . This model has no freeparameter (Fig. 7B, second panel).In a second limited-memory model, contrary to the τ Mean model just presented, the supportof the distribution of the run-lengths is not conﬁned to a single value. This model generalizes theone introduced by [17]. In this model, the marginal distribution of the run-lengths, p t ( τ | x t ) , isapproximated by another discrete distribution deﬁned over a limited set of constant values, called‘nodes’ (Fig. 7B, third panel). We call this model τ Nodes . A diﬀerence with the previous model( τ Mean ) is that the support of the distribution is ﬁxed, i.e., the set of nodes remains constantas time unfolds, whereas in the τ Mean model the single point of support, ¯ τ t , depends on the14timuli received. The details of the implementation of this algorithm, and, in particular, of how theapproximate marginal distribution of the run-lengths is updated upon receiving a new stimulus, areprovided in Methods. The model is parameterized by the number of nodes, N τ , and the values ofthe nodes. We implement it with up to ﬁve nodes.The two models just presented are drawn from the literature. We propose a third suboptimalmodel that relieves the memory load in the inference process. We also approximate, in this model,the marginal distribution of the run-lengths, p t ( τ | x t ) , by another, discrete distribution. We call N τ the size of the support of our approximate distribution, i.e., the number of values of the run-lengthat which the approximate distribution does not vanish. A simple way to approximate p t ( τ | x t ) isto identify the N τ most likely run-lengths, and set the probabilities of the other run-lengths to zero.More precisely, if, at trial t , the run-length takes a given value, τ t , then, upon the observation ofa new stimulus, at trial t + 1 it can only take one of two values: (if there is a change point) or τ t + 1 (if there is no change point). Hence, if the approximate marginal distribution of the run-lengths at trial t is non-vanishing for N τ values, then the updated distribution is non-vanishing for N τ + 1 values. We approximate this latter distribution by identifying the most unlikely run-length, arg min p t +1 ( τ | x t +1 ) , setting its probability to zero, and renormalizing the distribution. In otherwords, at each step, the N τ most likely run-lengths are retained while the least likely run-lengthis eliminated. We call this algorithm τ MaxProb (Fig. 7B, fourth panel). It is parameterized bythe size of the support of the marginal distribution, N τ , which can be understood as the number of‘memory slots’ in the model. A model with limited run-length memory through sampling-based inference

The ﬁvemodels considered hitherto (

OptimalInference , IncorrectQ , τ Mean , τ Nodes , and τ MaxProb ) are de-terministic: a given sequence of stimuli implies a given sequence of responses, in marked contrastwith the variability exhibited in the responses of subjects. To account for this experimental observa-tion, we suggest several models in which stochasticity is introduced in the generation of a response.Response stochasticity can stem from the inference step, the response-selection step, or both. Wepresent, ﬁrst, a model with stochasticity in the inference step.This model, which we call τ Sample , is a stochastic version of the τ MaxProb model: instead ofretaining deterministically the N τ most likely run-lengths at each trial, the τ Sample model samples N τ run-lengths using the marginal distribution of the run-lengths, p t ( τ | x t ) . More precisely, if attrial t + 1 the marginal distribution of the run-lengths, p t +1 ( τ | x t +1 ) , is non-vanishing for N τ + 1 values, then a run-length is sampled from the distribution [1 − p t +1 ( τ | x t +1 )] /z t +1 , where z t +1 is anormalization factor, and the probability of this run-length is set to zero (Fig. 8). In other words,while the τ MaxProb model eliminates the least likely run-length deterministically, the τ Sample model eliminates one run-length stochastically, in such a fashion that less probable run-lengths aremore likely to be eliminated. The τ Sample model has one parameter, N τ , the size of the supportof the marginal distribution of the run-lengths. Stochastic inference model with sampling in time and in state space: the particleﬁlter

Although the τ Mean , τ Nodes , τ MaxProb , and τ Sample models introduced above relieve thememory load by prescribing a form of truncation on the set of run-lengths, inference in these modelsis still executed on a continuous state space labeled by s (or, more precisely, on a discrete spacewith resolution as ﬁne as a pixel). Much as subjects may retain only a compressed representationof probabilities along the τ axis, it is conceivable that they may not maintain a full probabilityfunction over the 1089-pixel-wide s axis, as they carry out the behavioral task. Instead, they mayinfer using a coarser spatial representation, in order to reduce their memory and computational15 = 0 p ( τ | x t ) p ( s, τ | x t ) p ( τ | x t +1 ) p ( s, τ | x t +1 ) p ( τ | x t +2 ) p ( s, τ | x t +2 ) τ = 1 τ = 2 τ = 3 τ = 4 τ = 5 τ = 6 τ = 7 OptimalInferenceParticleFilter τ Sample s p ( s | x : t ) x t s p ( s | x : t + ) x t + s p ( s | x : t + ) x t + Figure 8: Posterior density over three successive trials for the

OptimalInference model, the τ Sample model with N τ = 2 , and the ParticleFilter model with ten particles.

The three panelscorrespond to the three successive trials. Each row except the last one corresponds to a diﬀerent run-length, τ . In these rows, the horizontal bars show the marginal probability of the run-length, p ( τ | x t ) . Theposterior (i.e., the joint distribution of the run-length and the state, p ( s, τ | x t ) ) is shown as a function ofthe state, s , for the OptimalInference model (blue shaded curve), the τ Sample model (pink line), and the

ParticleFilter model (orange vertical bars). The marginal probability of the run-length, p ( τ | x t ) , for the OptimalInference model, is additionally reﬂected in the hue of the curve (darker means higher probability).For the

ParticleFilter model, the heights of the bars are proportional to the weights of the particles. Whenthe state, s , of two or more particles coincide, a single bar is shown with a height proportional to the sum ofthe weights. The last row shows the marginal distributions of the states, p ( s | x t ) = (cid:80) τ p ( s, τ | x t ) , alongwith the location of the stimulus at each trial (red vertical line). At trial t (left panel), the probability ofthe run-length τ = 5 dominates in the three models. In the τ Sample model, it vanishes at run-lengths from to , and it is very small for τ = 4 . In the ParticleFilter model, the run-lengths of the ten particles are all , and thus the probability of all other run-lengths is zero. At trial t + 1 (middle panel), upon observationof the new stimulus, x t +1 , the marginal probability of the vanishing run-length ( τ = 0 ), which correspondsto a ‘change-point’ scenario, becomes appreciable in the OptimalInference model (top row). The probabilityof the run-length τ = 6 (a ‘no change-point scenario’ ) is however higher. As a result, a ‘bump’ appears inthe marginal distribution of the state, around the new stimulus (bottom row). In the τ Sample model, theoptimal update of the posterior results in a non-vanishing probability for three run-lengths ( τ = 0 , , and ),more than the number of ‘memory slots’ available ( N τ = 2 ). One run-length is thus randomly chosen, and itsmarginal probability is set to zero; in the particular instantiation of the model presented here, the run-length τ = 0 is chosen, and thus the resulting marginal probability of run-length is non-vanishing for τ = 5 and only. In the ParticleFilter model, the stochastic update of the particles results in seven particles adoptinga vanishing run-length, and the probability of a ‘change-point’ scenario ( τ = 0 ) becomes higher than thatof the ‘no change-point’ scenario ( τ = 6 ) supported by the remaining three particles. The various marginaldistributions of the states obtained in these three models (bottom row) illustrate how the τ Sample modeland the

ParticleFilter model approximate the optimal posterior: the τ Sample model assigns a negligibleprobability to a set of states whose probability is substantial under the

OptimalInference model, while the

ParticleFilter yields a coarse approximation reduced to a support of ten states (as opposed to a continuousdistribution). oads. Monte Carlo algorithms perform such approximations by way of randomly sampling thespatial distribution; sequential Monte Carlo methods, or ‘particle ﬁlters’, were developed in the1990s to address Hidden Markov Models, a class of hidden-state problems within which falls ourinference task [50, 51, 52]. Particle ﬁlters approximate a distribution by a weighted sum of deltafunctions. In our case, a particle i at trial t is a triplet, ( s it , τ it , w it ) , composed of a state, a run-length, and a weight; a particle ﬁlter with N P particles approximates the posterior, p t ( s, τ | x t ) , bythe distribution ˜ p t ( s, τ | x t ) = N P (cid:88) i =1 w it δ ( s − s it ) δ τ,τ it , (3)where δ ( s − s it ) is a Dirac delta function, and δ τ,τ it a Kronecker delta. In other words, a distributionover the ( s, τ ) space is replaced by a (possibly small) number of points, or samples, in that space,along with their probability weights.To obtain the approximate posterior at trial t + 1 upon the observation of a new stimulus, x t +1 ,we note, ﬁrst, that the Bayesian update (Eq. (2)) of the approximate posterior, ˜ p t ( s, τ | x t ) , is amixture (a weighted sum) of the N P Bayesian updates of each single particle (i.e., Eq. (2) with theprior, p t ( s, τ | x t ) , replaced, for each particle, by δ ( s − s it ) δ τ,τ it ). Then, we sample independentlyeach component of the mixture (i.e., each Bayesian update of a particle), to obtain stochasticallythe updated particles, ( s it +1 , τ it +1 ) , and to each particle is assigned the weight of the correspondingcomponent in the mixture. The details of the procedure just sketched, in particular the derivation ofthe mixture and of its weights, and how we handle the diﬃculties arising in practical applications ofthe particle ﬁlter algorithm, can be found in Methods. This model, which we call ParticleFilter ,has a single free parameter: the number of particles, N P (Fig. 8). Models with variability originating in the response-selection step

The τ Sample and

Par-ticleFilter models presented above reduce the dimensionality of the inference problem by pruningstochastically the posterior, in the inference step. But, as we pointed out, the behavior of the stan-dard deviation of the responses of the subjects, as compared to that of the width of the Bayesianposterior (Fig. 5C), hints at a more straightforward mechanism at the origin of response variabil-ity. The model we now introduce features stochasticity not in the inference step, but rather inthe response-selection step of an otherwise optimal model. In this model, the response is sampledfrom the marginal posterior on the states, p t ( s | x t ) , i.e., the response, ˆ s t , is a random variablewhose density is the posterior. This contrasts with the optimal response-selection strategy, whichmaximizes the expected score based on the Bayesian posterior, and which was implemented in allthe models presented above. Henceforth, we denote the optimal, deterministic response-selectionstrategy by Max , and the suboptimal, stochastic strategy just introduced by

Sampling . It has nofree parameter.Another source of variability in the response-selection step might originate in a limited motorprecision, in the execution of the task. To model this motor variability, in some implementations ofour models we include a ﬁxed, additive, Gaussian noise, parameterized by its standard deviation, σ m ,to obtain the ﬁnal estimate. Both this motor noise and the Sampling strategy entail stochasticityin response selection. The former, however, has a ﬁxed variance, σ m , while the variance of thelatter depends on the posterior which varies over the course of inference (Fig. 5C). When weinclude motor noise in the Max or in the

Sampling strategies, we refer to these as

NoisyMax and

NoisySampling , respectively.In sum, we have described four response-selection strategies (

Max , Sampling , NoisyMax , and

NoisySampling ), and seven inference strategies, of which ﬁve are deterministic (

OptimalInference , IncorrectQ , τ Mean , τ Nodes , and τ MaxProb ) and two are stochastic ( τ Sample and

ParticleFilter ).17e can combine any inference strategy with any response-selection strategy: thus, we have at hand4 × =

28 diﬀerent models, 27 of which are suboptimal, obtained from pairings of the inference andselection strategies. We label each of the 28 models by the combination of the two names referringto the two steps in the process, e.g.,

ParticleFilter+Sampling . O p t i m a l I n f . I n c o rr ec t Q τ M e a n τ N o d e s τ M a x P r o b τ S a m p l e P a rt i c l e F il t e r MaxNoisyMaxSamplingNoisySampling

Deterministic Stochastic D e t . S t o c h a s t i c F i x e d w i d t h P o s t e r i o r - b a s e d Inference strategySelection strategy

Table 1: Model ﬁtting favors the

ParticleFilter inference strategy with NoisyMax responseselection.

Ratios of the normalized mean squared error (NMSE) in each model to that of the best-ﬁttingmodel,

ParticleFilter+NoisyMax . Each model is a combination of an inference strategy (columns) with aresponse-selection strategy (rows). The second best model, also a

ParticleFilter but with a

NoisySampling response-selection strategy, yields an NMSE 46% higher than the best-ﬁtting model.

The 27 suboptimal models introduced in the previous section yield a range of discrepanciesfrom the optimal behavior. The ways in which each deviation from the optimal model impactsbehavior is examined in Methods; here, we ask how well these models account for the behavior ofhuman subjects. Whereas the optimal model,

OptimalInference+Max , computes the Bayesian pos-terior (

OptimalInference ) and selects the maximizing response (

Max ), the suboptimal models mimiccognitive limitations that may prevent the brain from reaching optimality: incorrect belief in thetemporal structure of the signal (

IncorrectQ ), compressed representation of the Bayesian posterior,either deterministically ( τ Mean , τ Nodes , and τ MaxProb ) or stochastically ( τ Sample and

Particle-Filter ), and noise introduced in the response-selection step, with a width either scaling with that ofthe posterior (

Sampling ), or constant (

NoisyMax ), or a combination of the two (

NoisySampling ).To evaluate the ability of each of these models to account for human behavior, we comparequantitatively their respective outputs with the responses of human subjects. For the three quan-tities we examine (the learning rate, the repetition propensity, and the standard deviation of theresponses), we compute the normalized mean squared error (NMSE) (sometimes referred to as the‘Fraction of Variance Unexplained’ in the context of linear regressions). It is deﬁned, for a given18odel and for each quantity, as the ratio of the mean squared error in the model output as com-pared to data, and the variance of the quantity under scrutiny in the behavioral data. We ﬁt eachof our models to human data, using the average of the three NMSEs as our error measure. (We notethat the

OptimalInference inference strategy is a special case of all the other inference strategies,except τ Mean , thus its NMSE cannot be lower than that of these strategies. Likewise, the

Max and

Sampling response-selection strategies are special cases of the

NoisyMax and

NoisySampling strategies, respectively.)We ﬁnd that the ﬁve best-ﬁtting models make use of stochastic compression in the inferencestep, in either the τ Sample approximation or the

ParticleFilter approximation (Table 1). Thesemodels all reproduce the qualitative trends in the behavior of subjects with respect to our threemeasures: for the learning rate and the standard deviation, the ‘smile shape’ of the HD curve,which crosses a decreasing HI curve; for the repetition propensity, conversely, an inverted U shapeof the HD curve which crosses an increasing HI curve (Fig. 9, results from the τ Sample+Max and

ParticleFilter+Sampling models are not shown, but the corresponding curves are similar).The τ Sample and

ParticleFilter strategies have one or two parameters, depending on whetherthey include motor noise or not. Other models, including all models with a deterministic inferencestep, have an error at least 30% higher than the best ﬁve models (and 2.45 times higher than thebest model), despite the fact that other strategies come with up to ﬁve parameters (Table 1). Thebest-ﬁtting model is

ParticleFilter+NoisyMax with N P = 9 particles. The ﬁtted standard deviation, σ m , of the Gaussian motor noise is approximately equal to . pixels; as a consequence, in abouthalf of the trials, the noise component is within the width of a pixel, and thus has no impact. Thesecond best model also follows a ParticleFilter inference strategy, with N P = 14 , combined with a NoisySampling response selection (with σ m = 0 . pixels).The third and fourth best-ﬁtting models use the τ Sample inference strategy, with N τ = 1 , andthe NoisyMax (with σ m = 0 . pixels, for the third one) and the Max (for the fourth one) selectionstrategies. At any given trial, these two models retain only a single assumption, τ t , on the run-length. Upon receiving a new stimulus, x t +1 , a model subject computes p change = p t +1 ( τ = 0 | x t +1 ) and − p change = p t +1 ( τ = τ t + 1 | x t +1 ) , and decides whether there was a change-point by samplingthis simple, Bernoulli distribution. This sampling process, over a marginalization of the posterior,is similar to that in the particle ﬁlter model, which samples over the full ( s, τ ) -dependent posterior.As a consequence of sampling, the τ Sample strategy also exhibits variability, which behaves ina fashion similar to the variability in the

ParticleFilter strategy (Fig. 9, bottom right). As forresponse selection, we note that with the

Sampling and

NoisySampling selection strategies (insteadof the

Max and

NoisyMax strategies), these models do not perform as well, and result in errors largerby 86% (

Sampling vs.

Max ) and 96% (

NoisySampling vs.

NoisyMax ). In fact, for all the seveninference models, the

NoisyMax response-selection strategy results in errors lower or equal (but moreoften, lower) than the other three selection strategies (

Max , Sampling and

NoisySampling ) (Table 1).This suggests that the variability in human responses does not originate from a posterior-samplingstrategy in the response-selection step, but, rather, from an intrinsically stochastic inference process.In order to seek further validation of this ﬁnding, we explore, below, a generalization of the

Sampling strategy.

Robustness of the results

To substantiate the picture that emerges from the results summarizedabove, we perform two supplementary analyses. First, we investigate whether a generalized

Sampling strategy yields smaller errors than the

NoisyMax strategy. Second, we consider our choice of ﬁtting-performance measure (the average of the NMSEs in the three quantities we examine), and we checkfor the robustness of model ﬁtting to changes in the relative weights of each quantity in the ﬁtting19 . . . . . . . Learning rateP articleF ilter + NoisyMaxSubjects

Repetition propensityHIHD

Standard deviation . . . . . . . P articleF ilter + NoisySamplingSubjects

Run - length . . . . . . . τSample + NoisyMaxSubjects

Run - length Run - length Figure 9: Behavior of the three best-ﬁtting models.

In HI (blue curves) and HD (yellow curves) con-ditions, average learning rate (ﬁrst column), repetition propensity (second column), and standard deviationof responses (third column), as a function of run-length, for the subjects (solid lines) and the three best-ﬁtting models (dashed lines):

ParticleFilter+NoisyMax (ﬁrst row),

ParticleFilter+NoisySampling (secondrow), and τ Sample+NoisyMax (third row). performance measure.Sampling from the posterior function is only one of many possible sampling strategies for re-sponse selection. Furthermore, in practice sampling may be diﬃcult to tease apart from maximizinga perturbed posterior function. [53] argue that, for some forms of random perturbations of a pos-terior probability density, maximizing the randomly perturbed function yields similar results tosampling from a modiﬁed posterior density function obtained as a power of the correct posterior: p κ ( s ) ∝ p ( s | x t ) κ . To establish the equivalence, the exponent, κ , is chosen as inversely related tothe magnitude of the perturbing noise. Sampling from the modiﬁed posterior yields a behavior thatinterpolates between posterior sampling (for κ = 1 ) and maximizing (for κ → ∞ ); it yields a familyof softmax operations over the posterior [54, 55]. Another interpretation of this sampling strategyis proposed by [2]: in the case of an integer κ and a Gaussian posterior, the mean of κ samplesdrawn from the posterior is a Gaussian random variable, with a standard deviation equal to thatof the posterior scaled by / √ κ ; i.e., a distribution equal to the posterior raised to the power κ ,and normalized. Hence, in the Gaussian case, sampling from the exponentiated posterior can beinterpreted as drawing κ samples from the unexponentiated posterior, and taking the mean.We implement this strategy of response selection by sampling a modiﬁed posterior, which we20enote κ Sampling . We ﬁnd that it performs better than the

Sampling strategy, as expected since the

Sampling strategy is a special case of the κ Sampling (with the parameter, κ , set to unity). However,in the case of all seven inference models, the κ Sampling strategy, which has one free parameter,performs worse than the

NoisyMax strategy, which has, also, a single parameter (Fig. 10B). Hence,a random, additive perturbation of the maximization strategy remains a better account of humanbehavior than a posterior-sampling strategy.

Std + LearningRate N M S E Repeat + LearningRate P a r t i c l e F i l t e r τ S a m p l e I n c o r r e c t Q τ M a x P r o b τ N o d e s O p t i m a l I n f . τ M e a n Repeat + Std T wo - measure Error P a r t i c l e F i l t e r τ S a m p l e I n c o r r e c t Q τ M a x P r o b τ N o d e s O p t i m a l I n f . τ M e a n N M S E NoisyMaxSampling NoisySamplingκSampling

T hree - measure Error A B

Figure 10: Model ﬁtting is robust to the measure used for model comparison.

Normalized MeanSquared Error of ﬁtted models. A. NMSE of models ﬁtted to subjects data, averaged over the three measures(learning rate, repetition propensity, and standard deviation of responses), grouped by inference models. B. NMSE between ﬁtted models and subjects data, averaged over two out of the three measures.

Our results, which suggest that the variability in the responses of subjects originate in theinference step rather than in the response-selection step, rely upon the ﬁtting performance measureused for model comparison. We chose a measure that weighted equally the three NMSEs (on thelearning rate, repetition propensity, and standard deviation), so as to obtain a model performingwell on all fronts, but that choice was arbitrary. Hence, one may be concerned, for instance, that thegoodness-of-ﬁt of the

ParticleFilter model be due to our choice of weighing errors. As a control, wecomputed the three ‘two-measure errors’, each excluding one of the three measures and averagingthe errors in the two remaining ones. We found that, regardless of the choice of the combination,the relative order of the models in terms of performance stays identical, with only a few exceptions.Most importantly, the

ParticleFilter remains, in all three cases of error combinations, the best-ﬁttingmodel (Fig. 10B).

Another concern regarding the choice of the NMSE as our performance measure for model com-parison is that it does not take into account the number of parameters in the models. A standardmethod to ﬁt and compare models is to maximize the log-likelihood of each model and computeits Bayesian Information Criterion (BIC), which includes a penalty as a function of the number ofparameters in the model [56]. In several of our models (and in many models in the literature), theresponses in successive trials, conditioned on the stimuli presented to the subject, are independent;as a result, the log-likelihood over all trials is the sum of the log-likelihoods for each trial, takenseparately. This obtains for all the models in which the inference strategy is deterministic (

Optima-lInference , IncorrectQ , τ Mean , τ Nodes , and τ MaxProb ). It does not apply, however, for the modelswith stochastic inference strategies ( τ Sample and

ParticleFilter ): in these models, successive re-21

MaxProbOptimalInference τ Mean τ NodesIncorrectQ τ SampleParticleFilter ∆BIC

NoisySamplingNoisyMax

Figure 11: Bayesian model selection favors the

ParticleFilter inference strategy.

Diﬀerencebetween the BIC of each model and that of the best-ﬁtting model, for the models combining one of theseven inference strategies with the

N oisyM ax or the

N oisySampling response-selection strategies. The twobest-ﬁtting models make use of the

ParticleFilter inference strategy. sponses, conditional on observed stimuli, are not independent as they depend on the realizationof the stochastic process that governs inference. To compute the BIC, it is therefore necessary tocompute, ﬁrst, the distribution of the possible realizations of the stochastic inference process. Thediﬃculty, here, lies in the fact that the space of these realizations grows exponentially with thenumber of trials in an experimental run.In the context of our task, in which the subjects undergo 1000 trials in a run, an exact compu-tation of the BIC is prohibitive. In order to circumvent this problem, we propose to approximatethe log-likelihood of a model by way of a Monte-Carlo estimation of the log-likelihoods of shortsequences of responses. This approach limits the computational load of the estimation while takinginto account the sequential dependence of responses. We report, here, the results of this estimationscheme using short sequences of 10 successive trials, but in our investigations we repeated the calcu-lations for diﬀerent choices, which yielded comparable results. We detail the procedure in Methods.Here, we mention that, even though models with temporal correlation such as the particle ﬁlter havebeen used to capture cognitive processes, to the best of our knowledge Bayesian model selectionusing the BIC has not been applied to them, except in the case of a binary categorization task [57].The approximate approach we propose may thus be of use beyond the conﬁnes of the speciﬁcs ofour problem.In the models that do not feature a (Gaussian) motor noise, some responses of the subjects have avanishing probability, and thus these models have an inﬁnite BIC. Hence, we look at the BICs of themodels equipped with the

Noisy or the

NoisySampling response-selection strategy. We ﬁnd that thethree best-ﬁtting models involve a stochastic approximation of the Bayesian inference: the two best-ﬁtting models make use of the

ParticleFilter inference strategy, and the third best-ﬁtting model has22 τ Sample inference strategy (Fig. 11). We note that with the NMSE metric the three best-ﬁttingmodels were also the two

ParticleFilter models followed by a τ Sample model. Thus, both model-ﬁtting approaches suggest that human inference evolves according to a stochastic compression ofthe posterior. The best-ﬁtting model is

ParticleFilter+NoisySampling , with N P = 8 particles, andits BIC is smaller than that of the second best-ﬁtting model, the ParticleFilter+NoisyMax modelwith N P = 4 particles, by , . This result is consistent with the best-ﬁtting numbers of particlesobtained when minimizing the NMSE, which were also relatively modest, although slightly larger( N P = 14 with NoisySampling and N P = 9 with NoisyMax ).Taken together, our results suggest that variability in human behavior, at least in the contextof our task, is dictated primarily by stochasticity in the inference step — i.e., in the manipulationand update of probabilities — rather than by ‘output noise’ such as stochasticity in the response-selection step or motor noise. This view agrees with the conclusion of a recent study of a cuecombination task [58]; its authors argue that a “dominant fraction” of human choice suboptimalityarises from random ﬂuctuations in the inference step.

This study investigates the behavior of human subjects in an online inference task, and examinesmechanisms that can account for behavioral trends found in experimental data. An important aspectof this task is that it makes use of both a history-independent (HI) condition with no temporalstructure, and a history-dependent (HD) condition in which a hidden state is almost periodical and,hence, highly structured in time (Fig. 1D). We ﬁnd that subjects display diﬀerent behaviors in thetwo conditions, adapting their learning rate to the temporal structure of the hidden state. We alsonote a propensity in subjects to repeat their response in consecutive trials; this repetition propensityincreases with the run-length, and in the HD condition drops again for larger run-lengths. Moreover,we observe that subjects exhibit a greater variability in their responses shortly after a change point,in both conditions, and at long run-lengths in the HD condition, i.e., the variability in behavior alsodepends on the temporal statistics of the stimuli.The distinctive behaviors of the learning rate and the repetition propensity in the HI and HDconditions are reproduced qualitatively by a Bayesian model of inference which yields optimalupdates of the probability density of the hidden state. As for the variability in subjects’ responses,we ﬁnd that its behavior is similar to that of the standard deviation of the Bayesian posterior. Wetherefore use the Bayesian model as a starting point to elaborate variant models which can accountfor the trends exhibited in human responses. We ﬁnd that the variability in human behavior, andits dynamics, can be reproduced by supoptimal models in which inference is executed in a stochasticmanner. Speciﬁcally, the τ Sample and the

ParticleFilter models alter the optimal inference step bymaintaining an approximate version of the posterior, by means of random sampling. This alterationof the optimal model at once introduces variability in the behavior and relieves the memory capacity,through sampling either in the ‘time dimension’ (in the τ Sample model) or in the ‘time and spacedimensions’ (in the

ParticleFilter model).The behavioral patterns that arise in our task in the HI condition are also found in otherexperiments. [25] and [26] both conducted an online inference task, with change points that occurredwith constant probability (similarly to our HI condition). We examined the responses of theirsubjects in the context of the respective tasks, and we observed very similar behavioral trends: thelearning rate decreases as a function of the run-length, while the repetition propensity increases.As for the empirical standard deviation of subjects’ responses, we note that in these studies some23ubjects were presented several times with the same sequence of stimuli, in diﬀerent sessions, thusallowing for the examination of the variability of responses within subjects. We ﬁnd, here also, thatthe within-subject standard deviation of responses shows the same modulations than the standarddeviation of the Bayesian posterior (Supplementary Fig. 21).

In order to make appropriate decisions in relation to their environment, humans and animals mustinfer the state of the surrounding world on the basis of the sensory signals they receive. If thesesignals are noisy and if the environment is changing, their inference task is complicated by the factthat a new stimulus may reﬂect either noise or a change in the underlying state. However, if eventsin the world present some kind of temporal structure, such as in our HD signal, it is possible to usethis structure to reﬁne one’s inference. Conversely, if events follow a Poisson process, as in the HIsignal, their occurrences present no particular temporal structure, and what just happened conveysno information on what is likely to happen next. Hence, there is a fundamental diﬀerence betweenthe HI and HD conditions, which impacts the inference of an optimal observer.Many natural events are not Poisson-distributed in time, and exhibit strong regularities. [27], [28,29], and [30] have recorded the motor activity of both rodents and human subjects over the courseof several days. In both species, they found that the time intervals between motion events weredistributed as a power law, a distribution characterized by a long tail, leading to bursts, or clusters,of events followed by long waiting epochs. The durations of motion episodes also exhibited heavytails. These kinds of distribution are incompatible with Poisson processes, which yield exponentiallydistributed inter-event epochs. Moreover, both rodent and human activity exhibited long-rangeautocorrelations, another feature that cannot be explained by a Poisson process. A particular formof autocorrelation is periodicity, which occurs in a wide range of phenomena. In the context ofhuman motor behavior, walking is a highly rhythmical natural activity [31, 32]. More complexpatterns exist (neither clustered nor periodic), such as in human speech which presents a varietyof temporal structures, whether at the level of syllables, stresses, or pauses [33, 34, 35]. In allthese examples, natural mechanisms produce series of temporally structured events. The ubiquityof history-dependent statistics of events in nature begs for explorations of inference mechanismsin their presence. For the purposes of our experiment, we chose an idealized temporal signal thatcombined several advantages: it featured a prominent form of history dependence, approximateperiodicity; it was not easily distinguishable from the other, history-independent signal used in thetask; and it was amenable to modeling.In the case of studies of perception and decision-making, in both humans and animals, history-dependent signals have been used widely. In a number of experiments [36, 37, 38, 10, 11], a ﬁrstevent (a sensory cue, or a motor action such as a lever press) is followed by a second event, suchas the delivery of a reward, or a ‘go’ signal triggering the next behavior. The time elapsed betweenthese two events – the ‘reward delay’ or the ‘waiting time’ – is randomized and sampled fromdistributions that, depending on the studies, vary in mean, variance, or shape. For instance, both[36] and [37] use unimodal and bimodal temporal distributions. Because of the stochasticity of thewaiting time, the probability of occurence of the second event varies with time, similarly to theprobability of a change point in our HD condition; these studies explore whether variations of thisprobability are used by human and animal subjects. [36]’s recordings from the V4 cortical areain rhesus monkey indicate that, for both unimodal and bimodal waiting times distributions, theattentional modulation of sensory neurons varies consistently with the event probability. [37] notethat the reaction times of macaques are inversely related to the event probability, for both unimodaland bimodal distributions, and that the activity of neurons in the lateral intraparietal (LIP) area24s correlated with the evolution of this probability over time. [38] manipulate another aspect of thedistribution of reward delays: between blocks of trials, the standard deviation of this distribution isvaried, while the mean is left unchanged. Mice, in this situation, are shown to adapt their waitingtimes to this variability of reward delays, consistently with a probabilistic inference model of rewardtiming.Akin to the tasks just outlined are ‘ready-set-go time-reproduction tasks’, in which subjectsare asked to estimate the random time interval between ‘ready’ and ‘set’ cues, and to reproduceit immediately afterwards. [10] and [11] show that human subjects combine optimally the cue(consisting in the perceived ready-set interval) with their prior on the interval length. Diﬀerentpriors are learned in training runs: they diﬀer by the variances of the interval distributions [10]or by their means [11]. In both cases, subjects integrate the prior in a fashion consistent withBayesian inference. Adopting a diﬀerent approach, [39] show that attentional resources can bedynamically allocated to points in time at which input is expected: when asked to detect auditorystimuli (beeps) of low intensity embedded in a continuous white noise, human subjects performbetter when detecting periodic beeps rather than random beeps, suggesting that they are able toidentify the temporal regularity and use it in their detection process.In all these studies, the event of interest has a probability of occurrence that varies with time.The resulting temporal structure in the signal appears to be captured by human and animal subjects,and reﬂected in behavior and in its neural correlate. Various probability distributions used in thereported tasks can be compared directly to our HD sigmoid-shaped change probability, with adjustedparameters. In line with these studies, our results conﬁrm that human subjects adapt their behaviordepending on the temporal structure of stimuli. Additionally, we provide a comparison betweentwo diﬀerent conditions, a HD condition akin to a ‘jittered periodic’ process, and the Poisson, HIcondition; the latter produces a memoryless process. Importantly, it plays the role of a benchmarkfrom the point of view of probability theory: in discrete time it yields a geometric distribution, andin continuous time it yields an exponential distribution; both distributions maximize the entropy,subject to the constraint of a ﬁxed event rate. In this study, we compared a speciﬁc, temporallystructured HD condition to this benchmark, HI condition.

Our ﬁrst observation is that the average learning rate of subjects and their repetition propensityare captured by a Bayesian model. The Bayesian paradigm has been viewed as an extension of logicthat enables reasoning with propositions whose truth or falsity is uncertain [59, 60]. In cognitivescience, it has successfully accounted for a wide range of observations, including cue combinationin humans [5, 4, 1, 61, 3, 6, 7, 2], sensorimotor control [8, 9, 62], integration of temporal statistics[37, 36, 10, 11, 38], perceptual multistability [63, 41, 42], and various aspects of cognition [64, 65,66, 40, 67].The literature on Bayesian online inference in cognition, where belief is updated iteratively asa function of incoming information, is growing; examples can be found in word segmentation [68],sentence processing [69], conditioning [70], as well as in the change-point literature [16, 17, 15, 21,22, 23, 25, 26]. In change-point tasks, subjects are presented with a long sequence of consecutiveinference problems (1000 of them, in our case). Each trial is a slightly diﬀerent task, in which onehas to handle the uncertainty resulting from the belief distribution, from the signal likelihood, andfrom the possibility of a change point. The latter, in the HD condition, bears the added complexityof a change-point probability, q ( τ ) , that depends on the time of the last change point. How theseuncertainties are handled determines the behavior, and in particular to what extent an observerreacts to a new stimulus: either shift the estimate towards it, or not move at all. We quantify25his response through the learning rate and the repetition propensity, and we ﬁnd that the idealBayesian observer and the subjects obey similar trends (Fig. 5A-B).The success of the Bayesian paradigm, however, is limited, and comes with three shortcomings.First, subjects do not behave quantitatively like the ideal Bayesian observer, and hence there remainsunexplained suboptimality. Second, we ﬁnd variability in the responses of subjects (Fig. 5C), anobservation incompatible with optimal Bayesian inference. Third, inference problems in the realworld are complex and high-dimensional, rendering Bayesian reasoning computationally heavy andmemory-intensive. This suggests that humans use approximations when carrying out inference andestimation tasks [71, 72, 73]. These three observations call for the investigation of alternatives tothe optimal Bayesian paradigm. Our study explores several scenarios. Although the behavior of the subjects and of the Bayesian model diﬀer in that the former exhibitsvariability while the latter is deterministic, the temporal modulation of the human variability followsa similar course to that of the standard deviation of the Bayesian posterior (Fig. 5C), and both thestandard deviation and the skewness of the distribution of subjects’ responses are correlated withthose of the Bayesian posterior (Fig. 6). Therefore, it is natural to propose that response selectionoperates by sampling the Bayesian posterior instead of maximizing the expected score. Decisionby posterior sampling, or ‘probability matching’, has been suggested by other decision-makingexperiments [74, 75, 76] and, more recently, perceptual experiments [77, 2, 41]. Although close tooptimal in some speciﬁc paradigms [78], sampling is suboptimal in the context of our behavioraltask. When ﬁtting models to human data, we observe that the

ParticleFilter+Sampling and the

ParticleFilter+NoisySampling models yield larger NMSE than the

ParticleFilter+NoisyMax model(the best-ﬁtting model), by 90% and 48%, respectively. More generally, we observe that for each ofour seven inference strategies, the

NoisyMax response-selection strategy results in better ﬁts (lowerNMSE) than the

Sampling and the

NoisySampling strategies (Fig. 10A). Relaxing the

Sampling model by allowing the posterior to be exponentiated before sampling, as in the κ Sampling strategy,does not yield better ﬁts than the

NoisyMax strategy either. We conclude that posterior samplingaccounts less successfully for our data than a simple perturbation of the optimal maximizationstrategy by an additive, ﬁxed-width, Gaussian noise. ad hoc repetition probability

Aside from posterior sampling and motor noise, the so-called ‘rational inattention’ approach whichhas been gaining grounds in economics [79, 80, 81], suggests a diﬀerent account of the variabilityof responses in decision-making tasks. In complement to the study of Bayesian and approximateBayesian models, we have examined models inspired by that approach. We summarize, here, ourresults, and provide a more detailed discussion in Methods. Rational-inattention models posit theexistence of a cognitive cost which prevents subjects from making optimal decisions. In a standardformulation of the approach, this cost is assumed to be proportional to the mutual informationbetween a subject’s mental representation (of quantities relevant to produce responses) and theexternal variables relevant to the decision (here, the sequence of presented stimuli). The subjectoptimizes the ‘information structure’, i.e., the distribution of the mental representation conditionalon the observed stimuli, under the cognitive cost. The optimal distribution of responses dependson both the (prior) distribution of stimuli and the form of the reward function. We implement thismodel and compute its BIC (see Methods for details). We ﬁnd that it is much larger (by 29,656)26han the BIC of the

OptimalInference+NoisyMax model, which itself has a larger BIC than mostof our other models (Fig. 11). Hence, a direct application of a rational inattention approach doesnot provide a better account of behavioral data than the addition of a Gaussian noise in responseselection following optimal inference.Faced with a similar issue, [26] introduced a variant of the rational-inattention model thatis particularly relevant to our study, as it applies to a sequential inference task with (history-independent) change points. In this variant, the response selection is split into a two-stage decisionprocess: ﬁrst, the subject decides whether to repeat the previous response; second, only if thedecision is made not to repeat, then the subject chooses the location of a new response; and bothdecision stages are subject to cognitive costs. This presents a diﬀerence with the models thatwe have analyzed so far, in that an ad hoc probability of repetition is included explicitly in themodel, whereas our approach, instead, was to study deviations from optimality that resulted fromdeterministic or stochastic approximations of a Bayesian scheme.We have analyzed our data using a model similar to the one proposed by [26]. In order toevaluate the relevance of a rational-inattention information structure, we have also studied a modelthat include a two-stage decision process, but does not involve cognitive costs. Speciﬁcally, thismodel combines the

OptimalInference strategy with a strategy of response selection in which at eachtrial the model subject chooses, with ﬁxed probability, whether to repeat the previous response. Theprobability of a repetition is constant, in this last model, whereas in the rational-inattention modelit depends on the stimulus history and on the location of the slider at the beginning of the trial.These two models yield a BIC lower than that of our previously best-ﬁtting model, suggest-ing that a two-stage response-selection process is worth considering as a candidate mechanism forsequential decision-making. The previously best-ﬁtting model, however, makes use of the

Particle-Filter inference strategy, whereas the models just presented rely on the

OptimalInference strategy.Hence, we implement an array of models that combine the same two-stage response-selection pro-cesses with, instead, the

ParticleFilter inference strategy. With this inference strategy, the model inwhich the subject chooses whether to repeat with a ﬁxed repetition probability results in a lower BIC(by 1,398) than the model in which the repetition probability is governed by a rational-inattentioncognitive cost. Moreover, as in the other analyses conducted above, the models with a

ParticleFilter inference strategy all yield substantially lower BICs than their counterparts that make use of the

OptimalInference strategy. It appears, thus, that while the introduction of an explicit repetitionprobability in models improves their explanatory power, deriving this repetition probability from asimple form of cognitive cost does not provide a better account of behavioral data than positing aﬁxed repetition probability.An explicit repetition probability alone is insuﬃcient to capture human inference in our task.Instead, stochastic compression of beliefs, as illustrated by stochastic pruning or particle ﬁltering,results in a closer match with experimental observations. (We provide details on the rational-inattention models, the ﬁxed-repetition-probability models, and their BICs in Methods.)

After rejecting the rational inattention and the sampling hypotheses for response selection, weare left with an unexplained modulation of the variability – speciﬁcally, the relation between themagnitude of behavioral variability and the width of the Bayesian posterior at successive times(Fig. 4). Noisy maximization, which makes use of an additive random perturbation with ﬁxedvariance, leads to behavioral variability with constant variance; it is thus insuﬃcient to explainthe experimental observations. If modulated variability does not originate in the response-selectionstep, it must derive from the inference step. Out of the models we consider, the ﬁve best-ﬁtting27odels implement either the ParticleFilter or the τ Sample inference strategy, both of which rely onsampling during inference . These strategies capture the trends in the variability of human responses,in both HI and HD conditions (Fig. 9).Both these inference strategies reduce the memory load in the inference problem by stochasticallytrimming the posterior, in a fashion akin to the ‘pruning’ model proposed by [82] in the contextof a decision-tree task. In their decision model, the evaluation of a possible sequence of decisionsis more likely to be curtailed (thus alleviating the dimensionality of the problem) if it appears tohave a low value. Similarly, the

ParticleFilter and the τ Sample inference strategies ignore with ahigher probability possible run-lengths and states that are less likely to be correct. Furthermore,we note that the τ MaxProb inference strategy also relies on pruning unlikely run-lengths, but itdeterministically eliminates the most unlikely, in contrast to its stochastic counterpart, τ Sample — which yields a better ﬁt of the data. In explore-exploit problems, ‘Thompson sampling’ [83]refers to a strategy in which one ‘explores’ by randomly choosing an action with the probabilitythat this action maximizes the reward (instead of deterministically choosing the action most likelyto maximize the reward). Several studies have reported that the responses of human subjects inexplore-exploit tasks appeared consistent with Thompson sampling [84, 85, 86]. In this perspective,the stochastic pruning of the posterior in our best-ﬁtting models appears as an exploration strategy,deployed during inference. Beyond the speciﬁcs of the pruning or exploration mechanisms, themain conceptual point, here, is that behavioral biases may result from an approximation to aBayesian scheme that relieves memory load. A similar picture has been advanced in the context ofprediction tasks, where ‘over-reaction’ – eﬀectively a biased, enhanced learning rate – results fromthe compression of information stored in memory [87, 88, 89].Aside from stochasticity in the inference step and in the response-selection step, noise in thesensory observation is a possible alternate account of behavioral variability, discussed, among others,by [90] and [58]. The design of our experiment, however, minimized perceptual ambiguity: thestimulus we presented to the subject at each trial, in our task, was a white dot that clearly contrastedwith the background, and which remained on the screen until the subject responded (the subjectwas thus free to look at it for as long as he or she wished). By contrast, in the two studies justmentioned, the stimuli consisted of low-contrast gratings presented for one second or less. Wepresume that our experimental design limited perceptual noise.It would nevertheless be interesting to know whether and how perceptual noise may contributeto behavioral variability in a sequential experiment, and how it may couple with stochasticity ininference. A possibility is that the magnitude of perceptual noise is constant throughout the task,in which case it would be expected to contribute an equal amount of variability at all run-lengths;if so, it would not account for the modulations of variability that we record in our task. Anotherpossibility is that perceptual noise itself adapts dynamically during the task. Under this hypothesis,we speculate that the magnitude of perceptual noise would decrease if uncertainty increases; if so,it would result in an eﬀect on behavioral variability opposite to the observed eﬀect. Although wecannot exclude that an eﬀect of this nature is at play, it does not appear to oﬀset completelythe modulations of the variability which can be understood in terms of an approximate Bayesianinference. In sum, in the present setting of the experiments, a natural explanation of the behavioralvariability in terms of perceptual ambiguity seems unlikely.

Comparing the behaviors of the best-ﬁtting model and that of the subjects, we note that thereremain discrepancies between the two, particularly at long run-lengths in the HD condition. Theincrease in the subjects’ learning rate, and the reduction in their repetition propensity, at these28un-lengths and in this condition, are not as sharp as those of the best-ﬁtting model (Fig. 9). Acandidate explanation of these deviations is that the subjects hold an inexact belief on the shapeof the change probability, q ( τ ) , as a function of the run-length. The analysis of the IncorrectQ inference strategy, in Methods, examines the case of a model subject who believes that the changeprobability increases more slowly ( λ < ) than it actually does and that the average interval lengthis greater ( T > ) than it actually is in our task. The learning rate of this model subject doesnot increase as abruptly, and the repetition propensity does not decrease as quickly, as those of thebest-ﬁtting model, similarly to the behavior of actual subjects (Fig. 12, middle panels). Moreover,among the ﬁve deterministic inference strategies considered, IncorrectQ is the best-ﬁtting strategy,regardless of the response-selection it is combined with, and with both model-comparison measures— NMSE (Table 1) and BIC (Fig. 11). This suggests that, aside from the stochastic compressionof the posterior, the subjects’ deviations from optimality may also result, to some extent, from anincorrect belief in the temporal structure of the signal.

The

ParticleFilter strategy is noteworthy in a number of respects. First, it is our best-ﬁtting model.Second, it constitutes a generic approach to inference; it was reported to account successfully forother inference and learning behaviors, such as category learning [91, 92], conditioning in pigeons[70], sentence processing [69], hidden state inference [14], and visual tracking of multiple objects[93]. Third, out of all the models we consider, it is by far the less demanding on memory: withnine particles, one needs to store 27 numbers (for s , τ , and the weight of each particle) in memory.As a comparison, the optimal model stores a discretized probability distribution over the ( s, τ ) space, which amounts to about 16000 numbers (the optimal posterior could be well approximatedwith less memory-intensive methods, but this would require further hypotheses.) Previous uses ofparticle-ﬁlter methods in the context of a variety of cognitive tasks yielded best-ﬁtting numbers ofparticles which ranged from one to several hundreds: from one to 400 particles, with a mean of 56,in [14]; 130 particles (but 70 when subjects simultaneously perform a distractor task) in [94]; around20 particles in [69]; 20 particles in [22]; and as few as one particle in [70] and [92]. In addition, [95]consider only the case of a single particle. We note that our analysis of the ParticleFilter inferencestrategy, detailed in Methods, reveals that a model with just one particle fails to reproduce thedecreasing learning rates, in the HI condition, and the smile shape of the learning rates, in the HDcondition, while models with two or more particles do capture these behavioral trends (Fig. 15B).Hence, in contrast to the last three studies cited, we ﬁnd that a particle-ﬁlter model with a singleparticle is qualitatively inconsistent with the behavior of human subjects, at least in the context ofour task.A fourth aspect of particle ﬁlters is that they provide a natural interpretation of the highrepetition propensity observed in subjects (Fig. 5B). As the support of the probability distributionis reduced to N P = 9 points on the ( s, τ ) plane, there is a fair chance that the posterior-maximizingparticle at trial t , ( s t , τ t ) , remains the posterior-maximizing particle at trial t + 1 . Hence, theresponse s t is likely to be repeated. In a similar spirit, particle ﬁlters have been shown to accountfor order eﬀects in category learning [92] and observations about online sentence comprehension(such as the processing of ‘garden-path sentences’ [69]).The success of particle ﬁlters, also known as Sequential Monte Carlo method, in accounting forhuman behavior in an online inference task adds to a growing literature on sample-based represen-tations in cognitive processes [40, 41, 42, 43]. Monte Carlo methods, which approximate probabilitydistributions with sets of samples, constitute a major element of a family of techniques used inmachine learning to address a wide range of problems (inference, optimization, numerical integra-29ion, etc); they have also been put forth as candidate cognitive algorithms [96, 72]. Moreover, theyaccount for a range of cognitive biases in the laboratory, such as base-rate neglect, conjunctionfallacy, and the unpacking eﬀect, as well as for human performance in complex, real-world tasks,and speciﬁc observations such as response variability and autocorrelation in perception and rea-soning tasks [42, 73]. At the implementation level, sample-based representations are well suited tolearning in neural networks [97]. Here, the variability in neural activity can be interpreted in termsof sampling-based representations of probability [98, 97, 99, 42], and a number of neural networkmodels performing probability sampling have been proposed [100, 41, 101, 102, 103]. The computer-based task was programmed and run with Psychopy [104]. In this task, white dotsappeared on a horizontal line in the middle of a grey screen. Subjects were told that these whitedots were snowballs thrown by a hidden person, the ‘enemy’ (also located on the horizontal line).The horizontal location of a snowball was the stimulus, x t , and the position of the hidden personwas the state, s t . The state space was arbitrarily chosen to be [0, 300]; this scale did not appear onthe screen. By clicking with a mouse (whose pointer moved on the horizontal axis only), subjectscould indicate where they thought the hidden person was (i.e., give their estimate, ˆ s t , of the state).The time of response was not constrained. A green dot provided a visual feedback of the locationof the click. After 100ms, a new white dot appeared, starting the next trial (Fig. 1A-B). If asubject’s ‘shot’ was within a ﬁxed distance around the state (the radius of the enemy), the subjectwas rewarded with 1 point. If the shot was ‘outside the enemy’ but within a distance equal to twicethe enemy radius, the reward was 0.25 point (Fig. 1E). Otherwise, the reward was zero. Subjectswere not informed of the reward immediately after each shot, as this would have provided additionalinformation on the location of the state. The total score was given every 100 trials, to allow for anassessment of average self-performance and to foster motivation. We ran the computer-based task on 30 paid subjects; all gave informed consent. The study wasapproved by Princeton University’s Institutional Review Board for Human Subjects. The samplesize was determined so as to be comparable to that used in similar experiments [16, 18]. Foursubjects performed signiﬁcantly worse than the other ones: their average error, deﬁned as theabsolute diﬀerence between their estimate and the state, | ˆ s t − s t | , was 10.4 (standard deviation(s.d.): 0.93), while the average error of the other 26 subjects was 6.5 (s.d.: 0.62). Because of thisdiﬀerence of more than 5 standard deviations, these four subjects were excluded from the analyses.Hence, a total of 26 subjects were included in the analyses. Our conclusion remain unchanged if all30 subjects are included in the analyses (see Supplementary Fig. 20). The stimulus, x t , was generated around the state, s t , according to the likelihood probability, g ( x t | s t ) ,which was chosen to be triangular, centered at s t , and of half-width 20. The state, s t , was piecewise-constant with respect to time, i.e., constant in the absence of a change point. In the HI condition,the probability of a change point, q t , was constant and equal to 10%. In the HD condition, q t depended on the run-length, τ t , deﬁned as the number of trials since the last change point, and had30 sigmoid shape: q t ( τ t ) = 1 / (1 + e − ( τ t − ) . At τ t = 0 (i.e., immediately after a change point), theprobability of another change point was very small. Six trials after a change point it was still small,less than 2%, before growing appreciably (50% at τ t = 10 , 95% at τ t = 13 ). This led to more regularintervals between change points than in the HI condition, with a change point roughly every 10 trials(Fig. 1C-D). The average number of trials between two change points in both conditions was 10. Ata change point, the state randomly jumped to a new state, s t +1 , according to the state transitionprobability, a ( s t +1 | s t ) . This distribution was chosen to be bimodal, symmetric, and centered at s t (two triangles of half-width 20 each, centered at s t ± d , where d = 25 , Fig. 1E). This preventednew states to be too close or too far from the previous state, which would have made change-pointdetection too diﬃcult or too obvious. All subjects did the task in both HI and HD conditions. 14 started with the HD task and 16started with the HI task (no signiﬁcant diﬀerence were found in results between these two groups).Subjects were not told the speciﬁcity of each situation. An explanatory text indicated that therewere “diﬀerences” between each condition but no further indications were given. Each conditionstarted with a series of explanations and tutorial runs. In a ﬁrst tutorial run, the enemy (i.e.,the state) was visible and moved according to the current (HI or HD) condition, and successivesnowballs appeared without any action from the user (as in passive video viewing). In a second run,the enemy was still visible and subjects had to click at each trial, after which the next snowballwould appear. This run was a very simple version of the actual task, because subjects were seeingthe state. In a third run, the half-width of the triangular likelihood, g ( x t | s t ) , was 10, i.e., half thevalue it took in the actual task. In this run, the state was not visible, except after a change point:in the occurrence of a change point, the position of the state before the change point was shown,along with the shots of the subject since the previous change point. This run had two goals: ﬁrst,to emphasize the timing of change points, and, second, to allow for self-performance assessmentand to illustrate that a strategy consisting in ‘following the white dots’, i.e., clicking on the stimuli,was ineﬃcient. A fourth tutorial run was an ‘easy’ version of the actual task: the state was alwayshidden, but the likelihood, g , had a half-width of 10. A ﬁfth and last tutorial run reproduced thethird run, but with the likelihood, g , with half-width 20. During the task, 15 subjects (7 amongstthe HD-ﬁrst group and 8 amongst the HI-ﬁrst group) were also shown the positions of past stimuli,as white dots with decreasing contrast, gradually merging with the grey background (Fig. 1). Theother 15 subjects were not shown past stimuli. No signiﬁcant diﬀerences were found in data betweenthe two groups. The number of stimuli presented in the tutorial runs totaled 297 for each condition.During the actual task there were 1000 trials in each condition, leading to a total of 2000 datapoints per subjects. As subjects did not know the true run-length, τ , we computed an empirical run-length, ˜ τ , basedon the responses of subjects. Whereas the true run-length is deﬁned as the number of trials sincethe last change point, the empirical run-length is deﬁned as the number of trials since the last‘large correction’; a large correction is deﬁned as a correction with absolute value larger than the90th percentile of corrections. This percentile level is chosen in relation to the average frequencyof change points, 1 for every 10 trials, in both HI and HD conditions. In some occasions, a subject“misses” a change point: the run-length and the empirical run-length, consequently, diﬀer. Forinstance, ˜ τ = 10 , while τ = 0 or 1. In such a case, because the change point did occur, the31ubject experiences a large surprise and is thus likely to subsequently opt for a large correction,i.e., to increase the learning rate. In the HD condition, because of the temporal statistics of changepoints, this situation is more likely to occur at empirical run-lengths around 10. Hence, this eﬀectcould bias the learning rates to higher values at these empirical run-lengths, in this condition. Thiseﬀect, however, does not originate in the inference process, but rather in the temporal statistics ofthe HD signal. In other words, even an observer whose inference algorithm is not adapted to theHD condition would have higher learning rates at empirical run-lengths around 10. In the resultspresented, we removed all trials with a (true) run-length of 0 or 1, in order to avoid this artifact. In order to provide statistical evidence of the smile shape of the learning rates in the HD condition(and the absence of a smile shape in the HI condition), we regress the learning rate on the run-lengths, with a quadratic term. For the HD condition, we ﬁnd that the coeﬃcient for the quadraticterm is positive and signiﬁcantly diﬀerent from zero (0.0046; p-value = 1e-11). For the HI condition,this coeﬃcient is smaller and we cannot reject at a signiﬁcance level of 5% the null hypothesis that itis zero (0.0013; p-value = 0.068). Moreover, the diﬀerence between the two quadratic coeﬃcients isstatistically signiﬁcant (F-test p-value = 5.7e-4). The coeﬃcient for the linear term is signiﬁcantlynegative (p-value < 1e-2).

We derive the Bayesian update equation for a general case with q = q ( s t , τ t ) , p ( x t | s t , τ t ) = g ( x t | s t , τ t ) ,and p ( s t +1 | τ t +1 = 0 , s t , τ t ) = a ( s t +1 | s t , τ t ) , which includes the case used in our task with q = q ( τ t ) , g = g ( x t | s t ) , and a = a ( s t +1 | s t ) .Our goal is to obtain an update rule for the posterior, p t ( s, τ | x t ) upon the observation of anew stimulus, x t +1 . Bayes’ rule yields p t +1 ( s, τ | x t +1 ) = 1 Z t +1 g ( x t +1 | s, τ ) p t +1 ( s, τ | x t ) , (4)where Z t +1 = p t +1 ( x t +1 | x t ) is a normalization constant. The third term in this product can bewritten as p t +1 ( s, τ | x t ) = (cid:88) τ t (cid:90) s t p t +1 ( s, τ | s t , τ t ) p t ( s t , τ t | x t )d s t . (5)The transition probability, p t +1 ( s, τ | s t , τ t ) , is determined by q and a . An absence of change pointoccurs with probability − q ( s t , τ t ) , and in such a case a state ( s t , τ t ) evolves into the state ( s t +1 = s t , τ t +1 = τ t + 1) . In the case of a change point, an event which occurs with probability q ( s t , τ t ) , possible states at t + 1 have the form ( s t +1 , τ t +1 = 0) . Hence the transition probabilityfrom ( s t , τ t ) to ( s, τ ) at t + 1 is given by p t +1 ( s, τ | s t , τ t ) = τ =0 q ( s t , τ t ) a ( s t , τ t , s ) + τ = τ t +1 ,s = s t (1 − q ( s t , τ t )) . (6)32ombining Equations (4), (5), and (6), we obtain the Bayesian update equation, as p t +1 ( s, τ | x t +1 ) = 1 Z t +1 g ( x t +1 | s, τ ) (cid:34) τ =0 (cid:88) τ t (cid:90) s t q ( s t , τ t ) a ( s t , τ t , s ) p t ( s t , τ t | x t )d s t + τ> (1 − q ( s, τ − p t ( s, τ − | x t ) (cid:35) . (7)In the special case with q = q ( τ t ) , a = a ( s t +1 | s t ) , and g = g ( x t | s t ) , we obtain the slightly simplerEq. (2). In addition, we note that, in the HI condition, the change probability is constant, q ( τ ) = q ;in this condition, we can marginalize over the variable τ to obtain a closed recursion over the stateposterior, as p t +1 ( s | x t +1 ) = 1 Z t +1 g ( x t +1 | s ) (cid:34) q (cid:90) s t a ( s | s t ) p t ( s t | x t )d s t + (1 − q ) p t ( s | x t ) (cid:35) . (8) IncorrectQ model

In the

IncorrectQ model, the two quantities governing the shape of q ( τ ) , λ and T , are treated asfree parameters, and we explore how varying these parameters impacts behavior, as compared tothe OptimalInference model.Keeping T constant at , and varying λ from (HI condition) to (HD condition), we ﬁndthat the learning rate as a function of the run-length gradually morphs from the HI, monotonicallydecreasing curve, to the HD, non-monotonic and ‘smile-shaped’ curve (Fig. 12B). A similar behaviorobtains at any ﬁxed value of T , with the diﬀerence that the minimum of the HD curve is shiftedto smaller τ for smaller T , and to larger τ for larger T . In other words, for ﬁxed T , a higher valueof λ , i.e., a sharper slope of the change probability, leads to a higher learning rate at run-lengthscomparable to T . (Note that, for T = 20 , the minimum of the learning rate occurs at run-lengthslarger than , hence the non-monotonicity is not apparent in Fig. 12B.) Conversely, for a ﬁxed λ > , the minimum of the learning rate occurs at a run-length comparable to T which, precisely,determines when change points become likely. Finally, for λ = 0 , the change probability is constantand there is no increase in the learning rates; these are, however, slightly higher for smaller T ,because in that case the change probability, q = 1 /T , is larger, so a new stimulus is more likely tobe interpreted as stemming from a change point.A subtlety that any analysis has to grapple with is that the statistics of responses depend notonly on the inference process, but also, of course, on the statistics of the stimuli. To tease the twoeﬀects apart, for each IncorrectQ model subjects (with diﬀering values of λ and of T ) we computedthe response behavior in presence of either signal: the HI signal, characterized by a constant changeprobability, q = 0 . , and the HD signal, characterized by a change probability, q ( τ ) , varying with therun-length as a sigmoid with parameters λ = 1 and T = 10 . We note that the impact on behaviorof changing the signal is modest, as compared to the impact of changing the model subject’s beliefs(Fig. 12B). This indicates that the discrepancy in human behavior in the HI and HD conditionsdoes not originate primarily from the statistics of the signals, but rather from the diﬀerent beliefson the temporal statistics of the signals, held by the subjects.33aralleling the behavior of learning rates, the repetition propensity in the HD condition peaksearlier or later depending on the value of T , and its shallowness depend on the value of λ (Fig. 12C).A belief in a shorter average inter-change-point interval, T , leads to a smaller repetition propensity:assuming frequent change points enhances the frequency of changes in one’s estimate.Human subjects correctly believe that q is not constant in the HD condition, and they usethis belief in their inference process, but they may hold an inexact representation of the shape of q ( τ ) (Fig. 12). This, however, is not suﬃcient to capture data quantitatively: subjects exhibitboth higher learning rates and more frequent repetitions than in the optimal model (Fig. 5), anobservation that cannot be explained by manipulating λ and T in the IncorrectQ model; in thelatter, high learning rates are accompanied by lower repetition propensity, and vice versa. Thus,and letting alone the issue of variability, an erroneous belief on the change probability, q ( τ ) , isinsuﬃcient to model experimental data. Run - length R e p e t i t i o n p r o p e n s i t y T = 20 T = 10 T = 6 HI signalHD signal

Run - length Run - length Run - length . . . . . . L e a r n i n g r a t e λ = 0 − q = cstT = 6 T = 10 T = 20 HI signalHD signal

Run - lengthλ = 0 . Run - lengthλ = 1 . . q ( τ ) Belief : λ = 0 ; q = cst = 1 /T T = 6 T = 10 T = 20 Belief : λ = 0 . Belief : λ = 1 . . . p ( i n t e r v a l ) ACB

Figure 12: Illustration of the

IncorrectQ model with various beliefs on the shape of the changeprobability. A.

Examples of beliefs in

IncorrectQ models. Change probability, q ( τ ) (top line), and resultinginter-change-point interval distribution (bottom line), for constant q (left column), sigmoid-shaped q ( τ ) withslope parameter λ = 0 . (middle) and λ = 1 (right); and for average interval length, T , of 6 (dashed line),10 (solid), and 20 (dotted). The ‘true’ HI signals used in the task correspond to λ = 0 , T = 10 , while theHD signals correspond to λ = 1 , T = 10 . B, C.

Average learning rate (B) and repetition propensity (C) asa function of the run-length, in

IncorrectQ models performing optimal inference with various beliefs on thechange probability, q ( τ ) , and presented with HI signals (blue) and HD signals (orange). .8.2 τ Mean modelDerivation

This model is a generalization of the model introduced by [16]. The approximatejoint probability of the state and the run-length, which we denote by ˜ p t ( s, τ | x t ) , is assumed, inthis model, to vanish at all values of the run-length, except for one, which we call the ‘approximateexpected run-length’ and which we denote by ¯ τ t . Hence, ˜ p t ( s, τ | x t ) = δ τ, ¯ τ t ˜ p t ( s, ¯ τ t | x t ) , (9)where δ τ, ¯ τ t is the Kronecker delta. As in the optimal model (see Eq. (2)), we use Baye’s rule andthe parameters of the task to derive the update equation, as p t +1 ( s, τ | x t +1 ) = 1 Z t +1 g ( x t +1 | s ) (cid:34) τ =0 q (¯ τ t ) (cid:90) s t a ( s | s t ) p t ( s t , ¯ τ t | x t )d s t + τ =¯ τ t +1 (1 − q (¯ τ t )) p t ( s, ¯ τ t | x t ) (cid:35) . (10)This distribution is non-vanishing for two values of the run-length, and ¯ τ t + 1 , which correspondto the two possible scenarios: with and without a change point at trial t . We use this distributionto compute the approximate expected run-length at trial t + 1 , ¯ τ t +1 , and the approximate posteriorat trial t + 1 , ˜ p t +1 ( s, τ | x t +1 ) . First, we obtain the probability of a change point at trial t + 1 , Ω t +1 ≡ p t +1 ( τ = 0 | x t +1 ) = 1 Z t +1 q (¯ τ t ) (cid:90) s g ( x t +1 | s ) (cid:90) s t a ( s | s t ) p t ( s t , ¯ τ t | x t )d s t d s, (11)and we use it to compute the approximate expected run-length at trial t + 1 : ¯ τ t +1 = Ω t +1 · − Ω t +1 )(¯ τ t + 1) . (12)Second, we approximate the posterior (Eq. (10)) by marginalizing it over the run-lengths, andmultiplying the result by a Kronecker delta which takes the value at ¯ τ t +1 : p t +1 ( s | x t +1 ) = (cid:88) τ p t +1 ( s, τ | x t +1 ) = p t +1 ( s, τ = 0 | x t +1 ) + p t +1 ( s, τ = ¯ τ t + 1 | x t +1 ) , ˜ p t +1 ( s, τ | x t +1 ) = δ τ, ¯ τ t +1 p t +1 ( s | x t +1 ) . (13)This model has no parameter. Behavior

While the optimal marginal distribution of the run-lengths, p t ( τ | x t ) , spans the wholerange of possible values of the run-length, it is approximated, in the τ Mean model, by a deltafunction on a single value, ¯ τ t . In the HI condition, the change probability, q , does not depend onthe run-length, τ ; the approximate joint distribution evaluated at τ = ¯ τ t , ˜ p t ( s, ¯ τ t | x t ) , is equal tothe optimal posterior distribution on the state, p t ( s | x t ) . (Compare Eq. (8) to the combination ofEqs. (10) and (13).) As a result, the τ Mean model computes the optimal posterior on the state,and, thus, the responses in this model are the same as those in the optimal model, i.e., the τ Mean model is optimal in the HI condition. In the HD condition, by contrast, the change probability, q ( τ ) , depends on the run-length. The τ Mean model evaluates this function at only one run-length, ¯ τ t , an approximation of the mean run-length; as compared to the optimal model, it fails to capturefully the consequences of the dependence of the change probability on the run-length (Fig. 1D).The learning rates in this model are higher than the optimal ones for short run-lengths, and lowerthan the optimal ones for long run-lengths (Fig. 13B); and the repetition propensities are lowerthan the optimal ones for short run-lengths, and higher than the optimal ones for long run-lengths(Fig. 13C). 35 .8.3 τ Nodes modelDerivation

This model generalizes the one introduced by [17]. In this paper, the authors interpreta change-point setting similar to ours as a ‘message-passing’ graph where run-lengths are nodes,weighted by their marginal probability p t ( τ | x t ) , edges are characterized by the change probability, q , and ‘messages’ are passed along edges from one node to another. More precisely, we compute themarginal probability of the run-length, using Eqs. (5) and (6): p t +1 ( τ | x t ) = (cid:40)(cid:80) τ t q ( τ t ) p t ( τ t | x t ) if τ = 0(1 − q ( τ − p t ( τ − | x t ) otherwise . (14)Hence, at trial t + 1 , the weight of a node τ (i.e., the marginal probability of the correspondingrun-length) is equal, if τ = 0 , to a sum of the marginal probabilities of all nodes at trial t , τ t ,weighted by their corresponding change probabilities, q ( τ t ) ; and, if τ > , it is the probability of thenode τ − , at trial t , weighted by the probability that there was no change, − q ( τ − . Taking adiﬀerent view, one can reformulate these weighted sums of probabilities as transfers of probabilitymasses, as follows. Each node τ sends two ‘messages’: a ‘no-change-point’ message is sent to node τ + 1 so as to set its weight to (1 − q ( τ )) p t ( τ | x t ) , and a ‘change-point’ message is sent to node τ = 0 to increase its probability by q ( τ ) p t ( τ | x t ) . This is the message-passing algorithm. [17]assume a change probability, q , that is constant; a likelihood, g , which belongs to the exponentialfamily; and a state transition probability, a ( s t +1 ) , that does not depend on the previous state, s t ,and which is a conjugate prior of the likelihood. They show that in this case each node can beseen as implementing a delta-rule, and the optimal Bayesian model amounts to the weighted sumof these delta-rules.The authors then ‘reduce’ this model by removing nodes and accordingly revise the message-passing algorithm and each node’s update rule. We only focus on the aspects of the model thatwill be used in our τ Nodes implementation. The set of new nodes comprises ‘virtual’ run-lengths l ∈ { l , l , ..., l N } . A node l i , with i (cid:54) = 0 , now sends three messages: one to l , one to the next node l i +1 , and one to itself. The ‘change-point’ message remains the same as in the previous algorithm,i.e., the quantity q ( l i ) p t ( l i | x t ) is sent to l (i.e., this quantity is added to the probability of thisnode). The ‘no-change-point’ message is now split in two, one message being sent to the next node, l i +1 , and the other one being a self-passing message (i.e., sent to itself, l i ). The authors seek therelative weight w ( l i ) assigned to the self-passing message which gives an average run-length increaseof 1 (i.e. E ( l t +1 | l it , no change) = l i + 1 ). They ﬁnd w ( l i ) = l i +1 − l i − l i +1 − l i for i (cid:54) = N and w ( l N ) = 1 . Thenext node, l i +1 , hence receives the message (1 − w ( l i ))(1 − q ( l i )) p t ( l i | x t ) . With the assumptionsmentioned above on q , g , and a , the model can again be understood as ‘a mixture of delta-rules’.We implement these ideas in our τ Nodes model. Instead of p t ( s, τ | x t ) , we consider the prob-ability distribution p t ( s, l | x t ) and apply the same Bayesian and marginalization equations usedin Eqs. (4) and (5). The main diﬀerence of the new model resides in the transition probability, p t +1 ( s, l | s t , l t ) , which becomes p t +1 ( s, l | s t , l t ) = l = l q ( s t , l t ) a ( s t , τ t , s )+ l = l t ,s = s t (1 − q ( s t , l t )) w ( l t )+ l = l t +1 ,s = s t (1 − q ( s t , l t ))(1 − w ( l t )) . (15)Combining Eq. (15) to Eqs. (4) and (5) adapted with l , we obtain an update equation similar tothe full Bayesian update equation (Eq. (7)), with an additional term corresponding to the ability ofnodes for self-passing messages. The model is parameterized by the number of nodes, N τ , but also36y the values of the nodes. We chose the possible values of the nodes to be in the set {0, 2.5, 5, 7.5,10, 12.5}. When ﬁtting the model for a given N τ , all the models corresponding to every possiblechoice of N τ nodes within these values, were computed, and the best-ﬁtting one was chosen. Behavior

In the HI condition, the model computes the optimal posterior on the state, p t ( s | x t ) .Thus, as for the τ Mean model above, the responses in the τ Nodes model are the same as those inthe optimal model, in the HI condition. In the HD condition, the greater the number of nodes,the more faithfully the model approximates optimal behavior (Fig. 13B,C). The learning ratesare higher than the optimal ones for short run-lengths, and lower than the optimal ones for longrun-lengths (Fig. 13B). The repetition propensities are higher than the optimal ones for long run-lengths; for short run-lengths, they are appreciably closer to the optimal ones, in the model withﬁve nodes, than in the model with one node (Fig. 13C). τ MaxProb modelDerivation

In the τ MaxProb model, we assume that we have, at trial t , an approximation of thejoint distribution of the state and the run-length, which we denote by ˜ p t ( s, τ | x t ) , and we assumethat the approximate marginal distribution of the run-lengths, ˜ p t ( τ | x t ) , is non-vanishing for nomore than N τ values of the run-length. Upon receiving a new stimulus, we perform a Bayesianupdate of the approximate joint distribution, ˜ p t ( s, τ | x t ) , as in Eq. (2), and obtain the posterior, p t +1 ( s, τ | x t +1 ) , from which we derive the marginal distribution of the run-lengths, p t +1 ( τ | x t +1 ) .If, at trial t , the run-length takes a given value, τ t , then, at trial t + 1 , it can only take one of twovalues: (if there is a change point) or τ t + 1 (if there is no change point). Hence, if the marginaldistribution at trial t , ˜ p t ( τ | x t ) , is non-vanishing for at most N τ values, as we assume, then theupdated distribution, p t +1 ( τ | x t +1 ) , is non-vanishing for at most N τ +1 values. In the case that thisdistribution is non-vanishing for less than N τ + 1 values, we do not perform further approximations,at this stage. In the other, more generic case, i.e., if N τ + 1 values of the run-length have a non-zero probability, then we identify the most unlikely run-length, τ ∗ = arg min p t +1 ( τ | x t +1 ) , and weapproximate the posterior as ˜ p t +1 ( s, τ | x t +1 ) = (cid:40) if τ = τ ∗ Z p t +1 ( s, τ | x t +1 ) if τ (cid:54) = τ ∗ , (16)where Z is a normalization constant equal to − p t +1 ( τ ∗ | x t +1 ) . Behavior

Even with N τ = 1 (one memory slot), the τ MaxProb model captures qualitatively theoptimal, high learning rates for large run-lengths, in the HD condition (Fig. 13D, second panel,dotted line). However, in other situations (HD condition for shorter run-lengths, and HI conditionfor all run-lengths), change points are not likely ( q < . ). Hence, in most cases, a vanishing run-length, i.e., the hypothesis of a change point, minimizes the marginal distribution, p t +1 ( τ | x t +1 ) ,and its probability vanishes in our approximation: ˜ p t +1 ( τ = 0 | x t +1 ) = 0 . In other words, changepoints tend to go by undetected. Consequently, suppressed learning rates and enhanced repetitionpropensity obtain in a model with a single memory slot (Fig. 13D, E).To compare our suboptimal models to the OptimalInference model, we compute their normalizedmean squared errors (NMSE) with regard to the responses of the optimal model (as opposed to theresponses of the human subjects, as we do in the main text). With a NMSE for learning rates at0.21, the τ MaxProb model with N τ = 1 is closer to optimality than the τ Mean model (NMSE of0.88) and the τ Nodes model with one node (NMSE of 0.95), in the HD condition (Fig. 13F). The37 . . p ( τ ) OptimalInferencet − t ¯ τ t − ¯ τ t . p ( τ ) τMean . . . p ( τ ) τNodes Run - length . . p ( τ ) τMaxP robModels A N τ . . . . N M S E ( H D ) τMean τNodesτMaxProb N τ . . N M S E ( H D ) τMean τNodesτMaxProb Run - length . . . . L e a r n i n g r a t e τNodesHD τMeanOptimalInferenceN τ = 1 N τ = 2 N τ = 5 0 2 4 6 8 10 Run - length R e p e t i t i o n p r o p e n s i t y τNodes HDτMeanOptimalInference B C . . . . L e a r n i n g r a t e τMaxP robHI OptimalInferenceN τ = 5 N τ = 2 N τ = 10 2 4 6 8 10 Run - length . . . . HD R e p e t i t i o n p r o p e n s i t y τMaxP rob HI Run - length HD EDF G

Figure 13: Behavior of the limited-memory models, as compared to the

OptimalInference model. A.

Schematic illustration of the marginal distribution of the run-length, p ( τ ) , in each model con-sidered. The OptimalInference model assigns a probability to each possible value of the run-length, τ , andoptimally updates that distribution upon receiving stimuli (ﬁrst panel). The τ Mean model uses a singlerun-length which tracks the inferred expected value, (cid:104) τ (cid:105) (second panel). The τ Nodes model holds in memorya limited number, N τ , of ﬁxed hypotheses on τ (“nodes”), and updates a probability distribution over thesenodes; N τ = 2 in this example (third panel). The τ MaxProb model reduces the marginal distribution bydiscarding less likely run-lengths; in this example, 2 run-lengths are stored in memory at any given time(fourth panel).

B, C.

Average learning rate (B), and repetition propensity (C), as functions of the run-length, in the

OptimalInference model, the τ Mean model, and the τ Nodes model with N τ =

1, 2, and 5, inthe HD condition. The HI condition is not displayed as the τ Mean and τ Nodes models do not diﬀer fromthe

OptimalInference model in this condition.

D, E.

Average learning rate (D), and repetition propensity(E), as functions of the run-length, in the

OptimalInference model and the τ MaxProb model with N τ =

1, 2, and 5, in the HI condition (top panels) and in the HD condition (bottom panels).

F,G.

NormalizedMean Squared Error relative to the learning rate (F) and the repetition propensity (G), as compared tothe

OptimalInference model, for the τ Mean model, which has no free parameter, and for the τ Nodes and τ MaxProb models with N τ = τ MaxProb model, however, leads to a larger error for this measure(NMSE of 1.81), as compared to the τ Mean (0.14) and τ Nodes (0.61) models (Fig. 13G). Addinga second memory slot allows for a better approximation of the marginal distribution, p t ( τ | x t ) , inthe τ MaxProb model, as demonstrated by its close-to-optimal behavior with N τ = 2 , both in termsof learning rates (NMSE of 0.043; compare to the τ Nodes model: 0.89) and repetition propensity(0.097; compare to the τ Nodes model: 0.29) (Fig. 13F, G). τ Sample model

The τ Sample model is identical to the τ MaxProb model, except that the run-length τ ∗ is chosenrandomly, i.e., sampled from the distribution [1 − p t +1 ( τ | x t +1 )] /z t +1 , where z t +1 is a normalizationfactor. The stochastic nature of the update rule on the probable run-lengths inﬂuences the learningrate and the repetition propensity. In the case N τ = 1 , there is, at trial t , a single run-length, τ t ,with non-vanishing probability. At trial t + 1 , the model subject chooses randomly between theno-change-point scenario, with τ t +1 = τ t + 1 , and the change-point scenario, with τ t +1 = 0 . Hence,the model can incur ‘false positives’ (a change-point scenario is opted for in the absence of a truechange point) and ‘false negatives’ (a true change point goes undetected by the model subject),and these occur stochastically. In most trials, the change-point scenario is less likely than the no-change-point scenario; in the τ MaxProb model, the former would be eliminated, but it occurs withsome probability in the τ Sample model, leading to false positives, which induce higher learningrates. Similarly, average learning rates of the τ Sample model are higher than the optimal ones (Fig.14A). The false negatives, in which change points go undetected, result, as in the τ MaxProb model,in higher repetition propensities (Fig. 14C). With increasing memory capacity, N τ , the behavior ofthe model approaches optimality, as reﬂected in the decrease of the NMSEs for the learning rates(Fig. 14B) and the repetition propensities (Fig. 14D).A qualitatively new aspect brought in by the τ Sample model is the stochasticity in the inferencestep, which is reﬂected in behavioral variability and measured by the standard deviation of theresponses of a model subject. Quantitatively, false negatives have a large impact on the behavioralvariability. A model subject can however correct for a false negative during the few trials thatfollow a true change point, i.e., at short run-lengths. This occurs randomly, in the τ Sample model,resulting in variability in responses. At longer run-lengths, the posterior probability of a changepoint, p t ( τ = 0 | x t ) , is dominated by the shape of the change probability, q ( τ ) , rather than by theobserved evidence. In the HI condition, q is constant; hence, the variability reaches a plateau forrun-lengths larger than about (Fig. 14E, top panel). In the HD condition, as q ( τ ) is an increasingfunction of the run-length, the variability increases for run-lengths larger than , resulting in the‘smile shape’ of the curve (Fig. 14E, bottom panel). As the parameter N τ is increased, the behaviorof the model approaches optimality, and, correspondingly, the standard deviation of the responsesof the model subject decreases (Fig. 14F). ParticleFilter

Derivation

The

ParticleFilter approximates the posterior by a weighted sum of delta functions(Eq. (3)). To obtain the approximate posterior at trial t + 1 , upon receiving a new observation, x t +1 , we start by writing the Bayesian update (Eq. (2)) of the approximate posterior at trial t , ˜ p t ( s, τ | x t ) , as p t +1 ( s, τ | x t +1 ) = 1 Z t +1 g ( x t +1 | s ) (cid:88) τ t (cid:90) s t p t +1 ( s, τ | s t , τ t )˜ p t ( s, τ | x t )d s t , (17)39 . . . . L e a r n i n g r a t e τSampleHI OptimalInferenceN τ = 1 N τ = 2 N τ = 50 2 4 6 8 10 Run - length . . . . HD R e p e t i t i o n p r o p e n s i t y HI Run - length HD N τ N M S E N τ . . . N M S E N τ S t a n d a r dd e v i a t i o n HIHD

CDAB S t a n d a r dd e v i a t i o n HI Run - length HD EF Figure 14: Behavior of the τ Sample model, as compared to the

OptimalInference model. A,C, E.

Average learning rate (A), repetition propensity (C), and standard deviations of responses (E), as afunction of run-length, in the

OptimalInference model and the τ Sample model with N τ =

1, 2, and 5, inthe HI condition (top panels) and in the HD condition (bottom panels).

B, D.

Normalized Mean SquaredError on learning rates (B), and on repetition propensity (D), as compared to the

OptimalInference model,for the τ Sample model, with N τ = F. Standard deviation of the responses of the τ Sample model, asa function of N τ . with the transition probability, p t +1 ( s, τ | s t , τ t ) , deﬁned in Eq. (6). Injecting the expression of theapproximate posterior at trial t (Eq. (3)), we can rewrite the Bayesian update as a sum of N P functions: p t +1 ( s, τ | x t +1 ) = 1 Z t +1 N P (cid:88) i =1 w it g ( x t +1 | s ) p t +1 ( s, τ | s it , τ it ) , (18)where p t +1 ( s, τ | s it , τ it ) = (cid:88) τ t (cid:90) s t p t +1 ( s, τ | s t , τ t ) δ ( s − s it ) δ τ,τ it d s t . (19)The interpretation of this form becomes apparent if we introduce, for each particle, a probabilitydistribution over ( s, τ ) , deﬁned as π t +1 ( s, τ | s it , τ it , x t +1 ) ≡ g ( x t +1 | s ) p t +1 ( s, τ | s it , τ it ) p ( x t +1 | s it , τ it ) , (20)where the denominator is obtained by normalization, p ( x t +1 | s it , τ it ) = (cid:88) τ (cid:90) s g ( x t +1 | s ) p t +1 ( s, τ | s it , τ it )d s. (21)40he distribution π t +1 ( s, τ | s it , τ it , x t +1 ) is none other than the Bayesian update of a single particle(i.e., Eq. (17) with the approximate prior, ˜ p t ( s, τ | x t ) , replaced by δ ( s − s it ) δ τ,τ it ), and the fullBayesian update is a weighted sum of these N P functions: p t +1 ( s, τ | x t +1 ) = N P (cid:88) i =1 w it p ( x t +1 | s it , τ it ) Z t +1 π t +1 ( s, τ | s it , τ it , x t +1 ) . (22)To complete the deﬁnition of the particle ﬁlter, we have to formulate a prescription for selecting the N P particles at trial t + 1 . Following the literature, instead of sampling the full Bayesian update, p t +1 ( s, τ | x t +1 ) , we sample independently each component of the mixture, π t +1 ( s, τ | s it , τ it , x t +1 ) , toobtain the updated particles, ( s it +1 , τ it +1 ) . To each sample, i.e., to each particle, is assigned theweight of the corresponding component in the mixture, w it +1 = w it p ( x t +1 | s it , τ it ) /Z t +1 . In the rarecases in which p ( x t +1 | s it , τ it ) = 0 , i.e., if new data invalidate particle i , and thus, π t +1 ( s, τ | s it , τ it , x t +1 ) =0 , we resample a new particle i from the other particles.In practical applications of particle ﬁlters, there exists a ‘weight degeneracy’ risk, whereby theweight of one particle may overwhelm the combined weight of the others. A common method tomitigate this shortcoming is called ‘resampling’. It is a stochastic method in which the particleswith high weights are likely to be duplicated, while the particles with low weights are likely to beeliminated. To achieve this, we use the N P -dimensional categorical distribution parameterized bythe N P weights of the particles, i.e., p ( j ) = w jt . We sample this distribution N P times, and obtain,thus, a set of N P indexes, { j i } N P i =1 . We use those to deﬁne the new N P particles: for each particle i , we replace ( s it , τ it ) by ( s j i t , τ j i t ) , and we set all the weights to /N P . In other words, the set ofparticles is randomly sampled with replacement, N P times. Particles with low weights are unlikelyto survive this scheme, as compared to particles with high weights. For the sake of simplicity, weresample at each trial.A possible consequence of resampling is the ‘sample impoverishment’ problem, i.e., the loss ofparticle diversity (all particles end up bearing the same state). A common procedure in the particle-ﬁlter literature that addresses this problem is ‘particle rejuvenation’, which increases the variabilityof particles by ‘jittering’ their parameters. Here, however, this issue is mitigated naturally by thestructure of our problem, as a new state is sampled from the distribution a ( s | s t ) every time a newparticle carries a change-point run-length ( τ = 0 ), thus renewing the set of particles. In addition,introducing a rejuvenation step implies choosing arbitrarily a transition kernel and an acceptancerule for candidate particles (usually, the Metropolis-Hastings rule is adopted). Many kernels usedin the literature come with additional parameters. While the rejuvenation method would be aninteresting addition to our model, the performance of our implementation of the particle-ﬁltermodel does not warrant the introduction of this new layer of complexity. Behavior

With a single particle ( N P = 1 ), the posterior is reduced to a unique sample, ( s t , τ t ) ,and, thus, the model subject has access to a single ‘hypothesis’ on the change probability, q ( τ t ) . Theparticle can then evolve in one of two ways: either it opts, with this probability, for a change-pointscenario, in which the new stimulus, x t +1 , is the only information available on the new state, alongwith the prior transition probability, and thus the learning rate is close to 1, or, with probability − q ( τ t ) , a no-change-point scenario is opted for, and the particle stays put ( s t +1 = s t ), i.e., thelearning rate vanishes. As a result, when averaged over several instantiations of the particle ﬁlter,the behavior of the learning rate as a function of run-length resembles that of the change probability, q ( τ ) , i.e., constant in the HI condition, and increasing in the HD condition (Fig. 15B), a behaviorqualitatively diﬀerent from either that of the OptimalInference model or the human responses. Butit is suﬃcient to add no more than a second particle for the model to capture the main trends in the41

50 100 150 N P S t a n d a r dd e v i a t i o n HIHD N P N M S E N P N M S E S t a n d a r dd e v i a t i o n HI Run - length HD R e p e t i t i o n p r o p e n s i t y Run - length . . . . . L e a r n i n g r a t e HI N P = 1 N P = 2 N P = 10 N P = 1000 2 4 6 8 10 Run - length . . . . . HD OptimalInference p t ( s | x t ) Signal x t N P = 150 N P = 1 p t ( s )

110 120 130 140 150 160 170 180

States, signal x

BC EA D F t i m e G Figure 15: Illustration of the

ParticleFilter model and its behavior. A.

Distribution of particlesduring inference, compared with the optimal posterior distribution, for particle ﬁlters with N P = ( s, τ ) plane, equipped with a weight. Only the spatial components s arerepresented here, as vertical bars (grey for N P = 1 , green for N P = 150 ). Bars heights are proportional to thecorresponding weights, but some are truncated due to the choice of scale, which emphasizes weight diversity.Upon receiving a new stimulus, x t +1 (blue), a particle i is updated by sampling p t +1 ( s, τ | s it , τ it , x t +1 ) . Thismay or may not involve a change point, in which case s it +1 (cid:54) = s it . B, D, F.

Average learning rate (B),repetition propensity (D), and standard deviations of responses (F), as a function of run-length, in the

OptimalInference model and the

ParticleFilter model with N P =

1, 2, 10, and 100, in the HI condition (toppanels) and in the HD condition (bottom panels).

C, E.

Normalized Mean Squared Error on learning rates(C), and on repetition propensity (E), as compared to the

OptimalInference model, for the

ParticleFilter model, with N P = G. Standard deviation of the responses of the

ParticleFilter model, as a functionof number of particles, N P . learning rate (a decreasing learning rate in the HI condition and smile shape in the HD condition).The NMSE drops sharply from 2.2 for N P = 1 to less than 0.7 for N P = 2 . As additional particlesare included in the model, the latter approaches optimality (the NMSE becomes less than 0.1 for N P ≥ ) (Fig. 15C).As mentioned, sampling in the particle ﬁlter induces variability in behavior: two particle ﬁltersreceiving the same sequence of observations do not respond with the same sequence of estimates.Since the stochasticity stems from the sampling of an (approximate) posterior, the resulting vari-ability scales with the width of the posterior. As measured by the standard deviation of responses,it decreases with the run-length, in the HI condition. In the HD condition, it decreases at shortrun-lengths before increasing at longer run-lengths (Fig. 15F). This behavior reproduces, at leastqualitatively, that of the subjects (compare to Fig. 5C). The greater the number of particles in aparticle ﬁlter, the closer the latter approximates the OptimalInference model; the standard deviationof the responses is a decreasing function of the number of particles (Fig. 15G).Since it operates on a low-dimensional spatial representation, the particle ﬁlter naturally predictsa higher repetition propensity than the

OptimalInference model does. More speciﬁcally, the posterioris non-vanishing for only a ﬁnite (possibly small) set of values at each trial, and it is more likelythan in the optimal case that the subject model’s estimate remains unchanged following stimulus42

Run - length . . . . Learning rateHIHD SamplingOptimalInf.

Run - length Repetition propensity

Run - length Standard deviation

A B C

Figure 16: Behavior of the

Sampling model, as compared to the

OptimalInference model.A,B.

Average learning rate (A), and repetition propensity (B), in the

Sampling model (dashed lines) andthe

OptimalInference model (solid lines), as a function of run-length, in the HI and HD conditions. C. Standard deviations of responses, as a function of run-length, of the

Sampling model in the HI and HDconditions. The

OptimalInference model exhibits no variability. presentation. This eﬀect is quantitatively appreciable, and leads to repetition propensities whichare multiples of those in the

OptimalInference model. Again, the repetition propensities decreasetoward their optimal values as the number of particles, N P , increases. The corresponding NMSEdrops from for N P = 1 to for N P = 150 (Fig. 15D, E). Sampling model

In the

Sampling model, instead of using the Bayesian posterior to maximize its expected score, amodel subject samples its response from the marginal posterior on the states, p t ( s | x t ) . In spiteof this suboptimal selection rule, the average learning rate as a function of the run-length hasa behavior similar to the optimal one (decreasing in the HI condition, smile-shaped in the HDcondition), albeit with higher average values (Fig. 16A). The repetition propensity also behavessimilarly to the optimal one, but is suppressed in magnitude due to sampling (Fig. 16B). Finally,as expected by construction, the Sampling model leads to behavioral variability, and the amplitudeof the latter scales with the width of the posterior distribution (Fig. 16C).

This section provides some details on the Normalized Mean Squared Error we use to comparethe results of the various models to the

OptimalInference model and to human data. Let y i ( τ ) be the value at run-length τ of the quantity of interest i (learning rate, repetition propensity,or standard deviation of responses), as observed in data or as resulting from the optimal model,and ˆ y i ( τ ) the value resulting from a suboptimal model. The mean squared error is M SE (ˆ y i ) = n (cid:80) τ (ˆ y i ( τ ) − y i ( τ )) , where n is the number of run-lengths. We want to be able to compare theerrors for diﬀerent quantities of interest. By dividing the M SE by the variance of y i , we obtain theNormalized Mean Squared Error, which is translation-invariant and scale-invariant: N M SE i = M SE (ˆ y i )Var[ y i ] = (cid:80) τ (ˆ y i ( τ ) − y i ( τ )) (cid:80) τ (¯ y i − y i ( τ )) , (23)where ¯ y i is the average y i ( τ ) . For model ﬁtting, we then use the average of this quantity over thethree quantities of interest (“three-error measure”) or over two of them (“two-error measure”).43 .10 Approximate formulation of the Bayesian Information Criterion For a given subject in a given condition (HI or HD), we denote the probability of a sequence of T responses, ˆ s T , by p (ˆ s T ) . In the models with deterministic inference, the joint probability ofresponses is the product of the probabilities of each of the responses in the successive trials: p (ˆ s T ) = T (cid:89) t =1 p (ˆ s t ) . (24)This independence condition does not hold for the models with stochastic inference. We describe,here, how we overcome this issue in the case of the ParticleFilter model, which is the most involvedmodel we consider and the most costly computationally. We denote by N the number of particles,and we deﬁne the ‘internal state’ of a particle ﬁlter at time t as the joint states of its N particles,each deﬁned by a location, s t , a run-length, τ t , and a weight, w t . We denote the internal state by σ t ,and the sequence of the internal states throughout a run with T trials by σ T . The probability of asubject’s sequence of responses is expressed in terms of the probability of the sequences of internalstates, as p (ˆ s T ) = (cid:88) σ T p (ˆ s T | σ T ) p ( σ T ) . (25)We note that conditional on the internal state, the responses are independent: p (ˆ s T | σ T ) = T (cid:89) t =1 p (ˆ s t | σ t ) . (26)In other words, a single realization of the particle ﬁlter can be thought of as a model with de-terministic inference. To compute the probability of responses, however, we must determine thedistribution of the realizations of the internal states of the particle ﬁlters, p ( σ T ) . The supportof this distribution is the Cartesian product of the N T internal states of the particle ﬁlter, and,hence, its size grows exponentially with

N T . Estimating a probability distribution over this spaceseems computationally intractable; we can, however, carry out an approximate calculation of theprobability distribution.Our method of approximation relies on a Monte-Carlo estimation: we run M = 500 simulationsof the inference model ( ParticleFilter or τ Sample ), and consequently we obtain M points σ T inthe space of possible sequences of internal states. For each realization of the internal state, weknow the probability distribution of the response, conditional on the state, p (ˆ s t | σ t ) . A Monte-Carloapproximation of the probability of a sequence of T responses, ˜ p (ˆ s T ) , is then obtained as ˜ p (ˆ s T ) = 1 M (cid:88) σ T p (ˆ s T | σ T ) , (27)i.e., ˜ p (ˆ s T ) = 1 M (cid:88) σ T T (cid:89) t =1 p (ˆ s t | σ t ) , (28)where the sum is taken over the M sampled sequences of internal states. This empirical approxi-mation is satisfactory provided M is suﬃciently large, so as to overcome the exponential growth ofthe number of possible sequences.For any sampled sequence of internal states, σ T , it is extremely likely that at least one response, ˆ s t , has vanishingly small probability given the corresponding internal state, σ t , i.e., p (ˆ s t | σ t ) ≈ . In44ther words, given a sequence of responses, ˆ s T , it is extremely unlikely that any one of M sequencesof internal states produces ˆ s T , i.e., it is prohibitively improbable that any one of M sequences ofinternal states account simultaneously for responses. Thus, if we carry out the MonteCarloapproximation naively, we underestimate severely the likelihood of the data. (We emphasize that,in that case, the low value of the likelihood does not reﬂect an inherent inability of the model toaccount for the data, but rather the poor sampling of the internal states in the model.)We can circumvent this practical problem by dividing our experimental runs into shorter se-quences: while our sample is too small for obtaining a useful approximation of the density of possiblesequences of successive internal states, we can treat this density over shorter sequences . In theextreme case, we can consider the (Monte-Carlo-approximated) likelihood of just one response, ˆ s t ,at a trial t : ˜ p (ˆ s t ) = 1 M (cid:88) σ T p (ˆ s t | σ t ) . (29)Here, the M samples, σ T , are used to estimate a one-dimensional density, instead of a -dimensional density. The joint likelihood of all responses can then be approximated as the productof the likelihoods of each response, i.e., ˜ p (ˆ s T ) ≈ T (cid:89) t =1 ˜ p (ˆ s t ) . (30)This approximation, however, in eﬀect makes the crude assumption that successive responses areindependent, and thus neglects the sequential dependence in responses that a model may predict.For instance, this approximation vastly underestimates the likelihood of a model that correctlypredicts an appreciable probability of repetitions.In order to obtain an approximation of the likelihood that takes into account the sequentialdependence of responses, and which can be computed on the basis of M samples, we choose tocompute the likelihood of responses over sequences of successive trials, ˜ p (ˆ s t : t +9 ) = 1 M (cid:88) σ T p (ˆ s t : t +9 | σ t : t +9 ) . (31)We ﬁt the models by evaluating how well they reproduce the responses in all the 10-trial-longsequences. More precisely, we associate to each model a BIC calculated as BIC = − (cid:34) (cid:89) t =1 , ,... ˜ p (ˆ s t : t +9 ) (cid:35) + k ln n, (32)where k is the number of parameters in the model under consideration, and n is the number of datapoints. The speciﬁc choice of sequences with 10 trials is arbitrary; in our analyses, we repeatedthe calculations for diﬀerent choices, which yielded comparable results. We chose to illustrate thischoice as it corresponds to sequences no shorter than the mean inter-change-point duration, thatoptimizes the precision of the approximation.We note that our approximation of the ParticleFilter model’s BIC could be interpreted as a(possibly still approximate) calculation of the BIC of a diﬀerent model. Speciﬁcally, the latterwould include a particular form of a particle-rejuvenation procedure in which all the particles arereplaced, every 10 trials, by as many new particles, each randomly sampled from the distributionof possible particles, at these trials. The rejuvenation kernel would thus be independent of therejuvenated particle (a possibility considered in the particle-ﬁlter literature, see [105]), and the45cceptance rate would be equal to 1, each 10 trials, and to 0 at other trials; thus, it would also beindependent of the particle (which is at odds with the usual form of the rejuvenation proceduresintroduced in the literature).

The rational-inattention models we present are inspired by the model introduced by [26]. We adopttheir notation for the new quantities introduced here to describe the response-selection process.The major new ingredient in this model is that the subject, after having observed a sequenceof stimuli, x t , is assumed to choose, ﬁrst, whether to adjust or to repeat the current estimate(the ‘repetition variable’); the repetition variable is a Bernoulli random variable parameterized bythe probability of adjusting, denoted by Λ t ( x t , ˆ s t − ) (and the probability of a repetition is thus − Λ t ( x t , ˆ s t − ) ). Second, conditional on adjusting, the subject randomly chooses a new estimate(the ‘location variable’), sampled from a distribution which we denote by µ t (ˆ s t | x t ) . Thus, themodel subject’s distribution of responses conditional on the observed stimuli and on the precedingresponse is p (ˆ s t | x t , ˆ s t − ) = (1 − Λ t ( x t , ˆ s t − )) δ (ˆ s t − ˆ s t − ) + Λ t ( x t , ˆ s t − ) µ t (ˆ s t | x t ) , (33)where δ is the Dirac delta function.Following the rational-inattention approach, we assume that the distribution of responses con-ditional on the observed stimuli and on the preceding response, p (ˆ s t | x t , ˆ s t − ) , maximizes, undera constraint detailed below, the expected reward. Although the response at trial t , ˆ s t , aﬀects therewards in subsequent trials, through the ensuing distributions of responses (Eq. (33)), for the sakeof calculations simplicity we will carry out a ’greedy’ optimization in the model. Speciﬁcally, we willassume that the distribution of responses conditional on the observed stimuli and on the precedingresponse, p (ˆ s t | x t , ˆ s t − ) , is obtained by considering only the immediate reward in expectation overthe state, the response, and the sequence of past stimuli, i.e., the quantity ¯ R ≡ (cid:90) · · · (cid:90) p ( x t ) (cid:90) p (ˆ s t | x t , ˆ s t − ) (cid:90) p t ( s | x t ) R (ˆ s, s )d s dˆ s t d x . . . d x t , (34)where R (ˆ s, s ) is the reward obtained if the estimate is ˆ s and the correct state is s .In the absence of a constraint, the optimal response is obtained by maximizing the expectedreward implied by the sequence of past stimuli, (cid:82) p t ( s | x t ) R (ˆ s, s )d s , which we denote by r (ˆ s t | x t ) .We assume, however, that it is costly for the subject to choose with precision the repetition variableand the location variable, and this hampers the ability to obtain this optimal estimate. Following[26], we assume that the repetition variable (distributed according to − Λ t ( x t , ˆ s t − ) ), and thelocation variable in trials in which the estimate is not repeated (distributed according to µ t (ˆ s t | x t ) ),each bear a cognitive cost proportional to measures of the amount of information on the sequence ofstimuli involved in choosing the repetition variable and the location variable, respectively, deﬁnedas I = (cid:90) · · · (cid:90) p ( x t ) D KL (Λ t ( x t , ˆ s t − ) || ˜Λ)d x . . . d x t (35)and I = (cid:90) · · · (cid:90) p ( x t )Λ t ( x t , ˆ s t − ) D KL ( µ t ( . | x t ) || ˜ µ )d x . . . d x t , (36)where ˜Λ and ˜ µ are the unconditional (not conditional on x t ) probability of adjusting and thedistribution of estimates, respectively. The distributions Λ t ( x t , ˆ s t − ) and µ t (ˆ s t | x t ) are obtained46 epetition Location BICvariable variable OptimalInf. ParticleFilter - NoisyMax

NoisyMax

Table 2: BICs of the rational-inattention models and the models with ﬁxed repetition prob-ability, combined with the

OptimalInference and the

ParticleFilter inference strategies. as those that maximize the quantity ¯ R − ψ I − ψ I (37)which expresses a trade-oﬀ between expected reward and cognitive costs, where ψ and ψ arenumerical coeﬃcients specifying the strength of the information-theoretic costs.The solution to the optimization problem just posed is given by µ t (ˆ s t | x t ) = 1 Z ( x t ) ˜ µ (ˆ s t ) exp( ψ − r (ˆ s t | x t )) , (38)and ln Λ t ( x t , ˆ s t − )1 − Λ t ( x t , ˆ s t − ) = ln ˜Λ1 − ˜Λ + ψ − ( ψ ln Z ( x t ) − r (ˆ s t − | x t )) , (39)where Z ( x t ) = (cid:90) ˜ µ (ˆ s ) exp( ψ − r (ˆ s | x t ))dˆ s. (40)The unconditional distribution of estimates, ˜ µ (ˆ s ) , can be approximated by a uniform distributionon the space of responses; for the sake of simplicity, we use this approximation in our calculations.We implement four variants of this model and compute their BICs (Table 2). In a ﬁrst variant,no cost weighs on the repetition variable ( ψ = 0 ), but a cognitive cost prevents the model subjectfrom choosing the optimal response ( ψ (cid:54) = 0 , Table 2, second row). By contrast, in the secondvariant of the model the repetition variable is subject to a cost ( ψ (cid:54) = 0 ), while the location variableis not ( ψ = 0 ); the latter follows, however, a NoisyMax response-selection strategy (Table 2, thirdrow). (If the location variable, instead, were optimal, the model would assign a vanishing probabilityto most of the subjects’ responses, and consequently would yield an inﬁnite BIC.) In a third variantof the model, there are attentional costs weighing on both the repetition and the location variable( ψ (cid:54) = 0 and ψ (cid:54) = 0 , Table 2, ﬁfth row). In the fourth variant of the model, the location variable issubject to a cognitive cost ( ψ (cid:54) = 0 ); the repetition variable, however, is not derived in a rational-inattention approach, but instead is random and governed by a ﬁxed repetition probability (the ﬁrstvariant, above, is a special case of this fourth variant of the model, corresponding to a repetitionprobability set to zero; Table 2, sixth row). For the sake of comparison, we implement a ﬁfth modelthat combines a ﬁxed repetition probability with a NoisyMax strategy for the location variable (thismodel does not feature any cognitive cost; Table 2, fourth row).The ﬁve models just presented make use of the

OptimalInference strategy. We implement, inaddition, ﬁve other models in which the repetition variable and the location variable are chosen as47n these ﬁve models, but the

OptimalInference strategy is replaced by the

ParticleFilter inferencestrategy (Table 2, last column).

This work was supported by the CNRS through UMR8550, the Global Scholar Program at PrincetonUniversity, and the Visiting Faculty Program at the Weizmann Institute of Science. A.P.C. wassupported by a PhD fellowship of the Fondation Pierre-Gilles de Gennes pour la Recherche. Thiswork was granted access to the HPC resources of MesoPSL ﬁnanced by the Region Ile de France andthe project Equip@Meso (reference ANR-10-EQPX-29-01) of the program Investissements d’Avenirsupervised by the Agence Nationale pour la Recherche.

References [1] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in astatistically optimal fashion.

Nature , 415(6870):429–433, jan 2002.[2] Peter W. Battaglia, Daniel Kersten, and Paul R. Schrater. How haptic size sensations improvedistance perception.

PLoS Computational Biology , 7(6):e1002080, jun 2011.[3] Peter W. Battaglia, Robert a Jacobs, and Richard N Aslin. Bayesian integration of visualand auditory signals for spatial localization.

Journal of the Optical Society of America A ,20(7):1391, 2003.[4] Robert J. van Beers, Anne C. Sittig, and J A N J Denier. Integration of Proprioceptiveand Visual Position-Information: An Experimentally Supported Model.

Integration The VlsiJournal , 81(3):1355–1364, 1999.[5] Robert a. Jacobs. Optimal integration of texture and motion cues to depth.

Vision Research ,39(21):3621–3629, oct 1999.[6] James M Hillis, Simon J Watt, Michael S Landy, and Martin S Banks. Slant from textureand disparity cues: Optimal cue combination.

Journal of Vision , 4(12):1, dec 2004.[7] David C Knill. Robust cue integration: A Bayesian model and evidence from cue-conﬂictstudies with stereoscopic and ﬁgure cues to slant.

Journal of Vision , 7(7):5, may 2007.[8] Konrad Paul Körding and Daniel M Wolpert. Bayesian integration in sensorimotor learning.

Nature , 427(6971):244–247, jan 2004.[9] Konrad Paul Körding and Daniel M Wolpert. Bayesian decision theory in sensorimotor control.

Trends in Cognitive Sciences , 10(7):319–326, jul 2006.[10] Makoto Miyazaki, Daichi Nozaki, and Yasoichi Nakajima. Testing Bayesian models of humancoincidence timing.

Journal of neurophysiology , 94(1):395–399, 2005.[11] Mehrdad Jazayeri and Michael N Shadlen. Temporal context calibrates interval timing.

NatureNeuroscience , 13(8):1020–1026, 2010.[12] Ryan Prescott Adams and David J. C. MacKay. Bayesian Online Changepoint Detection.

ArXiv e-prints , pages 1–7, oct 2007. 4813] Paul Fearnhead and Zhen Liu. On-line inference for multiple changepoint problems.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 69(4):589–605, sep 2007.[14] Scott D Brown and Mark Steyvers. Detecting and predicting changes.

Cognitive Psychology ,58(1):49–67, feb 2009.[15] Robert C Wilson, Matthew R Nassar, and Joshua I Gold. Bayesian Online Learning of theHazard Rate In Change Point Problems.

Neural Computation , pages 1–25, 2010.[16] Matthew R Nassar, Robert C Wilson, B Heasly, and Joshua I Gold. An ApproximatelyBayesian Delta-Rule Model Explains the Dynamics of Belief Updating in a Changing Envi-ronment.

Journal of Neuroscience , 30(37):12366–12378, 2010.[17] Robert C Wilson, Matthew R Nassar, and Joshua I Gold. A Mixture of Delta-Rules Ap-proximation to Bayesian Inference in Change-Point Problems.

PLoS Computational Biology ,9(7):e1003150, 2013.[18] Matthew R Nassar, Katherine M Rumsey, Robert C Wilson, Kinjan Parikh, Benjamin Heasly,and Joshua I Gold. Rational regulation of learning dynamics by pupil-linked arousal systems.

Nature Neuroscience , 15(7):1040–1046, 2012.[19] Alan Veliz-Cuba, Zachary P. Kilpatrick, and Krešimir Josić. Stochastic models of evidenceaccumulation in changing environments.

SIAM Review , 58(2):264–289, 2016.[20] Adrian E. Radillo, Alan Veliz-Cuba, Krešimir Josić, and Zachary P. Kilpatrick. EvidenceAccumulation and Change Rate Inference in Dynamic Environments.

Neural Computation ,29(6):1561–1610, jun 2017.[21] Christopher M. Glaze, Joseph W. Kable, and Joshua I. Gold. Normative evidence accumula-tion in unpredictable environments. eLife , 4(AUGUST2015):1–27, 2015.[22] Christopher M. Glaze, Alexandre L.S. Filipowicz, Joseph W. Kable, Vijay Balasubramanian,and Joshua I. Gold. A bias-variance trade-oﬀ governs individual diﬀerences in on-line learningin an unpredictable environment.

Nature Human Behaviour , 2(3):213–224, 2018.[23] Alex T. Piet, Ahmed El Hady, and Carlos D. Brody. Rats adopt the optimal timescale forevidence integration in a dynamic environment.

Nature Communications , 9(1):1–12, 2018.[24] Adrian E. Radillo, Alan Veliz-Cuba, Krešimir Josić, and Zachary P. Kilpatrick. Performance ofnormative and approximate evidence accumulation on the dynamic clicks task. (516088):1–29,feb 2019.[25] C R Gallistel, Monika Krishan, Ye Liu, Reilly Miller, and Peter E Latham. The perceptionof probability.

Psychological Review , 121(1):96–123, 2014.[26] Mel Win Khaw, Luminita Stevens, and Michael Woodford. Discrete adjustment to a changingenvironment : Experimental evidence.

Journal of Monetary Economics , 91:88–103, 2017.[27] L. A. Nunes Amaral, D. J. Bezerra Soares, L. R. da Silva, L. S. Lucena, M. Saito, H. Kumano,N. Aoyagi, and Y. Yamamoto. Power law temporal auto-correlations in day-long records ofhuman physical activity.

Europhys. Lett. , 66(3):448–454, 2004.4928] Toru Nakamura, Ken Kiyono, Kazuhiro Yoshiuchi, Rika Nakahara, Zbigniew R Struzik, andYoshiharu Yamamoto. Universal Scaling Law in Human Behavioral Organization.

PhysicalReview Letters , 138103(September):1–4, 2007.[29] Toru Nakamura, Toru Takumi, Atsuko Takano, Naoko Aoyagi, Kazuhiro Yoshiuchi, R Zbig-niew, and Yoshiharu Yamamoto. Of Mice and Men — Universality and Breakdown of Behav-ioral Organization.

PLoS ONE , 3(4):1–8, 2008.[30] C Anteneodo and D R Chialvo. Unraveling the ﬂuctuations of animal motor activity.

Chaos ,19:1–7, 2009.[31] J M Hausdorﬀ, C K Peng, Z Ladin, J Y Wei, and a L Goldberger. Is walking a randomwalk? Evidence for long-range correlations in stride interval of human gait.

Journal of appliedphysiology (Bethesda, Md. : 1985) , 78(1):349–358, 1995.[32] Lori Griﬃn, Damien J. West, and Bruce J. West. Random stride intervals with memory.

Journal of Biological Physics , 26(3):185–202, 2000.[33] Franck Ramus, Marina Nespor, and Jacques Mehler. Correlates of linguistic rhythm in thespeech signal.

Cognition , 73(3):265–292, 1999.[34] E Low, E Grabe, and F Nolan. Quantitative characterisations of speech rhythm.

Languageand Speech , 46:377–401, 2000.[35] Estelle Campione and Jean Véronis. A Large-Scale Multilingual Study of Silent Pause Dura-tion.

Speech Prosody 2002 , pages 199–202, 2002.[36] Peter Janssen and Michael N Shadlen. A representation of the hazard rate of elapsed time inmacaque area LIP.

Nature Neuroscience , 8(2):234–241, feb 2005.[37] Geoﬀrey M Ghose and John H R Maunsell. Attentional modulation in visual cortex dependson task timing.

Nature , 419(6907):616–620, oct 2002.[38] Yi Li and Joshua Tate Dudman. Mice infer probabilistic models for timing.

Proceedings ofthe National Academy of Sciences , 110(42):17154–17159, oct 2013.[39] Sanne ten Oever, Charles E. Schroeder, David Poeppel, Nienke van Atteveldt, and Elana Zion-Golumbic. Rhythmicity and cross-modal temporal cues facilitate detection.

Neuropsychologia ,63:43–50, oct 2014.[40] Noah Goodman, Joshua B. Tenenbaum, Jacob Feldman, and Thomas L. Griﬃths. A RationalAnalysis of Rule-Based Concept Learning.

Cognitive Science: A Multidisciplinary Journal ,32(1):108–154, jan 2008.[41] Rubén Moreno-Bote, David C Knill, and Alexandre Pouget. Bayesian sampling in visualperception.

Proceedings of the National Academy of Sciences , 108(30):12491–12496, jul 2011.[42] Samuel J. Gershman, Edward Vul, and Joshua B. Tenenbaum. Multistability and PerceptualInference.

Neural Computation , 24(1):1–24, 2012.[43] Edward Vul, Noah Goodman, Thomas L. Griﬃths, and Joshua B. Tenenbaum. One andDone? Optimal Decisions From Very Few Samples.

Cognitive Science , 38(4):599–637, 2014.5044] Sture Holm. A Simple Sequentially Rejective Multiple Test Procedure.

Scandinavian Journalof Statistics , 6(2):65–70, 1978.[45] P Morasso. Spatial control of arm movements.

Experimental Brain Research. ExperimentelleHirnforschung. Expérimentation Cérébrale , 42(2):223–7, 1981.[46] Daniel M Wolpert. Computational approaches to motor control.

Trends in Cognitive Sciences ,1(6):1–5, 1997.[47] Reza Shadmehr. Computational Approaches to Motor Control. In

Encyclopedia of Neuro-science , volume 3, pages 9–17. 2009.[48] Lionel Rigoux and Emmanuel Guigon. A Model of Reward- and Eﬀort-Based Optimal DecisionMaking and Motor Control.

PLoS Computational Biology , 8(10), 2012.[49] Sangtae Ahn and J.a. Fessler. Standard Errors of Mean, Variance, and Standard DeviationEstimators.

EECS Department, University of Michigan , pages 1–2, 2003.[50] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non-GaussianBayesian state estimation.

IEE Proceedings F Radar and Signal Processing , 140(2):107, 1993.[51] M.S. Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particleﬁlters for online nonlinear/non-Gaussian Bayesian tracking.

IEEE Transactions on SignalProcessing , 50(2):174–188, 2002.[52] Arnaud Doucet and Adam M Johansen. A Tutorial on Particle Filtering and Smoothing:Fifteen years later. In D. Crisan and B. Rozovsky, editors,

The Oxford Handbook of NonlinearFiltering , pages 1–39. Oxford University Press, 2008.[53] Luigi Acerbi, Sethu Vijayakumar, and Daniel M Wolpert. On the Origins of Suboptimalityin Human Probabilistic Inference - S2 - Noisy Probabilistic Inference.

PLoS ComputationalBiology , 10(6):1–5, 2014.[54] Edward Vul. Sampling in human cognition - Chapter 1.

Dissertation Abstracts International:Section B: The Sciences and Engineering , 72(4-B):2471, 2011.[55] Angela J Yu and He Huang. Maximizing Masquerading as Matching in Human Visual SearchChoice Behavior.

Decision , 1(4):275–287, 2014.[56] Gideon Schwarz. Estimating the Dimension of a Model.

The Annals of Statistics , 6(2):461–464,mar 1978.[57] Kevin Lloyd, Adam Sanborn, David Leslie, and Stephan Lewandowsky. Why Higher WorkingMemory Capacity May Help You Learn: Sampling, Search, and Degrees of Approximation.

Cognitive Science , 43(12), 2019.[58] Jan Drugowitsch, Valentin Wyart, Anne-Dominique Devauchelle, and Etienne Koechlin. Com-putational Precision of Mental Inference as Critical Source of Human Choice Suboptimality.

Neuron , 92(6):1398–1411, 2016.[59] Richard T. Cox. Probability, Frequency and Reasonable Expectation.

American Journal ofPhysics , 14(1):1, 1946.[60] Edwin T Jaynes.

Probability Theory: The Logic of Science . Cambridge University Press, 2003.5161] David C Knill. Mixture models and the probabilistic structure of depth cues.

Vision Research ,43(7):831–854, 2003.[62] Max Berniker and Konrad Kording. Bayesian approaches to sensory integration for motorcontrol.

Wiley Interdisciplinary Reviews: Cognitive Science , 2(4):419–428, 2011.[63] Rashmi Sundareswara and Paul R. Schrater. Perceptual multistability predicted by searchmodel for Bayesian decisions.

Journal of vision , 8(5):12.1–19, may 2008.[64] Thomas L. Griﬃths and Joshua B. Tenenbaum. Optimal Predictions in Everyday Cognition.

Psychological Science , 17(9):767–773, sep 2006.[65] Aaron P. Blaisdell, Kosuke Sawa, Kenneth J. Leising, and Michael R. Waldmann. Causalreasoning in rats.

Science , 311(5763):1020–1022, 2006.[66] Alan A. Stocker and Eero P. Simoncelli. A Bayesian Model of Conditioned Perception.

Ad-vances in neural information processing systems , 20(May):1409–1416, 2008.[67] Thomas L. Griﬃths and Joshua B. Tenenbaum. Predicting the future as Bayesian inference:People combine prior knowledge with observations when estimating duration and extent.

Jour-nal of Experimental Psychology: General , 140(4):725–743, 2011.[68] Lisa Pearl, Sharon Goldwater, and Mark Steyvers. Online Learning Mechanisms for BayesianModels of Word Segmentation.

Research on Language and Computation , 8(2-3):1–26, sep2011.[69] Roger Levy, Florencia Reali, and Thomas L. Griﬃths. Modeling the eﬀects of memory onhuman online sentence processing with particle ﬁlters.

Proceedings of NIPS 2008 , pages 1–8,2008.[70] Nathaniel D Daw and Aaron C Courville. The pigeon as particle ﬁlter.

Advances in neuralinformation processing systems , pages 369–376, 2008.[71] Samuel J. Gershman, Eric J. Horvitz, and Joshua B. Tenenbaum. Computational ratio-nality: A converging paradigm for intelligence in brains, minds, and machines.

Science ,349(6245):273–278, 2015.[72] Samuel J. Gershman and Jeﬀrey M. Beck. Complex Probabilistic Inference : From Cognitionto Neural Computation. In

Computational Models of Brain and Behavior , pages 1–17. 2016.[73] Adam N. Sanborn and Nick Chater. Bayesian Brains without Probabilities.

Trends in Cogni-tive Sciences , xx:1–11, 2016.[74] R J Herrnstein. Relative and absolute strength of response as a function of frequency ofreinforcement.

Journal of the Experimental Analysis of Behavior , 4(3):267–272, jul 1961.[75] Derek J. Koehler and Greta James. Probability matching in choice under uncertainty: Intu-ition versus deliberation.

Cognition , 113(1):123–127, 2009.[76] Stephanie Denison, Elizabeth Bonawitz, Alison Gopnik, and Thomas L. Griﬃths. Rationalvariability in children’s causal inferences: The Sampling Hypothesis.

Cognition , 126(2):280–300, 2013. 5277] David R Wozny, Ulrik R Beierholm, and Ladan Shams. Probability Matching as a Com-putational Strategy Used in Perception.

PLoS Computational Biology , 6(8):e1000871, aug2010.[78] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson Sampling: An Asymptoti-cally Optimal Finite Time Analysis.

Algorithmic Learning Theory , (1):15, may 2012.[79] Christopher A. Sims. Implications of rational inattention.

Journal of Monetary Economics ,50(3):665–690, 2003.[80] Michael Woodford. Information-constrained state-dependent pricing.

Journal of MonetaryEconomics , 56(SUPPL.):S100–S124, 2009.[81] Christopher A. Sims. Rational Inattention and Monetary Economics - Handbook of MonetaryEconomics.

Handbook of Monetary Economics , 3(2007):155–181, 2011.[82] Quentin J M Huys, Neir Eshel, Elizabeth O Nions, Luke Sheridan, Peter Dayan, andP Jonathan. Bonsai Trees in Your Head : How the Pavlovian System Sculpts Goal-DirectedChoices by Pruning Decision Trees.

PLOS Computational Biology , 8(3), 2012.[83] William R. Thompson. On the Likelihood That One Unknown Probability Exceeds AnotherIn View Of The Evidence Of Two Samples.

Biometrika , 25(3):285–294, 1933.[84] Eric Schulz, Emmanouil Konstantinidis, and Maarten Speekenbrink. Learning and decisionsin contextual multi-armed bandit tasks.

Proceedings of the 37th Annual Conference of theCognitive Science Society. Austin, TX: Cognitive Science Society , 1:2122–2127, 2015.[85] Maarten Speekenbrink and Emmanouil Konstantinidis. Uncertainty and exploration in arestless bandit problem.

Topics in Cognitive Science , 7(2):351–367, 2015.[86] Samuel J. Gershman. Deconstructing the human algorithms for exploration.

Cognition ,173(December 2017):34–42, 2018.[87] Rava Azeredo da Silveira and Michael Woodford. Noisy Memory and Over-Reaction to News.

AEA Papers and Proceedings , 109:557–561, may 2019.[88] Nathaniel Neligh. Rational Memory with Decay. 2019.[89] Hassan Afrouzi, Spencer Yongwook Kwon, and Yueran Ma. A Model of Costly Recall. 2020.[90] Alan A. Stocker and Eero P. Simoncelli. Sensory adaptation within a Bayesian framework forperception.

Advances in Neural Information Processing Systems , 18:1291–1298, 2006.[91] Adam N. Sanborn, Thomas L. Griﬃths, and Daniel J. Navarro. A More Rational Model ofCategorization.

Proceedings of the 28th Annual Conference of the Cognitive Science Society ,pages 1–6, 2006.[92] Adam N. Sanborn, Thomas L. Griﬃths, and Daniel J. Navarro. Rational approximationsto rational models: Alternative algorithms for category learning.

Psychological Review ,117(4):1144–1167, 2010.[93] Edward Vul, Michael C Frank, Joshua B. Tenenbaum, and George Alvarez. Explaining hu-man multiple object tracking as resource-constrained approximate inference in a dynamicprobabilistic model.

Advances in Neural Information Processing Systems , 22:1–9, 2009.5394] Pratiksha Thaker, Joshua B. Tenenbaum, and Samuel J. Gershman. Online learning of sym-bolic concepts.

Journal of Mathematical Psychology , 77:10–20, 2017.[95] Neil R Bramley, Peter Dayan, Thomas L Griﬃths, and David A Lagnado. FormalizingNeurath’s ship: Approximate algorithms for online causal learning.

Psychological Review ,124(3):301–338, 2017.[96] Adam N. Sanborn. Types of approximation for probabilistic cognition: Sampling and varia-tional.

Brain and Cognition , pages 8–11, jul 2015.[97] József Fiser, Pietro Berkes, Gergő Orbán, and Máté Lengyel. Statistically optimal percep-tion and learning: from behavior to neural representations.

Trends in Cognitive Sciences ,14(3):119–130, mar 2010.[98] Patrik O Hoyer and Aapo Hyvärinen. Interpreting neural response variability as monte carlosampling of the posterior.

Advances in neural information processing systems , (1):293–300,2003.[99] Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics assampling: A model for stochastic computation in recurrent networks of spiking neurons.

PLoSComputational Biology , 7(11):e1002211, nov 2011.[100] L Shi and Thomas L. Griﬃths. Neural implementation of hierarchical Bayesian inference byimportance sampling.

Advances in Neural Information Processing Systems 22 , pages 1669–1677, 2009.[101] Guillaume Hennequin, Laurence Aitchison, and Máté Lengyel. Fast Sampling-Based Inferencein Balanced Neuronal Networks.

Advances in Neural Information Processing Systems , pages1–9, 2014.[102] Cristina Savin and Sophie Denève. Spatio-temporal representations of uncertainty in spikingneural networks.

Advances in Neural Information Processing Systems , pages 1–9, 2014.[103] Laurence Aitchison and Máté Lengyel. The Hamiltonian Brain: Eﬃcient Probabilistic In-ference with Excitatory-Inhibitory Neural Circuit Dynamics.

PLoS Computational Biology ,12(12):1–24, 2016.[104] Jonathan W Peirce. Generating stimuli for neuroscience using psychopy.

Frontiers in Neu-roinformatics , 2(10), 2009.[105] Nicholas Chopin. A sequential particle ﬁlter method for static models.

Biometrika , 89(3):539–552, aug 2002. 54

Supplementary tables: statistical tests ˜ τ Table 3: p-values for the one-sided statistical tests of equality of the means of the learningrates in the HI and HD conditions, for each run-length (Fig. 2C).

W: Welch’s test. MWU: Mann-Whitney’s U test. S: Student’s test. The N(HI) and N(HD) lines report the number of observations. Thesecond half of the table reports the same quantities when excluding all occurrences of repetitions (Suppl.Fig. 18B).

Condition 1 Condition 2

Avg Avg W MWU S N N HI, ˜ τ ∈ [5 , HI, ˜ τ ∈ [9 , ˜ τ ∈ [5 , HD, ˜ τ ∈ [5 , ˜ τ ∈ [9 , HD, ˜ τ ∈ [9 , ˜ τ ∈ [5 , HD, ˜ τ ∈ [9 , ˜ τ ∈ [5 , HI, ˜ τ ∈ [9 , ˜ τ ∈ [5 , HD, ˜ τ ∈ [5 , ˜ τ ∈ [9 , HD, ˜ τ ∈ [9 , ˜ τ ∈ [5 , HD, ˜ τ ∈ [9 , Table 4: Learning rates averages under various HI/HD and short/long run-length conditions,and p-values for one-sided statistical tests of equality of the means (Fig. 2B).

The ﬁrst twocolumns indicate the HI/HD and short/long run-length conditions. Learning rates at trials verifying thesetwo conditions have their averages reported in the

Avg and Avg columns. Columns W, MWU, and Sprovide the p-values for the tests of equality of the means between the two conditions. W: Welch’s test.MWU: Mann-Whitney’s U test. S: Student’s test. N and N report the number of observations for eachcondition. The second half of the table reports the same quantities when excluding all occurrences ofrepetitions (Suppl. Fig. 18A). τ p . . . . . . . . . . . N ( H I ) N ( H D ) E x c l ud i n g r e p e t i t i o n s p . . . . . . . . . . . N ( H I ) N ( H D ) T a b l e : p - v a l u e s f o r L e v e n e ’ ss t a t i s t i c a l t e s t o f e q u a li t y o f t h e v a r i a n c e s o f t h e c o rr e c t i o n s i n t h e H I a nd H D c o nd i t i o n s , f o r e a c h r un - l e n g t h ( F i g . C ) . T h e N ( H I ) a nd N ( H D ) li n e s r e p o r tt h e nu m b e r o f o b s e r v a t i o n s . T h e s ec o ndp a r t o f t h e t a b l e r e p o r t s t h e s a m e q u a n t i t i e s w h e n e x c l ud i n ga ll o cc u rr e n ce s o f r e p e t i t i o n s ( Supp l. F i g . ) . ˜ τ p . . . . . . . . . . . . N ( H I ) N ( H D ) T a b l e : p - v a l u e s f o r F i s h e r ’ s e x a c tt e s t o f i nd e p e nd e n c e o f t h e r e p e t i t i o np r o p e n s i t i e s i n t h e H I a nd H D c o nd i t i o n s , a t e a c h r un - l e n g t h ( F i g . B ) . T h e N ( H I ) a nd N ( H D ) li n e s r e p o r tt h e nu m b e r o f o b s e r v a t i o n s . Supplementary ﬁgures . . . . . . . . . L e a r n i n g R a t e ∗ ∗∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗∗ HIHD

Figure 17: Learning rates at run-lengths greater than 10.

In order to curb the ﬂuctuations due tothe decreasing amount of data at high run-lengths, we use a sliding window of size 3 over the run-lengths (i.e.,we pool together the learning rates corresponding to three consecutive run-lengths). The number reportedon the x-axis is the center of this sliding window. In the HD condition, the learning rate, after run-length10, remains large and appears to increase. In the HI condition, the learning rate keeps decreasing after run-length 10. Although presumably the learning rate in this condition should eventually plateau and remainconstant, it does not seem that our subjects reach this stage over the run-lengths for which we have suﬃcientdata to allow for analysis (up to 15). HI ˜ τ ∈ [5 , HI ˜ τ ∈ [9 , HD ˜ τ ∈ [5 , HD ˜ τ ∈ [9 , . . . . . A v e r a g e l e a r n i n g r a t e ∗∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗ Run - length ˜ τ . . . . . . L e a r n i n g r a t e ∗∗ ∗∗∗ ∗∗ HI HD ˜ τ . . S t a n d a r dd e v i a t i o n ∗∗ ∗∗ ∗ ∗∗∗ ∗∗∗ ∗ ∗∗∗ HI HD

A B C

Figure 18: Human learning rates and standard deviations of responses, excluding all occur-rences of repetitions. ( A,B ) as in Figs. 2B,C; ( C ) as in Fig. 4C. − − − . − . − . . . . Sub j ec t s R e s p o n s e s Skewness (Bayesian posterior in [75,225])

HIHD

Lin. reg.

Figure 19: Skewness in subjects’ responses away from the bounds of the response range.

Sameanalysis of the skewness as in Fig. 6B, but restricted to the trials in which the support of the Bayesianposterior is entirely contained in the middle interval of width half that of the state space (i.e., [75,225],to be compared to the state space, [0,300]). In both conditions, the correlation between the skewness ofthe Bayesian posterior and the empirical skewness of the subjects’ responses is positive and signiﬁcant (HI:Pearson’s r = 0.23, p = 2e-12; HD: r = 0.14, p = 4e-4).

Run - length . . . ∗∗∗ ∗ ∗∗∗∗∗∗ ∗ ∗∗∗ Learning rateHI HD

Run - length ∗∗ ∗ ∗∗ ∗∗∗∗∗ ∗ ∗∗ ∗∗∗ Repetition propensity

Run - length . . . . . . . ∗∗ ∗∗∗ ∗∗∗ ∗∗ ∗∗∗ ∗∗∗ ∗ ∗∗∗∗∗ ∗∗∗ ∗∗∗ ∗∗ ∗∗∗ ∗∗∗ ∗ ∗∗∗ Standard deviation

A B C

Figure 20: Human learning rates, repetition propensities, and standard deviations of responses,from the complete contingent of subjects including the four subjects excluded from the analysispresented in the main text (see Methods). ( A ) as in Fig. 2C; ( B ) as in Fig. 3B; ( C ) as in Fig. 4C.

20 40 60 80Run-length . . . . . . . . Learning rate 0 20 40 60 80Run-length84%86%88%90%92%94% Repetition propensity 0 20 40 60 80Run-length . . . . . . . . Subjects . . . . . Standard deviationBayesian posterior (RHS)0 20 40 60 80Run-length . . . . . . . Learning rate 0 20 40 60 80Run-length86%88%89%92%94%96%98% Repetition propensity 0 20 40 60 80Run-length . . . . . . . Subjects . . . . . Standard deviationBayesian posterior (RHS) AB Figure 21: Behavior of human subjects in the online inference tasks conducted by [25] (A),and [26] (B).

Left panels : Learning rates as a function of the run-lengths, in trials in which the surpriseis greater than . . Middle panels : Repetition propensities.

Right panels : Standard deviations of responses,averaged across subjects. In these studies, the stimulus is binary and Bernoulli-distributed. The task isto infer the parameter of the Bernoulli distribution, which is subject to change points with a constantprobability of 0.5% (to be compared to 10% in the HI condition of our task.) Moreover, some subjectswere presented several times with the same sequence of stimuli (in diﬀerent sessions). It is thus possible toexamine the variability of responses within subjects (right panels, across-subject mean of the within-subjectstandard-deviations). In all panels, in order to mitigate noise, we pool together the responses in windows ofﬁve consecutive run-lengths.: Standard deviations of responses,averaged across subjects. In these studies, the stimulus is binary and Bernoulli-distributed. The task isto infer the parameter of the Bernoulli distribution, which is subject to change points with a constantprobability of 0.5% (to be compared to 10% in the HI condition of our task.) Moreover, some subjectswere presented several times with the same sequence of stimuli (in diﬀerent sessions). It is thus possible toexamine the variability of responses within subjects (right panels, across-subject mean of the within-subjectstandard-deviations). In all panels, in order to mitigate noise, we pool together the responses in windows ofﬁve consecutive run-lengths.