[PDF] Informational entropy refinement as a stochastic mechanism for sequential decision-making in humans

Abstract

While perceptual decision making in humans is often considered to be governed by evidence accumulator models (like drift-diffusion), mechanisms driving harder situations where prospection of future scenarios is necessary remain largely unknown. Here, experimental and computational evidence is given in favour of a mechanism in which prospection of possible future payoffs associated to each available choice could be used, through the internal estimation of the corresponding Shannon's entropy S. So, the decision would be triggered as soon as S reaches a threshold which ensures that a choice is reliable enough. We illustrate this idea using a task in which subjects have to navigate sequentially through a maze on the computer screen while avoiding trajectory overlaps, forcing them to use memory and prospection skills that we indirectly capture through eye-tracking. Comparison of the experimental data to that from virtual (ideal) subjects allows us to verify that the performances observed, as well as the distribution of decision-making times, of humans are compatible with the aforementioned mechanism.

Full PDF

IInformational entropy reﬁnement as a stochastic mechanism for sequentialdecision-making in humans

Javier Crist´ın, Vi¸cenc M´endez and Daniel Campos (Dated: February 18, 2021)While perceptual decision making in humans is often considered to be governed by evidence ac-cumulator models (like drift-diﬀusion), mechanisms driving harder situations where prospection offuture scenarios is necessary remain largely unknown. Here, experimental and computational evi-dence is given in favour of a mechanism in which prospection of possible future payoﬀs associated toeach available choice could be used, through the internal estimation of the corresponding Shannon’sentropy S . So, the decision would be triggered as soon as S reaches a threshold which ensures thata choice is reliable enough. We illustrate this idea using a task in which subjects have to navigatesequentially through a maze on the computer screen while avoiding trajectory overlaps, forcing themto use memory and prospection skills that we indirectly capture through eye-tracking. Comparisonof the experimental data to that from virtual (ideal) subjects allows us to verify that the perfor-mances observed, as well as the distribution of decision-making times, of humans are compatiblewith the aforementioned mechanism. I. INTRODUCTION

In our daily life, we constantly ﬁnd ourselves in situ-ations that imply making decisions: what I am going toeat, which ﬁlm I will see or if I am on time for the bus.In all these situations we must evaluate the diﬀerent op-tions available and ﬁnd a way to choose the best one.While these situations lie within the ﬁeld of psychology,in the recent years there has been a growing interdisci-plinary interest in decision-making. Its neural correlatesconstitute at present an important subject in cognitiveand behavioral neuroscience [1–4]. Also, the strategies toimprove the eﬃciency of our decisions constitute an im-portant subject in game theory [5, 6] and econophysics.Last but not least, ideas from statistical physics and/orcomplex systems have also made its way; while most con-tributions to date focus on decision-making at the levelof groups or collectives (see [7–11] for some reviews), ten-tative works suggesting physical principles that could beinvolved in individual decisions do also exist [12–16].Large eﬀorts have been put in understanding the dy-namics and the characteristics of perceptual decisions, itis, those where sensory information provides direct evi-dence for choosing between the options available, as inthe famous random dot motion task [17, 18]. As a re-sult, a correspondence between such sensory informationand the neuronal responses responsible for the evidenceaccumulation in the brain are assumed to be identiﬁ-able, providing a physical measure or correlate of theneural processes involved. Alternatively, value-based orpreferential decision-making identiﬁes (though the exactdeﬁnition changes from one ﬁeld to another, or amongauthors) situations where a deliberative process, maybeinvolving some subjective estimates, is necessary in orderto reach the decision, as for example when a subject isasked to choose between two food items. So, in such casesthe neural correlate becomes obviously more diﬃcult toidentify.We can still introduce a third class of situations inwhich an objective answer to the task does exist but such answer is not trivial to reach from sensory infor-mation alone because the task involves consecutive deci-sions such that the present output has an inﬂuence overthe next ones. Following some existing literature (see,e.g, [19, 20] and references there in), we will denote thisas sequential decision-making. This obviously requires ahigher cognitive capacity and a more reﬂective responseby the subject in order to process the information. Thisis the typical case involved in tasks like playing boardgames as chess, or solving mazes or tasks presented insome intelligence tests. Note that all these decisions in-volve building a bunch of future possibilities, it is, car-rying out a prospection of the possible situations thatwould occur after a single decision, in a tree-like fashion.In our work, then, we will use the term prospection to de-note such hypothetical, or mental, simulations of futureevents that in general requires (and is actually coupledto) high memory and abstraction capacities [21–23].There have been lots of eﬀorts focused on develop-ing a theoretical framework to explain the mechanismsand dynamics of perceptual decision making. Most ofthem lie within the so-called accumulator framework, inwhich cognitive evidence (described through some kindof stochastic process) is gained throughout the time untilit reaches a given threshold, which then triggers the de-cision. The paradigmatic example of such approaches isthe Drift-Diﬀusion model (DDM) [24], where a randomvariable denoting the relative evidence between the dif-ferent options is assumed to be driven by the combinationof a diﬀusion process (which introduces cognitive ﬂuctu-ations or noise in the process), and a drift process thataccounts for the evidence accumulation towards the cor-rect answer gained through sensory information. Nowa-days it is widely accepted among psychologists that thesuccess of the DDM to describe such simpler situationsis overwhelming [25, 26], though in some cases it requiresmodiﬁcations or extensions to be useful, as introducingtime-dependent thresholds [27] or dynamic changes of thedrift according to the current state of the variable [28].Furthermore, recent works have shown that value-based a r X i v : . [ phy s i c s . s o c - ph ] F e b decisions can be also accommodated within this frame-work by introducing some additional details or assump-tions [29, 30].On the contrary, equivalent stochastic mechanismsable to capture the dynamics of sequential or more com-plex decision processes are scarce [31]. Here we will pro-vide experimental evidence that those highly cognitivedecision processes in humans, at least those of a cer-tain type, are compatible with a stochastic framework inwhich not evidence but information (or informational en-tropy, to be speciﬁc) is implicitly being computed by theindividual. For this, we study the eﬃciency of subjects atsolving a particular navigation task through a maze onthe computer screen, combined with eye-tracking datato assess the corresponding behavioral time dynamics.We do not introduce any explicit costs for prospecting oranalyzing information (as there are not time, or other,constraints involved). Thus we pose an extreme situa-tion where decisions are mostly driven by optimizationof the internal prospection process, rather than by anyspeed-accuracy tradeoﬀ or any other principle related tofast or eﬃcient sampling.We will use principles from statistical physics and in-formation theory to show that human decision-makingin such contexts can be adequately described by a pro-cess of (informational) entropy reﬁnement until a relia-bility threshold in the entropy is reached. In Section IIwe present our framework and discuss its main concep-tual diﬀerences with accumulator models typically usedto study perceptual decision-making. In Section III wepresent the experimental design and discuss the corre-sponding performance by the subjects. Comparison ofthose eﬃciencies to those obtained by (virtual) randomsubjects allows us to prove (Section IV) that perfor-mances obtained by real human subjects can only be ex-plained if prospection is actually being used during thetask, and we can even quantify to a certain extent thelevel of prospection that individuals are using in eachcase. Finally, we study in Section V the statistical prop-erties of the response time dynamics observed during thetask to provide quantitative evidence that human behav-ior in our task is driven by the entropy threshold mecha-nism mentioned above. The conclusions from our studyare presented in Section VI. II. THEORETICAL FRAMEWORK

A relevant problem in any decision-making process isto establish a criteria to identify when I have enough in-formation in order to make an accurate decision. This isthe question that leaded Wald to develop the well-knownSequential Probability Ratio Test (SPRT), which pro-vides a criteria that minimizes the time required to makea decision based on accumulation of evidence. Given twodiﬀerent options ( A and B , each with its correspondingreliability), we deﬁne the probabilities to decide each oneof the options as p A and p B (which are assumed to be FIG. 1.

Scheme for the accumulator and reliability mecha-nisms. The left column corresponds to the evidence accumu-lation for perceptual decision making (upper half), whit theWald’s ratio W as quantifying magnitude (lower half). Theright column corresponds to the sequential decision makingmechanism (upper half), with the Shannon’s entropy S beingthe quantifying magnitude (lower half). The label t d in thehorizontal axis corresponds to the decision time. mutually exclusive options, so p A + p B = 1). The SPRTis given in terms of the log-likelihood function W = ln (cid:18) p A p B (cid:19) (1)and establishes that the decision should be taken as soonas the cumulative of W ( t ) computed through evidence ac-cumulation, (cid:80) i W = (cid:80) i ln( p A,i ) − (cid:80) i ln( p B,i ), exceeds(or falls below) a given threshold ( W th ). Here, p A,i and p B,i are deﬁned as the probabilities for each option (ei-ther A or B ) when estimated through the accumulationof sensory evidence (assuming for convenience that suchprocess can be divided into discrete steps). Consequently,the SPRT criteria above establishes the suﬃcient infor-mation to decide, and actually the DDM can be seen as aparticular continuum implementation of the test [32, 33].The SPRT, in fact, also admits an interpretation interms of information theory. If we redeﬁne the probabil-ities of the options A and B as p A = 1 − p and p B = p then W = ln (cid:18) − pp (cid:19) = ∂S∂p , (2)where we have introduced in the last step the Shannon’sentropy S = − p ln p − (1 − p ) ln(1 − p ) (the extension frombinary options to ternary, or more complex, decisions isstraightforward). So that, the SPRT criteria can be seenas a threshold in the cumulative variation of entropy withrespect to variations in p during the process of evidenceaccumulation.Our proposal for sequential decision-making is that, inthe absence of any explicit tradeoﬀs that constrain theprospection of future possibilities, it would be the entropyitself, not its variations, the relevant magnitude for reach-ing a state at which the decision can be taken reliably. Sothat, we propose that one will take its decision once S hasreached a given threshold S th . Note that entropy cannotbe accumulated (contrary to the case of evidence) since S is a bounded function. Instead, our mechanism suggeststhat the initial state of the subject is characterized bya maximum entropy (or maximum uncertainty), and theprogressive information acquisition and prospection pro-vides a better estimation of p A and p B , such that whenthe entropy decays below a threshold S th the decision istaken. Accordingly, evidence accumulation typical of theSPRT is replaced by an entropy reﬁnement process (Fig.1, left panels).To implement this mechanism we need to solve theproblem of how to translate the information processedby the subject may be translated by it into the probabil-ities p A and p B appearing in S . In the case of perceptualdecisions, probabilities are assigned in a way proportionalto sensory evidence (for example, in the random dot mo-tion task evidence is proportional to the time the subjectis gazing at point moving in one given direction and/or tothe number of diﬀerent points detected in that directionof motion). But how can this be implemented in con-texts where not sensory evidence, but prospection andsubsequent reﬂection are driving the decision?If the information obtained during the prospection pro-cess results in a clear reward or payoﬀ (we denote theserewards as E A and E B for the options A and B ), then thequestion reduces to ﬁnd the mapping from the E ’s to theprobabilities. At this point, we introduce the hypothesisthat the prospection process is used by humans as a wayto estimate the mean value of the payoﬀ, µ , that can beobtained from each choice. If that is the case, then a di-rect implementation of the Maximum Entropy Principle(MEP) [34] from information theory states that for anestimate of µ , the most neutral (or unbiased) choice ofprobabilities that we can assign to each option reads p A,B = e − βE A,B /µ Z , (3)where β is a positive constant (which appears as a La-grange multiplier when applying the MEP) and Z anormalization factor. Note that this is equivalent tocanonical or Maxwell-Boltzmann statistics employed instatistical physics, and interestingly it leads to W ( t ) = − β ( E A − E B ) /µ , so the traditional SPRT could be in-terpreted in this case as a way to impose a thresholdin the diﬀerence of the cumulative rewards E A and E B computed through prospection. FIG. 2.

Probability distribution for the number of of neces-sary samples ( n ) to reach the corresponding decision thresholdboth for the entropic and Wald’s algorithms. A. Binary toy model

Before moving to experimental results, it can be il-lustrative to explore some properties of our mechanism(based on a threshold of the Shannon entropy S ) if com-pared to the classical SPRT criteria.We propose the following toy model to compare both.We denote the mean rewards for options A and B as µ A and µ B . Then we assume that, through the prospectionprocess, successive estimates of the energies E A and E B are obtained by the subject in such a way that each sam-ple made during prospection follows a Gaussian distribu-tion of unit variance centered at µ A and µ B , respectively.The cumulative energy so obtained is used to computethe probabilities in (3). The fundamental question toanswer is what is the number of samples n that are nec-essary to overpass (or fall below) a threshold, either in W (for the SPRT mechanism) or in S (in our proposal),as a way to decide between options A and B .The SPRT case exhibits a distribution of the number ofsamples that depends strongly on the distance betweenthe means ∆ µ ≡ µ A − µ B (Fig. 2)). Instead, for thecriteria of entropy reﬁnement the distribution of neces-sary samples exhibits a power law behavior P ( n ) ∝ n ,for a wide range of ∆ µ and S th . At this point, we havedemonstrated at least one fundamental diﬀerences of us-ing energetic or entropic accumulators, so we can use itas a proxy to discriminate between these two criteria. III. ENVIRONMENTS FOR PROSPECTION

Sequential decision making requires a mental process-ing of the acquired information which can be deeply com-plex and hard to capture, even with the help of brainactivity monitoring. However, simple situations in whichsensory information can be assumed to reﬂect somehowsuch information processing can help reconstructing theactual mechanisms behind. With this purpose we havedesigned a particular navigation task in a maze mon-itored with the help of eye-trackers. Eﬃcient naviga-tion strategies in a maze involve a prospection processthrough visual inspection of the possible paths availablewithin the sight distance, so involving a gathering of non-local evidence. In our context, we prepare the setup insuch a way that we can indirectly estimate the pathsprospected by the subjects with the help of the eye-tracking data.

A. Experimental Design

18 clinically normal adults (11 women and 8 men) agedfrom 18 to 45 carried out the experiment. In the ﬁrst partof the task, subjects are presented a discrete 7x7 regu-lar lattice on the screen, representing a discrete set of 49patches where the navigation process takes place (Fig 3,upper panel on th right). The patches are linked throughpaths connecting them only to neighbour patches (4paths per node, except for the boundaries where pathsare only 2 or 3). Among all possible paths, we removea part of them (20%, preventing isolated regions in thestructure to be formed) so introducing some level of het-erogeneity in the lattice (Fig 3, left panels).The subjects are asked to visit the maximum amountof patches of the resulting graph within 49 steps if start-ing from the center of the structure (one step is deﬁnedas a transition between connected patches in the graph).They do this by clicking with the mouse over the patchto which they want to move next (Fig 3, middle pan-els, show some realizations of the resulting trajectories).Heterogeneity of the graph then makes the process non-trivial (note that for a regular lattice the optimal strategyto cover the maximum number of nodes would be simplyto perform a ladder-like trajectory).To assess the subjects performance under diﬀerent lev-els of complexity, the patches of the rectangular latticeare then reorganized in a circular way in two diﬀerentways. In the ﬁrst case (Circular Ordered), we keep theorder of the rows of the ﬁrst rectangular graph (Fig 3 b)).For the third graph (Circular Disordered), we place thenodes following a circular structure but with the maxi-mum visual disorder (Fig 3 c)). We remark that topo-logically the three structures are completely equivalent.However, intuitively we expect a growing diﬃculty togather the relevant information and prospecting as thevisual distribution of patches and paths gets disordered,resulting in a lower performance (it is, a lower number ofpatches visited). Additionally, we rotated 90 º , 180 º and270 º the rectangular structure (with their correspond-ing Circular Ordered and Circular Disordered reorgani-zations) for randomizing the task (so 12 cases in total arepresented to each subject) without changing the topologyof the structure. A more detailed explanation about theorganization of the structures is given in the AppendixA. The ﬁnal dataset comprised 72 trajectories for eachgraph (Rectangular, Circular Ordered and Circular Dis-ordered). During the task, we cannot infer directly the pathsprospected by the subjects from their trajectory. As aproxy for prospection, then, we use eye ﬁxations mea-sured through a commercial eye-tracker (Tobii X2-30, at30 Hz) (Fig 3, right panels). We use this to analyze (i)the number of patches at which the subject gazes betweenconsecutive steps, and (ii) the time it remains gazing atparticular patches, as a way to infer how resources arebeing invested throughout the prospection time (see Ap-pendix B for further details). B. Decision-making time dynamics

The global performance of the individuals in the threediﬀerent levels of the graph is computed as the number ofpatches that have been covered during the entire trajec-tory (Fig. 4 a)). For the Rectangular graph, the subjectsvisited in average 37 . ± . ,

7% ofthe total 49 patches). For the Circular Ordered graph,they covered 29 . ± . , . ± . , d b as the minimum number of steps/bonds between the cur-rent patch and the one the individual is gazing at. Thecorresponding experimental distributions of d b is foundto be completely diﬀerent for the three visual organiza-tions considered (4 c)). The distance (in bonds) betweenany two patches is exactly the same in all cases; then itis clear that the individuals do not prospect equally inthese cases: while for the rectangular case a large amountof time is invested in gazing at nearby patches, for theCircular cases (specially for the Disordered one) frequentgazes at distant patches (in terms of bonds) is observed,which must be attributed either to (i) distractions causedby the presence of patches which are close on the screenconﬁguration (though they are not easily accessible fromthe current one), and (ii) the diﬃculty at identifying eas-ily the patches which are available in the next few steps.An eﬃcient prospection should combine an intensive ex-ploration of closer patches and a smaller (but not nec-essarily negligible) exploration of further ones. We il-lustrate that idea in the inset of ﬁgure 4 c), where thecumulative probability of being gazing at nearby patches(deﬁned as those with d b ≤

4) is shown to decrease dras-tically as a function of the visual complexity of the task.

FIG. 3.

Scheme of the experimental setup. The ﬁrst column corresponds to the three patch graphs. The paths correspondto the allowed movement between patches (absence of them meaning the nodes aren’t connected). The color of the patchesis introduced to facilitate the understanding of the spatial reorganization. The second column shows one individual trajectorywithin the structures (the color code corresponds to the current step of the steps trajectory). The third column shows thelocations where the individual has gazed into during the trajectory (the color code corresponds to the current step of the stepstrajectory). IV. QUANTIFYING PROSPECTION DURINGNAVIGATION

As a way to quantify and reﬁne the ideas above, wepropose to compare the subjects performance in our taskto that of virtual subjects following an algorithm which isable to automatically prospect all information availablewithin a certain topological distance d p (the prospectionlength) in the lattice. Note that in a classical random-walk algorithm, instead, the walker would select thepaths completely at random, making uninformed deci-sions. Our prospective walker then assigns a payoﬀ E i to theneighbour patch where the prospected path begins. Thispayoﬀ is equal to the fraction of already visited patchesthat the prospected path crosses (so E ∼ [0 , E = 0 representing a prospected path for which all sitesare still unvisited, and E = 1 representing a path forwhich all sites are already visited).To compute the number of newly visited patches, it isthen necessary that the walker keeps in memory its pre-vious trajectory. To implement this in a realistic way, weconsider that previous visits to a patch are kept in mem-ory by the walker during a characteristic number of steps FIG. 4. a) Averaged number of covered patches after the steps trajectory for each one of the graphs. b) Averaged decisiontime accounting all the movement of the trajectory for each one of the graphs. c) Distribution of the bond distance d b between thecurrent patch and the patches that have been gazed before doing a movement for each one of the graphs. The inset correspondsto the accumulated probability for gazing patches at a distance d b ≤ . τ m . As a result, if a certain choice would imply movingto regions that, according to the prospection length andthe memory capacity of the walker, are already visited,then the payoﬀ associated to that option would be large.Instead, if a certain choice is seen to drive the walker toa region with a large number of non-covered patches, thecorresponding payoﬀ would be lower.The option with lower energy E i will be then the onewhich is assigned a larger probability p i , according to (3).In a situation where all the energies are equal then thewalker would have no reliable information to make its de-cision. The decision of when to move to the next patchis taken by the random walker according to the decisioncriteria described above in Section II. This is, successiveprospections of paths are carried out, and so values ofthe payoﬀs E i and the probabilities p i are continuouslyupdated. The walker computes then the correspondingShannon’s entropy S = − (cid:80) i p i ln p i and when the com-puted value falls below a threshold S th , the walker makesa decision according to the probabilities p i computed atthat time (see Appendix A for technical details).The rules above already allow the algorithm to avoidoverlaps in its trajectory (more or less eﬃciently, in termsof the parameters d p and τ m ) without explicitly requir-ing it to maximize the number of visited sites, as we dowith the human subjects. As the performance of thealgorithm is independent of the visual organization ofthe lattice (Rectangular, or Circular) we can use it asa reference model against which to compare the humanperformance in our experiment, so assessing the prospec-tion mechanisms that are being presumably used by thehuman subjects.By exploring the range of d p and τ m values in the algo-rithm, we can divide the parameter phase space into fourregions (5 a)). For region I the algorithm produces anaveraged number of visited oatches lower than the indi-viduals in any of the experiments. The region II producesa performance which lies between the results obtainedbetween Circular Ordered and Circular Disordered. Theregion III overcomes the results for the Circular Ordered performance but not for the Rectangular. The region IV,ﬁnally, outperforms all the experimental results (regionIV).Hence, we conclude that relatively large values both of τ m and d p are necessary to match or improve the coveragein the Rectangular graph, suggesting that individuals re-member the visited patches and predict future outcomeseﬃciently in this case. So, the ability to prospect is nec-essary to justify the subjects performance found in theexperiments. For the Circular structures the individu-als are probably not able to track the paths to distantpatches (there is no suﬃcient local gathering); in conse-quence, the value of d p necessary to reproduce their per-formance is lower than in the Rectangular case (a furthercomparison is provided in Appendix A).For certain values of d p and τ m the random walkeractually reproduces the distribution of performances ob-tained from the experiments. In Fig. 5 we show resultsfor the case d p = 6 with τ Rm = 70, d p = 3 with τ COm = 7and d p = 2 with τ CDm = 5, which provide the best ﬁts tothe experimental data. We analyze the dynamical trendsof the experimental trajectories and the trajectories ofthe ﬁtted parameters (Figure 5 b)). The number of cov-ered patches presents in all cases a monotonic growth,which is reduced as the trajectory progresses and over-laps can appear in consequence. The experimental curvesand the those obtained from the algorithm with the pa-rameters mentioned above agree almost perfectly. Then,the algorithm reproduces the dynamic performance ofhuman subjects during the experiment. Likewise, thedistribution of the ﬁnal performances so obtained is alsoin perfect agreement to experimental data (ﬁgure 5 c)).

V. THE EXPERIMENTAL DATA ISCOMPATIBLE WITH THE ENTROPICREFINEMENT MECHANISM

In Section II we have introduced a decision-making cri-teria based on the acquisition of suﬃcient decision re-liability, based on a threshold for the information en-

FIG. 5. a) Diagram for the walker covered patches in comparison with experimental results. Regime I corresponds to aworse averaged performance than all geometries. Regime II corresponds to a better performance than in Circular Disordered.Regime III corresponds to a better performance than in Circular Ordered and Disordered. Regime IV corresponds to a betterperformance than in all geometries. b) Evolution of the averaged remaining non-visited nodes during the subject and walkerperformance . c) Distribution of ﬁnal number of covered patches obtained from the subject and walker performance. The dotscorrespond to the experimental data while the solid lines correspond to the walker mechanism. FIG. 6. a) Distribution of times the subject is staring to a certain patch. b) Distribution of decision times (consecutivemovements). c) Distribution of the number of gathered patches between consecutive movements. tropy. The Gaussian toy model explored there exhib-ited power-law scaling (with exponent γ = −

3) for thenumber of samples between decisions for the entropy re-ﬁnement mechanism. Actually, the algorithm presentedin the previous Section for reproducing the dynamics ofsubjects in the Rectangular and Circular structures alsoexhibits the same behavior (7)). The exponent − i ) the times between decisions, ii ) the timesduring which the subjects gaze at the same patch and iii ) the number of diﬀerent patches gazed before makinga decision ( n ).Despite the diﬀerent performances found for the threelevels (Rectangular, Circular Ordered and Disordered),as reported in previous Sections, the time distributions inall these cases exhibit extremely similar properties (Fig.6). This suggests a common underlying mechanism for decision-making. What is more, all the distributions ﬁtclosely the power-law decay P ( t ) ∼ t − (or P ( n ) ∼ n − ),in agreement with the predictions from our information-theoretical criteria based on S . The only signiﬁcant dif-ferences appear for smaller decision times, which seem tobe scarce in the circular Disordered case (as suggestedfrom Fig. 4 b)).Intuitively, the decision time may be understood as thesum of the times that the individual has been gazing ateach individual patch. Then it could be that the power-law emerges either from (i) the distribution of times thesubject keeps looking at a given patch, or (ii) the numberof patches that are gazed between decisions. Both caseswould provide an explanation for the scale-free feature ofthe decision time distributions as a consequence of otherdistribution. It is, however, the case that both distribu-tions (for the number of patches gazed and the gazingtimes) seemingly present the scale-free decay (Fig. 6,middle and right panels). So, the underlying mechanismyielding the power-law distribution for decision times isapparently a nontrivial combination of both.To provide further evidence of the compatibility ofthe experimental results with our information-theoretical FIG. 7.

Distribution of the number of prospections n per-formed by the walker to force the entropy S to fall below thethreshold S th . FIG. 8. a) Energy diﬀerence ∆ E between the options (di-rections) with more and less accumulated energy when thedecision is made. The x-axis groups the decisions by theircorresponding decision time. The ﬁtted curves are f ( x ) =0 . x + 2 . (Rectangular), g ( x ) = 0 . x + 2 . (Circular Or-dered) and z ( x ) = 0 . x + 2 . (Circular Disordered). b)Shannon’s entropy S according when the decision is made.The x-axis groups the decisions by their corresponding de-cision time. The ﬁtted curves are f ( x ) = 0 . x + 0 . (Rectangular), g ( x ) = 0 . x + 0 . (Circular Ordered) and z ( x ) = 0 . x + 0 . (Circular Disordered). (entropy reﬁnement) criteria, we quantify the payoﬀs E i and the corresponding entropy S at the instant at whicheach decision is made by the subject (8). We remind thatthe SPRT criteria for canonical probabilities (3) is equiv-alent to impose that a threshold in the energy diﬀerence∆ E triggers the decision, while for our criteria it is theentropy S which must reach a ﬁxed threshold. For theexperimental navigation task, we ﬁnd that the diﬀerences∆ E (computed between the choices with lower and higherpayoﬀs at the decision time) is a monotonically growingfunction of the decision time, so longer decisions requirelonger payoﬀ accumulation (8 b)). On the contrary, theinformational entropy S remains approximately indepen-dent of the decision time, suggesting that this magnitudeis really an invariant for all decisions and so supportingthe view that a threshold in S may trigger the decision ( (8 c)). This seems to be particularly robust for longerdecisions, while shorter ones ( < VI. CONCLUSIONS

Navigation eﬃciency in higher organisms should takeinto account the criteria they use to prospect the out-comes of their available options, and how these areweighted to reach a behavioral decision. Here we havestudied the decision-making dynamics subject to a pro-cess of information gathering and have provided evidencethat the mechanism triggering the decision would relyon a threshold in the informational entropy based on theprobabilities estimated through prospection. While thismechanism shares some ideas with the classical SPRT cri-teria typically used for perceptual decision making, in se-quential decision-making scenarios we claim that it is in-formation reﬁnement or reliability (rather than evidenceaccumulation) the fundamental magnitude.Our hybrid analysis comparing the navigation abili-ties of virtual subjects (algorithms) to those of humansubjects, provides an approximate quantitative charac-terization for the cognitive memory and the prospectionability that should be required by the subject during thetask. Furthermore, the distribution of decision (or gaz-ing) times together with the study of the ﬁnal values for S reached at the moment of the decisions (Section V)allows us to think that the criteria proposed here can ac-count to a signiﬁcant extent for how information is beingprocessed by the subjects during the task. At this re-spect, note that traditionally mean times to decision, aswell as the ratio of the times corresponding to choosingoption A or B (for binary decisions) have been studied indetail by psychologists. On the contrary, decision timedistributions are rarely computed in decision-making ex-periments. Here we show that such distributions can beused as a signature to discriminate between models. TheSPRT criteria, as well as other alternative mechanismswe have explored (not reported here) have been foundunable to reproduce the power-law decay P ( t ) ∼ t − characteristic of our experiments).Regarding the experimental robustness of our results,since neural correlates able to provide a detailed descrip-tion of how prospection and information gathering pro-cesses are carried out is probably unattainable for se-quential decision-making, eye-tracking reamins a conve-nient proxy for this. Still, probably the combination ofsuch data with EEG or other physiological sensors couldbe used to reﬁne our ideas, and provide more reliableestimates of the dynamics in sequential decision-makingand/or navigation tasks, also in more realistic environ-ments than the navigation on the computer screen usedhere. We expect that our results can stimulate research in this line in order to test the general validity of ourinformation-theoretical criteria based on entropy reﬁne-ment. [1] J. I. Gold and M. N. Shadlen. The Neural Basis of De-cision Making.

Ann. Rev. of Neuroscience, 30 535-574(2007).[2] S. P. Kelly and R. G.O’Connell .

The neural processesunderlying perceptual decision making in humans: Recentprogress and future directions.

Journal of Phys.P 109, 27-37 (2015)[3] S. Blakemore and T. W Robbins.

Decision-making in theadolescent brain.

Nat. Neurosci. 15 (2012)[4] B. De Martino, D. Kumaran, B. Seymour and R. J.Dolan.

Frames, Biases, and Rational Decision-Making inthe Human Brain.

Science 313, 684–687 (2006)[5] D. Lee.

Game theory and neural basis of social decisionmaking.

Nat Neurosci. 11, 404–409 (2008)[6] D. Lee, M. L.Conroy, B. P.McGreevy and D.J.Barraclough.

Reinforcement learning and decision mak-ing in monkeys during a competitive game

CognitiveBrain Research 22, 45-58 (2004)[7] Couzin, I., Krause, J., Franks, N. et al.

Eﬀective leader-ship and decision-making in animal groups on the move.

Nature 433, 513–516 (2005)[8] A. J. W. Ward, D. J. T. Sumpter, I. D. Couzin, P. J. B.Hart and J. Krause.

Quorum decision-making facilitatesinformation transfer in ﬁsh shoals.

PNAS 105,19) 6948-6953 (2008)[9] G. Valentini, E. Ferrante and M. Dorigo.

The Best-of-nProblem in Robot Swarms: Formalization, State of theArt, and Novel Perspectives.

Front. Robot. (2017)[10] L. Conradt and C. List

Group decisions in humans andanimals: a survey.

Phil. Trans. R. Soc. B 364,1518719–742. (2009)[11] S. Redner.

Reality-inspired voter models: A mini-review.

Comp. Rend. Phys. 20, 4 275-292 (2019).[12] P.A. Ortega and D.A. Braun.

Thermodynamics as a the-ory of decision-making with information-processing costs .Proc. R. Soc. A 469: 20120683 (2013).[13] V.I. Yukalov and D. Sornette.

Self-organization in Com-plex Systems as Decision Making . Adv. Complex. Syst.17, 1450016 (2014).[14] P. Schwartenbeck, T. FitzGerald, R.J. Dolan and K. Fris-ton.

Exploration, novelty, surprise, and free energy min-imization . Front. Psychol. 7:2013.00710 (2013).[15] ´E. Rold´an, I. Neri, M. D¨orpinghaus, H. Meyr and F.J¨ulicher.

Decision Making in the Arrow of Time . Phys.Rev. Lett. 115, 250602 (2015).[16] M. Favre, A. Wittwer, H.R. Heinimann, V.I. Yukalov andD. Sornette.

Quantum Decision Theory in Simple RiskyChoices . PLoS ONE 11:e0168045.[17] KH Britten, MN Shadlen, WT Newsome and JAMovshon

The analysis of visual motion: a comparisonof neuronal and psychophysical performance . J Neurosci12,12 (1992)[18] J. D Roitman 1 and M. N Shadlen

Response of neurons inthe lateral intraparietal area during a combined visual dis-crimination reaction time task . J Neurosci. 22,21 (2002)[19] E.M. Tartaglia, A.M. Clarke1 and M.H. Herzog.

What to Choose Next? A Paradigm for Testing Human SequentialDecision Making . Front. Psychol. 8:312 (2017).[20] D. Zhang and R. Gu.

Behavioral preference in sequentialdecision-making and its association with anxiety . Hum.Brain Mapp. 39:2482–2499 (2018).[21] D.T. Gilbert and T.D. Wilson.

Prospection: Experiencingthe future . Science 317, 1351-1354 (2007).[22] T. Suddendorf and M.C. Corballis.

The evolution of fore-sight: What is mental time travel, and is it unique tohumans? . Behav. Brain Sci. 30, 299 (2007).[23] B.E. Pfeiﬀer and D.J. Foster.

Hippocampal place-cell se-quences depict future paths to remembered goals . Nature497, 74-81 (2013).[24] R. Ratcliﬀ, P. L. Smith, S. D.Brown and G. McKoon

Diﬀusion Decision Model: Current Issues and History.

Trends in Cog. Science, 20 260-281 (2016).[25] R. Ratcliﬀ and G. McKoon

The Diﬀusion DecisionModel: Theory and Data for Two-Choice Decision Tasks.

Neural Comput. 20 873–922 (2008).[26] A. Rangel et al

The Drift Diﬀusion Model can accountfor the accuracy andreaction time of value-based choicesunder high and low timepressure

Judgment and DecisionMaking, 5 437–449 (2010)[27] M. L. Pedersen, M. J. Frank and G. Biele.

The drift dif-fusion model as the choice rule in reinforcement learning .Psychon Bull Rev 24 1234–1251 (2017)[28] Fontanesi, L., Gluth, S., Spektor, M.S. et al.

A reinforce-ment learning diﬀusion decision model for value-based de-cisions . Psychon Bull Rev 26 1099–1121 (2019)[29] A. Roxin

Drift–diﬀusion models for multiple-alternativeforced-choice decision making . Jou. of Math. Neuro-science 9, 5 (2019)[30] S. Tajima, J. Drugowitsch and A. Pouget

Optimal pol-icy for value-based decision-making . Nat. Comm. 7:12400(2016)[31] K.P. Nguyen, K. Josi´c and Z.P. Kilpatrick.

Optimizingsequential decisions in the drift-diﬀusion model . J. Math.Psychol. 88, 32-47 (2019).[32] R. Bogacz, E. Brown, J. Moehlis, P. Holmes and J.D.Cohen.

The physics of optimal decision making: a for-mal analysis of models of performance in two-alternativeforced-choice tasks . Psychol. Rev. 113(4), 700 (2006).[33] A. Wald and J. Wolfowitz.

Optimum character of the se-quential probability ratio test . Ann. Math. Stat. 32,6–339(1948).[34] E.T. Jaynes.

Information Theory and Statistical Mechan-ics I . Phys. Rev. 106, 620–630 (1957). APPENDIX A: EXPERIMENTAL METHODS

The ﬁnal dataset comprised 72 validated trajectoriesfor each graph (Rectangular, Circular Ordered and Cir-cular Disordered, each with four possible orientations:0 º , 90 º , 180 º , 270 º ) coming from 18 adult subjects. Theexperiment is a carried out on a 21” screen monitor (HPE233). The subjects were requested to keep their po-sition around a distance of 50 cm from the monitor andto try to avoid sharp physical movements. We used acommercial eye-tracker (Tobii X2-30, at 30 Hz) to recordthe positions of the eye ﬁxations (Fig 3, right panels).The subjects were not required to take oﬀ glasses, butit was recommended to avoid interferences with the eye-tracker. The subjects were not urged to ﬁnish the ex-periment within any given time; they were simply givena ﬁxed number of movements (49) to explore the maze,and were asked to cover the maximum number of patchespossible (one movement is deﬁned as a transition be-tween connected neighbour patches in the graph). The49 patches in the graph presented to the subjects arelinked through paths connecting them only to neighbourpatches. Among all possible movements between ﬁrstneighbours, we removed a part of them (20%, preventingisolated regions in the structure to be formed) so intro-ducing some level of heterogeneity in the graph (Fig 3,leftpanels). All paths started from the same (central) nodeof the structure in all the cases, and all subjects were pre-sented exactly the same structures in order to avoid anydiﬀerence between them that could distort the compari-son between their performances. So, the only diﬀerencebetween the three structures (Rectangular, Circular Or-dered and Circular Disordered) were in the visualizationof the graph on the screen, as explained in the main text.Movements between patches in the graph were carriedout by clicking with the mouse on the next patch. To fa-cilitate visualization of the options available (especially inthe Circular Disordered case, where visualization couldbe tough), the current patch of the individual was de-picted in a diﬀerent color (green, with the rest of thepatches appeared in blue) and the links available at eachmovement were emphasized (with thicker solid lines). Onthe contrary, the subjects have no visual guides to dis-tinguish between visited and non-visited patches, so theyneed to use their memory skills to avoid overlaps. Thescreen recording of a particular realization of the task foreach structure (Rectangular, Circular Ordered, CircularDisordered) is provided as a Supplementary Material Fileto facilitate understanding. APPENDIX B: PROSPECTIVE ALGORITHMA. Deﬁnition

We propose an algorithm in which virtual subjects areable to prospect those paths available within d p stepsin the lattice (we call this parameter the prospection length). For each path prospected, the walker assignsa payoﬀ E i to the neighbour patch at which that pathstarts (for a simple visualization, see Fig. 9). The pay-oﬀ is taken to be equal to the fraction of already vis-ited patches that the prospected path crosses (so E isbounded between 0 and 1, with E = 0 for a path thatdoes not cover any visited patches, and E = 1 if allpatches covered by the path have been previously vis-ited).The walker keeps in memory its previous trajectoryduring a characteristic number of steps τ m . In particu-lar, the visits are remembered by the virtual subject dur-ing a time τ obtained from the exponential distribution P ( τ ) = τ m e − tτm . After this time, the walker will forgetthat this particular patch has been previously visited andwill contribute as a non-visited patch for computing thecorresponding payoﬀs.As a result, the algorithm will assign higher payoﬀsto the options that it remembers having visited ans/orthat are adjacent to regions that it remembers having vis-ited. So, according to 3 the probability to choose thoseoptions will decrease, so leading the virtual subject toregions that are still unvisited (or at least it does notremember having visited before). A larger prospectionlength d p allows the walker to sample the state of furtherregions and to compute the payoﬀ the information of dis-tant patches, but this will be only eﬃcient if the memoryparameter τ m is large enough.Successive prospections of the paths available in eachdirection are carried out at random among all possibleones of length d p , and so values of the payoﬀs and theprobabilities p i are continuously updated. Note that fora given value of d p the number of paths that can beprospected is of the order of ∼ d p (if assuming thatall links between neighbour patches are available). Thedecision of when to move to the next patch is taken bythe random walker according to the decision criteria de-scribed above in Section II. After each single prospec-tion of one path in each direction, the walker computesthe corresponding Shannon’s entropy S = (cid:80) i − p i ln p i ;if the computed value falls below a ﬁxed threshold S th ,the walker makes the decision according to the probabil-ities p i computed at that time (we have checked to de-cide according to the highest probability does not changequalitatively the walker dynamics). On the contrary, if S > S th then the prospection process continues. How-ever, we additionally introduce a rule such that the max-imum number of prospections is limited to 100 to avoid(extremely unusual) situations in which S would neverdecay below S th because all options avaiable persistentlyexhibit very similar payoﬀs. We have carefully checkedthat this rule doesn’t modify any of the results reportedin a signiﬁcant way.1 FIG. 9.

Scheme of three diﬀerent prospection paths corre-sponding to three diﬀerent prospection lengths d p = 1 (black), d p = 2 (yellow) and d p = 3 (green). The patches that arealready visited are marked in red (so they add +1 to the pay-oﬀ), while the unvisited ones appear in blue (so they add to the payoﬀ). The payoﬀ assigned in each one of thepaths depicted would be a) E f = 1 to option f for d p = 1 ,b) E h = (0 + 0) / to option h for d p = 2 , and E b = (0 + 0 + 1) / / to option b for d p = 3 . B. Phase diagram

The results of the algorithm shown above have beenobtained under the same conditions that in the task pre-sented to the human subjects; this is, for 49-step trajec-tories through the 7 x τ m and d p . In particular, we study the mean coveragetime ( T Cov ) (this is, the mean time required to coverall sites in the lattice). Minimization of this magnitudewould then give an estimation of the search/navigationeﬃciency of the algorithm.The main conclusion we can extract (as one can de-duce from the results in Fig. 10) is that the ability toprospect future paths (so, having a large d p ) is uselessunless the individual has good memory skills (this is, alarge τ m value in our context). This makes clear sense, aswhen the walker cannot remember the previously visitedpatches (low values of τ m ), the optimal strategy consistsof removing prospection ( d p = 1); in that case the infor-mation provided by further patches represents just use-less noise as the walker always sees them as non-visitedpatches. On the other side, for large τ m the walker cancorrectly identify the previously visited patches (large values of τ m ), so then progressively higher prospectionlengths d p are found to optimize the coverage of the struc-ture and the search of a target. FIG. 10.

Prospection length that minimizes the minimummean coverage time ( T Cov ) of the walker as a function ofmemory time τ m when the walker is placed in the same struc-ture of the experimental design. C. Distributed prospection lengths

Assigning a constant prospection length d p to all theprospected paths may seem rather unrealistic. Hu-man individuals are expected instead to prospect pathswith diﬀerent lengths depending on the speciﬁc situation(complexity, number of choices available, etc) . The re-sults reported in ﬁg 6 c) also support this statement (thenumber of gazed patches is not ﬁxed to a constant num-ber but exhibits a variation which spans almost one orderof magnitude).We have studied then our algorithm for the case whena distribution of d p is introduced instead of a constantvalue. We have tried in particular a distribution P ( d p ) ∝ d γp (for d p ≥ (cid:80) ∞ d p =1 P ( d p ) = 1 to guarantee nor-malization. For γ → ∞ , the paths are then ﬁxed to d p = 1, so the prospection algorithm is only to iden-tify whether the neighbour patches have been visited ornot. On the other side, for γ → d p values (at practice welimit d p to 1 ≤ d p ≤ since much larger values would beabsurd, given the 7x7 maze we have used). Figure 11reports that sampling a small (but not negligible) num-ber of long paths combined with a majority short paths(as happens for intermediate γ values) is suﬃcient to re-cover the same dynamics as obtained for a large ﬁxed d p . This can be seen by comparing results obtained forlower values of γ to those of large values of d p , whichare extermely similar. This result is remarkable from anevolutionary perspective, since it suggests that improv-ing navigation eﬃciency would not necessarily require toprocess much more information continually (note againthat the number of paths available for prospection growsin our structure as 4 d p , and it is in general n d p for a se-quential decision task in which n choices are given to thesubject at any step, so processing costs grow very fast).Instead, having the ability to carry out longer prospec-2tions and use this ability just promptly would be enoughto increase eﬃciency signiﬁcantly. FIG. 11.

Final number of covered patches after a 49 stepstrajectory as a function of the memory time τ m for the casea) d p ﬁxed and b) γ ﬁxed. By exploring the whole range of γ and τ m values, wecan divide the parameter phase space into four regions(12 a)), analogously as shown above for ﬁxed d p (see Fig.5). The region I produces an averaged performance thatvisits less patches than the individuals in any of the ex-perimental graphs. The region II produces a performancewhich lies between the results obtained between Circu-lar Ordered and Disordered. The region III overcomesthe results for the Circular Ordered performance but notfor the Rectangular. The region IV, ﬁnally, outperformsall the experimental results (region IV). The regions areequivalent to the obtained for ﬁxed path lengths, butwith a lesser number of prospected patches. Again, thisshows that distributed values of d p can be used to obtainhigher navigation eﬃciencies without consuming muchhigher times of information processing. FIG. 12.

Diagram in which the averaged number of coveredpatches is compared with the experimental results. The grapha) is for the ﬁxed d p prospection length and the graph b), forthe γ distribution. D. Robustness of the power-law exponent for thedistribution of decision times

We have reported in the main text that the decisiontime for the walker, deﬁned as the number of requiredprospected paths that makes

S < S th , exhibits again thesame power-law distribution (with exponent −

3) as theGaussian toy model when the entropic mechanism is used(with the exponent − t m and d p ﬁtted from the experiments. Herewe provide an analysis to check that the − S th used in the algorithm.First, in Fig. 13 a) we show the explicit dependenceon the entropy threshold, and verify that the power-lawbehavior is kept as long as reasonable values of this pa-rameter are chosen (extreme choices, with, S th → S th , we ﬁx it in our results above to S th = 0 . d p nor τ m FIG. 13. a) Distribution of the number of prospections per-formed by the walker before making a movement for the ﬁxedparameters d p = 6 and τ m = 100 and a variable entropythreshold S th . b) Distribution of the number of prospectionsperformed by the walker before making a movement for a ﬁxedentropy threshold S th = 0 . and diﬀerent S and τ m . modify signiﬁcantly the ∼ n − behavior as long as somesigniﬁcant level of memory and prospection is kept.We stress that the classical SPRT criteria, as well asother variations we have numerically explored, are un-able to reproduce the − P ( n ). This, together with the robustness anal-ysis reported here, provides signiﬁcant robustness to theentropy threshold criteria proposed here.4 VII. AUTHOR CONTRIBUTIONS

J.C., V.M. and D.C. conceived and designed the study.J.C. and D.C. carried out the experiment. J.C. per-formed the numerical simulations. J.C., V.M. and D.C.carried out the mathematical analysis. J.C., V.M. andD.C. wrote and reviewed the paper. All authors gaveﬁnal approval for publication and agree to be held ac- countable for the work performed therein.