Informational entropy refinement as a stochastic mechanism for sequential decision-making in humans
IInformational entropy refinement as a stochastic mechanism for sequentialdecision-making in humans
Javier Crist´ın, Vi¸cenc M´endez and Daniel Campos (Dated: February 18, 2021)While perceptual decision making in humans is often considered to be governed by evidence ac-cumulator models (like drift-diffusion), mechanisms driving harder situations where prospection offuture scenarios is necessary remain largely unknown. Here, experimental and computational evi-dence is given in favour of a mechanism in which prospection of possible future payoffs associated toeach available choice could be used, through the internal estimation of the corresponding Shannon’sentropy S . So, the decision would be triggered as soon as S reaches a threshold which ensures thata choice is reliable enough. We illustrate this idea using a task in which subjects have to navigatesequentially through a maze on the computer screen while avoiding trajectory overlaps, forcing themto use memory and prospection skills that we indirectly capture through eye-tracking. Comparisonof the experimental data to that from virtual (ideal) subjects allows us to verify that the perfor-mances observed, as well as the distribution of decision-making times, of humans are compatiblewith the aforementioned mechanism. I. INTRODUCTION
In our daily life, we constantly find ourselves in situ-ations that imply making decisions: what I am going toeat, which film I will see or if I am on time for the bus.In all these situations we must evaluate the different op-tions available and find a way to choose the best one.While these situations lie within the field of psychology,in the recent years there has been a growing interdisci-plinary interest in decision-making. Its neural correlatesconstitute at present an important subject in cognitiveand behavioral neuroscience [1–4]. Also, the strategies toimprove the efficiency of our decisions constitute an im-portant subject in game theory [5, 6] and econophysics.Last but not least, ideas from statistical physics and/orcomplex systems have also made its way; while most con-tributions to date focus on decision-making at the levelof groups or collectives (see [7–11] for some reviews), ten-tative works suggesting physical principles that could beinvolved in individual decisions do also exist [12–16].Large efforts have been put in understanding the dy-namics and the characteristics of perceptual decisions, itis, those where sensory information provides direct evi-dence for choosing between the options available, as inthe famous random dot motion task [17, 18]. As a re-sult, a correspondence between such sensory informationand the neuronal responses responsible for the evidenceaccumulation in the brain are assumed to be identifi-able, providing a physical measure or correlate of theneural processes involved. Alternatively, value-based orpreferential decision-making identifies (though the exactdefinition changes from one field to another, or amongauthors) situations where a deliberative process, maybeinvolving some subjective estimates, is necessary in orderto reach the decision, as for example when a subject isasked to choose between two food items. So, in such casesthe neural correlate becomes obviously more difficult toidentify.We can still introduce a third class of situations inwhich an objective answer to the task does exist but such answer is not trivial to reach from sensory infor-mation alone because the task involves consecutive deci-sions such that the present output has an influence overthe next ones. Following some existing literature (see,e.g, [19, 20] and references there in), we will denote thisas sequential decision-making. This obviously requires ahigher cognitive capacity and a more reflective responseby the subject in order to process the information. Thisis the typical case involved in tasks like playing boardgames as chess, or solving mazes or tasks presented insome intelligence tests. Note that all these decisions in-volve building a bunch of future possibilities, it is, car-rying out a prospection of the possible situations thatwould occur after a single decision, in a tree-like fashion.In our work, then, we will use the term prospection to de-note such hypothetical, or mental, simulations of futureevents that in general requires (and is actually coupledto) high memory and abstraction capacities [21–23].There have been lots of efforts focused on develop-ing a theoretical framework to explain the mechanismsand dynamics of perceptual decision making. Most ofthem lie within the so-called accumulator framework, inwhich cognitive evidence (described through some kindof stochastic process) is gained throughout the time untilit reaches a given threshold, which then triggers the de-cision. The paradigmatic example of such approaches isthe Drift-Diffusion model (DDM) [24], where a randomvariable denoting the relative evidence between the dif-ferent options is assumed to be driven by the combinationof a diffusion process (which introduces cognitive fluctu-ations or noise in the process), and a drift process thataccounts for the evidence accumulation towards the cor-rect answer gained through sensory information. Nowa-days it is widely accepted among psychologists that thesuccess of the DDM to describe such simpler situationsis overwhelming [25, 26], though in some cases it requiresmodifications or extensions to be useful, as introducingtime-dependent thresholds [27] or dynamic changes of thedrift according to the current state of the variable [28].Furthermore, recent works have shown that value-based a r X i v : . [ phy s i c s . s o c - ph ] F e b decisions can be also accommodated within this frame-work by introducing some additional details or assump-tions [29, 30].On the contrary, equivalent stochastic mechanismsable to capture the dynamics of sequential or more com-plex decision processes are scarce [31]. Here we will pro-vide experimental evidence that those highly cognitivedecision processes in humans, at least those of a cer-tain type, are compatible with a stochastic framework inwhich not evidence but information (or informational en-tropy, to be specific) is implicitly being computed by theindividual. For this, we study the efficiency of subjects atsolving a particular navigation task through a maze onthe computer screen, combined with eye-tracking datato assess the corresponding behavioral time dynamics.We do not introduce any explicit costs for prospecting oranalyzing information (as there are not time, or other,constraints involved). Thus we pose an extreme situa-tion where decisions are mostly driven by optimizationof the internal prospection process, rather than by anyspeed-accuracy tradeoff or any other principle related tofast or efficient sampling.We will use principles from statistical physics and in-formation theory to show that human decision-makingin such contexts can be adequately described by a pro-cess of (informational) entropy refinement until a relia-bility threshold in the entropy is reached. In Section IIwe present our framework and discuss its main concep-tual differences with accumulator models typically usedto study perceptual decision-making. In Section III wepresent the experimental design and discuss the corre-sponding performance by the subjects. Comparison ofthose efficiencies to those obtained by (virtual) randomsubjects allows us to prove (Section IV) that perfor-mances obtained by real human subjects can only be ex-plained if prospection is actually being used during thetask, and we can even quantify to a certain extent thelevel of prospection that individuals are using in eachcase. Finally, we study in Section V the statistical prop-erties of the response time dynamics observed during thetask to provide quantitative evidence that human behav-ior in our task is driven by the entropy threshold mecha-nism mentioned above. The conclusions from our studyare presented in Section VI. II. THEORETICAL FRAMEWORK
A relevant problem in any decision-making process isto establish a criteria to identify when I have enough in-formation in order to make an accurate decision. This isthe question that leaded Wald to develop the well-knownSequential Probability Ratio Test (SPRT), which pro-vides a criteria that minimizes the time required to makea decision based on accumulation of evidence. Given twodifferent options ( A and B , each with its correspondingreliability), we define the probabilities to decide each oneof the options as p A and p B (which are assumed to be FIG. 1.
Scheme for the accumulator and reliability mecha-nisms. The left column corresponds to the evidence accumu-lation for perceptual decision making (upper half), whit theWald’s ratio W as quantifying magnitude (lower half). Theright column corresponds to the sequential decision makingmechanism (upper half), with the Shannon’s entropy S beingthe quantifying magnitude (lower half). The label t d in thehorizontal axis corresponds to the decision time. mutually exclusive options, so p A + p B = 1). The SPRTis given in terms of the log-likelihood function W = ln (cid:18) p A p B (cid:19) (1)and establishes that the decision should be taken as soonas the cumulative of W ( t ) computed through evidence ac-cumulation, (cid:80) i W = (cid:80) i ln( p A,i ) − (cid:80) i ln( p B,i ), exceeds(or falls below) a given threshold ( W th ). Here, p A,i and p B,i are defined as the probabilities for each option (ei-ther A or B ) when estimated through the accumulationof sensory evidence (assuming for convenience that suchprocess can be divided into discrete steps). Consequently,the SPRT criteria above establishes the sufficient infor-mation to decide, and actually the DDM can be seen as aparticular continuum implementation of the test [32, 33].The SPRT, in fact, also admits an interpretation interms of information theory. If we redefine the probabil-ities of the options A and B as p A = 1 − p and p B = p then W = ln (cid:18) − pp (cid:19) = ∂S∂p , (2)where we have introduced in the last step the Shannon’sentropy S = − p ln p − (1 − p ) ln(1 − p ) (the extension frombinary options to ternary, or more complex, decisions isstraightforward). So that, the SPRT criteria can be seenas a threshold in the cumulative variation of entropy withrespect to variations in p during the process of evidenceaccumulation.Our proposal for sequential decision-making is that, inthe absence of any explicit tradeoffs that constrain theprospection of future possibilities, it would be the entropyitself, not its variations, the relevant magnitude for reach-ing a state at which the decision can be taken reliably. Sothat, we propose that one will take its decision once S hasreached a given threshold S th . Note that entropy cannotbe accumulated (contrary to the case of evidence) since S is a bounded function. Instead, our mechanism suggeststhat the initial state of the subject is characterized bya maximum entropy (or maximum uncertainty), and theprogressive information acquisition and prospection pro-vides a better estimation of p A and p B , such that whenthe entropy decays below a threshold S th the decision istaken. Accordingly, evidence accumulation typical of theSPRT is replaced by an entropy refinement process (Fig.1, left panels).To implement this mechanism we need to solve theproblem of how to translate the information processedby the subject may be translated by it into the probabil-ities p A and p B appearing in S . In the case of perceptualdecisions, probabilities are assigned in a way proportionalto sensory evidence (for example, in the random dot mo-tion task evidence is proportional to the time the subjectis gazing at point moving in one given direction and/or tothe number of different points detected in that directionof motion). But how can this be implemented in con-texts where not sensory evidence, but prospection andsubsequent reflection are driving the decision?If the information obtained during the prospection pro-cess results in a clear reward or payoff (we denote theserewards as E A and E B for the options A and B ), then thequestion reduces to find the mapping from the E ’s to theprobabilities. At this point, we introduce the hypothesisthat the prospection process is used by humans as a wayto estimate the mean value of the payoff, µ , that can beobtained from each choice. If that is the case, then a di-rect implementation of the Maximum Entropy Principle(MEP) [34] from information theory states that for anestimate of µ , the most neutral (or unbiased) choice ofprobabilities that we can assign to each option reads p A,B = e − βE A,B /µ Z , (3)where β is a positive constant (which appears as a La-grange multiplier when applying the MEP) and Z anormalization factor. Note that this is equivalent tocanonical or Maxwell-Boltzmann statistics employed instatistical physics, and interestingly it leads to W ( t ) = − β ( E A − E B ) /µ , so the traditional SPRT could be in-terpreted in this case as a way to impose a thresholdin the difference of the cumulative rewards E A and E B computed through prospection. FIG. 2.
Probability distribution for the number of of neces-sary samples ( n ) to reach the corresponding decision thresholdboth for the entropic and Wald’s algorithms. A. Binary toy model
Before moving to experimental results, it can be il-lustrative to explore some properties of our mechanism(based on a threshold of the Shannon entropy S ) if com-pared to the classical SPRT criteria.We propose the following toy model to compare both.We denote the mean rewards for options A and B as µ A and µ B . Then we assume that, through the prospectionprocess, successive estimates of the energies E A and E B are obtained by the subject in such a way that each sam-ple made during prospection follows a Gaussian distribu-tion of unit variance centered at µ A and µ B , respectively.The cumulative energy so obtained is used to computethe probabilities in (3). The fundamental question toanswer is what is the number of samples n that are nec-essary to overpass (or fall below) a threshold, either in W (for the SPRT mechanism) or in S (in our proposal),as a way to decide between options A and B .The SPRT case exhibits a distribution of the number ofsamples that depends strongly on the distance betweenthe means ∆ µ ≡ µ A − µ B (Fig. 2)). Instead, for thecriteria of entropy refinement the distribution of neces-sary samples exhibits a power law behavior P ( n ) ∝ n ,for a wide range of ∆ µ and S th . At this point, we havedemonstrated at least one fundamental differences of us-ing energetic or entropic accumulators, so we can use itas a proxy to discriminate between these two criteria. III. ENVIRONMENTS FOR PROSPECTION
Sequential decision making requires a mental process-ing of the acquired information which can be deeply com-plex and hard to capture, even with the help of brainactivity monitoring. However, simple situations in whichsensory information can be assumed to reflect somehowsuch information processing can help reconstructing theactual mechanisms behind. With this purpose we havedesigned a particular navigation task in a maze mon-itored with the help of eye-trackers. Efficient naviga-tion strategies in a maze involve a prospection processthrough visual inspection of the possible paths availablewithin the sight distance, so involving a gathering of non-local evidence. In our context, we prepare the setup insuch a way that we can indirectly estimate the pathsprospected by the subjects with the help of the eye-tracking data.
A. Experimental Design
18 clinically normal adults (11 women and 8 men) agedfrom 18 to 45 carried out the experiment. In the first partof the task, subjects are presented a discrete 7x7 regu-lar lattice on the screen, representing a discrete set of 49patches where the navigation process takes place (Fig 3,upper panel on th right). The patches are linked throughpaths connecting them only to neighbour patches (4paths per node, except for the boundaries where pathsare only 2 or 3). Among all possible paths, we removea part of them (20%, preventing isolated regions in thestructure to be formed) so introducing some level of het-erogeneity in the lattice (Fig 3, left panels).The subjects are asked to visit the maximum amountof patches of the resulting graph within 49 steps if start-ing from the center of the structure (one step is definedas a transition between connected patches in the graph).They do this by clicking with the mouse over the patchto which they want to move next (Fig 3, middle pan-els, show some realizations of the resulting trajectories).Heterogeneity of the graph then makes the process non-trivial (note that for a regular lattice the optimal strategyto cover the maximum number of nodes would be simplyto perform a ladder-like trajectory).To assess the subjects performance under different lev-els of complexity, the patches of the rectangular latticeare then reorganized in a circular way in two differentways. In the first case (Circular Ordered), we keep theorder of the rows of the first rectangular graph (Fig 3 b)).For the third graph (Circular Disordered), we place thenodes following a circular structure but with the maxi-mum visual disorder (Fig 3 c)). We remark that topo-logically the three structures are completely equivalent.However, intuitively we expect a growing difficulty togather the relevant information and prospecting as thevisual distribution of patches and paths gets disordered,resulting in a lower performance (it is, a lower number ofpatches visited). Additionally, we rotated 90 º , 180 º and270 º the rectangular structure (with their correspond-ing Circular Ordered and Circular Disordered reorgani-zations) for randomizing the task (so 12 cases in total arepresented to each subject) without changing the topologyof the structure. A more detailed explanation about theorganization of the structures is given in the AppendixA. The final dataset comprised 72 trajectories for eachgraph (Rectangular, Circular Ordered and Circular Dis-ordered). During the task, we cannot infer directly the pathsprospected by the subjects from their trajectory. As aproxy for prospection, then, we use eye fixations mea-sured through a commercial eye-tracker (Tobii X2-30, at30 Hz) (Fig 3, right panels). We use this to analyze (i)the number of patches at which the subject gazes betweenconsecutive steps, and (ii) the time it remains gazing atparticular patches, as a way to infer how resources arebeing invested throughout the prospection time (see Ap-pendix B for further details). B. Decision-making time dynamics
The global performance of the individuals in the threedifferent levels of the graph is computed as the number ofpatches that have been covered during the entire trajec-tory (Fig. 4 a)). For the Rectangular graph, the subjectsvisited in average 37 . ± . ,
7% ofthe total 49 patches). For the Circular Ordered graph,they covered 29 . ± . , . ± . , d b as the minimum number of steps/bonds between the cur-rent patch and the one the individual is gazing at. Thecorresponding experimental distributions of d b is foundto be completely different for the three visual organiza-tions considered (4 c)). The distance (in bonds) betweenany two patches is exactly the same in all cases; then itis clear that the individuals do not prospect equally inthese cases: while for the rectangular case a large amountof time is invested in gazing at nearby patches, for theCircular cases (specially for the Disordered one) frequentgazes at distant patches (in terms of bonds) is observed,which must be attributed either to (i) distractions causedby the presence of patches which are close on the screenconfiguration (though they are not easily accessible fromthe current one), and (ii) the difficulty at identifying eas-ily the patches which are available in the next few steps.An efficient prospection should combine an intensive ex-ploration of closer patches and a smaller (but not nec-essarily negligible) exploration of further ones. We il-lustrate that idea in the inset of figure 4 c), where thecumulative probability of being gazing at nearby patches(defined as those with d b ≤
4) is shown to decrease dras-tically as a function of the visual complexity of the task.
FIG. 3.
Scheme of the experimental setup. The first column corresponds to the three patch graphs. The paths correspondto the allowed movement between patches (absence of them meaning the nodes aren’t connected). The color of the patchesis introduced to facilitate the understanding of the spatial reorganization. The second column shows one individual trajectorywithin the structures (the color code corresponds to the current step of the steps trajectory). The third column shows thelocations where the individual has gazed into during the trajectory (the color code corresponds to the current step of the stepstrajectory). IV. QUANTIFYING PROSPECTION DURINGNAVIGATION
As a way to quantify and refine the ideas above, wepropose to compare the subjects performance in our taskto that of virtual subjects following an algorithm which isable to automatically prospect all information availablewithin a certain topological distance d p (the prospectionlength) in the lattice. Note that in a classical random-walk algorithm, instead, the walker would select thepaths completely at random, making uninformed deci-sions. Our prospective walker then assigns a payoff E i to theneighbour patch where the prospected path begins. Thispayoff is equal to the fraction of already visited patchesthat the prospected path crosses (so E ∼ [0 , E = 0 representing a prospected path for which all sitesare still unvisited, and E = 1 representing a path forwhich all sites are already visited).To compute the number of newly visited patches, it isthen necessary that the walker keeps in memory its pre-vious trajectory. To implement this in a realistic way, weconsider that previous visits to a patch are kept in mem-ory by the walker during a characteristic number of steps FIG. 4. a) Averaged number of covered patches after the steps trajectory for each one of the graphs. b) Averaged decisiontime accounting all the movement of the trajectory for each one of the graphs. c) Distribution of the bond distance d b between thecurrent patch and the patches that have been gazed before doing a movement for each one of the graphs. The inset correspondsto the accumulated probability for gazing patches at a distance d b ≤ . τ m . As a result, if a certain choice would imply movingto regions that, according to the prospection length andthe memory capacity of the walker, are already visited,then the payoff associated to that option would be large.Instead, if a certain choice is seen to drive the walker toa region with a large number of non-covered patches, thecorresponding payoff would be lower.The option with lower energy E i will be then the onewhich is assigned a larger probability p i , according to (3).In a situation where all the energies are equal then thewalker would have no reliable information to make its de-cision. The decision of when to move to the next patchis taken by the random walker according to the decisioncriteria described above in Section II. This is, successiveprospections of paths are carried out, and so values ofthe payoffs E i and the probabilities p i are continuouslyupdated. The walker computes then the correspondingShannon’s entropy S = − (cid:80) i p i ln p i and when the com-puted value falls below a threshold S th , the walker makesa decision according to the probabilities p i computed atthat time (see Appendix A for technical details).The rules above already allow the algorithm to avoidoverlaps in its trajectory (more or less efficiently, in termsof the parameters d p and τ m ) without explicitly requir-ing it to maximize the number of visited sites, as we dowith the human subjects. As the performance of thealgorithm is independent of the visual organization ofthe lattice (Rectangular, or Circular) we can use it asa reference model against which to compare the humanperformance in our experiment, so assessing the prospec-tion mechanisms that are being presumably used by thehuman subjects.By exploring the range of d p and τ m values in the algo-rithm, we can divide the parameter phase space into fourregions (5 a)). For region I the algorithm produces anaveraged number of visited oatches lower than the indi-viduals in any of the experiments. The region II producesa performance which lies between the results obtainedbetween Circular Ordered and Circular Disordered. Theregion III overcomes the results for the Circular Ordered performance but not for the Rectangular. The region IV,finally, outperforms all the experimental results (regionIV).Hence, we conclude that relatively large values both of τ m and d p are necessary to match or improve the coveragein the Rectangular graph, suggesting that individuals re-member the visited patches and predict future outcomesefficiently in this case. So, the ability to prospect is nec-essary to justify the subjects performance found in theexperiments. For the Circular structures the individu-als are probably not able to track the paths to distantpatches (there is no sufficient local gathering); in conse-quence, the value of d p necessary to reproduce their per-formance is lower than in the Rectangular case (a furthercomparison is provided in Appendix A).For certain values of d p and τ m the random walkeractually reproduces the distribution of performances ob-tained from the experiments. In Fig. 5 we show resultsfor the case d p = 6 with τ Rm = 70, d p = 3 with τ COm = 7and d p = 2 with τ CDm = 5, which provide the best fits tothe experimental data. We analyze the dynamical trendsof the experimental trajectories and the trajectories ofthe fitted parameters (Figure 5 b)). The number of cov-ered patches presents in all cases a monotonic growth,which is reduced as the trajectory progresses and over-laps can appear in consequence. The experimental curvesand the those obtained from the algorithm with the pa-rameters mentioned above agree almost perfectly. Then,the algorithm reproduces the dynamic performance ofhuman subjects during the experiment. Likewise, thedistribution of the final performances so obtained is alsoin perfect agreement to experimental data (figure 5 c)).
V. THE EXPERIMENTAL DATA ISCOMPATIBLE WITH THE ENTROPICREFINEMENT MECHANISM
In Section II we have introduced a decision-making cri-teria based on the acquisition of sufficient decision re-liability, based on a threshold for the information en-
FIG. 5. a) Diagram for the walker covered patches in comparison with experimental results. Regime I corresponds to aworse averaged performance than all geometries. Regime II corresponds to a better performance than in Circular Disordered.Regime III corresponds to a better performance than in Circular Ordered and Disordered. Regime IV corresponds to a betterperformance than in all geometries. b) Evolution of the averaged remaining non-visited nodes during the subject and walkerperformance . c) Distribution of final number of covered patches obtained from the subject and walker performance. The dotscorrespond to the experimental data while the solid lines correspond to the walker mechanism. FIG. 6. a) Distribution of times the subject is staring to a certain patch. b) Distribution of decision times (consecutivemovements). c) Distribution of the number of gathered patches between consecutive movements. tropy. The Gaussian toy model explored there exhib-ited power-law scaling (with exponent γ = −
3) for thenumber of samples between decisions for the entropy re-finement mechanism. Actually, the algorithm presentedin the previous Section for reproducing the dynamics ofsubjects in the Rectangular and Circular structures alsoexhibits the same behavior (7)). The exponent − i ) the times between decisions, ii ) the timesduring which the subjects gaze at the same patch and iii ) the number of different patches gazed before makinga decision ( n ).Despite the different performances found for the threelevels (Rectangular, Circular Ordered and Disordered),as reported in previous Sections, the time distributions inall these cases exhibit extremely similar properties (Fig.6). This suggests a common underlying mechanism for decision-making. What is more, all the distributions fitclosely the power-law decay P ( t ) ∼ t − (or P ( n ) ∼ n − ),in agreement with the predictions from our information-theoretical criteria based on S . The only significant dif-ferences appear for smaller decision times, which seem tobe scarce in the circular Disordered case (as suggestedfrom Fig. 4 b)).Intuitively, the decision time may be understood as thesum of the times that the individual has been gazing ateach individual patch. Then it could be that the power-law emerges either from (i) the distribution of times thesubject keeps looking at a given patch, or (ii) the numberof patches that are gazed between decisions. Both caseswould provide an explanation for the scale-free feature ofthe decision time distributions as a consequence of otherdistribution. It is, however, the case that both distribu-tions (for the number of patches gazed and the gazingtimes) seemingly present the scale-free decay (Fig. 6,middle and right panels). So, the underlying mechanismyielding the power-law distribution for decision times isapparently a nontrivial combination of both.To provide further evidence of the compatibility ofthe experimental results with our information-theoretical FIG. 7.
Distribution of the number of prospections n per-formed by the walker to force the entropy S to fall below thethreshold S th . FIG. 8. a) Energy difference ∆ E between the options (di-rections) with more and less accumulated energy when thedecision is made. The x-axis groups the decisions by theircorresponding decision time. The fitted curves are f ( x ) =0 . x + 2 . (Rectangular), g ( x ) = 0 . x + 2 . (Circular Or-dered) and z ( x ) = 0 . x + 2 . (Circular Disordered). b)Shannon’s entropy S according when the decision is made.The x-axis groups the decisions by their corresponding de-cision time. The fitted curves are f ( x ) = 0 . x + 0 . (Rectangular), g ( x ) = 0 . x + 0 . (Circular Ordered) and z ( x ) = 0 . x + 0 . (Circular Disordered). (entropy refinement) criteria, we quantify the payoffs E i and the corresponding entropy S at the instant at whicheach decision is made by the subject (8). We remind thatthe SPRT criteria for canonical probabilities (3) is equiv-alent to impose that a threshold in the energy difference∆ E triggers the decision, while for our criteria it is theentropy S which must reach a fixed threshold. For theexperimental navigation task, we find that the differences∆ E (computed between the choices with lower and higherpayoffs at the decision time) is a monotonically growingfunction of the decision time, so longer decisions requirelonger payoff accumulation (8 b)). On the contrary, theinformational entropy S remains approximately indepen-dent of the decision time, suggesting that this magnitudeis really an invariant for all decisions and so supportingthe view that a threshold in S may trigger the decision ( (8 c)). This seems to be particularly robust for longerdecisions, while shorter ones ( < VI. CONCLUSIONS
Navigation efficiency in higher organisms should takeinto account the criteria they use to prospect the out-comes of their available options, and how these areweighted to reach a behavioral decision. Here we havestudied the decision-making dynamics subject to a pro-cess of information gathering and have provided evidencethat the mechanism triggering the decision would relyon a threshold in the informational entropy based on theprobabilities estimated through prospection. While thismechanism shares some ideas with the classical SPRT cri-teria typically used for perceptual decision making, in se-quential decision-making scenarios we claim that it is in-formation refinement or reliability (rather than evidenceaccumulation) the fundamental magnitude.Our hybrid analysis comparing the navigation abili-ties of virtual subjects (algorithms) to those of humansubjects, provides an approximate quantitative charac-terization for the cognitive memory and the prospectionability that should be required by the subject during thetask. Furthermore, the distribution of decision (or gaz-ing) times together with the study of the final values for S reached at the moment of the decisions (Section V)allows us to think that the criteria proposed here can ac-count to a significant extent for how information is beingprocessed by the subjects during the task. At this re-spect, note that traditionally mean times to decision, aswell as the ratio of the times corresponding to choosingoption A or B (for binary decisions) have been studied indetail by psychologists. On the contrary, decision timedistributions are rarely computed in decision-making ex-periments. Here we show that such distributions can beused as a signature to discriminate between models. TheSPRT criteria, as well as other alternative mechanismswe have explored (not reported here) have been foundunable to reproduce the power-law decay P ( t ) ∼ t − characteristic of our experiments).Regarding the experimental robustness of our results,since neural correlates able to provide a detailed descrip-tion of how prospection and information gathering pro-cesses are carried out is probably unattainable for se-quential decision-making, eye-tracking reamins a conve-nient proxy for this. Still, probably the combination ofsuch data with EEG or other physiological sensors couldbe used to refine our ideas, and provide more reliableestimates of the dynamics in sequential decision-makingand/or navigation tasks, also in more realistic environ-ments than the navigation on the computer screen usedhere. We expect that our results can stimulate research in this line in order to test the general validity of ourinformation-theoretical criteria based on entropy refine-ment. [1] J. I. Gold and M. N. Shadlen. The Neural Basis of De-cision Making.
Ann. Rev. of Neuroscience, 30 535-574(2007).[2] S. P. Kelly and R. G.O’Connell .
The neural processesunderlying perceptual decision making in humans: Recentprogress and future directions.
Journal of Phys.P 109, 27-37 (2015)[3] S. Blakemore and T. W Robbins.
Decision-making in theadolescent brain.
Nat. Neurosci. 15 (2012)[4] B. De Martino, D. Kumaran, B. Seymour and R. J.Dolan.
Frames, Biases, and Rational Decision-Making inthe Human Brain.
Science 313, 684–687 (2006)[5] D. Lee.
Game theory and neural basis of social decisionmaking.
Nat Neurosci. 11, 404–409 (2008)[6] D. Lee, M. L.Conroy, B. P.McGreevy and D.J.Barraclough.
Reinforcement learning and decision mak-ing in monkeys during a competitive game
CognitiveBrain Research 22, 45-58 (2004)[7] Couzin, I., Krause, J., Franks, N. et al.
Effective leader-ship and decision-making in animal groups on the move.
Nature 433, 513–516 (2005)[8] A. J. W. Ward, D. J. T. Sumpter, I. D. Couzin, P. J. B.Hart and J. Krause.
Quorum decision-making facilitatesinformation transfer in fish shoals.
PNAS 105,19) 6948-6953 (2008)[9] G. Valentini, E. Ferrante and M. Dorigo.
The Best-of-nProblem in Robot Swarms: Formalization, State of theArt, and Novel Perspectives.
Front. Robot. (2017)[10] L. Conradt and C. List
Group decisions in humans andanimals: a survey.
Phil. Trans. R. Soc. B 364,1518719–742. (2009)[11] S. Redner.
Reality-inspired voter models: A mini-review.
Comp. Rend. Phys. 20, 4 275-292 (2019).[12] P.A. Ortega and D.A. Braun.
Thermodynamics as a the-ory of decision-making with information-processing costs .Proc. R. Soc. A 469: 20120683 (2013).[13] V.I. Yukalov and D. Sornette.
Self-organization in Com-plex Systems as Decision Making . Adv. Complex. Syst.17, 1450016 (2014).[14] P. Schwartenbeck, T. FitzGerald, R.J. Dolan and K. Fris-ton.
Exploration, novelty, surprise, and free energy min-imization . Front. Psychol. 7:2013.00710 (2013).[15] ´E. Rold´an, I. Neri, M. D¨orpinghaus, H. Meyr and F.J¨ulicher.
Decision Making in the Arrow of Time . Phys.Rev. Lett. 115, 250602 (2015).[16] M. Favre, A. Wittwer, H.R. Heinimann, V.I. Yukalov andD. Sornette.
Quantum Decision Theory in Simple RiskyChoices . PLoS ONE 11:e0168045.[17] KH Britten, MN Shadlen, WT Newsome and JAMovshon
The analysis of visual motion: a comparisonof neuronal and psychophysical performance . J Neurosci12,12 (1992)[18] J. D Roitman 1 and M. N Shadlen
Response of neurons inthe lateral intraparietal area during a combined visual dis-crimination reaction time task . J Neurosci. 22,21 (2002)[19] E.M. Tartaglia, A.M. Clarke1 and M.H. Herzog.
What to Choose Next? A Paradigm for Testing Human SequentialDecision Making . Front. Psychol. 8:312 (2017).[20] D. Zhang and R. Gu.
Behavioral preference in sequentialdecision-making and its association with anxiety . Hum.Brain Mapp. 39:2482–2499 (2018).[21] D.T. Gilbert and T.D. Wilson.
Prospection: Experiencingthe future . Science 317, 1351-1354 (2007).[22] T. Suddendorf and M.C. Corballis.
The evolution of fore-sight: What is mental time travel, and is it unique tohumans? . Behav. Brain Sci. 30, 299 (2007).[23] B.E. Pfeiffer and D.J. Foster.
Hippocampal place-cell se-quences depict future paths to remembered goals . Nature497, 74-81 (2013).[24] R. Ratcliff, P. L. Smith, S. D.Brown and G. McKoon
Diffusion Decision Model: Current Issues and History.
Trends in Cog. Science, 20 260-281 (2016).[25] R. Ratcliff and G. McKoon
The Diffusion DecisionModel: Theory and Data for Two-Choice Decision Tasks.
Neural Comput. 20 873–922 (2008).[26] A. Rangel et al
The Drift Diffusion Model can accountfor the accuracy andreaction time of value-based choicesunder high and low timepressure
Judgment and DecisionMaking, 5 437–449 (2010)[27] M. L. Pedersen, M. J. Frank and G. Biele.
The drift dif-fusion model as the choice rule in reinforcement learning .Psychon Bull Rev 24 1234–1251 (2017)[28] Fontanesi, L., Gluth, S., Spektor, M.S. et al.
A reinforce-ment learning diffusion decision model for value-based de-cisions . Psychon Bull Rev 26 1099–1121 (2019)[29] A. Roxin
Drift–diffusion models for multiple-alternativeforced-choice decision making . Jou. of Math. Neuro-science 9, 5 (2019)[30] S. Tajima, J. Drugowitsch and A. Pouget
Optimal pol-icy for value-based decision-making . Nat. Comm. 7:12400(2016)[31] K.P. Nguyen, K. Josi´c and Z.P. Kilpatrick.
Optimizingsequential decisions in the drift-diffusion model . J. Math.Psychol. 88, 32-47 (2019).[32] R. Bogacz, E. Brown, J. Moehlis, P. Holmes and J.D.Cohen.
The physics of optimal decision making: a for-mal analysis of models of performance in two-alternativeforced-choice tasks . Psychol. Rev. 113(4), 700 (2006).[33] A. Wald and J. Wolfowitz.
Optimum character of the se-quential probability ratio test . Ann. Math. Stat. 32,6–339(1948).[34] E.T. Jaynes.
Information Theory and Statistical Mechan-ics I . Phys. Rev. 106, 620–630 (1957). APPENDIX A: EXPERIMENTAL METHODS
The final dataset comprised 72 validated trajectoriesfor each graph (Rectangular, Circular Ordered and Cir-cular Disordered, each with four possible orientations:0 º , 90 º , 180 º , 270 º ) coming from 18 adult subjects. Theexperiment is a carried out on a 21” screen monitor (HPE233). The subjects were requested to keep their po-sition around a distance of 50 cm from the monitor andto try to avoid sharp physical movements. We used acommercial eye-tracker (Tobii X2-30, at 30 Hz) to recordthe positions of the eye fixations (Fig 3, right panels).The subjects were not required to take off glasses, butit was recommended to avoid interferences with the eye-tracker. The subjects were not urged to finish the ex-periment within any given time; they were simply givena fixed number of movements (49) to explore the maze,and were asked to cover the maximum number of patchespossible (one movement is defined as a transition be-tween connected neighbour patches in the graph). The49 patches in the graph presented to the subjects arelinked through paths connecting them only to neighbourpatches. Among all possible movements between firstneighbours, we removed a part of them (20%, preventingisolated regions in the structure to be formed) so intro-ducing some level of heterogeneity in the graph (Fig 3,leftpanels). All paths started from the same (central) nodeof the structure in all the cases, and all subjects were pre-sented exactly the same structures in order to avoid anydifference between them that could distort the compari-son between their performances. So, the only differencebetween the three structures (Rectangular, Circular Or-dered and Circular Disordered) were in the visualizationof the graph on the screen, as explained in the main text.Movements between patches in the graph were carriedout by clicking with the mouse on the next patch. To fa-cilitate visualization of the options available (especially inthe Circular Disordered case, where visualization couldbe tough), the current patch of the individual was de-picted in a different color (green, with the rest of thepatches appeared in blue) and the links available at eachmovement were emphasized (with thicker solid lines). Onthe contrary, the subjects have no visual guides to dis-tinguish between visited and non-visited patches, so theyneed to use their memory skills to avoid overlaps. Thescreen recording of a particular realization of the task foreach structure (Rectangular, Circular Ordered, CircularDisordered) is provided as a Supplementary Material Fileto facilitate understanding. APPENDIX B: PROSPECTIVE ALGORITHMA. Definition
We propose an algorithm in which virtual subjects areable to prospect those paths available within d p stepsin the lattice (we call this parameter the prospection length). For each path prospected, the walker assignsa payoff E i to the neighbour patch at which that pathstarts (for a simple visualization, see Fig. 9). The pay-off is taken to be equal to the fraction of already vis-ited patches that the prospected path crosses (so E isbounded between 0 and 1, with E = 0 for a path thatdoes not cover any visited patches, and E = 1 if allpatches covered by the path have been previously vis-ited).The walker keeps in memory its previous trajectoryduring a characteristic number of steps τ m . In particu-lar, the visits are remembered by the virtual subject dur-ing a time τ obtained from the exponential distribution P ( τ ) = τ m e − tτm . After this time, the walker will forgetthat this particular patch has been previously visited andwill contribute as a non-visited patch for computing thecorresponding payoffs.As a result, the algorithm will assign higher payoffsto the options that it remembers having visited ans/orthat are adjacent to regions that it remembers having vis-ited. So, according to 3 the probability to choose thoseoptions will decrease, so leading the virtual subject toregions that are still unvisited (or at least it does notremember having visited before). A larger prospectionlength d p allows the walker to sample the state of furtherregions and to compute the payoff the information of dis-tant patches, but this will be only efficient if the memoryparameter τ m is large enough.Successive prospections of the paths available in eachdirection are carried out at random among all possibleones of length d p , and so values of the payoffs and theprobabilities p i are continuously updated. Note that fora given value of d p the number of paths that can beprospected is of the order of ∼ d p (if assuming thatall links between neighbour patches are available). Thedecision of when to move to the next patch is taken bythe random walker according to the decision criteria de-scribed above in Section II. After each single prospec-tion of one path in each direction, the walker computesthe corresponding Shannon’s entropy S = (cid:80) i − p i ln p i ;if the computed value falls below a fixed threshold S th ,the walker makes the decision according to the probabil-ities p i computed at that time (we have checked to de-cide according to the highest probability does not changequalitatively the walker dynamics). On the contrary, if S > S th then the prospection process continues. How-ever, we additionally introduce a rule such that the max-imum number of prospections is limited to 100 to avoid(extremely unusual) situations in which S would neverdecay below S th because all options avaiable persistentlyexhibit very similar payoffs. We have carefully checkedthat this rule doesn’t modify any of the results reportedin a significant way.1 FIG. 9.
Scheme of three different prospection paths corre-sponding to three different prospection lengths d p = 1 (black), d p = 2 (yellow) and d p = 3 (green). The patches that arealready visited are marked in red (so they add +1 to the pay-off), while the unvisited ones appear in blue (so they add to the payoff). The payoff assigned in each one of thepaths depicted would be a) E f = 1 to option f for d p = 1 ,b) E h = (0 + 0) / to option h for d p = 2 , and E b = (0 + 0 + 1) / / to option b for d p = 3 . B. Phase diagram
The results of the algorithm shown above have beenobtained under the same conditions that in the task pre-sented to the human subjects; this is, for 49-step trajec-tories through the 7 x τ m and d p . In particular, we study the mean coveragetime ( T Cov ) (this is, the mean time required to coverall sites in the lattice). Minimization of this magnitudewould then give an estimation of the search/navigationefficiency of the algorithm.The main conclusion we can extract (as one can de-duce from the results in Fig. 10) is that the ability toprospect future paths (so, having a large d p ) is uselessunless the individual has good memory skills (this is, alarge τ m value in our context). This makes clear sense, aswhen the walker cannot remember the previously visitedpatches (low values of τ m ), the optimal strategy consistsof removing prospection ( d p = 1); in that case the infor-mation provided by further patches represents just use-less noise as the walker always sees them as non-visitedpatches. On the other side, for large τ m the walker cancorrectly identify the previously visited patches (large values of τ m ), so then progressively higher prospectionlengths d p are found to optimize the coverage of the struc-ture and the search of a target. FIG. 10.
Prospection length that minimizes the minimummean coverage time ( T Cov ) of the walker as a function ofmemory time τ m when the walker is placed in the same struc-ture of the experimental design. C. Distributed prospection lengths
Assigning a constant prospection length d p to all theprospected paths may seem rather unrealistic. Hu-man individuals are expected instead to prospect pathswith different lengths depending on the specific situation(complexity, number of choices available, etc) . The re-sults reported in fig 6 c) also support this statement (thenumber of gazed patches is not fixed to a constant num-ber but exhibits a variation which spans almost one orderof magnitude).We have studied then our algorithm for the case whena distribution of d p is introduced instead of a constantvalue. We have tried in particular a distribution P ( d p ) ∝ d γp (for d p ≥ (cid:80) ∞ d p =1 P ( d p ) = 1 to guarantee nor-malization. For γ → ∞ , the paths are then fixed to d p = 1, so the prospection algorithm is only to iden-tify whether the neighbour patches have been visited ornot. On the other side, for γ → d p values (at practice welimit d p to 1 ≤ d p ≤ since much larger values would beabsurd, given the 7x7 maze we have used). Figure 11reports that sampling a small (but not negligible) num-ber of long paths combined with a majority short paths(as happens for intermediate γ values) is sufficient to re-cover the same dynamics as obtained for a large fixed d p . This can be seen by comparing results obtained forlower values of γ to those of large values of d p , whichare extermely similar. This result is remarkable from anevolutionary perspective, since it suggests that improv-ing navigation efficiency would not necessarily require toprocess much more information continually (note againthat the number of paths available for prospection growsin our structure as 4 d p , and it is in general n d p for a se-quential decision task in which n choices are given to thesubject at any step, so processing costs grow very fast).Instead, having the ability to carry out longer prospec-2tions and use this ability just promptly would be enoughto increase efficiency significantly. FIG. 11.
Final number of covered patches after a 49 stepstrajectory as a function of the memory time τ m for the casea) d p fixed and b) γ fixed. By exploring the whole range of γ and τ m values, wecan divide the parameter phase space into four regions(12 a)), analogously as shown above for fixed d p (see Fig.5). The region I produces an averaged performance thatvisits less patches than the individuals in any of the ex-perimental graphs. The region II produces a performancewhich lies between the results obtained between Circu-lar Ordered and Disordered. The region III overcomesthe results for the Circular Ordered performance but notfor the Rectangular. The region IV, finally, outperformsall the experimental results (region IV). The regions areequivalent to the obtained for fixed path lengths, butwith a lesser number of prospected patches. Again, thisshows that distributed values of d p can be used to obtainhigher navigation efficiencies without consuming muchhigher times of information processing. FIG. 12.
Diagram in which the averaged number of coveredpatches is compared with the experimental results. The grapha) is for the fixed d p prospection length and the graph b), forthe γ distribution. D. Robustness of the power-law exponent for thedistribution of decision times
We have reported in the main text that the decisiontime for the walker, defined as the number of requiredprospected paths that makes
S < S th , exhibits again thesame power-law distribution (with exponent −
3) as theGaussian toy model when the entropic mechanism is used(with the exponent − t m and d p fitted from the experiments. Herewe provide an analysis to check that the − S th used in the algorithm.First, in Fig. 13 a) we show the explicit dependenceon the entropy threshold, and verify that the power-lawbehavior is kept as long as reasonable values of this pa-rameter are chosen (extreme choices, with, S th → S th , we fix it in our results above to S th = 0 . d p nor τ m FIG. 13. a) Distribution of the number of prospections per-formed by the walker before making a movement for the fixedparameters d p = 6 and τ m = 100 and a variable entropythreshold S th . b) Distribution of the number of prospectionsperformed by the walker before making a movement for a fixedentropy threshold S th = 0 . and different S and τ m . modify significantly the ∼ n − behavior as long as somesignificant level of memory and prospection is kept.We stress that the classical SPRT criteria, as well asother variations we have numerically explored, are un-able to reproduce the − P ( n ). This, together with the robustness anal-ysis reported here, provides significant robustness to theentropy threshold criteria proposed here.4 VII. AUTHOR CONTRIBUTIONS
J.C., V.M. and D.C. conceived and designed the study.J.C. and D.C. carried out the experiment. J.C. per-formed the numerical simulations. J.C., V.M. and D.C.carried out the mathematical analysis. J.C., V.M. andD.C. wrote and reviewed the paper. All authors gavefinal approval for publication and agree to be held ac- countable for the work performed therein.