A Bayesian brain model of adaptive behavior: An application to the Wisconsin Card Sorting Task
Marco D'Alessandro, Stefan T. Radev, Andreas Voss, Luigi Lombardi
AA B
AYESIAN BRAIN MODEL OF ADAPTIVE BEHAVIOR : A
NAPPLICATION TO THE W ISCONSIN C ARD S ORTING T ASK
A P
REPRINT
Marco D’Alessandro
Department of Psychology and Cognitive ScienceUniversity of TrentoCorso Bettini, 84, 38068 Rovereto [email protected]
Stefan T. Radev
Institute of PsychologyHeidelberg UniversityHauptstr. 47-51, 69117 Heidelberg [email protected]
Andreas Voss
Institute of PsychologyHeidelberg UniversityHauptstr. 47-51, 69117 Heidelberg [email protected]
Luigi Lombardi
Department of Psychology and Cognitive ScienceUniversity of TrentoCorso Bettini, 84, 38068 Rovereto [email protected]
November 30, 2020 A BSTRACT
Adaptive behavior emerges through a dynamic interaction between cognitive agents and changingenvironmental demands. The investigation of information processing underlying adaptive behaviorrelies on controlled experimental settings in which individuals are asked to accomplish demandingtasks whereby a hidden regularity or an abstract rule has to be learned dynamically. Although per-formance in such tasks is considered as a proxy for measuring high-level cognitive processes, thestandard approach consists in summarizing observed response patterns by simple heuristic scoringmeasures. With this work, we propose and validate a new computational Bayesian model accountingfor individual performance in the Wisconsin Card Sorting Test (WCST), a renowned clinical tool tomeasure set-shifting and deficient inhibitory processes on the basis of environmental feedback. Weformalize the interaction between the task’s structure, the received feedback, and the agent’s behav-ior by building a model of the information processing mechanisms used to infer the hidden rules ofthe task environment. Furthermore, we embed the new model within the mathematical frameworkof the Bayesian Brain Theory (BBT), according to which beliefs about hidden environmental statesare dynamically updated following the logic of Bayesian inference. Our computational model mapsdistinct cognitive processes into separable, neurobiologically plausible, information-theoretic con-structs underlying observed response patterns. We assess model identification and expressivenessin accounting for meaningful human performance through extensive simulation studies. We thenvalidate the model on real behavioral data in order to highlight the utility of the proposed model inrecovering cognitive dynamics at an individual level. We highlight the potentials of our model in de-composing adaptive behavior in the WCST into several information-theoretic metrics revealing thetrial-by-trial unfolding of information processing by focusing on two exemplary individuals whosebehavior is examined in depth. Finally, we focus on the theoretical implications of our computa-tional model by discussing the mapping between BBT constructs and functional neuroanatomicalcorrelates of task performance. We further discuss the empirical benefit of recovering the assumeddynamics of information processing for both clinical and research practices, such as neurologicalassessment and model-based neuroscience. K eywords Adaptive behavior · Bayesian brain · Cognitive modeling · Wisconsin Card Sorting Task a r X i v : . [ q - b i o . N C ] N ov PREPRINT - N
OVEMBER
30, 2020
Introduction
Computational models of cognition provide a way to formally describe and empirically account for mechanistic,process-based theories of adaptive cognitive functioning ([66, 15, 43]). A foundational theoretical framework fordescribing functional characteristics of neurocognitive systems has recently emerged under the hood of Bayesianbrain theories ([37, 25]). Bayesian brain theories owe their name to their core assumption that neural computationsresemble the principles of Bayesian statistical inference.In a Bayesian theoretical framework, cognitive agents interact with an uncertain and changeable sensory environment.This requires a cognitive system to infer sensory contingencies based on an internal generative model of the environ-ment. Such a generative model represents subjective hypotheses, or beliefs, about the causal structure of events in theenvironment ([24, 37]) and forms a basis for adaptive behavior. It is assumed that internal beliefs are constantly up-dated and refined to match the current state of the world as new observations become available. The core idea behindthe Bayesian brain hypothesis is that computational mechanisms underlying such an internal belief updating follow thelogic of Bayesian probability theory. In this respect, information about the external world provided by sensory inputsis represented as a conditional probability distribution over a set of environmental states. Consequently, the brain relieson this probabilistic representation of the world to infer the most likely environmental causes (states) which generatethose inputs, and such a process follows the computational principles of Bayesian inference ([28, 25, 13]).To clarify this concept, consider a simple example of a perceptual task in which a cognitive agent is required to judgewhether an item depicted on a flat plane is concave or convex. Its judgment is based solely on the basis of a set ofobserved perceptual features, such as, shape, orientation, texture and brightness. Here, the concave-to-convex gradiententails the set of environmental states which must be inferred. The internal generative model of the agent codifiesbeliefs about how different degrees of convexity might give rise to certain configurations of perceptual inputs. Froma Bayesian perspective, the problem is solved by inverting the generative model of the environment in order to turnassumptions about how environmental states generate sensory inputs into beliefs about the most likely states (e.g.,degree of convexity) given the available sensory information.Potentially, there are no limitations regarding the complexity of environmental settings (e.g., items and rules in ex-perimental tasks) and cognitive processes to be described in light of the Bayesian brain framework. Indeed, the latterhas proven to be a consistent computational modeling paradigm for the investigation of a variety of neurocognitivemechanisms, such as motor control ([29]), oculomotor dynamics ([26]), object recognition ([36]), attention ([18]), per-ceptual inference ([54, 37]), multisensory integration ([39]), as well as for providing a foundational theoretical accountof general neural systems’ functioning ([44, 24, 23]) and complex clinical scenarios such as Schizophrenia ([64]), andAutistic Spectrum Disorder ([32, 42]). For this reason, such a modeling approach might provide a comprehensiveand unified framework under which several cognitive impairments can be measured and understood in the light of ageneral process-based theory of neural functioning.In this work, we address the challenging problem of modeling adaptive behavior in a dynamic environment. Theempirical assessment of adaptive functioning often relies on dynamic reinforcement learning scenarios which requireparticipants to adapt their behavior during the unfolding of a (possibly) demanding task. Typically, these tasks aredesigned with the aim to figure out how adaptive behavior unfolds through multiple trials as participants observe certainenvironmental contingencies, take actions, and receive feedback based on their actions. From a Bayesian theoreticalperspective, optimal performance in such adaptive experimental paradigms require that agents infer the probabilisticmodel underlying the hidden environmental states. Since these models usually change as the task progresses, agents,in turn, need to adapt their inferred model, in order to take optimal actions.Here, we propose and validate a computational Bayesian model which accounts for the dynamic behavior of cognitiveagents in the Wisconsin Card Sorting Test (WCST; [8, 33]), which is perhaps the most widely adopted neuropsycholog-ical setting employed to investigate adaptive functioning, due to its specificity in accounting for executive componentsunderlying observed behavior, such as set-shifting, cognitive flexibility and impulsive response modulation ([10, 2]).For this reason, we consider the WCST as a fundamental paradigm for investigating adaptive behavior from a Bayesianperspective.The environment of the WCST consists of a target and a set of stimulus cards with geometric figures which varyaccording to three perceptual features. The WCST requires participants to infer the correct classification principleby trial and error using the examiner’s feedback. The feedback is thought to carry a positive or negative informationsignaling the agent whether the immediate action was appropriate or not. Modeling adaptive behavior in the WCSTfrom a Bayesian perspective is straightforward, since observable actions emerge from the interaction between theinternal probabilistic model of the agent and a set of discrete environmental states. PREPRINT - N
OVEMBER
30, 2020
Performance in WCST is usually measured via a rough summary metric such as the number of correct/incorrectresponses or pre-defined psychological scoring criteria (see for instance [33]). These metrics are then used to infer theunderlying cognitive processes involved in the task. A major shortcoming of this approach is that it simply assumesthe cognitive processes to be inferred without specifying an explicit process model . Moreover, summary measuresdo not utilize the full information present in the data, such as trial-by-trial fluctuations or various interesting agent-environment interactions. For this reason, crude scoring measures are often insufficient to disentangle the dynamics ofthe relevant cognitive (sub)processes involved. Consequently, an entanglement between processes at the metric levelcan prevent us from answering interesting research questions about aspects of adaptive behavior.In our view, a sound computational account for adaptive behavior in the WCST needs to provide at least a quantitativemeasure of effective belief updating about the environmental states at each trial. This measure should be comple-mented by a measure of how feedback-related information influences behavior. The first measure should account forthe integration of meaningful information. In other words, it should describe how prior beliefs about the current en-vironmental state change after an observation has been made. The second measure should account for signaling the(im)probability of observing a certain environmental configuration (e.g., an (un)expected feedback given a response)([59]).Indeed, recent studies suggest that the meaningful information content and the pure unexpectedness of an observationare processed differently at the neural level. Moreover, such disentanglement appears to be of crucial importance tothe understanding of how new information influences adaptive behavior ([50, 59, 53]). Inspired by these results andprevious computational proposals ([38]), we integrate these different information processing aspects into the currentmodel from an information-theoretic perspective.Our computational cognitive model draws heavily on the mathematical frameworks of Bayesian probability theoryand information theory ([58]). First, it provides a parsimonious description of observed data in the WCST via twoneurocognitively meaningful parameters, namely, flexibility and information loss (to be motivated and explained in the
Model section). Moreover, it captures the main response patterns obtainable in the WCST via different parameter con-figurations. Second, we formulate a functional connection between cognitive parameters and underlying informationprocessing mechanisms related to belief updating and prediction formation. We formalize and distinguish between
Bayesian surprise and
Shannon surprise as the main mechanisms for adaptive belief updating. Moreover, we intro-duce a third quantity, which we named predictive
Entropy and which quantifies an agent’s subjective uncertainty aboutthe current internal model. Finally, we propose to measure these quantities on a trial-by-trial basis and use them as aproxy for formally representing the dynamic interplay between agents and environments.The rest of the paper is organized as follows. First, the WCST is described in more detail and a mathematical repre-sentation of the new Bayesian computational model is provided. Afterwards, we explore the model’s characteristicsthrough simulations and perform parameter recovery on simulated data using a powerful Bayesian deep neural net-work method ([55]). We then apply the model to real behavioral data from an already published dataset. Finally, wediscuss the results as well as the main strengths and limitations of the proposed model.
The Wisconsin Card Sorting Test
In a typical WCST ([33, 8]), participants learn to pay attention and respond to relevant stimulus features, while ignoringirrelevant ones, as a function of experimental feedback. In particular, Individuals are asked to match a target card withone of four stimulus cards according to a proper sorting principle, or sorting rule. Each card depicts geometric figuresthat vary in terms of three features, namely, color (red, green, blue, yellow), shape (triangle, star, cross, circle) andnumber of objects (1, 2, 3 and 4). For each trial, the participant is required to identify the sorting rule which is validfor that trial, that is, which of the three feature has to be considered as a criterion to matching the target card with theright stimulus card (see Figure 1). Noticed that both features and sorting rules refer to the same concept. However, thefeature still codifies a property of the card, whilst the sorting rule refers to the particular feature which is valid for thecurrent trial. PREPRINT - N
OVEMBER
30, 2020
Figure 1: Suppose that the current sorting rule is the feature shape. The target card in the first trial (left box) containstwo blue triangles. A correct response requires that the agent matches the target card with the stimulus card containingthe single triangle (arrow represents the correct choice), regardless of the features color and number. The same appliesfor the second trial (right box) in which matching the target card with the stimulus card containing three yellow crossesis the correct response.Each response in the WCST is followed by a feedback informing the participant if his/her response is correct orincorrect. After some fixed number of consecutive responses, the sorting rule is changed by the experimenter withoutwarning, and participants are required to infer the new sorting rule. Clearly, the most adaptive response would be toexplore the remaining possible rules. However, participants sometimes would persist responding according to the oldrule and produce what is called a perseverative response . Methods
The Model
The core idea behind our computational framework is to encode the concept of belief into a generative probabilisticmodel of the environment. Belief updating then corresponds to recursive Bayesian updating of the internal modelbased on current and past interactions between the agent and its environment. Optimal or sub-optimal actions areselected according to a well specified or a misspecified internal model and, in turn, cause perceptible changes in theenvironment.We assume that the cognitive agent aims to infer the true hidden state of the environment by processing and integratingsensory information from the environment. Within the context of the WCST, the hidden environmental states mightchange as a function of both the structure of the task and the (often sub-optimal) behavioral dynamics, so the agentconstantly needs to rely on environmental feedback and own actions to infer the current state. We assume that theagent maintains an internal probability distribution over the states at each individual trial of the WCST. The agentthen updates this distribution upon making new observations. In particular, the hidden environmental states to beinferred are the three features, s t ∈ { , , } , which refer the three possible sorting rules in the task environment suchthat 1: color, 2: shape and 3: number of objects. The posterior probability of the states depends on an observationvector x t = ( a t , f t ) , which consists of the pair of agent’s response a t ∈ { , , , } , codifying the action of choosingdeck 1, 2, 3 or 4, and received feedback f t ∈ { , } , referring to the fact that a given response results in a failure(0) or in a success (1), in a given trial t = 0 , ..., T . The discrete response a t represents the stimulus card indicatorbeing matched with a target card at trial t . We denote a sequence of observations as x t = ( x , x , ..., x t ) =(( a , f ) , ( a , f ) , ( a , f ) , ..., ( a t , f t )) and set x = ∅ in order to indicate that there are no observations at the onsetof the task. Thus, trial-by-trial belief updating is recursively computed according to Bayes’ rule: p ( s t | x t ) = p ( x t | s t , x t − ) p ( s t | x t − ) p ( x t | x t − ) (1)Accordingly, the agent’s posterior belief about the task-relevant features s t after observing a sequence of response-feedback pairs x t is proportional to the product of the likelihood of observing a particular response-feedback pair PREPRINT - N
OVEMBER
30, 2020 and the agent’s prior belief about the task-relevant feature in the current trial. The likelihood of an observation iscomputed as follows: p ( x t | s t , x t − ) = f t p ( a t | s t = i ) + (1 − f t )(1 − p ( a t | s t = i )) f t (cid:80) j p ( a t | s t = j ) + (1 − f t ) (cid:80) j (1 − p ( a t | s t = j )) (2)where j = 1 , , and p ( a t | s t = i ) indicates the probability of a matching between the target and the stimulus cardassumed that the current feature is i . Here, we assume the likelihood of a current observation to be independent fromprevious observations without loss of generality, that is: p ( x t | s t , x t − ) = p ( x t | s t ) The prior belief for a given trial t is computed based on the posterior belief generated in the previous trial, p ( s t − | x t − ) , and the agent’s belief about the probability of transitions between the hidden states, p ( s t | s t − ) . Theprior belief can also be considered as a predictive probability over the hidden states. The predictive distribution for anupcoming trial t is computed according to the Chapman-Kolmogorov equation: p ( s t +1 = k | x t ) = (cid:88) i =1 p ( s t +1 = k | s t = i, Γ ( t )) p ( s t = i | x t ) (3)where Γ ( t ) represents a stability matrix describing transitions between the states (to be explained shortly). Thus,the agent combines information from the updated belief (posterior distribution) and the belief about the transitionproperties of the environmental states to predict the most probable future state. The predictive distribution representsthe internal model of the cognitive agent according to which actions are generated.The stability matrix Γ ( t ) encodes the agent’s belief about the probability of states being stable or likely to change inthe next trial. In other words, the stability matrix reflects the cognitive agent’s internal representation of the dynamicprobabilistic model of the task environment. It is computed on each trial based on the response-feedback pair, x t , anda matching signal, m t , which are observed.The matching signal m t is a vector informing the cognitive agent which features are currently relevant (meaningful),such that m ( i ) t = 1 when a positive feedback is associated with a response implying feature s t = i , and m ( i ) t = 0 otherwise. Note, that the matching signal is not a free parameter of the model, but is completely determined by thetask contingencies. The matching signal vector allows the agent to compute the state activation level ω ( i ) t ∈ [0 , forthe hidden state s t = i , which provides an internal measure of the (accumulated) evidence for each hidden state at trial t . Thus, the activation levels of the hidden states are represented by a vector ω t . The stability matrix is a square andasymmetric matrix related to hidden state activation levels such that: Γ ( t ) = ω (1) t (1 − ω (1) t ) (1 − ω (1) t ) (1 − ω (2) t ) ω (2) t (1 − ω (2) t ) (1 − ω (3) t ) (1 − ω (3) t ) ω (3) t (4)where the entries Γ ii ( t ) in the main diagonal represent the elements of the activation vector ω t , and the non-diagonalelements are computed so as to ensure that rows sum to 1. The state activation vector is computed in each trial asfollows: ω (1) t ω (2) t ω (3) t = f t ω δt − m (1) t m (2) t m (3) t + λ (1 − f t ) ω δt − − m (1) t − m (2) t − m (3) t ω (1) t − ω (2) t − ω (3) t − . (5)This equation reflects the idea that state activations are simultaneously affected by the observed feedback, f t , and thematching signal vector, m t . However, the matching signal vector conveys different information based on the currentfeedback. Matching a target card with a stimulus card makes a feature (or a subset of features) informative for aspecific state. The vector m t contributes to increase the activation level of a state if the feature is informative for PREPRINT - N
OVEMBER
30, 2020 that state when a positive feedback is received, as well as to decrease the activation level when a negative feedback isreceived.The parameter λ ∈ [0 , modulates the efficiency to disengage attention to a given state-activation configuration whena negative feedback is processed. We therefore term this parameter flexibility . We also assume that information fromthe matching signal vector can degrade by slowing down the rate of evidence accumulation for the hidden states. Thismeans that the matching signal vector can be re-scaled based on the current state activation level. The parameter δ ∈ [0 , is introduced to achieve this re-scaling. When δ = 0 , there is no re-scaling and updating of the stateactivation levels relies on the entire information conveyed by m t . On the other extreme, when δ = 1 , several trialshave to be accomplished before converging to a given configuration of the state activation levels. Equivalently, highervalues of δ affect the entropy of the distribution over hidden states by decreasing the probability of sampling of thecorrect feature. We therefore refer to δ as information loss .The free parameters λ and δ are central to our computational model, since they regulate the rate at which the internalmodel converges to the true task environmental model. Eq. (5) can be expressed in compact notation as follows: ω t = f t ω δt − m t + λ (cid:2) (1 − f t ) ω δt − (1 − m t ) (cid:3) ω t − (6)Note that the information loss parameter δ affects the amount of information that a cognitive agent acquires fromenvironmental contingencies, irrespective of the type of feedback received. Global information loss thus affects therate at which the divergence between the agent’s internal model and the true model is minimized. Figure 2 illustratesthese ideas.The probabilistic representation of adaptive behaviour provided by our Bayesian agent model allows us to quantifylatent cognitive dynamics by means of meaningful information-theoretic measures. Information theory has provento be an effective and natural mathematical language to account for functional integration of structured cognitiveprocesses and to relate them to brain activity ([38, 27, 14, 65, 23]). In particular, we are interested in three keymeasures, namely, Bayesian surprise , B t , Shannon surprise , I t , and entropy , H t . The subscript t indicates thatwe can compute each quantity on a trial-by-trial basis. Each quantity is amenable to a specific interpretation interms of separate neurocognitive processes. Bayesian surprise B t quantifies the magnitude of the update from priorbelief to posterior belief. Shannon surprise I t quantifies the improbability of an observation given an agent’s priorexpectation. Finally, entropy H t measures the degree of epistemic uncertainty regarding the true environmental states.Such measures are thought to account for the ability of the agent to manage uncertainty as emerging as a function ofcompeting behavioral affordances ([34]). We expect an adaptive system to attenuate uncertainty over environmentalstates (current features) by reducing the entropy of its internal probabilistic model.Bayesian surprise can be computed as the Kullback–Leibler ( KL ) divergence between prior and posterior beliefs aboutthe environmental states. Thus, Bayesian surprise accounts for the divergence between the predictive model for thecurrent trial and the updated predictive model for the upcoming trial. It is computed as follows: B t = KL [ p ( s t +1 | x t ) || p ( s t | x t − )]= (cid:88) i =1 (cid:20) p ( s t +1 = i | x t ) log (cid:18) p ( s t +1 = i | x t ) p ( s t = i | x t − ) (cid:19)(cid:21) (7)The Shannon surprise of a current observation given a previous one is computed as the conditional information contentof the observation: I t = − log p ( x t | x t − )= − log (cid:88) i =1 [ p ( x t | s t = i ) p ( s t = i | x t − )] (8)Finally, the entropy is computed over the predictive distribution in order to account for the uncertainty in the internalmodel of the agent in trial t as follows: H t = E [ − log p ( s t | x t − )]= − (cid:88) i =1 p ( s t = i | x t − ) log p ( s t = i | x t − ) (9) PREPRINT - N
OVEMBER
30, 2020
Figure 2: Suppose the correct sorting rule is the feature shape . The figure shows the rate of convergence of thepredictive distributions to the true task environmental model. The predictive distributions at trial t + 1 depends onthe sorting action a t (first row) and the received feedback f t (second row). Two examples of updating a predictivedistribution are shown: one in which information loss is high ( δ = 0 . , third row), and one in which informationloss is low ( δ = 0 . , fifth row). High information loss slows down the convergence of the internal model to the trueenvironmental model. The gray bar plots represent the predictive probability distribution over the rules from which anaction is sampled at each trial. Dotted bars represent the updated predictive distribution after the feedback observation.For each scenario, trial-by-trial information-theoretic measures are shown. PREPRINT - N
OVEMBER
30, 2020
Expression Name Description s t ∈ { , , } Sorting rule Card feature relevant for the sorting criterion in trial t . a t ∈ { , , , } Choice action Action of choosing one of the four stimulus cards in trial t . f t ∈ { , } Feedback Indicates whether the action of matching a stimulus to atarget card is correct or not in trial t . x t = ( a t , f t ) Observation Pair of action and feedback which constitutes theagent’s observation in trial t . Γ ( t ) Stability matrix Matrix encoding the agent’s beliefs about state transi-tions from trial t to the next trial t + 1 . λ ∈ [0 , Flexibility Parameter encoding the efficiency to disengage atten-tion from a currently attended hidden state when sig-naled by the environment. δ ∈ [0 , Information loss Parameter encoding how efficiently the agent’s internalmodel converges to the true environmental model basedon experience. m ( i ) t ∈ { , } Matching signal Signal indicating whether feature i is relevant in trial t based on the feedback received. ω ( i ) t ∈ [0 , State activation level Agent’s internal measure of the accrued evidence for thehidden environmental state i in trial t . B t ∈ R + Bayesian surprise Kullback-Leibler divergence between prior and poste-rior beliefs about hidden environmental states in trial t . I t ∈ R + Shannon surprise Information-theoretic surprise encoding the improbabil-ity or unexpectedness of an observation in trial t . H t ∈ R + Entropy Degree of epistemic uncertainty in the internal model ofthe environment in trial t .Table 1: Descriptive summary of all quantities involved in our model representation.Once the flexibility ( λ ) and information loss ( δ ) parameters are estimated from data, the information-theoretic quanti-ties can be easily computed and visualized for each trial of the WCST (see Figure 2). This allows to rephrase standardneurocognitive constructs in terms of measurable information-theoretic quantities. Moreover, the dynamics of thesequantities, as well as their interactions, can be used for formulating and testing hypotheses about the neurcognitiveunderpinnings of adaptive behavior in a principled way, as discussed later in the paper. A summary of all quantitiesrelevant for our computational model is provided in Table 1. Simulations
In this section we evaluate the expressiveness of the model by assessing its ability to reproduce meaningful behavioralpatterns as a function of its two free parameters. We study how the generative model behaves when performing theWCST in a 2-factorial simulated Monte Carlo design where flexibility ( λ ) and information loss ( δ ) are systematicallyvaried.In this simulation, the Heaton version of the task ([33]) is administered to the Bayesian cognitive agent. In thisparticular version, the sorting rule (true environmental state) changes after a fixed number of consecutive correctresponses. In particular, when the agent correctly matches the target card in 10 consecutive trials, the sorting rule isautomatically changed. The task ends after completing a maximum of 128 trials. Generative Model
The cognitive agent’s responses are generated at each time step (trial) by processing the experimental feedback. Itsperformance depends on the parameters governing the computation of the relevant quantities. The generative algorithmis outlined in
Algorithm 1 . PREPRINT - N
OVEMBER
30, 2020
Algorithm 1
Bayesian cognitive agent Set parameters θ = ( λ, δ ) . Set initial activation levels ω = (0 . , . , . . Set initial observation x = ∅ and p ( s | x ) = p ( s ) . for t = 1 , ..., T do Sample feature from prior/predictive internal model s t ∼ p ( s t | x t − ) . Obtain a new observation x t = ( a t , f t ) . Compute state posterior p ( s t | x t ) . Compute new activation levels ω t . Compute stability matrix Γ ( t ) . Update prior/predictive internal model to p ( s t +1 | x t ) . end forSimulation 1: Clinical Assessment of the Bayesian Agent Ideally, the qualitative performance of the Bayesian cognitive agent will resemble human performance. To this aim,we adopt a metric which is usually employed in clinical assessment of test results in neurological and psychiatricpatients ([11, 71, 7, 41]). Thus, agent performance is codified according to a neuropsychological criterion ([33, 21])which allows to classify responses into several response types. These response types provide the scoring measures forthe test.Here, we are interested in: 1) non-perseverative errors (E); 2) perseverative errors (PE); 3) number of trials to completethe first category (TFC); and 4) number of failures to maintain set (FMS). Perseverative errors occur when the agentapplies a sorting rule which was valid before the rule has been changed. Usually, detecting a perseveration error is farfrom trivial, since several response configurations could be observed when individuals are required to shift a sortingrule after completing a category (see [21] for details). On the other hand, non-perseverative errors refer to all errorswhich do not fit the above description, or in other words, do not occur as a function of changing the sorting rule, suchas casual errors.The number of trials to complete the first category tells us how many trials the agent needs in order to achieve the firstsorting principle, and can be seen as an index of conceptual ability ([4, 60]). Finally, a failure to maintain a set occurswhen the agent fails to match cards according to the sorting rule after it can be determined that the agent has acquiredthe rule. A given sorting rule is assumed to be acquired when the individual correctly sorts at least five cards in arow ([33, 19]). Thus, a failure to maintain a set arises whenever a participant suddenly changes the sorting strategy inthe absence of negative feedback. Failures to maintain a set are mostly attributed to distractibility. We compute thismeasure by counting the occurrences of first errors after the acquisition of a rule.We run the generative model by varying flexibility across four levels, λ ∈ { . , . , . , . } , and information lossacross three levels, δ ∈ { . , . , . } . We generate data from 150 synthetic cognitive agents per parameter combina-tion and compute standard scoring measures for each of the agents simulated responses. Results from the simulationruns are depicted in Table 2 and a graphical representation is provided in Figure 3. PREPRINT - N
OVEMBER
30, 2020
Scoring Measure Info. Loss ( δ ) Flexibility ( λ ) λ = 0 . λ = 0 . λ = 0 . λ = 0 . Casual Errors (E) δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 0 . λ ) and information loss ( δ ). Cells show the averagescores across simulated agents (standard deviation is shown in parenthesis).The simulated performance of our Bayesian cognitive agents demonstrates that different parameter combinations cap-ture different meaningful behavioral patterns. In other words, flexibility and information loss seem to interact in atheoretically meaningful way.First, overall errors increase when flexibility ( λ ) decreases, which is reflected by the inverse relation between thenumber of casual, as well as perseverative, errors and the values of parameter λ . Moreover, this pattern is consistentacross all the levels of parameter δ . More precisely, information loss ( δ ) seems to contribute to the characterizationof the casual and the perseverative components of the error in a different way. Perseverative errors are likely to occurafter a sorting rule has changed and reflect the inability of the agent to use feedback to disengage attention from thecurrently attended feature. They therefore result from local cognitive dynamics conditioned on a particular stage ofthe task (e.g., after completing a series of correct responses).Second, information loss does not interact with flexibility when perseverative errors are considered. This is due tothe fact that high information loss affects general performance by yielding a dysfunctional response strategy whichincreases the probability of making an error at any stage of the task. The lack of such interaction provides evidencethat our computational model can disentangle between error patterns due to perseveration and those due to generaldistractibility, according to neuropsychological scoring criteria.However, in our framework, flexibility ( λ ) is allowed to yield more general and non-local cognitive dynamics aswell. Indeed, λ plays a role whenever belief updating is demanded as a function of negative feedback. An errorclassified as non-perseverative (e.g., casual error) by the scoring criteria might still be processed as a feedback-relatedevidence for belief updating. Consistently, the interaction between λ and δ in accounting for causal errors shows thatperformance worsens when both flexibility and information loss become less optimal, and that such pattern becomesmore pronounced for lower values of δ .On the other hand, a specific effect of information loss ( δ ) can be observed for the scoring measures related to slowinformation processing and distractibility. The number of trials to achieve the first category reflects the efficiencyof the agent in arriving at the first true environmental model. Flexibility does not contribute meaningfully to theaccumulation of errors before completing the first category for some levels of information loss. This is reflected by thefact that the mean number of trials increases as a function of δ , and do not change across levels of λ for low and midvalues of δ . A similar pattern applies for failures to maintain a set. Both scoring measures index a deceleration of theprocess of evidence accumulation for a specific environmental configuration, although the latter is a more exhaustivemeasures of dysfunctional adaptation.Therefore, an interaction between parameters can be observed when information loss is high. A slow internal modelconvergence process increases the amount of errors due to improper rule sampling from the internal environmentalmodel. However, internal model convergence also plays a role when a new category has to be accomplished aftercompleting an older one. On the one hand, compromised flexibility increases the amount of errors due to inefficientfeedback processing. This leads to longer trial windows needed to achieve the first category. On the other hand,when information loss is high, belief updating upon negative feedback is compromised due to high internal model PREPRINT - N
OVEMBER
30, 2020
Figure 3: Clinical scoring measures as functions of flexibility ( λ ) and information loss ( δ ) - simulated scenarios. Thedifferent cells show the violin plots for the estimated distribution densities of the scoring measures obtained from thegroup of synthetic individuals, for the levels of λ across different levels of δ . In particular, they show the distribution ofnon-perseverative errors (E), perseverative errors (PE), number of trials to complete the first category (TFC), numberof failures to maintain set (FMS) obtained from 150 synthetic agent’s response simulations for each cell of the factorialdesign. PREPRINT - N
OVEMBER
30, 2020 uncertainty. At this point, the probability to err due to distractibility increases, as accounted by the failures to maintaina set measures.Finally, the joint effect of δ and λ for high levels of information loss suggests that the roles played by the two cognitiveparameters in accounting for adaptive functioning can be entangled when neuropsychological scoring criteria areconsidered. Simulation 2: Information-Theoretic Analysis of the Bayesian Agent
In the following, we explore a different simulation scenario in which information-theoretic measures are derived to as-sess performance of the Bayesian cognitive agent. In particular, we explore the functional relationship between cogni-tive parameters and the dynamics of the recovered information-theoretic measures by simulating observed responses byvarying flexibility across three levels, λ ∈ { . , . , . } , and information loss across three levels, δ ∈ { . , . , . } .For this simulation scenario, we make no prior assumptions about sub-types of error classification. Instead, we inves-tigate the dynamic interplay between Bayesian surprise, B t , Shannon surprise, I t , and entropy, H t over the entirecourse of 128 trials in the WCST.Figure 4: Information-theoretic measures varying as a function of flexibility λ and information loss δ across 128 trialsof the WCST. Optimal belief updating and uncertainty reduction are achieved with low information loss and highflexibility (first row, third column).Figure 4 depicts results from the nine simulation scenarios. Although an exhaustive discussion on cognitive dynamicsshould couple information-theoretic measures with patterns of correct and error responses, we focus solely on theinformation-theoretic time series for illustrative purposes. We refer to the Application section for a more detaileddescription of the relation between observed responses and estimated information-theoretic measures in the context ofdata from a real experiment.Again, simulated performance of the Bayesian cognitive agent shows that different parameter combinations yielddifferent patterns of cognitive dynamics. Observed spikes and their related magnitudes signal informative task events(e.g., unexpected negative feedback), as accounted by Shannon surprise, or belief updating, as accounted by Bayesiansurprise. Finally, entropy encodes the epistemic uncertainty about the environmental model on a trial-by-trial basis.In general, low information loss ( δ ) ensures optimal behavior by speeding up internal model convergence by decreasingthe number of trials needed to minimize uncertainty about the environmental states. Low uncertainty reflects two main PREPRINT - N
OVEMBER
30, 2020 aspects of adaptive behavior. On the one hand, the probability that a response occurs due to sampling of improper rulesdecreases, allowing the agent to prevent random responses due to distractibility. On the other hand, model convergenceentails a peaked Shannon surprise when a negative feedback occurs, due to the divergence between predicted and actualobservations.Flexibility ( λ ) plays a crucial role in integrating feedback information in order to enable belief updating. The first rowdepicted in Figure 4 shows cognitive dynamics related to low information loss, across the levels of flexibility. As can benoticed, there is a positive relation between the magnitude of the Bayesian surprise and the level of flexibility, althoughunexpectedness yields approximately the same amount of signaling, as accounted by peaked Shannon surprise. Fromthis perspective, surprise and belief updating can be considered functionally separable, where the first depends on theparticular internal model probability configuration related to δ , whilst the second depends on flexibility λ .However, more interesting patterns can be observed when information loss increases. In particular, model convergenceslows down and several trials are needed to minimize predictive model entropy. Casual errors might occur within trialwindows characterized by high uncertainty, and interactions between entropy and Shannon surprise can be observes insuch cases. In particular, Shannon surprise magnitude increases when model’s entropy decreases, that is, during taskphases in which the internal model has already converged. As a consequence, negative feedback could be classified asinformative or uninformative, based on the uncertainty in the current internal model. This is reflected by the negativerelation between entropy and Shannon surprise, as can be noticed by inspecting the graphs depicted in the third row ofFigure 4. Therefore, the magnitude of belief updating depends on the interplay between entropy and Shannon surprise,and can differ based on the values of the two measures in a particular task phase.In sum, both simulation scenarios suggest that the simulated behavior of our generative model is in accord withtheoretical expectations. Moreover, the flexibility and information loss parameters can account for a wide range ofobserved response patterns and inferred dynamics of information processing. Parameter Estimation
In this section, we discuss the computational framework for estimating the parameters of our model from observedbehavioral data. Parameter estimation is essential to inferring the cognitive dynamics underlying observed behavior inreal-world applications of the model. This section is slightly more technical and can be skipped without significantlyaffecting the flow of the text.
Computational Framework
Rendering our cognitive model suitable for application in real-world contexts also entails accounting for uncertaintyabout parameter estimates. Indeed, uncertainty quantification turns out to be a fundamental and challenging goal whenfirst-level quantities, that is, cognitive parameter estimates, are used to recover (second-level) information-theoreticmeasures of cognitive dynamics. The main difficulties arise when model complexity makes estimation and uncertaintyquantification intractable at both analytical and numerical levels. For instance, in our case, probability distributions forthe hidden model are generated at each trial, and the mapping between hidden states and responses changes dependingon the structure of the task environment.Identifying such a dynamic mapping is relatively easy from a generative perspective, but it becomes challenging, andalmost impossible, when inverse modeling is required. Generally, this problem arises when the likelihood functionrelating model parameters to the data is not available in closed-form or too complex to be practically evaluated ([61]).To overcome these limitations, we apply the first version of the recently developed
BayesFlow method (see [55]for mathematical details). At a high-level,
BayesFlow is a simulation-based method that estimates parameters andquantifies estimation uncertainty in a unified Bayesian probabilistic framework when inverting the generative model isintractable. The method is based on recent advances in deep generative modeling and makes no assumptions about theshape of the true parameter posteriors. Thus, our ultimate goal becomes to approximate and analyze the joint posteriordistribution over the model parameters. The parameter posterior is given via an application of Bayes’ rule: p ( θ | x T , m T ) = p ( x T , m T | θ ) p ( θ ) (cid:82) p ( x T , m T | θ ) p ( θ ) d θ (10)where we set θ = ( λ, δ ) and stack all observations and matching signals into the vectors x T = ( x , x , ..., x T ) and m T = ( m , m , ..., m T ) , respectively. The BayesFlow method uses simulations from the generative model tooptimize a neural density estimator which learns a probabilistic mapping between raw data and parameters. It relieson the fact that data can easily be simulated by repeatedly running the generative model with different parameterconfigurations θ sampled from the prior. During training, the neural network estimator iteratively minimizes the PREPRINT - N
OVEMBER
30, 2020 divergence between the true posterior and an approximate posterior. Once the network has been trained, we canefficiently obtain samples from the approximate joint posterior distribution of the cognitive parameters of interest,which can be further processed in order to extract meaningful summary statistics (e.g., posterior means, medians,modes, etc.). Importantly, we can apply the same pre-trained inference network to an arbitrary number of real orsimulated data sets (i.e., the training effort amortizes over multiple evaluations of the network).For our purposes of validation and application, we train the network for 50 epochs which amount to 50000 forwardsimulations. As a prior, we use a bivariate continuous uniform distribution p ( θ ) ∼ U ([0 , , [1 , . We then validateperformance on a separate validation set of 1000 simulated data sets with known ground-truth parameter values.Training the networks took less than a day on a single machine with an NVIDIA ® GTX1060 graphics card (CUDAversion 10.0) using TensorFlow (version 1.13.1) ([1]). In contrast, obtaining full parameter posteriors from the entirevalidation set took approximately 1.78 seconds. In what follows, we describe and report all performance validationmetrics.
Performance Metrics and Validation Results
To assess the accuracy of point estimates, we compute the root mean squared error (RMSE) and the coefficient ofdetermination ( R ) between posterior means and true parameter values. To assess the quality of the approximateposteriors, we compute a calibration error ([55]) of the empirical coverage of each marginal posterior Finally, weimplement simulation-based calibration (SBC, [67]) for visually detecting systematic biases in the approximate pos-teriors. Point Estimates . Point estimates obtained by posterior means as well as corresponding RMSE and R metrics aredepicted in Figure 5A. Note, that point estimates do not have any special status in Bayesian inference, as they couldbe misleading depending on the shape of the posteriors. However, they are simple to interpret and useful for ease-of-comparison. We observe that pointwise recovery of λ is better than that of δ . This is mainly due to suboptimalpointwise recovery in the lower (0 , . range of δ . This pattern is evident in Figure 5A and is due to the fact that δ values in this range produce almost indistinguishable data patterns. Bootstrap estimates yielded an average RMSE of . ( SD = 0 . ) and an average R of . ( SD = 0 . ) for the δ parameter. An average RMSE of . ( SD = 0 . ) and an average R of . ( SD = 0 . ) were obtained for the λ parameter. These results suggestgood global pointwise recovery but also warrant the inspection of full posteriors, especially in the low ranges of δ . Full Posteriors . Average bootstrap calibration error was . ( SD = 0 . ) for the marginal posterior of δ and . ( SD = 0 . ) for the marginal posterior of λ . Calibration error is perhaps the most important metric here, asit measures potential under- or overconfidence across all confidence intervals of the approximate posterior (i.e., an α -confidence interval should contain the true posterior with a probability of α , for all α ∈ (0 , ). Thus, low calibrationerror indicates a faithful uncertainty representation of the approximate posteriors. Additionally, SBC-histograms aredepicted in Figure 5B. As shown by ([67]), deviations from the uniformity of the rank statistic (also know as a PIThistogram) indicate systematic biases in the posterior estimates. A visual inspection of the histograms reveals thatthe posterior means slightly overestimate the true values of δ . This corroborates the pattern seen in Figure 5A for thelower range of δ .Finally, Figure 5C depicts the full marginal posteriors on two example validation sets. Even on these two data sets,we observe strikingly different posterior shapes. The marginal posterior of δ obtained from the first data set is slightlyleft-skewed and has its density concentrated over the (0 . , . range. On the other hand, the marginal posteriorof δ from the second data set is noticeably right-skewed and peaked across the lower range of the parameter. Themarginal posteriors of λ appear more symmetric and warrant the use of the posterior mean as a useful summary ofthe distribution. These two examples underline the importance of investigating full posterior distributions as a meansto encode epistemic uncertainty about parameter values. Moreover, they demonstrate the advantage of imposing nodistributional assumptions on the resulting posteriors, as their form and sharpness can vary widely depending on theconcrete data set. Application
In this section we fit the Bayesian cognitive model to real clinical data. The aim of this application is to evaluate theability of our computational framework to account for dysfunctional cognitive dynamics of information processing insubstance dependent individuals (SDI) as compared to healthy controls. PREPRINT - N
OVEMBER
30, 2020
Figure 5: Parameter recovery results on validation data; (A)
Posterior means vs. true parameter values; (B)
Histogramsof the rank statistic used for simulation-based calibration; (C)
Example full posteriors for two validation data sets; ( D )Example information-theoretic dynamics recovered from the parameter posteriors. Rationale
The advantage of modeling cognitive dynamics in individuals from a clinical population is that model predictions canbe examined in light of available evidence about individual performance. For instance, SDIs are known to demonstrateinefficient conceptualization of the task and dysfunctional, error-prone response strategies. This has been attributedto defective error monitoring and behavior modulation systems, which depend on cingulate and frontal brain regionsfunctionality ([40, 70]). On the other hand, the WCST should be a rather easy and straightforward task for healthy par-ticipants to obtain excellent performance. Therefore, we expect our model to consistently capture such characteristics.To test these expectations, we estimate the two relevant parameters λ and δ from both clinical patients and healthycontrols from an already published dataset ([7]). The Data
The dataset used in this application consists of responses collected by administering the standard Heaton version of theWCST ([33]) to healthy participants and SDIs. In this version of the task, the sorting rule changes when a participantcollects a series of 10 consecutive correct responses, and the task ends when this happens for 6 times. Participants inthe study consisted of 39 SDIs and 49 healthy individuals. All participants were adults ( > years old) and gave theirinformed consent for inclusion which was approved by the appropriate human subject committee at the University ofIowa. SDIs were diagnosed as substance dependent based on the Structured Clinical Interview for DSM-IV criteria([20]). Model Fitting
We fit the Bayesian cognitive agent separately to data from each participant in order to obtain individual-level posteriordistributions. We apply the same
BayesFlow network trained for the previous simulation studies, so obtaining posteriorsamples for each participant is almost instant (due to amortized inference). PREPRINT - N
OVEMBER
30, 2020
Results
The means of the joint posterior distributions are depicted for each individual in Figure 6, and provide a completeoverview of the heterogeneity in cognitive sub-components at both individual and group levels (individual-level fulljoint posterior distributions can be found in the
SI Appendix ).Figure 6: Joint posterior mean coordinates of the cognitive parameters, flexibility ( λ ) and information loss ( δ ), esti-mated for each individual. We observe a great heterogeneity in the distribution of posterior means, most pronouncedlyfor the flexibility parameter. However, a moderate between-subject variability in information loss can still be observedin both groups.The estimates reveal a rather interesting pattern across both healthy and SDI participants. In particular, in both clinicaland control groups, individuals with a poor flexibility (e.g., low values of λ ) can be detected. However, the groupparameter space appears to be partitioned into two main clusters consisting of individuals with high and low flexibility,respectively. As can be noticed, the majority of SDIs belongs to the latter cluster, which suggests that the model is PREPRINT - N
OVEMBER
30, 2020 able to capture error-related defective behavior in the clinical population and attribute it specifically to the flexibilityparameter. On the other hand, individual performance seems hardly separable along the information loss parameterdimension.As a further validation, we compare the classification performance of two logistic regression models. The first uses theestimated parameter means as inputs and the participants’ binary group assignment (patient vs. control) as an outcome.The second uses the four standard clinical measures (non-perseverative errors (E), perseverative errors (PE), numberof trials to complete the first category (TFC), number of failures to maintain set (FMS) computed from the sampleas inputs and the same outcome. Since we are interested solely in classification performance and want to mitigatepotential overfitting due to small sample size, we compute leave-one-out cross-validated (LOO-CV) performance forboth models. Interestingly, both logistic regression models achieve the same accuracy of . , with a sensitivity of . and specificity of . . Thus, it appears that our model is able to differentiate between SDIs and healthy individualsas good as the standard clinical measures.However, as pointed out in the previous sections, estimated parameters serve merely as a basis to reconstruct cognitivedynamics by means of the trial-by-trial unfolding of information-theoretic measures. Moreover, cognitive dynamicscan only be analysed and interpreted by relying on the joint contribution of both estimated parameters and individual-specific observed response patterns.To further clarify this concept, we investigate the reconstructed time series of information-theoretic quantities based onthe response patterns of two exemplary individuals (Figure 7). In particular, Figure 7A depicts the behavioral outcomesof a SDI with sub-optimal performance where the information-theoretic trajectories are reconstructed by taking thecorresponding posterior means ( [¯ λ = 0 . , ¯ δ = 0 . ), thus representing compromised flexibility and high informationloss. Differently, Figure 7B shows the information-theoretic path related to response dynamics of an optimal controlparticipant, according to the parameter set [¯ λ = 0 . , ¯ δ = 0 . , representing relatively high flexibility, and lowinformation loss. Note, that in both cases, the reconstructed information-theoretic measures are based on the estimatedposterior means for ease of comparison (see SI Appendix for the full joint posterior densities of the two exemplaryindividuals and the rest of the sample). PREPRINT - N
OVEMBER
30, 2020
Figure 7: Recovered cognitive dynamics of two exemplary individuals. ( A ) Trial-by-trial information-theoretic mea-sures of a SDI characterized by very low flexibility and very high information loss; ( B ) Trial-by-trial information-theoretic measures of a healthy individual characterized by relatively high flexibility and low information loss. LabelsC and E indicate correct and error responses.Results in Figure 7A account for a typical sub-optimal behavior observed in the SDI group, where several errors areproduced in different phases of the task. The error patterns produced by such an individual might be induced by anon-trivial interaction between cognitive sub-components. Lower values of flexibility imply that errors are likely tobe produced by generating responses from an internal environmental model which is no longer valid. In other words,the agent is unable to rely on local feedback-related information in order to update beliefs about hidden states. Onthe other hand, higher values of information loss reflect a general inefficiency of belief updating processes due toslow convergence to the optimal probabilistic environmental model. From this perspective, Bayesian surprise B t andShannon surprise I t might play different roles in regulating behavior based on different internal model probabilityconfigurations. In addition, errors might be processed differently based on the status of the internal environmentalrepresentation, as reflected by the entropy of the predictive model, H t . Thus, information-theoretic measures allowto describe cognitive dynamics on a trial-by-trial basis and, further, to disentangle the effect that different feedback-related information processing dynamics exert on adaptive behavior.Processing unexpected observations is accounted by the quantification of surprise upon observing a response-feedbackpair which is inconsistent with the current internal model of the task environment. Negative feedback is maximallyinformative when errors occur after the internal model has converged to the true task model (grey area, Figure 7A), orthe entropy approaches zero (grey line, Figure 7A). The Shannon surprise (orange line) is maximal when errors occurwithin trial windows in which the agent’s uncertainty about environmental states is minimal (orange areas, Figure 7A).However, internal model updates following an informative feedback are not optimally performed, which is reflected PREPRINT - N
OVEMBER
30, 2020 by very small Bayesian surprise (blue line, Figure 7A). This can be attributed to impaired flexibility and reflects thefact that after internal model convergence, informative feedback is not processed adequately and the internal modelbecomes impervious to change.Conversely, errors occurring when the agent is uncertain about the true environmental state carry no useful informationfor belief updating, since the system fails to conceive such errors as unexpected and informative. The information lossparameter plays a crucial role in characterizing this cognitive behavior. The slow convergence to the true environmentalmodel, accompanied by the slow reduction of entropy in the predictive model, leads to a large number of trials requiredto achieve a good representation of the current task environment (white areas, Figure 7A). Errors occurring within trialwindows with large predictive model entropy (green area, Figure 7A) do not affect subsequent behavior, and feedbackis maximally uninformative.Rather different cognitive dynamics can be observed in Figure 7B, accounting for a typical optimal behavior wherethe errors produced fall within the trial windows which follow a rule completion (e.g. when the individual completesa sequence of 10 consecutive correct responses), and, thus, the environmental model becomes obsolete. However, thehigh flexibility, λ , allows to rely on local feedback-related information to suddenly update beliefs about the hiddenstates, that is, the most appropriate sorting rule. In this case, negative feedback become maximally informative aftermodel convergence (grey area, Figure 7B) and the process of entropy reduction (green line, Figure 7B) is faster (e.g.less trials are needed) compared to the sub-optimal behavior scenario. Since uncertainty about the environmentalstates decreases faster, the Shannon surprise is always highly peaked when errors occur (orange line, Figure 7B), thusensuring an efficient employment of the local feedback-related information. Accordingly, higher values of Bayesiansurprise are observed (blue line, Figure 7B), revealing optimal internal model updating.In general, the role that predictive (internal) model uncertainty plays in characterizing the way the agent processesfeedback allows to disentangle sub-types of errors based on the information they convey for subsequent belief updating.From this perspective, error classification is entirely dependent on the status of the internal environmental model acrosstask phases. Identifying such a dynamic latent process is therefore fundamental, since the error codification criterionevolves with respect to the internal information processing dynamics. Otherwise, the problem of inferring which errorsare due to perseverance in maintaining an older (converged) internal model and which due to uncertainty about thetrue environmental state becomes intractable, or even impossible. Discussion
Investigating information processing related to changing environmental contingencies is fundamental to understand-ing adaptive behavior. For this purpose, cognitive scientists mostly rely on controlled settings in which individuals areasked to accomplish (possibly) highly demanding tasks whose demands are assumed to resemble those of natural en-vironments. Even in the most trivial cases, such as the WCST, optimal performance requires integrated and distributedneurocognitive processes. Moreover, these processes are unlikely to be isolated by simple scoring or aggregate perfor-mance measures.In the current work, we developed and validated a new computational Bayesian model which maps distinct cogni-tive processes into separable information-theoretic constructs underlying observed adaptive behavior. We argue thatthese constructs could help describe and investigate the neurocognitive processes underlying adaptive behavior in aprincipled way.Furthermore, we couple our computational model with a novel neural density estimation method for simulation-basedBayesian inference ([55]). Accordingly, we can quantify the entire information contained in the data about the as-sumed cognitive parameters via a full joint posterior over plausible parameter values. Based on the joint posterior, arepresentative summary statistic can be computed to simulate the most plausible unfolding of information-theoreticquantities on a trial-by-trial basis.Several computational models have been proposed to describe and explain performance in the WCST, ranging frombehavioral ([10, 31, 62]) to neural network models ([17, 3, 45, 49]). These models aim to provide psychologicallyinterpretable parameters or biologically inspired network structures, respectively, accounting for specific qualitativepatterns of observed data. Behavioral models, in particular, abstract the main cognitive features underlying individualperformance in the WCST according to different theoretical frameworks (e.g., attentional updating ([10]), or reinforce-ment learning ([62])) and disentangle psychological sub-processes explaining observed task performance. However,the main advantage of our Bayesian model is that it provides both a cognitive and a measurement model which co-exist within the overarching theoretical framework of Bayesian brain theories. More precisely, the presented modelis specifically designed to capture trial-by-trial fluctuations in information processing as described by second-orderinformation-theoretic quantities. The latter can be seen as a multivariate quantitative account of the interaction be- PREPRINT - N
OVEMBER
30, 2020 tween the agent and its environment. Moreover, it is worth noting that such a model representation might not beapplicable outside a Bayesian theoretical framework.Even though our computational model is not a neural model, it might provide a suitable description of cognitivedynamics at a representational and/or a computational level ([47]). This description can then be related to neuralfunctioning underlying adaptive behavioral. Indeed, there is some evidence to suggest that neural processes related tobelief maintenance/updating and unexpectedness are crucial for performance in the WCST. In particular, brain circuitsassociated with cognitive control and belief formation, such as the parietal cortex and prefrontal regions, seem to sharea functional basis with neural substrates involved in adaptive tasks ([50]). Prefrontal regions appear to mediate therelation between feedback and belief updating ([46]) and efficient functioning in such brain structures seems to beheavily dependent on dopaminergic neuromodulation ([52]). Moreover, the dopaminergic system plays a role in theprocessing of salient and unexpected environmental stimuli, in learning based on error-related information, and inevaluating candidate actions ([50, 16, 30]). Accordingly, dopaminergic system functioning has been put in relationwith performance in the WCST ([35, 57]) and shown to be critical for the main executive components involved inthe task, that is, cognitive flexibility and set-shifting ([9, 63]). Further, neural activity in the anterior cingulate cortex(ACC) is increased when a negative feedback occurs in the context of the WCST ([46]). This finding corroborates theview that the ACC is part of an error-detection network which allocates attentional resources to prevent future errors.The ACC might play a crucial role in adaptive functioning by encoding error-related or, more generally, feedback-related information. Thus, it could facilitate the updating of internal environmental models ([56]).The neurobiological evidence suggests that brain networks involved in the WCST might endow adaptive behavior byaccounting for maintaining/updating of an internal model of the environment and efficient processing of unexpectedinformation. Is it noteworthy, that these processing aspects are incorporated into our computational framework. Atthis point, we briefly outline the empirical and theoretical potentials of the proposed computational framework forinvestigating adaptive functioning and discuss future research vistas.
Model-Based Neuroscience . Recent studies have pointed out the advantage of simultaneously modeling and analyzingneural and behavioral data within a joint modeling framework. In this way, the latter can be used to provide informationfor the former, as well as the other way around ([68, 69, 22]). This involves the development of joint models whichencode assumptions about the probabilistic relationships between neural and cognitive parameters.Within our framework, the reconstruction of information-theoretic discrete time series yields a quantitative accountof the agent’s internal processing of environmental information. Event-related cognitive measures of belief updating,epistemic uncertainty and surprise can be put in relation with neural measurements by explicitly providing a formalaccount of the statistical dependencies between neural and cognitive (information-theoretic) quantities. In this way,latent cognitive dynamics can be directly related to neural event-related measures (e.g., fMRI, EEG). Applicationsin which information-theoretic measures are treated as dependent variables in standard statistical analysis are alsopossible.
Neurological Assessment . Although neuroscientists have considered performance in the WCST as a proxy for measur-ing high-level cognitive processes, the usual approach to the analysis of human adaptive behavior consists in summa-rizing response patterns by simple heuristic scoring measures (e.g., occurrences of correct responses and sub-types oferrors produced) and classification rules ([21]). However, the theoretical utility of such a summary approach remainsquestionable. Indeed, adaptive behavior appears to depend on a complex and intricate interplay between multiplenetwork structures ([5, 48, 46, 6, 12]). This posits a great challenge for disentangling high-level cognitive constructsat a model level and further investigating their relationship with neurobiological substrates. It appears that standardscoring measures might not be able to fulfil these tasks. Moreover, there is a pronounced lack of anatomical specificityin previous research concerning the neural and functional substrates of the WCST ([51]).Thus, there is a need for more sophisticated modeling approaches. For instance, disentangling errors due to persever-ative processing of previously relevant environmental models from those due to uncertainty about task environmentalstates, is important and nontrivial. Sparse and distributed error patterns might depend on several internal model prob-ability configurations. Such internal models are latent, and can only be uncovered through cognitive modeling. There-fore, information-based criteria to response (error) classification can enrich clinical evaluation beyond heuristicallymotivated criteria.
Generalizability . Another important advantage of the proposed computational framework is that it is not solely con-fined to the WCST. In fact, one can argue that the seventy-year old WCST does not provide the only or even the mostsuitable setting for extracting information about cognitive dynamics from general populations or maladaptive behaviorin clinical populations. One can envision tasks which embody probabilistic (uncertain) or even chaotic environments(for instance with partially observable or unreliable feedback or partially observable states) and demand integrating PREPRINT - N
OVEMBER
30, 2020 information from different modalities ([53, 50]). These settings might prove more suitable for investigating changes inuncertainty-related processing or cross-modal integration than deterministic and fully observable WCST-like settings.Despite these advantages, our proposed computational framework has certain limitations. A first limitation might con-cern the fact that the new Bayesian cognitive model accounts for the main dynamics in adaptive tasks by relying ononly two parameters. Although such a parsimonious proposal suffices to disentangle latent data-generating processes,a more exhaustive formal description of cognitive sub-components might be envisioned. However, parameter estima-tion can become challenging in such a scenario, especially when one-dimensional response data is used as a basis forparameter recovery. Second, the information loss parameter appears to be more challenging to estimate than the flexi-bility parameter in some datasets. There are at least two possible remedies for this problem. On the one hand, globalestimation of information loss might be hampered due to the model’s current functional (algorithmic) formulation andcan therefore be optimized via an alternative formulation/parameterization. On the other hand, it might be the casethat the data obtainable in the simple WCST environment is not particularly informative about this parameter and, ingeneral, not suitable for modeling more complex and non-linear cognitive dynamics in general. Future works shouldtherefore focus on designing and exploring more data-rich controlled environments which can provide a better start-ing point for investigating complex latent cognitive dynamics in a principled way. Additionally, the information lossparameter seems to be less effective in differentiating between substance abusers and healthy controls in the particularsample used in this work. Thus, further model-based analyses on individuals from different clinical populations areneeded to fully understand the potential of our 2-parameter model as a clinical neuropsychological tool. Finally, inthis work, we did not perform formal model comparison, as this would require an extensive consideration of variousnested and non-nested model within the same theoretical framework and between different theoretical frameworks.We therefore leave this important endeavor for future research.
Conclusions
In conclusion, the proposed model can be considered as the basis for a (bio)psychometric tool for measuring the dy-namics of cognitive processes under changing environmental demands. Furthermore, it can be seen as a step towardsa theory-based framework for investigating the relation between such cognitive measures and their neural underpin-nings. Further investigations are needed to refine the proposed computational model and systematically explore theadvantages of the Bayesian brain theoretical framework for empirical research on high-level cognition.
Acknowledgements
We thank Karin Prillinger and Luca D’Alessandro for reading the manuscript and providing useful suggestions whichsignificantly improved the original text.
References [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In { USENIX } symposium on operating systems design and implementation ( { OSDI } , pages 265–283, 2016.[2] Julie A Alvarez and Eugene Emory. Executive function and the frontal lobes: a meta-analytic review. Neuropsy-chology review , 16(1):17–42, 2006.[3] Andrew Amos. A computational model of information processing in the frontal cortex and basal ganglia.
Journalof cognitive neuroscience , 12(3):505–519, 2000.[4] Peter J Anderson. Towards a developmental model of executive function. In
Executive functions and the frontallobes , pages 37–56. Psychology Press, 2010.[5] Francisco Barcelo, Carles Escera, Maria J Corral, and Jose A Peri´a˜nez. Task switching and novelty processingactivate a common neural network for cognitive control.
Journal of cognitive neuroscience , 18(10):1734–1748,2006.[6] Francisco Barcel´o and Francisco J Rubia. Non-frontal p3b-like activity evoked by the wisconsin card sortingtest.
Neuroreport , 9(4):747–751, 1998.[7] Antoine Bechara and Hanna Damasio. Decision-making and addiction (part i): impaired activation of somaticstates in substance dependent individuals when pondering decisions with negative future consequences.
Neu-ropsychologia , 40(10):1675–1689, 2002. PREPRINT - N
OVEMBER
30, 2020 [8] Esta A Berg. A simple objective technique for measuring flexibility in thinking.
The Journal of general psychol-ogy , 39(1):15–22, 1948.[9] Sven Bestmann, Diane Ruge, John Rothwell, and Joseph M Galea. The role of dopamine in motor flexibility.
Journal of cognitive neuroscience , 27(2):365–376, 2014.[10] Anthony J Bishara, John K Kruschke, Julie C Stout, Antoine Bechara, David P McCabe, and Jerome R Buse-meyer. Sequential learning models for the wisconsin card sort task: Assessing processes in substance dependentindividuals.
Journal of mathematical psychology , 54(1):5–13, 2010.[11] David L Braff, Robert Heaton, Julie Kuck, Munro Cullum, John Moranville, Igor Grant, and Sidney Zisook. Thegeneralized pattern of neuropsychological deficits in outpatients with chronic schizophrenia with heterogeneouswisconsin card sorting test results.
Archives of general psychiatry , 48(10):891–898, 1991.[12] Bradley R Buchsbaum, Stephanie Greer, Wei-Li Chang, and Karen Faith Berman. Meta-analysis of neuroimag-ing studies of the wisconsin card-sorting task and component processes.
Human brain mapping , 25(1):35–45,2005.[13] Christopher L Buckley, Chang Sub Kim, Simon McGregor, and Anil K Seth. The free energy principle for actionand perception: A mathematical review.
Journal of Mathematical Psychology , 81:55–79, 2017.[14] Guillem Collell and Jordi Fauquet. Brain activity and cognition: a connection from thermodynamics and infor-mation theory.
Frontiers in psychology , 6:818, 2015.[15] Richard Cooper, John Fox, Jonny Farringdon, and Tim Shallice. A systematic methodology for cognitive mod-elling.
Artificial Intelligence , 85(1-2):3–44, 1996.[16] Nathaniel D Daw, Samuel J Gershman, Ben Seymour, Peter Dayan, and Raymond J Dolan. Model-based influ-ences on humans’ choices and striatal prediction errors.
Neuron , 69(6):1204–1215, 2011.[17] Stanislas Dehaene and Jean-Pierre Changeux. The wisconsin card sorting test: Theoretical analysis and modelingin a neuronal network.
Cerebral cortex , 1(1):62–79, 1991.[18] Harriet Feldman and Karl Friston. Attention, uncertainty, and free-energy.
Frontiers in human neuroscience ,4:215, 2010.[19] Ivonne J Figueroa and Robert J Youmans. Failure to maintain set: A measure of distractibility or cognitiveflexibility? In
Proceedings of the human factors and ergonomics society annual meeting , volume 57, pages828–832. Sage Publications Sage CA: Los Angeles, CA, 2013.[20] Michael B First. Structured clinical interview for dsm-iv axis i disorders.
Biometrics Research Department ,1997.[21] Laura A Flashman, Michael D Homer, and David Freides. Note on scoring perseveration on the wisconsin cardsorting test.
The Clinical Neuropsychologist , 5(2):190–194, 1991.[22] Birte U Forstmann, Eric-Jan Wagenmakers, Tom Eichele, Scott Brown, and John T Serences. Reciprocal re-lations between cognitive neuroscience and formal cognitive models: opposites attract?
Trends in cognitivesciences , 15(6):272–279, 2011.[23] Karl Friston. Learning and inference in the brain.
Neural Networks , 16(9):1325–1352, 2003.[24] Karl Friston. A theory of cortical responses.
Philosophical transactions of the Royal Society B: Biologicalsciences , 360(1456):815–836, 2005.[25] Karl Friston. The free-energy principle: a unified brain theory?
Nature reviews neuroscience , 11(2):127–138,2010.[26] Karl Friston, Rick Adams, Laurent Perrinet, and Michael Breakspear. Perceptions as hypotheses: saccades asexperiments.
Frontiers in psychology , 3:151, 2012.[27] Karl Friston, Thomas FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, and Giovanni Pezzulo. Activeinference: a process theory.
Neural computation , 29(1):1–49, 2017.[28] Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle.
Philosophical Transactionsof the Royal Society B: Biological Sciences , 364(1521):1211–1221, 2009.[29] Karl J Friston, Jean Daunizeau, James Kilner, and Stefan J Kiebel. Action and behavior: a free-energy formula-tion.
Biological cybernetics , 102(3):227–260, 2010.[30] Samuel J Gershman. The successor representation: its computational logic and neural substrates.
Journal ofNeuroscience , 38(33):7193–7200, 2018. PREPRINT - N
OVEMBER
30, 2020 [31] Jan Gl¨ascher, Ralph Adolphs, and Daniel Tranel. Model-based lesion mapping of cognitive control using thewisconsin card sorting test.
Nature communications , 10(1):1–12, 2019.[32] Helene Haker, Maya Schneebeli, and Klaas Enno Stephan. Can bayesian theories of autism spectrum disorderhelp improve clinical practice?
Frontiers in psychiatry , 7:107, 2016.[33] RK Heaton. Wisconsin card sorting test manual; revised and expanded.
Psychological Assessment Resources ,pages 5–57, 1981.[34] Jacob B Hirsh, Raymond A Mar, and Jordan B Peterson. Psychological entropy: A framework for understandinguncertainty-related anxiety.
Psychological review , 119(2):304, 2012.[35] Pei Chun Hsieh, Tzung Lieh Yeh, I Hui Lee, Hui Chun Huang, Po See Chen, Yen Kuang Yang, Nan TsingChiu, Ru Band Lu, and Mei-Hsiu Liao. Correlation between errors on the wisconsin card sorting test and theavailability of striatal dopamine transporters in healthy volunteers.
Journal of psychiatry & neuroscience: JPN ,35(2):90, 2010.[36] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as bayesian inference.
Annu. Rev.Psychol. , 55:271–304, 2004.[37] David C Knill and Alexandre Pouget. The bayesian brain: the role of uncertainty in neural coding and computa-tion.
TRENDS in Neurosciences , 27(12):712–719, 2004.[38] Etienne Koechlin and Christopher Summerfield. An information theoretical approach to prefrontal executivefunction.
Trends in cognitive sciences , 11(6):229–235, 2007.[39] Konrad P K¨ording, Ulrik Beierholm, Wei Ji Ma, Steven Quartz, Joshua B Tenenbaum, and Ladan Shams. Causalinference in multisensory perception.
PLoS one , 2(9):e943, 2007.[40] A K¨ubler, Kevin Murphy, and H Garavan. Cocaine dependence and attention switching within and betweenverbal and visuospatial working memory.
European Journal of Neuroscience , 21(7):1984–1992, 2005.[41] Oriane Landry and Shems Al-Taie. A meta-analysis of the wisconsin card sort task in autism.
Journal of autismand developmental disorders , 46(4):1220–1235, 2016.[42] Rebecca P Lawson, Geraint Rees, and Karl J Friston. An aberrant precision account of autism.
Frontiers inhuman neuroscience , 8:302, 2014.[43] Michael D Lee and Eric-Jan Wagenmakers.
Bayesian cognitive modeling: A practical course . Cambridgeuniversity press, 2014.[44] Tai Sing Lee and David Mumford. Hierarchical bayesian inference in the visual cortex.
JOSA A , 20(7):1434–1448, 2003.[45] Daniel S Levine, Randolph W Parks, and Paul S Prueitt. Methodological and theoretical issues in neural networkmodels of frontal cognitive functions.
International Journal of Neuroscience , 72(3-4):209–233, 1993.[46] Chuh-Hyoun Lie, Karsten Specht, John C Marshall, and Gereon R Fink. Using fmri to decompose the neuralprocesses underlying the wisconsin card sorting test.
Neuroimage , 30(3):1038–1049, 2006.[47] David Marr. Vision: A computational investigation into the human representation and processing of visualinformation, henry holt and co.
Inc., New York, NY , 2(4.2), 1982.[48] Oury Monchi, Michael Petrides, Valentina Petre, Keith Worsley, and Alain Dagher. Wisconsin card sortingrevisited: distinct neural circuits participating in different stages of the task identified by event-related functionalmagnetic resonance imaging.
Journal of Neuroscience , 21(19):7733–7741, 2001.[49] Oury Monchi, John G Taylor, and Alain Dagher. A neural model of working memory processes in normalsubjects, parkinson’s disease and schizophrenia for fmri design and predictions.
Neural Networks , 13(8-9):953–973, 2000.[50] Matthew M Nour, Tarik Dahoun, Philipp Schwartenbeck, Rick A Adams, Thomas HB FitzGerald, Christo-pher Coello, Matthew B Wall, Raymond J Dolan, and Oliver D Howes. Dopaminergic basis for signalingbelief updates, but not surprise, and the link to paranoia.
Proceedings of the National Academy of Sciences ,115(43):E10167–E10176, 2018.[51] Erika Nyhus and Francisco Barcel´o. The wisconsin card sorting test and the cognitive assessment of prefrontalexecutive functions: a critical update.
Brain and cognition , 71(3):437–451, 2009.[52] Torben Ott and Andreas Nieder. Dopamine and cognitive control in prefrontal cortex.
Trends in CognitiveSciences , 2019. PREPRINT - N
OVEMBER
30, 2020 [53] Jill X O’Reilly, Urs Sch¨uffelgen, Steven F Cuell, Timothy EJ Behrens, Rogier B Mars, and Matthew FS Rush-worth. Dissociable effects of surprise and model update in parietal and anterior cingulate cortex.
Proceedings ofthe National Academy of Sciences , 110(38):E3660–E3669, 2013.[54] Frederike H Petzschner, Stefan Glasauer, and Klaas E Stephan. A bayesian perspective on magnitude estimation.
Trends in cognitive sciences , 19(5):285–293, 2015.[55] Stefan T. Radev, Ulf K. Mertens, Andreass Voss, Lynton Ardizzone, and Ullrich K¨othe. Bayesflow: Learningcomplex stochastic models with invertible neural networks, 2020.[56] Matthew FS Rushworth and Timothy EJ Behrens. Choice, uncertainty and value in prefrontal and cingulatecortex.
Nature neuroscience , 11(4):389, 2008.[57] JK Rybakowski, A Borkowska, PM Czerski, P Kapelski, M Dmitrzak-Weglarz, and J Hauser. An associationstudy of dopamine receptors polymorphisms and the wisconsin card sorting test in schizophrenia.
Journal ofneural transmission , 112(11):1575–1582, 2005.[58] Khalid Sayood. Information theory and cognition: A review.
Entropy , 20(9):706, 2018.[59] Philipp Schwartenbeck, Thomas HB FitzGerald, and Ray Dolan. Neural signals encoding shifts in beliefs.
Neuroimage , 125:578–586, 2016.[60] Shailja Singh, Tapas Kumar Aich, and Raju Bhattarai. Wisconsin card sorting test performance impairment inschizophrenia: An indian study report.
Indian journal of psychiatry , 59(1):88, 2017.[61] Scott A Sisson and Yanan Fan.
Likelihood-free MCMC . Chapman & Hall/CRC, New York.[839], 2011.[62] Alexander Steinke, Florian Lange, Caroline Seer, Merle K Hendel, and Bruno Kopp. Computational model-ing for neuropsychological assessment of bradyphrenia in parkinson’s disease.
Journal of Clinical Medicine ,9(4):1158, 2020.[63] Christine Stelzel, Ulrike Basten, Christian Montag, Martin Reuter, and Christian J Fiebach. Frontostriatal in-volvement in task switching depends on genetic differences in d2 receptor density.
Journal of Neuroscience ,30(42):14205–14212, 2010.[64] Klaas E Stephan, Torsten Baldeweg, and Karl J Friston. Synaptic plasticity and dysconnection in schizophrenia.
Biological psychiatry , 59(10):929–939, 2006.[65] Bryan A Strange, Andrew Duggins, William Penny, Raymond J Dolan, and Karl J Friston. Information theory,novelty and hippocampal responses: unpredicted or unpredictable?
Neural Networks , 18(3):225–230, 2005.[66] Ron Sun. Theoretical status of computational cognitive modeling.
Cognitive Systems Research , 10(2):124–140,2009.[67] Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. Validating bayesian infer-ence algorithms with simulation-based calibration. arXiv preprint arXiv:1804.06788 , 2018.[68] Brandon M Turner, Birte U Forstmann, Bradley C Love, Thomas J Palmeri, and Leendert Van Maanen. Ap-proaches to analysis in model-based cognitive neuroscience.
Journal of Mathematical Psychology , 76:65–79,2017.[69] Brandon M Turner, Birte U Forstmann, Eric-Jan Wagenmakers, Scott D Brown, Per B Sederberg, and MarkSteyvers. A bayesian framework for simultaneously modeling neural and behavioral data.
NeuroImage , 72:193–206, 2013.[70] Ingo Willuhn, Weiwen Sun, and Heinz Steiner. Topography of cocaine-induced gene regulation in the ratstriatum: relationship to cortical inputs and role of behavioural context.
European Journal of Neuroscience ,17(5):1053–1066, 2003.[71] Konstantine K Zakzanis. The subcortical dementia of huntington’s disease.
Journal of Clinical and ExperimentalNeuropsychology , 20(4):565–578, 1998., 20(4):565–578, 1998.