The thermodynamics of human reaction times
TThe thermodynamics of human reaction times
Ferm´ın MOSCOSO DEL PRADO MART´INAugust 23, 2009
Abstract
I present a new approach for the interpretation of reaction time (RT) data from behavioral experi-ments. From a physical perspective, the entropy of the RT distribution provides a model-free estimateof the amount of processing performed by the cognitive system. In this way, the focus is shifted fromthe conventional interpretation of individual RTs being either long or short, into their distribution beingmore or less complex in terms of entropy. The new approach enables the estimation of the cognitiveprocessing load without reference to the informational content of the stimuli themselves, thus providinga more appropriate estimate of the cognitive impact of different sources of information that are carriedby experimental stimuli or tasks. The paper introduces the formulation of the theory, followed by anempirical validation using a database of human RTs in lexical tasks (visual lexical decision and wordnaming). The results show that this new interpretation of RTs is more powerful than the traditionalone. The method provides theoretical estimates of the processing loads elicited by individual stimuli.These loads sharply distinguish the responses from different tasks. In addition, it provides upper-boundestimates for the speed at which the system processes information. Finally, I argue that the theoreticalproposal, and the associated empirical evidence, provide strong arguments for an adaptive system thatsystematically adjusts its operational processing speed to the particular demands of each stimulus. Thisfinding is in contradiction with Hick’s law, which posits a relatively constant processing speed within anexperimental context. keywords: cognition | entropy | lexical decision | reaction time | word naming Ever since its introduction by Donders [1] in the very early days of experimental psychology, reactiontime (RT) has been among the most widely used measures of cognitive processing in human and animalbehavioral experiments. Very generally speaking, following Donders’ seminal work, the logic underlying theanalysis of data in RT experiments is that, information processing takes time, thus the average time taken toinitiate or complete a task reflects the duration of the process(es) that are involved in the task. Therefore,if certain types of stimuli, tasks, or groups of subjects elicit longer RTs than others, it is generally inferredthat the former involve more cognitive processing than the latter. In this study, I propose a qualitativelydifferent perspective on the understanding of RT data: Rather than focusing on whether some experimentalconditions elicit shorter or longer RTs than others, I investigate whether different conditions elicit RTdistributions with different degrees of complexity . As I will argue, an increase in the complexity of the RTdistribution constitutes an indirect measure of the amount of information processing that has been performedby the system. For this, I take a psychologically naive, model-free, approach: Instead of guiding the RTanalysis using knowledge about the relevant neural and/or psychological processes that give rise to RTs, Iintend to draw inferences on the former by studying only the properties of the latter.The cognitive system can be considered a system in the thermodynamical sense of the word. In partic-ular, it is an open system that exchanges energy (and information) with its environment. Performing anexperimental task involves an exchange of information with the environment. The experimental instructionsand the presentation of stimuli are a source of information. Performing the experimental task requires theprocessing of this external information, and information processing is costly in energy terms. As discussedby Brillouin [2], the acquisition of information by any part of a system must be offset with a decrease ofinformation somewhere else. In Brillouin’s terms there is a balance between gained and lost ‘negentropy’,1 a r X i v : . [ q - b i o . N C ] A ug hat is, information. Having received energy and information, the stimulus is processed and a response is ini-tiated. Once more, this process involves a further exchange of negentropy and energy with the environment.An ideal system with perfect efficiency could perhaps achieve a perfect balance between the received, andthe spent negentropy. However, as the efficiency is never perfect, some negentropy will be lost in the process.Eventually, in the case of the cognitive system, this loss of negentropy can be compensated for by a supplyof energy, normally by metabolic means, that would enable the system to return to its ‘resting’ state. Inshort, the processing of experimental stimuli should temporarily increase the entropy of the cognitive systemby an amount directly proportional to the amount of information that has been processed, corresponding tothe negentropy that was wasted in the process. In essence, a measure of the increases in the entropy of thecognitive system elicited by different experimental conditions or stimuli would provide an estimate of theamount of information that has been processed (see [3] for a detailed physical description of this type ofprocesses).Measuring the overall state of entropy of the cognitive system might not be an easy task, as it wouldinvolve a quantification of the uncertainty in the state of all the microscopic units in the system. However,collateral measures of the ‘noise’ emitted by the system should reflect increases in its state of complexity.This is to say, if the system is in a higher state of complexity, the noises it emits will also increase intheir complexity. The random variability of times at which responses happen in a particular experimentalcondition can be considered as part of this emitted ‘noise’. Therefore the uncertainty ( i.e., entropy, in itsstatistical sense [4]) of this distribution can be taken to reflect the state of the system that generated them.My working assumption is that one can measure the entropy of the distribution of RTs in a particularcondition (i.e., the temporal entropy ), and make inferences about variations in the entropy (in its physicalsense) of the underlying system. In short, an increase in the informational entropy of an RT distribution isdirectly proportional to the amount of information that has been processed.In a typical repeated measures RT experiment, the differential entropy [4] of the RT distribution canbe expressed as a mixed effect model (MEM; see [5, 6, 7] for recent introductions to this technique) withmeaningful (and thus very constrained) parameter values: E [ − log p ( t )] = h + k N (cid:88) i =1 (cid:2) θ i I i ( S, P ) (cid:3) + ε + E S [ − log p ( t )] + E P [ − log p ( t )] . (1)In this model, the independent variable is the self-information of the RTs (i.e., − log p ( t )), whose expectedvalue is – by definition – the entropy. The intercept of the model ( h ) corresponds to the baseline entropyof the RT distribution, which must always be positive and provides an indication of task complexity. Thefixed effect coefficients ( kθ i ) indicate the relative contribution of the i -th known source of information in thestimuli ( I i ), and must all be positive and smaller than or equal to one. In this product, the θ i represent theproportion of the i -th source of information that is processed. On the other hand, k is constant for all sourcesof information representing the proportion of the wasted negentropy that is reflected in the RT variability.Therefore both k and the θ i must also lie within the (0 ,
1] interval. This has the additional implication that k is bound to be larger than or equal to the largest observed fixed effect value, least the estimated value forsome of the θ i would be greater than one. The last two terms on the right-hand side of Eqn. 1 correspond torandom effects of the individual stimulus S and participant P . These correspond to other unknown sourcesof information linked to the identity of the stimulus or participant that are not accounted for by the I i .Finally, ε accounts for the error in the estimations. If estimates of p ( t ) and of I i ( S, P ) can somehow beobtained, this relationship can be tested directly.Information Theory has a long history in the study of behavior, particularly so in the study of RTs.Very soon after Shannon’s development of information theory in telecommunications [4], psychologists wereapplying it to the study of human RTs. This produced one of the few standing laws of experimentalpsychology: The time it takes to make a choice is linearly related to the entropy of the possible alternatives; In this study, I follow [2]’s interpretation equating negentropy and information. See supplemental materials for the derivation of this equation. carried by the stimuli, to the amount ofinformation about them that is actually processed . As I will show, this change of perspective has importantimplications for the theoretical and empirical validity of Hick’s Law.Here, I investigate the usefulness of the thermodynamical argument to understand behavioral data. Ifocus on lexical stimuli, as these are less amenable to informational content measurements than plainlyperceptual ones (but see also [16, 17, 18, 19, 20, 21, 22, 23] for approaches to quantifying different aspectsof lexical complexity using information-theoretical measures). The empirical confirmation of the theorymakes use of two large sets from the English Lexicon Project database (ELP; [24]) of English visual lexicaldecision (VLD) and word naming (WN) data. The empirical evidence consists of four analyses. The firstone investigates the relationship between the temporal entropy variable (i.e., complexity of RTs describedabove) with the traditional average RT (magnitude of RTs) interpretation of response latency data. Thisserves as a first validation of the plausibility of the approach, and it reveals its relation to Hick’s Law. Thesecond part of the analysis tests the power of the method to distinguish between the VLD and WN tasks,despite the great similarity of their average RTs. The third part provides a direct test of the theoreticaldevelopment expressed in Eqn. 1 for particular properties of the stimuli. Finally, the fourth part goes tofurther depth about the implications of the relationship between mean RT and temporal entropy, and howthese implications provide strong evidence (both theoretical and empirical) against Hick’s Law.
Results
Analysis I: Relation of temporal entropy to RT averages
The first question that arises is how do temporal entropies relate to the traditional reading of the RTs beingshort or long. For individual words, I investigated the relationship between both measures. For this, I usedMEM regressions both on the plain (and logged) RTs, and on the RT self-informations. The relation betweenthe random effect adjustments for individual words between models (corrected by the corresponding generalintercept) summarized the relationship between the temporal entropy and RT.Fig. 1 plots the relation between the individual response self-informations and the corresponding RTs.Note that this relation is non-trivial; for the bulk of the responses, it follows a non-linear U-shaped pattern.Therefore one cannot directly assume a simple relation between the results of both measures. In contrast,Fig. 2 compares the estimated mean RT with the temporal entropies for each word in each of the tasks.The dashed lines plot the best fitting linear regression between both measures. The figure suggests thatthe temporal entropies are linearly related to the average RTs. This (apparently) linear relation is verystrong, with explained variance values of 64% in WN and up to 81% in VLD. At first sight, this findingcould be understood as a stronger restatement of Hick’s Law [8, 9, 10]: These studies provided evidencethat the average RT to a stimulus is directly proportional to the informational content of the stimulus.The current results seem to suggest that the amount of information processing caused by the stimuli isalso directly proportional to the average RT. This would indicate that there is a constant – possibly task-dependent – information processing speed of the cognitive system. Further support for this interpretationcould come from the very similar slopes of the linear regression lines in the figure across visual lexical decision The relationship between temporal entropy and log RT was also tested; it revealed a strongly non-linear pattern with lowerexplained variances. . ± SE .
05 bits/s) and word naming (4 . ± SE .
07 bits/s) indicating that this processing speed mightbe constant even across tasks. This would go one step further than Hick’s Law by saying that the averageRT would be directly proportional to the amount of information that has been processed (rather than justcontained by the stimulus). Although this would be a suggestive interpretation, some caution needs to betaken before accepting this conclussion. Analysis IV will provide a more in-depth study of the issue ofinformation processing speed and its implications for RT distributions. I now turn to an investigation of theeffect of informational contents of stimulus and task complexity on the amount of information processing.
Analysis II: Estimating cognitive processing across two tasks
For this analysis, I selected subsets of the ELP database that contained responses by multiple subjects tothe same word stimuli across two tasks. This enabled direct pairwise comparison between RTs across thetasks. Responses in VLD have generally slightly longer latencies than in WN. This is corroborated in thisdataset. Comparing the estimated average RTs to individual words in VLD and WN (as predicted by theMEM models above), revealed that VLD RTs are longer than in WN (average difference 19 .
82 ms.; 95%CI [17 . , .
39] ms.; paired t [1985] = 15 . p < . . . , . t [1985] = 6 . p < . .
75 bits.; 95% CI [ . , .
76] bits; paired t [1985] = 113 . p < . The magnitude of the t statistic was much larger for the difference in temporal entropies than for theRTs, whether logged or untransformed. Furthermore, the average difference between conditions was alsomuch larger for the temporal entropies than for the RT measures: Whereas the RTs to words in VLD were3.1% longer with respect to their WN latencies (and down to .2% in logarithmic scale), they were 8.5% morecomplex. This suggested that temporal entropy provides a clearer differentiation between the tasks thandid the magnitude of the RTs. This is depicted in Fig. 3. The panels compare the estimated mean RTs(left panel), mean log RTs (middle panel), and temporal entropies (right panel) for WN (horizontal axes)and VLD (vertical axes). The grey dots correspond to individual words, and the dashed line is the identitycondition. Whereas in the RT measures the difference between tasks (i.e., asymmetry with respect to theidentity line) is barely noticeable, the temporal entropy measure sharply separates the two tasks. Only in18 out of the 1,986 (less than 1%) studied words did the temporal entropy have a higher value in WN thanit did in VLD. Analysis III: Estimating the information processed about letters and words
Table 1 shows that the MEM models on the temporal self-informations, in both datasets, revealed maineffects of the lexical and letter information variables on the temporal entropies that were fully consistentwith the theoretical predictions. As predicted above, the fixed-effect regression coefficients ( β i = kθ i ) werein all cases positive and smaller than one. Also, as indicated by the χ log-likelihood tests, the suggestedrandom slopes (see Materials and Methods) constituted a significant improvement on the basic models. Forcomparison, the table also includes the results of running MEMs on the log RTs using the same predictorsthat were used with the temporal self-informations, revealing a nearly identical pattern.The values of the main effect coefficients provided a lower-bound estimate for the value of the dimen-sionless coefficients k ; recall that this task-specific constant measures the scaling between the entropy of theRTs and the entropy of the system. In the case of VLD, this lower-bound was k ≥ max { . , . } = . k (cid:63) , and for WN it was k ≥ max { . , . } = . k (cid:63) . Notice that the k (cid:63) estimates for bothtasks were rather similar; in fact, if the standard errors of the estimates were also taken into account, therewas no reason to believe that the estimates were at all different. Note that these estimates have been rescaled into bits/s. These results were also confirmed by non-parametric Wilcoxon signed rank tests. This difference was not due to differences in the size or number of participants between both datasets. See SupplementalMaterials for details.
4s discussed above, the k (cid:63) lower-bounds, combined with the fixed effect estimates of the regression ( β i )can be used to estimate the upper-bound for the possible contribution of one bit of information containedby the properties of the stimulus, into the amount of information about it that is actually processed bythe system ( θ i ). In VLD, one bit of lexical information corresponded to a mean of at most one bit ofcognitive processing, while one bit of letter information resulted in a maximum of .18 bits of cognitiveprocessing. Similarly, in word naming, one bit of lexical information elicited at most .32 bits of cognitiveprocessing, while each bit of letter information could maximally correspond to a full bit of processing. Theseestimations put numbers into the intuition that lexical information is more relevant in VLD than it is inWN, and that letter identity information is more important for WN than it is for VLD. In both cases, muchof the information that the stimulus contains is not at all processed, presumably because it is not useful forthe task at hand. Analysis IV: Is information processing speed constant within a task?
I now return to the linear relationship between mean RTs and temporal entropy that was suggested tobe taken with caution in Analysis I. The question arises as to the implications that the arguably linearrelationship in Fig. 2 would have for the shape RT distribution. This question was addressed using thePrinciple of Maximum Entropy [25, 26]. A Maximum Entropy analysis revealed that, under the assumptionof an existing mean µ that is linearly related to the entropy, the most likely relationship between the entropyand the mean is described by: κ µ + E[log t ] − log κ = a + bµ, (2)where a , b are the intercept and slope of the assumed linear relation, and κ , κ are constants for a givendataset. Notice that a new term E[log t ] corresponding to the mean of the log RTs appeared in the rela-tionship, without it having been assumed a priori . This indicates that, if one assumed that there is a linearrelationship between the mean RT and the temporal entropy, one should also assume that there is a linearrelationship between the mean log RT and the temporal entropy. This entails a sort of probabilistic reductioad absurdum : It says that the temporal entropy is linearly related to both the mean RT and the mean logRTs. That is, unless the mean RT and the mean log RT were independent of each other – and they were not– the relationship between mean RT and temporal entropy was most likely to be nonlinear after all, despitethe seemingly linear appearance of the plots of Fig. 2.To investigate this prediction, I performed a linear regression on the temporal entropies of individual wordsin each task, with both mean RT and mean log RT as co-variates (the effect of mean log RT was consideredonly after partialling out the effect of mean RT). As was predicted by the Maximum Entropy Analysis, boththe visual lexical decision and the word naming datasets revealed significant linear contributions of mean RT(VLD: β = 0 . ± SE . F [1 , , . p < . β = 0 . ± SE . F [1 , , . p < . β = − . ± SE . F [1 , . p < . β = − . ± SE . F [1 , . p < . The non-linearity is summarized by the additional solid lines in Fig. 2. These estimate how the correctionintroduced by considering the mean log RT looked like. Both datasets showed a pattern reminiscent of ahockey stick, in which there were small bendings at the bottom of the ranges that became increasingly See supplemental materials for the Maximum Entropy derivation of this equation. As shown in the supplemental materials, the Maximum Entropy method could also reveal an influence of the second momentof the distribution, if one assumed its existence. This would lead to a more realistic Weibull-type RT distribution, whose entropywould also have a linear term for the second moment, in addition to the terms in Eqn. S-21. Additional regressions also confirmedthis relation with the second moment, which added approximately a further 5% explained variance to each dataset. I focus thediscussion only on the additional contribution of the mean log RTs, as this is sufficient to introduce the necessary non-linearitywhile being considerably simpler (the estimation of the second moment required additional methods not discussed here). Inany case the consideration of the second moments did not produce any significant change on the results to follow. The lines were obtained as non-parametric smoothers between the estimated mean RTs and the predicted values of themultiple regressions using both mean RTs and mean log RTs as co-variates. r t = ∂h∂ E[ t ] . (3)The figure shows that, in both tasks, the information processing rates increased monotonically with amountof processing, this is, the more the amount of processing that was needed, the faster it happened. In addition,there seemed to be three linear regimes for this increase that were rather similar across tasks. It is possiblethat the third regime, corresponding to the slower slope lines in the high values of temporal entropy were atleast partly a consequence of the truncation point at 4,000 ms. Slow words will be more likely to be affectedby this truncation, as a proportionally larger part of their density mass was chopped off, leading to possibleunderestimations of both the mean RT and the temporal entropy.Finally, the lower-bound estimates k (cid:63) in each task obtained in Analysis III, open the possibility of guessingthe upper-bounds for the range of variation of the overall cognitive information processing speed in each ofthe tasks, as a function of the rate in terms of temporal entropy:¯ r = ¯ r t k ≤ ¯ r t k (cid:63) = 1 k (cid:63) ∂h∂ E[ t ] = ¯ r (cid:63) . (4)In VLD this gave a range of upper-bound values (¯ r (cid:63) ) going from around 33 bits/s to about 61 bits/s, and inWN the range went between 22 bits/s and 60 bits/s. Discussion
This study has introduced a novel interpretation for RT experiments. The conventional approach is to con-sider how long responses take to occur. Instead, I proposed to investigate whether the temporal distributionof responses is more or less complex . I argued that this complexity of the reaction time distribution reflectsthe underlying state of complexity in the cognitive system, and the empirical evidence has supported thisview. This enables a shift from studying how much information is contained in stimuli or tasks, to directlyinvestigating the amount of this information that is actually processed.As evidenced by the comparison between VLD and WN, the temporal entropy measure is remarkablymore sensitive than the traditional RT magnitudes – whether in untransformed or logarithmic scale – indistinguishing tasks with different properties. There has been a growing interest in techniques that enablegoing beyond the mean in the description of RT data [27, 28, 29, 30, 31, 32, 33]. These proposals consist instudying either the quantiles or the higher moments of the distribution, or the parameters of some distributionfamily that is assumed a priori , which are estimated separately for different participants and/or experimentalconditions. I have proposed a considerably more simple, and model-free, measure: The entropy of the RTdistribution summarizes the cognitively relevant aspects of its shape. The working assumption, from aninformation processing perspective, is that any variation in the amount of processing must be reflected inthe entropy of the distribution. By implication, the temporal entropy should be a sufficient statistic toreflect the effects of different cognitive manipulations. Furthermore, the new measure uses the random effectstructure of the experiment, and is thus less sensitive to the sometimes very reduced number of points ofeach individual condition (see, e.g, [32]).Entropy is, by definition, an additive measure. Different contributions of independent factors can thenbe considered in an plainly additive manner. Current practice in the analysis of RT data recommendstransforming the RTs prior to statistical analyses, either using a logarithmic or a reciprocal transform. Thishas the undesirable effect of breaking the additive interpretability of effects, forcing researchers to delveinto complex multiplicative processes [33]. In contrast, the approach proposed here remains in the additive6omain, while at the same time being able to capture complex aspects of the distributional shape. This isachieved while keeping with a model-free approach, that is, no particular distributional shape needs to beassumed.An important consequence of this analysis is the conclusion that Hick’s law needs to be extended andreformulated. Strictly speaking, the law proposed by Hick, Hyman, and others [8, 9, 10] concerns onlythe relation between information contained in the stimuli and mean RT. Crucially, in this study, I extendthis argument to the information about the stimuli that is actually processed, this is to say, the relevantinformation. In this case, the relationship between processed information and mean RT is not linear, eventhough it might seem linear to the naked eye. The non-linearity was suggested by Maximum Entropytheoretical analysis and confirmed by the empirical data. Note that collectively these results strengthen theargument considerably: The non-linearity is not just obtained from a fit to the data, but was predicted apriori in detail. Although the adjustment might seem small on the average entropies, inspection of Fig. 4makes it clear that that the non-linear relation allows the information processing rate to double in somecontexts. This finding suggests an adaptive system where the processing load is dynamically adjusted to thetask demands. In a way, it is a ‘lazy’ system in the sense of Zipf [34].Not surprisingly, the increase of information processing rate with stimulus complexity is consistent withthe findings of Kostic [16] for VLD, even in the estimates of the information processing rate. In a VLD taskusing Serbian words, Kostic estimated that, depending on the average informational content of the particularstimuli in an experiment, the information processing rate ranged from about 30 to about 100 bits/s. This isconsistent with the maximum rate (i.e., ¯ r (cid:63) ) ranging from 30 to 60 bits/s that I have derived. The differencebetween both estimates are possibly due to the difference between information contained in the stimuli thatKostic studied, and information that is processed about them that has been studied here. Notice also thatKostic’s estimates refer to the aggregated averages over full experiments. In contrast, the technique proposedhere enables obtaining estimates for individual stimuli.The principal contribution of the present study is that the entropy of an RT distribution provides an indexof the amount of information processing that has taken place. Stated alternatively, changes in the entropy ofRT distributions reflect changes in the underlying state of the cognitive system. I have estimated that, in thetasks under study, a minimum of around 10% of the increase in the entropy of the cognitive system is reflectedin an increase of temporal entropy (i.e., the k (cid:63) estimates). This finding provides a handle by which RT datacan be used to establish the link between higher cognitive function and its metabolic counterparts, that wasproposed in the seminal study of Kirkaldy [3]. Furthermore, the characterization of behavioral responses interms of entropy enables the consistent treatment of behavioral and neurophysiological signals using the sametheoretical tools (see [35] for a review on Information Theory in the study of neurophysiological signals). Materials & Methods
Materials
I retrieved from the ELP database [24] the individual lexical decision RTs and word naming latencies to 1,986 nouns and verbs.The set of selected words corresponds to those words used in a previous study that compared the lexical decision and namingtasks [36]. The selection of only a subset of the words enables us to keep the models below tractable. All responses that werenot marked as correct in the database were excluded from further analysis. In addition, I also excluded all responses equalor longer than 4,000 ms., as beyond this limit responses appear truncated in the ELP database. This left a total of 64,087responses (from 816 different participants) in the lexical decision dataset and 53,403 (from 445 participants) in the naming one.Along with the individual RTs, the surface frequency of the words (extracted from the CELEX database [38]) and their lengthin characters was also recorded. The word frequency and word length measures were transformed into information theoreticalmeasures, the self-information of the word, and its average informational content due to its letters. Estimation of the individual RT self-informations
The individual self-information values for each RT in the visual lexical decision and naming datasets were estimated by KernelDensity Estimation (KDE; [39]) with Gaussian kernels, and imposing bounds on the distribution at the truncation points of0 ms. and 4,000 ms., beyond which the density was estimated to have a value of zero (with the adequate normalization to See supplemental materials for details on these transformations. ntegrate to one in that interval). The estimates combine a direct KDE for the left tail of the distribution, and a retransformedestimate on logarithmic scale for the right tails. The reason for this dual estimate was to avoid the high-frequency noisethat KDE introduces on the right tails of heavy-tailed distributions. Estimating the distribution in logarithmic scale greatlyattenuates the noise on the right tail [37], but it introduces additional noise in the vicinity of zero, thus the dual estimation.To guarantee a certain smoothness in the transition between the untransformed and the logarithmic KDE, in the area aroundthe distributional mode I used a weighted average between both estimates. From the total support grid of 512 points where thedensities were estimated, 40 points to the left of the mode received an weighting linearly decreasing in favor of the transformedestimate. This pattern was reversed in the 40 points to the right of the mode, where the weighting was linearly increased tofavor the logarithmic estimates. The mode itself received a plain .5/.5 average between both estimates.The individual self-information values correspond to the the minus logarithm of the estimated density, using two as thebase of the logarithm to obtain an estimate in bits. The individual values for each response were interpolated from the 512point grid on which the densities were estimated. Fig. 1 shows the result of this process for both datasets under study. MEM regressions
To investigate the relationship between temporal entropy and mean RTs (Analyses I, II, & IV), I fitted two MEM modelsto the data. In both cases, the models included an intercept, and two random effects, one of word and one of subject, aspredictors. The first model had the RT as dependent variable (so that the average predictions of the model correspond to themean RTs), and the second model had the self-information of the RTs as independent variables (thus as argued above, theaverage predictions of the model correspond to temporal entropies). The individual estimates of either mean RT or temporalentropy for each word were computed as the sum of the corresponding model’s intercept with the particular random effectadjustment for that word.To investigate the predictions on the informational content of stimuli (Analysis III), I performed MEM regressions to thetemporal self-information and to the log RTs in each task. As before, these regressions included random effects of both subjectand word. In addition, I also included fixed effect covariates measuring the lexical self-information and the average informationcontent of its letters. An additional factor needed to be considered in these MEMs. The measure of the informational contentof the letters in a word refers to the average case. However, different words will contain combinations of letters that containmore or less information than the average (i.e., they are either more frequent than usually or rarer than usual). Therefore,the effect of the information of the letters can vary with respect to word identity. To address this problem, the MEM modelsincluded the possibility of a random slope making the effect of letter-based information variable for different words. Similarly,the experience that different people have had with different words varies in both quantitative and qualitative terms. This wasaccounted for by introducing a random slope that enables the variation of the lexical self-information effect across individualparticipants.All MEM regressions were fitted using a restricted maximum likelihood algorithm, as implemented in the R package “lme4”[5]. Acknowledgments
The author is indebted to L. B. Feldman, D. Filipovi´c-Durdevi´c, I. J. Grainger, and M. Mondrag´on for helpful suggestions.
References [1] Donders, F. C. (1869) On the speed of mental processes.
Attn & Perf , 412–431.[2] Brillouin, L. (1956) Science and Information Theory . (Academic Press, New York).[3] Kirkaldy, J. S. (1965) The thermodynamics of the human brain.
Biophys J , 981–986.[4] Shannon, C. E. (1948) A mathematical theory of communication. Bell Syst Tech J , 379–423, 623–656.[5] Bates, D. M. (2005) Fitting linear mixed models in R. R News , 27–30.[6] Baayen, R. H. (2007) Analyzing Linguistic Data: A Practical Introduction to Statistics using R . (Cam-bridge University Press, Cambridge, UK).[7] Baayen, R. H, Davidson, D. J, & Bates, D. M. (2008) Mixed-effects modeling with crossed randomeffects for subjects and items.
J Mem & Lang , 390–412.[8] Hick, W. E. (1952) On the rate of gain of information. Q J Exp Psychol , 11–26.[9] Hyman, R. (1953) Stimulus information as a determinant of reaction time. J Exp Psychol , 188–196.810] Hellyer, S. (1963) Stimulus-response coding and the amount of information as determinants of reactiontime. J Exp Psychol , 521–522.[11] Longstreth, L. E, El-Zahhar, N, & Alcorn, M. B. (1985) Exceptions to Hick’s law: explorations with aresponse duration measure. J Exp Psychol: Gen , 417–434.[12] Welford, A. T. (1987) Comment on “Exceptions to Hick’s law: explorations with a response durationmeasure” (Longstreth, El-Zahhar, & Alcorn, 1985).
J Exp Psychol: Gen , 312–314.[13] Longstreth, L. E & Alcorn, M. B. (1987) Hick’s Law versus a power law: reply to Welford.
J ExpPsychol: Gen , 315–316.[14] Norwich, K. E, Seburn, C, & Axelrad, E. (1989) An informational approach to reaction times.
BulMath Biol , 347–358.[15] Norwich, K. E. (1993/2003) Information, Sensation, and Perception . (Academic Press, San Diego).Online edition by E. Barull, biopsychology.org.[16] Kosti´c, A. (2005) The effects of the amount of information on processing of inflected morphology,(Laboratory for Experimental Psychology, University of Belgrade, Belgrade, Serbia), Technical report.[17] Kosti´c, A. (1991) Informational approach to processing inflected morphology: Standard data reconsid-ered.
Psychol Res , 62–70.[18] Kosti´c, A, Markovi´c, T, & Baucal, A. (2003) in Morphological structure in language processing , eds.Baayen, R. H & Schreuder, R. (Mouton de Gruyter, Berlin), pp. 1–44.[19] McDonald, S. A & Shillcock, R. C. (2001) Rethinking the word frequency effect: the neglected role ofdistributional information in lexical processing.
Lang & Speech , 295–322.[20] Moscoso del Prado Mart´ın, F, Kosti´c, A, & Baayen, R. H. (2004) Putting the bits together: aninformation theoretical perspective on morphological processing. Cognition , 413–421.[21] Moscoso del Prado Mart´ın, F. (2007) Co-occurrence and the effect of inflectional paradigms. Lingue eLinguaggio , 247–263.[22] Filipovi´c-Durdevi´c, D. (2007) The Polysemy Effect in Serbian Language.
PhD in Experimental Psy-chology (Faculty of Philosophy, University of Belgrade, Serbia).[23] Milin, P, Filipovi´c-Durdevi´c, D, & Moscoso del Prado Mart´ın, F. (2009) The simultaneous effects ofinflectional paradigms and classes on lexical recognition: evidence from Serbian.
J Mem & Lang ,50–64.[24] Balota, D. A, Yap, M. J, Cortese, M. J, Hutchison, K. A, Kessler, B, Loftis, B, Neely, J. H, Nelson,D. L, Simpson, G. B, & Treiman, R. (2007) The English Lexicon Project. Behav Res Meth , 445–59.[25] Jaynes, E. T. (1957) Information theory and statistical mechanics. Phys Rev , 620–630.[26] Jaynes, E. T. (1957) Information theory and statistical mechanics II.
Phys Rev , 171–190.[27] Ratcliff, R. (1978) A theory of memory retrieval.
Psychol Rev , 59–108.[28] Luce, R. D. (1986) Response Times: Their Role in Inferring Elementary Mental Organization . (OxfordUniversity Press, New York).[29] Heathcote, A, Popiel, S. J, & Mewhort, D. J. K. (1991) Analysis of response time distributions : anexample using the Stroop task.
Psychol Bul , 340–347.930] Van Zandt, T. (2002) Analysis of response time distributions. In
Stevens’ Handbook of ExperimentalPsychology (3rd Edition), Volume IV: Methodology in Experimental Psychology , eds. Wixted, J. T &Pashler, H. (Wiley Press, New York), pp. 461–516.[31] Rouder, J. N, Lu, J, Speckman, P, Sun, D, & Jiang, Y. (2005) A hierarchical model for estimatingresponse time distributions.
Psychon Bul & Rev , 195–223.[32] Balota, D. A, Yap, M. J, Cortese, M. J, & Watson, J. M. (2008) Beyond mean response latency:response time distributional analyses of semantic priming. J Mem & Lang , 495–523.[33] Holden, J. G, Van Orden, G. C, & Turvey, M. T. (2009) Dispersion of response times reveals cognitivedynamics. Psychol Rev , 318–342.[34] Zipf, G. (1949)
Human Behavior and the Principle of Least Effort . (Addison-Wesley, Reading, MA).[35] Borst, A & Theunissen, F. E. (1999) Information theory and neural coding.
Nature Neurosci , 947–957.[36] Baayen, R. H, Feldman, L. B, & Schreuder, R. (2006) Morphological influences on the recognition ofmonosyllabic monomorphemic words. J Mem & Lang , 290–313.[37] Newman, M. E. J. (2005) Power laws, Pareto distributions and Zipf’s law. Contemp Phys , 323–351.[38] Baayen, R. H, Piepenbrock, R, & Gulikers, L. (1995) The CELEX lexical database (CD-ROM) . (Lin-guistic Data Consortium, University of Pennsylvania, Philadelphia).[39] Parzen, E. (1962) On estimation of a probability density function and mode.
Ann Math Stat ,1065–1076. 10 Lexical decision
RT (ms.) S e l f - i n f o r m a t i on ( b i t s ) Word naming
RT (ms.) S e l f - i n f o r m a t i on ( b i t s ) Figure 1: Relation between reaction time and self-information. The left panel plots the Visual LexicalDecision dataset and the right panel plots the Word Naming one. Reaction time self-informations wereestimated combining Gaussian KDE estimated at untransformed scale for the left-tails and logarithmic scalefor the right tails. . . . . Visual Lexical Decision
Mean RT (ms.) T e m po r a l en t r op y ( b i t ) . . . . Word Naming
Mean RT (ms.) T e m po r a l en t r op y ( b i t ) Figure 2: Relation between average RT and temporal entropy. The left panel plots the visual lexical decisiondataset, and the right panel plots the word naming one. The grey dots plot the estimates of mean RT andtemporal entropy for individual words. The dashed lines illustrate the best fit of a purely linear relationshipbetween both measures, as would be characteristic of Hick’s Law. The solid lines plot this relationship whenthe effect of the mean log RT is also considered. 11 Mean RT (ms.)
Word Naming V i s ua l Le x i c a l D e c i s i on + . . . . . Mean log RT
Word Naming V i s ua l Le x i c a l D e c i s i on + . . . . Temporal Entropy (bit)
Word Naming V i s ua l Le x i c a l D e c i s i on + Figure 3: Cross-task comparison. The plots compare the estimated mean RTs (left panel), mean log RTs(middle panel), and temporal entropies (right panel) for word naming (horizontal axes) and lexical decision(vertical axes). The light grey dots correspond to individual words, and the dashed line is the identitycondition. The dark grey contours corresponds to a two-dimensional KDE. The crosses plot the mean ofboth tasks for each of the three measures. . . . . . . Visual Lexical Decision
Temporal Entropy (bits) dh / d E [t] ( b i t s / s ) . . . . . . . . Word Naming
Temporal Entropy (bits) dh / d E [t] ( b i t s / s ) Figure 4: Relation between information processing rate and temporal entropy. The left panel plots the lexicaldecision estimates, and the right panel plots the word naming one. Both panels plot temporal entropy ontheir horizontal axes, and an estimate of the average instantaneous processing rate (¯ r t ) in the vertical axis.The estimates were obtained as the numerical gradients of the solid black lines in Fig. 2. The rugs at thebottom of the panels illustrate the approximate number of points on which each portion of the curve isestimated. 12able 1: Summary of Mixed-Effect Model Results. Effects of lexical self-information and letter informationon the temporal entropy and log RTs as estimated by the MEM models. The upper section of the tableprovides the estimates of the fixed effects and their associated statistics. The lower section provides the modelcomparison statistics (log-likelihood tests) comparing models including different combinations of randomslopes. The estimation of the degrees of freedom for an MEM is not a straightforward issue and the p -values can be considered as slightly lax (cf., [6]). However, for so large degrees of freedom, the possibleunderestimation of the p -values would be negligible. The effect of letter information in visual lexical decisionis only marginally significant (in a two-tailed test). However it was significant in the basic model withoutrandom slopes ( β = . ± SE . t [53 , . , p = . Dependent variable: Temporal Entropies Log RTsLexical Decision Word Naming Lexical Decision Word NamingLexical Self-inform. β = . ± SE . β = . ± SE . β = . ± SE . β = . ± SE . t [64 , . t [53 , . t [64 , . t [53 , . p < . p < . p < . p < . β = . ± SE . β = . ± SE . β = . ± SE . β = . ± SE . t [64 , . t [53 , . t [64 , . t [53 , . p = . p < . p < . p < . χ [2] = 11 . χ [2] = 79 . χ [2] = 6 . p = 1Lett. Info. by Word p = . p < . p = . χ [2] = 499 . χ [2] = 111 . χ [2] = 119 . χ [2] = 32 . p < . p < . p < . p < . χ [2] = 9 . χ [2] = 65 . χ [2] = 6 . χ [2] = 11 . p = . p < . p = . p = . upplemental Materials Derivation of the MEM regression If t are the times of responses elicited in an condition C that requires an amount of information processing I P ( C ), it can be predicted that: kI P ( C ) (cid:39) h ( t ) − h ≥ , (S-1)where h > k ∈ (0 ,
1] is a constantindicating the proportional reflection of the increase of the system’s entropy into the RT distribution, and h ( t ) is the differential entropy of the distribution of RTs: h ( t ) = − (cid:90) ∞ p ( t ) log p ( t )d t. (S-2)The inequality in the right side of Eqn. S-1 results from the fact that the in order to process the information,the system forcefully needs to have increased its entropy ( h ( t ) ≥ h ).Information processing is the result of an experimental situation. In a particular task, the amount ofprocessing required will be different for different stimuli. Therefore, one can also consider a task-specificinformational content of the stimuli themselves, which is external to the cognitive system. The amountof information processing involved in processing a particular experimental condition is bound to be loweror equal to the total information content of the stimulus, this is to say, the amount of information thatis available limits the amount of information that can be processed. Denoting this external informationalcontent of a particular stimulus in a given task by I S ( C ) we see that: I S ( C ) ≥ I P ( C ) ≥ . (S-3)Furthermore, if the estimate of the stimulus information content is relatively accurate, the amount of infor-mation that is processed should be proportional to what is available: I P ( C ) = θI S ( C ) , (S-4)where θ ∈ (0 ,
1] represents the proportion of available information that is processed. Combining Eqns. S-1,S-2, and S-4 one obtains: θI S ( C ) (cid:39) k (cid:18) − (cid:90) ∞ p ( t ) log p ( t )d t − h (cid:19) . (S-5)Taking the k proportion to be relatively constant across participants in a given experimental context, inprinciple, the relationship in Eqn. S-5 could be tested experimentally, providing a direct measurement of theinformation processing involved in a task across conditions. However, direct application of these expressionsto actual experimental data is problematic. The value of the h term is unknown, and it does not seem easyto estimate it (but see also [2]). In addition, the actual distribution of RTs ( p ( t )) is itself unknown, only aparticular sample of RTs obtained in an experiment is available, and it is in most cases rather sparse.The first problem is circunvented by studying the relative increase in RT entropy elicited by severalexperimental conditions C , C , . . . , C n . Between any two given conditions C i and C j . In this case, fromEqn. S-5 one finds that: θ [ I S ( C i ) − I S ( C j )] (cid:39) k [ h ( t i ) − h ( t j )] , (S-6)where t i and t j refer to the RTs obtained in conditions C i and C j . This can be readily extended to aregression situation in which the informational content of the stimuli is varied continuously. In such a case,if t C represents the RTs for a particular stimulus C : α + θβI S ( C ) (cid:39) h ( t C ) k . (S-7)S-1ote that the α and β coefficients above are themselves meaningful. On the one hand, the intercept coefficient α reflects the baseline level of temporal entropy scaled up by the proportionality constant ( h /k ).This impliesthat one can force α >
0. On the other hand, the slope coefficient β corresponds to the increase in informationprocessing per processed unit of information. This is so because the θ and k coefficients already index whatproportion of the stimulus information is processed, and how this processing relates to the RT distributionentropy. Therefore, trivially, β = 1. Including an explicit error term ε results in: h + kθI S ( C ) + ε = h ( t C ) , (S-8)with h > k, θ ∈ (0 , k ). It is worth noting herethat Eqn. S-8 would also enable reasoning in the opposite direction. Given an estimate of the cognitive costof processing different stimuli, one could also obtain an estimate of their relevant informational load.The second problem concerns the estimation of the entropy of the RT distribution in a particular condition( h ( t C )). In a typical repeated measures experimental design, several participants respond to to differentstimuli. In these cases the entropy of the RT distribution is determined not only by the known informationalcontent of the stimuli, but also by differences on the entropy of the RT distributions of particular participants,and by additional informative issues of the stimuli that are unknown to the experimenter or difficult to controlfor. Thus the entropy on the overall RT distribution is the sum of multiple sources of uncertainty: h ( t ) = h ( t C ) + h ( t S ) + h ( t P ) , (S-9)where h ( t S ) is the RT entropy that is intrisical to the particular stimulus S , and h ( t P ) is the RT entropyof a particular participant P . Taking this into account, for a response of an individual participant P to astimulus S , we need to extend (S-8) to: h + kθI S ( C ) + h ( t S ) + h ( t P ) + ε = h ( t ) . (S-10)By definition, the entropies can be reformulated in as the expected values of the negated log-probabilities(i.e., the self-informations; [1]) of the RTs. Considering simultaneously the effects of N independent knownsources of information: h + k N (cid:88) i =1 [ θ i I S ( C )] + E S [ − log p ( t )] + E P [ − log p ( t )] + ε = E [ − log p ( t )] . (S-11)This corresponds to the expression of a regression model with an intercept, a covariate I S ( C ), and two randomeffects S and P , with the self-information of the individual RTs as dependent variable. The parameters ofthe regression have a direct interpretation. The intercept corresponds to the baseline entropy of the systemfor the task at hand, it is thus a measure of overall task complexity. The fixed effect coefficients correspondto the kθ i products, that is, the influence of the known informational content of the stimulus on the amountof processing, weighted by the proportion of processing that is reflected in the increase in RT complexity.In general, to ensure that θ i ∈ (0 ,
1] it must hold that k ≥ max { ˆ β i } = k (cid:63) , where ˆ β i are the estimated fixedeffect coefficients of the regression. This provides a useful lower-bound for this parameter. Furthermore, thelimitations above also force that all fixed effect coefficients must fall in the range (0 , Informational Content of the Stimuli
The word length counts were transformed into an informational measure using a corpus based estimate of theaverage entropy rate of English of 1.23 bits per letter [3] discounting an estimate of .04 bits per charactersestimated to reflect the information carried by spaces or case information [4]. Thus, the information content [4] actually estimated .06 bits per character for spaces and case, but this was scaled down to account for the difference inthe overall estimate with the [3] estimates, which are considered the best available approximations. S-2f a word due to its letters was estimated as. I L ( w ) = 1 . · l ( w ) , (S-12)where l ( w ) is the word length in letters of the word w .Different words in a language vary with respect to the amount of information that they convey. Generallyspeaking, to quantify the precise amount of information that is conveyed by a word seems at best verydifficult. However, coming up with a theoretical, a-priori estimate of the information contained by a wordwould require the joint consideration of multiple linguistic and contextual factors, many of which are yetpoorly understood. Here, I only consider one simple measure of a word’s informativity, its self-informationderived from its frequency: I F ( w ) = log 1 f ( w ) = − log f ( w ) , (S-13)where f ( w ) is the relative frequency of occurrence of the word w in the CELEX database [5]. Maximum Entropy Analysis
What does the knowledge that there is a linear relationship between the mean RT and the temporal entropytell us about the distributional shape? The less biased or more reasonable distributional shape to believe inis the one that satisfies the constraint while introducing as little additional knowledge as possible [6,7]. If m ( t ) is the distribution that reflects our full ignorance about the possible values of the RT, one should choosea new distribution p ( t ) that satisfies the constraints while being as similar as possible to the ‘know-nothing’distribution m ( t ). Mathematically, this is given by the Shannon-Jaynes entropy, that is, the Kullback-Leiblerdivergence [8] between p ( t ) and m ( t ): h S − J ( p (cid:107) m ) = − (cid:90) T max p ( t ) log p ( t ) m ( t ) d t. (S-14)In his introduction of the Transformation Groups argument, Jaynes derived the shape of the full ignoranceprior for the rate parameter r of a Poisson distribution: m r ( r ) ∝ r . (S-15)The justification for the necessity of this choice comes from a general consistency argument. Considertwo separate observers who were to assign probabilities to the rate of occurrence of an event. The twoobservers use different mechanisms to measure time, using perhaps different units (e.g., one uses millisecondsand the other uses minutes). If both observers are fully ignorant of the nature of the process, the mostreasonable thing would be that they would asign an a priori probability distribution that reflects theircomplete ignorance. By ignorance it is meant that the observers know strictly nothing about the processthat generates these events further than that they might happen with a rate of occurrence equal or greaterthan zero. Obviously, as the level of ignorance of the observers is equivalent, any consistent prior distributionwould be one by which both observers assign exactly the same probability distribution to the rate, irrespectiveof the measuring units they each use. This requires a prior probability that is in accord with Eqn. S-15.Note that the prior distribution in Eqn. S-15 is an improper one: It cannot integrate to one in the domain[0 , ∞ ). However, in Bayesian and Maximum-Entropy analyses this does not constitute a problem, as onlythe posterior needs to be normalized (cf., [9]). Furthermore, the recorded RTs in any experiment have apractical upper-bound at some time T max , and in such cases the proposed prior is proper.The argument of [10] can be readily extended to obtain a full ignorance prior for the times at whichevents occur; one that can then be used for the analysis of RT distributions. The rate at which events occuris the reciprocal of the times at which they happen ( r = 1 /t ). Therefore, knowing the prior distribution forthe rate, one can directly infer the prior for the times themselves, such that both priors are consistent withS-3ach other (e.g., in the example above, a third ignorant observer might have decided to infer the rates fromthe times, and his state of ignorance must be equivalent to that of the other two observers): m t ( t ) = m r ( r ) (cid:12)(cid:12)(cid:12)(cid:12) d r d t (cid:12)(cid:12)(cid:12)(cid:12) = 1 t m r (cid:18) t (cid:19) = c tt = ct . (S-16)where c is the constant part of the prior distribution of the rates. In sum, the ignorance prior for the timesmust be the same as the ignorance prior for the rates.The problem is then to maximize Eqn. S-14 subject to the constraints: (cid:90) T max p ( t ) d t = 1 (S-17) (cid:90) T max p ( t ) t d t = µ (S-18) − (cid:90) T max p ( t ) log p ( t ) d t = a + bµ (S-19)Constraint S-17 is the usual normalization requirement for proper distributions, S-18 represents the assump-tion of an existing finite mean RT µ , and S-19 expresses the proposed linear relation between the mean RTand the temporal entropy.This is an optimization problem that can be solved using the method of Lagrange multipliers fromvariational calculus. This results in a distribution of the form: p ( t ) = κ t e − κ t , κ , κ > , (S-20)which is a power law (with exponent -1) with an exponential cutoff. Plugging Eqn. S-20 into the linearrelation constraint of Eqn. S-19, and simplifying using Eqn. S-17 and Eqn. S-18, one finds that: κ µ + E[log t ] − log κ = a + bµ, (S-21)It is important to notice here that the distribution in Eqn. S-20 is in fact a rather implausible one forRT distributions; it is monotonically decreasing. The argument is not that this is the best, or even a good,distribution to account for RTs, but rather that it is the most reasonable one to believe in if one assumed only the information that was given. Including further knowledge about the distribution in the form ofadditional constraints will produce a distribution that is more and more similar to the actual RT one. Asan example, consider that one also assumed that the distribution has a known variance (which would alsobe safe assumption provided the RTs are truncated at some T max ).Including also information on the second moment of the distribution amounts to adding one furtherconstraint to Eqns. S-17, S-18, and S-19: (cid:90) T max p ( t ) t d t = ξ, (S-22)where ξ is the finite value of the second moment. In this case, the resulting distribution would be: p ( t ) = κ t e κ t + κ t , (S-23)and the relation between mean and entropy would now be: κ µ + κ ξ + E[log t ] − log κ = a + bµ. (S-24)Notice that the mean log RT term is still present. The origin of this term lies in the ignorance prior itself.Threfore, even if one included many additional constraints, such as futher higher moments, quantile values,or actual observed values, the term will remain there.S-4 urther Analyses on Task Differentiation As the VLD dataset contained more responses per word that the WN one, a possibility is that the increase inentropy was due to a bias in the entropy estimates introduced by sample size. This possibility was discardedby a re-sampling analysis: Random downsampling of the VLD dataset to the same size as the WN one didnot affect the results above in any significant way. Another plausible confound is that the larger numberof participants in the VLD dataset (816 vs. p = . Bibliography
1. Shannon, C. E. (1948) A mathematical theory of communication.
Bell Syst Tech J , 379–423,623–656.2. Moscoso del Prado Mart´ın, F. (2009) The baseline for response latency distributions . (Submittedmanuscript). Available from Nature Precedings http://hdl.handle.net/10101/npre.2009.3622.1
3. Rosenfeld, R. (1996) A maximum entropy approach to adaptive statistical language modelling.
CompSpeech & Lang , 187–228.4. Brown, P. F, della Pietra, V. J, Mercer, R. L, della Pietra, S. A, & Lai, J. C. (1992) An estimate ofan upper bound for the entropy of English. Comp Ling , 31–40.5. Baayen, R. H, Piepenbrock, R, & Gulikers, L. (1995) The CELEX lexical database (CD-ROM) . (Lin-guistic Data Consortium, University of Pennsylvania, Philadelphia, PA).6. Jaynes, E. T. (1957) Information theory and statistical mechanics.
Phys Rev , 620–630.7. Jaynes, E. T. (1957) Information theory and statistical mechanics II.
Phys Rev , 171–190.8. Kullback, S & Leibler, R. A. (1951) On information and sufficiency.
Ann Math Stat , 79–86.9. Sivia, D. S & Skilling, J. (2006) Data Analysis: A Bayesian Tutorial (2nd Edition) . (Oxford UniversityPress, Oxford, UK).10. Jaynes, E. T. (1968) Prior probabilities.
IEEE Trans Sys Sci & Cyb