[PDF] Beyond Rescorla-Wagner: the ups and downs of learning

Abstract

We check the robustness of a recently proposed dynamical model of associative Pavlovian learning that extends the Rescorla-Wagner (RW) model in a natural way and predicts progressively damped oscillations in the response of the subjects. Using the data of two experiments, we compare the dynamical oscillatory model (DOM) with an oscillatory model made of the superposition of the RW learning curve and oscillations. Not only do data clearly show an oscillatory pattern, but they also favor the DOM over the added oscillation model, thus pointing out that these oscillations are the manifestation of an associative process. The latter is interpreted as the fact that subjects make predictions on trial outcomes more extended in time than in the RW model, but with more uncertainty.

Full PDF

11 Beyond Rescorla-Wagner: The Ups and Downs of Learning

Gianluca Calcagni * Instituto de Estructura de la Materia, CSIC, Madrid, Spain

Justin A. Harris

School of Psychology, University of Sydney, Sydney, Australia

Ricardo Pellón

Facultad de Psicología, UNED, Madrid, Spain

Abstract

We check the robustness of a recently proposed dynamical model of associative Pavlovian learning that extends the Rescorla-Wagner (RW) model in a natural way and predicts progressively damped oscillations in the response of the subjects. Using the data of two experiments, we compare the dynamical oscillatory model (DOM) with a non-associative oscillatory model (NAOM) made of the superposition of the RW learning curve and oscillations. Not only do data clearly show an oscillatory pattern, but they also favour the DOM over the NAOM, thus pointing out that these oscillations are the manifestation of an associative process. The latter is interpreted as the fact that subjects make predictions on trial outcomes more extended in time than in the RW model, but with more uncertainty.

Key words:

Pavlovian conditioning, Rescorla – Wagner model, Dynamical oscillatory model, Individual differences, Bayes information criterion

1. Introduction

Empirical data of Pavlovian conditioning experiments sometimes show a curious oscillatory pattern where the subject response fluctuates before reaching the asymptote of learning. Example apparent to the naked eye are Figure 6 in Ghirlanda and Ibadullaiev (2015) (where, if not oscillations, at least an overshoot is visible), Figure 3 in Miller, Greco, and Vigorito (1981), and, notably, Figure 2 in Zelikowsky and Fanselow (2010). These are not the only available examples and, in fact, oscillations (not to be confused with the shortest-scale phenomenon of post-peak depression; for a * Corresponding author: [email protected]. discussion, see Calcagni, Caballero-Garrido and Pellón, 2020) may be more common than currently acknowledged, partly because there are not looked for or are ignored as experimental errors. In turn, they might not be looked for because, to begin with, the existing theories do not predict their occurrence. For instance, the Rescorla–Wagner (RW) model (Rescorla and Wagner, 1972; Wagner and Rescorla, 1972; Wagner and Vogel, 2009), one of the simplest available quantitative descriptions of the learning curve, does not show oscillations. In the present work, we propose and verify a model that extends RW to cover fluctuating responses. The RW model opened up a new way of thinking about associative Pavlovian learning. While it was not the first mathematical model of learning (Bush and Mosteller, 1951a; 1951b; Estes, 1950), the RW model shifted the focus from response probability to association strength and thereby opened the way to explain new conditioning phenomena. Like any mathematical model, it has limitations (Miller, Barnet, and Grahame, 1995), which spurred further development of associative learning theory (e.g., Ghirlanda and Enqvist, 2019; Le Pelley, 2004; Mackintosh, 1975; Pearce and Hall, 1980; Wagner, 1981; Wagner and Vogel, 2009). Here we will extend the RW model to consider oscillations in responding, as observed in several studies (Calcagni et al., 2020; Harris, Patterson and Gharaei, 2015; Miller et al., 1981; Zelikowsky and Fanselow, 2010). Because the relevant property of the RW model is the same for both the single-cue and multiple-cue versions of the model, we will focus on the single-cue version and refer to this as RW. The role of individual differences and response variations in conditioning models has long been of concern (Hayes, 1953; Merrill, 1931; Sidman, 1952, for initial publications; Blanco and Moris, 2018; Calcagni et al., 2020; Gallistel, 2012; Gallistel, Fairhurst, and Balsam, 2004; Glautier, 2013; Jaksic et al., 2018; Mazur and Hastie, long-range oscillations (spanning several tens of trials) eventually disappear. Using the Bayes and Akaike Information Criteria in a standard model selection procedure, we reanalyze data presented in Harris et al. (2015) and Calcagni et al. (2020) and compare the DOM with the RW model as well as with a non-associative oscillating model (NAOM) constructed as an ad hoc modification of the RW model designed to mimic response oscillations. We will find up to very strong evidence that the DOM can fit more individual data than the RW model and the NAOM. This analysis will significantly extend and confirm the findings of Calcagni et al. (2020) with more data, more statistics, and with the additional, robust cross-check represented by the NAOM. Throughout this article we differentiate between fluctuations and oscillations. Fluctuations are random deviations from the ideal learning curve that occur on a trial-by-trial time scale. On the contrary, oscillations take place on a much longer range and give rise (ideally) to a smooth, differentiable pattern.

In parallel, we will also explore the nature of the random, trial-by-trial response fluctuations characterizing the data. On the one hand, a spectral analysis will confirm that these fluctuations are described by white noise (Calcagni et al., 2020). On the other hand, in Appendix C we will further explore an associative but nondeterministic model constructed in the same paper with the tools of quantum mechanics, where random response variations are a quantitative part of the theory. Although data are compatible with the predictions of this model, we will be unable to find conclusive evidence in its favor.

2. Four models of Pavlovian conditioning

This simple associative model of Pavlovian conditioning was the result of early efforts in the discipline by researchers such as Hull (1943), Estes (1950), Bush and Mosteller (1951a), and Rescorla and Wagner (1972). The main feature of the RW model is the prediction of the animal’s response in the next trial from its response in the last one. Consider an experiment where a conditioned stimulus (CS) with salience 𝛼 is associated to an unconditioned stimulus (US) with salience 𝛽. The finite difference or increment ∆𝑣 𝑛 : = 𝑣 𝑛 − 𝑣 𝑛−1 of the association between the CS and the US from trial 𝑛 − 1 to trial 𝑛 is proportional to the last prediction error made by the subject, i.e., the difference between 𝑣 𝑛−1 and the optimal learning asymptote 𝜆 : ∆𝑣 𝑛 = 𝛼𝛽(𝜆 − 𝑣 𝑛−1 ) , (1) where 𝑛 = 1,2,3, … . Therefore, at each trial the association strength is updated on the current value of 𝑣 and the value of 𝜆 . Equation (1) corresponds to the Rescorla–Wagner model with a single cue (Rescorla and Wagner, 1972) and its solution (e.g., Calcagni, 2018) is 𝑣 𝑛 = 𝜆[1 − (1 − 𝛼𝛽) 𝑛−1 ] . (2) The learning sequence Eq. (2) is plotted in Fig. 1 for about 50 trials.

Figure 1: The learning sequence Eq. (2) of the RW model Eq. (1) (black dots) and the learning curve Eq. (4) of its continuum approximation Eq. (3) (solid curve), with 𝛼𝛽 =0.2 and 𝜆 = 1 . Equation (1) is a recursive discrete formula defining the RW model. With many trials, however, we can use a continuum approximation where the progression of learning is measured on a continuous time parameter 𝑡 instead of a discrete set of trials 𝑛. Then, the incremental law Eq. (1) becomes 𝑣̇ + 𝛼𝛽(𝑣 − 𝜆) = 0 , (3) where 𝑣̇ = 𝑑𝑣/𝑑𝑡 is the first time derivative of the association strength, the infinitesimal approximation of the finite difference (∆𝑣 𝑛 )/∆𝑛 = ∆𝑣 𝑛 . Increments or nv derivatives can be of first or higher order. The RW model has simple incremental steps, corresponding to first-order derivatives. Later, we will see that the DOM happens to have double incremental steps, corresponding to second-order derivatives. As a matter of nomenclature, we recall that Eq. (1) is a so-called recursive or finite-difference equation, while Eq. (3) is a so-called first-order differential equation. Technically, one can always map a discrete (finite-difference) equation into a continuous (differential) one, independently of the order of the incremental steps or derivatives. In general, however, one is entitled to transform a discrete equation into a continuum one when there are enough points and these points are sufficiently packed together, so that an incremental step is small enough. In other words, the continuum approximation is especially good when the increment Eq. (1) is small, i.e., near the asymptote. We can see this in Fig. 1, where the solid curve is the solution of Eq. (3): 𝑣(𝑡) = 𝜆 (1 − 𝑒 −𝛼𝛽 𝑡 ) . (4) This curve perfectly overlaps with the data points after only a few trials. One can check that the number of trials needed to achieve almost exact overlap ranges from about 20-30 trials for 𝛼𝛽 = 0.01 to just one trial for 𝛼𝛽 = 0.99 . Since is the natural range of this product of parameters, we can conclude that the continuum approximation breaks down only when the data set is scarcely populated. The continuum approximation will be relevant when comparing experimental data with the theoretical model. Best fits are usually presented as continuous curves, even when they stem from a discrete model (Hull, 1943; Spence, 1956). In our case, this representation of the RW, the DOM and the NAOM will be legitimate both mathematically and numerically. Technically, the RW model is a sequence of approximations of 𝜆 generated by each trial. These approximations are points that lie on a curve, but the model is not filling in the line between the points. This would be apparent in any experiment with no more than 10 to 20 points. However, when one fits the model with dozens of points, the difference is so small that one can ignore it to all purposes. In our case, we will deal with a minimum of 90 points to a maximum of 540, so that this difference will be negligible with respect to the precision achieved by a Bayesian model selection in discriminating between different models. Finally, one may object that the data points fluctuate a lot for typical subjects, which means that their increments ∆𝑣 𝑛 may be not much smaller than 1. In this case, one would expect the continuum approximation to be violated. However, we will employ various tools of analysis to show a posteriori that this does not happen. First of all, the phenomenon of random trial-by-trial fluctuations goes well beyond the capabilities of the RW or similar models. From the point of view of associative models such as Rescorla–Wagner (1972), Mackintosh (1975), and Pearce–Hall (1980), these fluctuations should be treated as errors. The spectral analysis is a better tool than the learning curve to describe a randomly fluctuating response. At any rate, the dispersion of the data points in the experiments we ran was smaller than the asymptote of learning for all subjects, although not much smaller. This means that response fluctuations from one trial to the next were not excessively large. The RW model Eq. (1) stems from the assumption that the subject learning an association between stimuli does so according to the surprisingness or novelty of the outcome, represented by the difference between the asymptote and the current association strength. The less surprising the appearance of the US is in contiguity with the CS, the smaller the increase in association strength. It is possible to recast this model in very different terms and to regard it as a dynamical system in the most rigorous meaning of the word. This reformulation shows that the RW model is a special case of a more general model in which behavioral oscillations are common. The pay-off in doing so is a first principle governing these and any other associative models based on recursive relations. We are talking about the principle of least action , well known in physics but recently adapted to the psychological core of Pavlovian conditioning (Calcagni et al., 2020). Independently of whether time is continuous or discrete, the principle of least action postulates that there is a quantity that is minimized during the evolution of the system. This quantity, the action, represents the efficiency with which an organism (or its neural network) adapts to a new situation such as a learning process. A “biological” interpretation of the action might go along the following lines. At the level of brain structure, learning involves the modification of synaptic connections in a dendritic arborization process through the rearrangement, pruning and creation of new synapses or dendrites. Due to the huge number of connections, it would be useful to describe this process in terms of emergent or global degrees of freedom. The assumption here is that, globally, the learning process is efficient, despite errors in the performance of the subject. This postulate translates into assuming that the organism minimizes (albeit imperfectly) the “energy” spent in the dendritic arborization or synaptic adjustments that take place during learning. Quantifying this “energy” may be difficult, but there is an alternative. The construct of energy, as used here, is directly related to the construct of “action.” Thus, the efficiency postulate leads us to the principle of least action, which describes emergent or global degrees of freedom that are purely behavioral (the association strength is the main one).

The degree of freedom of the action is the association strength 𝑣 , and the requirement that the action be minimized by the dynamical evolution of the system determines the so-called equation of motion for 𝑣 . It is relatively easy to write down the action for 𝑣 in such a way that the equation of motion (1), or Eq. (3) in a continuum setting, for the RW model be recovered. In other words, the action is an expression that tells one how a system evolves. From the action and via the least action principle, one gets the equation of motion, such as the incremental law (1) or its analog (3). A major drawback of the RW model is that it assumes, in a rather minimalistic fashion, that the subject bases its prediction on what will happen in the ongoing trial only on what they “remember” of the previous one (incremental law (1)). A more general situation would allow the subject to base their behavior on more information than the one admitted in the RW model. We can quantify this issue with two dynamical, mutually equivalent analogies. The first is to regard the RW as a spring system. The RW model describes the accumulation of a property, associative strength ( v ), across successive experiences with the US (captured by λ). v accumulates until it is equal to λ but with negative sign. In this sense, it is like a spring under a load: As the spring compresses, it builds force acting against the load until the force from compression equals the weight of the load and system is in equilibrium. The RW equation is actually a special case of a dynamic system that describes the motion of a spring. However, in a natural setting one would expect the spring to oscillate to equilibrium instead of reaching it monotonically without any recoil. For a load on a spring to avoid oscillations would require that some other external force gradually releases the load onto the spring with such precise measure that The actions for the RW model and the DOM can be found in Calcagni et al. (2020). We omit them here to reduce the technical bulk of the presentation. the compressive force in the spring comes to equal the weight of the load without bouncing. The second analogy is the one of a ball in a potential well. Assume the continuum approximation for simplicity (nothing relevant changes in the discrete case). The solution 𝑣(𝑡) (given by Eq. (4)) can be interpreted as the position of a “particle” or a small ball as it evolves in time when it rolls in a parabolic well 𝑈(𝑣) = (𝛼𝛽) (𝑣 − 𝜆) /2 . The system is dissipative: this potential well has a rough surface that induces a friction force on the ball. In the RW model, the friction is tuned so precisely that the ball stops exactly at the bottom of the well. In other words, we place the ball on one of the slopes of the well and we let it start roll down simply by a gravitational force pointing downwards. As it rolls down, the ball increases its velocity (technically: it acquires momentum) but, at the same time, the friction force makes it decelerate until it magically stops at the minimum asymptotically in the future. Friction is a property of the medium, while momentum is a property of the ball, and there is no reason, concrete or abstract, why these two qualitatively different properties should be related to each other. However, in the RW model they are. The reason is that the “mass” coefficient (𝛼𝛽) in the potential 𝑈(𝑣) matches exactly the magnitude of the friction force. All that has been said is valid also in a setting where time is discrete (trials n ). From one trial to the next, the ball acquires momentum ∆𝑣 𝑛 and it asymptotically reaches the bottom of the potential monotonically. The only difference with respect to the continuum case is that, between one hop and the next, momentum is “frozen” like any other dynamical observable. The idea of adding a friction term to the dynamics and to use the ball-in-the-well analogy is not an entirely novel feature of the DOM. A momentum term is often added to the backpropagation learning equation in artificial neural networks (McClelland and Rumelhart, 1988). This addition improves the efficiency of the search of the minimum of the error function for a given learning task. In this context, Regardless of whether we work in discrete or continuous time, this is not the most natural and most generic situation that a ball in a parabolic well may experience. When there is a numerical mismatch between friction and potential energy, then the ball will roll down to the bottom and climbs up the opposite slope up to a lesser height, from where it rolls back down in a sequence of damped oscillations, eventually resting at the bottom of the well. This is the origin of the DOM: We allow for a mismatch, encoded in a new parameter 𝜇 , between friction and ball momentum and the result is an oscillatory pattern dying up at the bottom. The only but crucial difference between RW and the DOM is the presence of one extra parameter in the potential, 𝑈(𝑣) = (𝛼𝛽) +𝜇 (𝑣 − 𝜆) , (5) introduced simply because it is the most general thing one could do to define the action. Just like for the other parameters, the theory does not tell us what the value of 𝜇 is for a given subject; but observations do. We call this line of reasoning sustaining the DOM the naturalness argument , to distinguish it from two more arguments we will offer shortly. All of this holds for a system evolving with either a continuous or a discrete time parameter. Mathematically and conceptually, there is no difficulty in adopting either view. In the discrete case, the dynamics of the ball is observed as through a sequence of photographic snapshots, while in the continuum case this sequence is so fast that it becomes a movie. However, since the model is defined on a discretum, we first quote its equation of motion as a recursive formula, the DOM counterpart of the incremental law (1): which is very different from ours because it does not aim to fit learning data, the analogy of the ball in a narrow valley is used. ∆𝑣 𝑛+1 − (1 − 2𝛼𝛽)∆𝑣 𝑛 = (𝛼 𝛽 + 𝜇 )(𝜆 − 𝑣 𝑛−1 ) . (6) The right side of this equation reduces to the RW model when 𝜇 = 0 (Calcagni et al., 2020). Notice the presence of two increments, one from 𝑛 to 𝑛 + 1 and another from 𝑛 − 1 to 𝑛 . While the last equation can be reverted to expression (1) with only one increment, one cannot make this simplification in the case of the DOM. Therefore, the learning law (6) of the DOM necessarily involves both increments ∆𝑣 𝑛+1 and ∆𝑣 𝑛 . While in the RW model the subject can predict the outcome of the next trial (future value 𝑣 𝑛+1 ) just from the present outcome (present value 𝑣 𝑛 ), the DOM needs a longer memory of past states and also the knowledge of the past trial (value 𝑣 𝑛−1 . ) is required. In other words, while in the RW model learning (i.e., the momentum ∆𝑣 𝑛 ) is proportional to the prediction error (𝜆 − 𝑣 𝑛−1 ) , in the DOM the prediction error depends mainly on the learning at the next-to-last trial and, to a lesser degree (because ), on the learning at the last trial. However, prediction errors weigh more on learning than in the RW model, since the right-hand side of Eq. (6) is augmented by a factor 𝜇 . Therefore, subjects can learn further in the future what is going to happen, but they do so with greater uncertainty. This is the origin of the overshooting of optimal response and the subsequent readjustments through an oscillatory pattern. We dub this justification of the DOM the errors-in-learning argument and it is perhaps the most compelling one from a psychological point of view. It makes the learning process considerably more flexible, or more realistic, than in the RW model. The infinitesimal version of the combination ∆𝑣 𝑛+1 − ∆𝑣 𝑛 is the second-order time derivative 𝑣̈ , so that the continuum limit of Eq. (6) is 𝑣̈ + 2𝛼𝛽𝑣̇ + (𝛼 𝛽 + 𝜇 )(𝑣 − 𝜆) = 0 , (7) whose general solution is 𝑣(𝑡) = 𝜆 [1 − 𝑒 −𝛼𝛽 𝑡 (cos 𝜇𝑡 + 𝐴 sin 𝜇𝑡)] , (8) where 𝐴 is a constant amplitude. The learning profile Eq. (8) has a total of five free parameter, while the RW profile (4) has only three. This leads to a second way to express the naturalness argument justifying the DOM, which we may call the fine tuning argument . For definiteness, we can illustrate it with the continuum version of the models; it is simpler and nonrestrictive. By definition, the initial conditions for a system described by a second-order differential equation are the value of 𝑣 and its first time derivative at the initial time 𝑡 = 0 (trial number 1). The RW model can be regarded as a second-order system, i.e., the limit 𝜇 →0 of the DOM equation (7). In fact, taking the derivative of Eq. (3) with respect to time one has 𝑣̈ + 2𝛼𝛽𝑣̇ + 𝛼 𝛽 (𝑣 − 𝜆) = 0 . (9) This is exactly equivalent to the first-order equation (3), which means that one of the two initial conditions has become redundant inasmuch as it overdetermines the dynamics of the model. The most common dynamical systems (all endowed with an action) have equations of motions with second-order derivatives and one must specify two initial conditions (position and velocity). The DOM is one such system and is quite unremarkable. Also RW (which also follows the least action principle) looks like a second-order system, but its dynamics can be reduced to a first-order system Eq. (3), something quite anomalous and due to the fact that we fine-tuned the dynamics by setting the parameter 𝜇 exactly equal to zero. Thus, in the RW case one needs only one initial condition (the position). This is just another way to describe the contrast between the two-state memory in the DOM (retrodiction of two past states from the knowledge of the present one) and the one-state memory in the RW model. To put it in formulae, call 𝑣 RW (𝜆, 𝛼, 𝛽; 𝑡) the most general solution (4) of the RW model (9). The initial conditions for an excitatory process are 𝑣 RW (𝜆, 𝛼, 𝛽; 0) = 0 , (10a) 𝑣̇ RW (𝜆, 𝛼, 𝛽; 0) = 𝜆𝛼𝛽 . (10b) The subject starts with zero association strength and a positive incremental rate 𝜆𝛼𝛽 . While (10a) tells where the ball starts inside the potential well, (10b) tells with what initial velocity the ball starts to roll down (the subject starts to learn). Now consider the DOM solution (8), denoted as 𝑣 DOM (𝜆, 𝛼, 𝛽, 𝜇, 𝐴; 𝑡) . Since cos 0 = 1 and sin 0 = 0 , the initial conditions are 𝑣 DOM (𝜆, 𝛼, 𝛽, 𝜇, 𝐴; 0) = 0 , (11a) 𝑣̇ DOM (𝜆, 𝛼, 𝛽, 𝜇, 𝐴; 0) = 𝜆 (𝛼𝛽 − 𝜇 𝐴) . (11b) The catch is that the RW initial condition 𝑣̇ RW (𝜆, 𝛼, 𝛽; 0) = 𝑣̇ DOM (𝜆, 𝛼, 𝛽, 0, 𝐴; 0) is a special case of (11b) achieved by setting 𝜇 = 0 , so that the ball goes to the bottom of the well monotonically. If one deviates, even infinitesimally, from Eq. (10b), then one does not obtain Eq. (4) as a solution. Theoretically, there is no reason why one should set to zero a parameter that, in general, will take a non-vanishing value. To summarize, the RW model pretends to match a dynamical feature (the initial velocity of the ball) with an environmental one (the friction of the basin surface) without explaining why we forced the system to such a special situation. This fine tuning problem is, in general, a powerful killer of dynamical models. Note that we have implicitly given two equivalent ways to describe the overshoot of the minimum by the ball. One, used in Calcagni et al. (2020) and in Eq. (5), compares two models at the level of the equations of motion and states that the ball in the DOM rolls down a potential well with a steeper slope than for the well in the RW model. The other way compares the two models at the level of the initial conditions and states that the potential well is the same in both models (by a rescaling 𝛼𝛽 → √(𝛼𝛽) + 𝜇 in Eq. (5)) but, by virtue of the same rescaling, the initial momentum Eq. (10b) of the ball in the DOM is greater than the initial momentum in the RW case. Thus, in the DOM the ball cannot brake at the bottom and it undergoes damped oscillations. For completeness, we mention a caveat about oscillations in the RW. Although they are not present in the single-cue case, when more cues are present response fluctuations can take place. In the simplest case, this could happen when discriminating between the context and the compound made of the context and the CS, a two-dimensional system that can in principle sustain oscillations. In general, therefore, a multi-dimensional first-order system is a potential alternative to the one-dimensional second-order DOM. Whether oscillations of multi-cue origin occur with realistic parameter values is an open question, which we will not address in our single-cue settings. To conclude this subsection, we gave three arguments for the DOM, all based upon the least action principle: the naturalness argument, the errors-in-learning argument, and the fine-tuning argument. The first and the last are equivalent and rely on reasoning in terms of the dynamics. The second argument invokes a more flexible learning process and gives a psychological interpretation of the DOM. Another interpretation based on the efficiency of learning was also given. In Calcagni et al. (2020), only the naturalness argument and the efficiency interpretation were provided. Ultimately empirical evidence is what matters. Many subjects in the experiments we will describe did show a learning curve with a nonzero 𝜇 . Assume temporarily to discover that the majority of individual data in experiments of Pavlovian conditioning display smooth, long-range oscillations in the response and that these data are well fitted by the DOM (which in our case occurs, as it will be shown latter). One may wonder whether this is evidence in favor of the DOM or just of the presence of oscillations that, after all, could be implemented in many different ways. Clearly, one cannot get oscillations from the RW model. Changing the value of the rate parameter 𝛼𝛽 or the asymptote 𝜆 only modify, respectively, the slope and final height of the learning curve, not its shape. If, on the other hand, one tries to change the rate parameter on a trial-by-trial basis, then one hits the Mackintosh model (1975), where, however, the salience varies monotonically. To get oscillations, we must introduce at least one new parameter, which is the frequency 𝜇 of the oscillations. It is a quantitative change (setting to nonzero something which is zero for the RW model) that gives rise to a qualitative change (oscillations versus monotonic evolution). The DOM is a model doing just that. Is it the only possibility? To act as the devil’s advocate, let us suppose that the oscillations have nothing to do with learning but reflect some other periodic system. For example, they could be fluctuations in the subject level of hunger or some other motivational factor that is influencing response level but is not part of the learning process. We can conceive an alternative oscillatory model (the NAOM) where oscillations are independent of the learning process: 𝑣(𝑡) = 𝜆(1 − 𝑒 −𝛼𝛽 𝑡 ) + 𝐴 sin 𝜇𝑡 + 𝐵 cos 𝜇𝑡 . (12) In Eq. (12), oscillations are simply added to the RW learning curve, while in the DOM (8) they multiply it as a modulation factor. In Appendix A, we show that this model is non-associative because, although it can be derived from an action, it hides an extra degree of freedom independent of the associative strength 𝑣 that has the peculiar property of increasing indefinitely in the future and, therefore, has no interpretation in terms of a learning process with an asymptote. For this reason, the NAOM encodes both learning (associative) and motivational (non-associative) elements, both of which can affect the subject’s performance. Therefore, the difference between the NAOM and the DOM is that the DOM assumes that oscillatory behavior is an intrinsic part of the learning process, whereas the NAOM also allows for oscillations but treats those as being a part of performance rather than learning. From a mathematical standpoint, the DOM is simpler, but it requires a significant rethinking of what the learning process is. The NAOM is more complex mathematically, but easier to integrate with our existing theories of learning. For that reason, most people working in the field might be more accepting of the NAOM and might want to know which of the two models explain data best. The rationale behind the NAOM is the following. Comparing the DOM with the RW model, we will find an overall advantage for the DOM in terms of predictive power. The question, then, is whether this advantage is just because the extra one or two parameters allow the DOM to track fluctuations in responding that cannot be explained by RW. In terms of explaining performance, the DOM fluctuations are probably meaningful, but in terms of the acquisition process they might well be considered as noise. Expressed informally, is the DOM “beating” RW because it can also model something that is effectively noise when it comes to acquisition? This issue is important because the DOM oscillations are an intrinsic part of its description of acquisition, so In the sense of being, on one hand, possibly related to motivation and, on the other hand, in no way related to the associative strength. Therefore, the NAOM does not deal with habituation, sensitization, pseudo-conditioning and other phenomena traditionally categorized as “ non-associative. ” for that model these are not independent (or noise). But in the data the oscillations may have nothing to do with acquisition and so in effect are just noise. This is what justifies comparing the DOM to the NAOM. The NAOM is RW plus some ability to track oscillations in performance. In the NAOM, the oscillations are not intrinsic to the acquisition process. So, if the DOM sometimes beats RW just because it can track oscillations (even when these are noise) and RW cannot, then replacing RW with the NAOM should deal with this and the DOM should not beat the NAOM. In fact, the NAOM should beat the DOM because for the NAOM the oscillations are independent of acquisition and so the NAOM can track them right through the training data, whereas for the DOM the oscillations are more constrained since they are a part of the acquisition process. That is, the DOM can only explain oscillations in responding while the subject is acquiring the response, but these oscillations will dampen down with continued training, whereas the NAOM is better able to explain fluctuations in responding that have nothing to do with acquisition per se. We will find the opposite trend, the DOM more favored than the NAOM in a large number of cases. This will eventually rule out non-associative factors as the principal determinants of the oscillations. In the spirit of Bayesian model selection, we will compare all models among one another instead of only the pairs RW-DOM and DOM-NAOM, regardless of their theoretical justification. Finally, one could object that the NAOM is not an effective “control” model because oscillations are not damped, and one would not expect a model not leading to a stable performance to be a better fit than the DOM (where oscillations are naturally damped) for most data. While this issue may not be easily dismissed for exceptionally long experiments such as the one of Calcagni et al. (2020), it has little or no impact in all other typical cases where the period of the oscillations is comparable with the time scale of the learning curve, as one can verify a posteriori by looking at data (including those of Harris et al., 2015). Therefore, since most of the data we use come from an experiment where the total number of trials is of order of the fluctuations’ wavelength, cases where the NAOM is a better fit than the DOM are not artefacts. RW model The last model we consider was proposed in Calcagni et al. (2020), to which we refer for details. Since the model is complex, here we only recall its main characteristics and predictions. We consider it here just in order to complete the analysis of Calcagni et al. (2020) with the NAOM, but the reader uninterested or not acquainted with our past results can safely skip this part. This model, based on the mathematics of quantum mechanics, is associative just like the RW model and the DOM, but contrary to them it is nondeterministic: Even in the ideal and complete absence of experimental errors, it is intrinsically impossible to predict the exact value of the association strength at any given trial. The source of this uncertainty is an adaptation of Heisenberg’s principle to a Pavlovian setting, so that the more accurately we determine the association strength, the greater uncertainty falls upon a measurement of its variation, and vice versa. Despite this, one can still calculate a probability distribution for 𝑣(𝑡) . The actual model is constructed taking an associative model and apply then a “quantization” procedure to it. In Calcagni et al. (2020), we took the RW model as a classical starting point because it is much simpler to quantize than any other model. Its main predictions are: Response variability is described by a Gaussian spectrum, i.e., white noise (confirmed in Calcagni et al., 2020). 2-

The asymptote of learning is not a durable achievement and there exists an intrinsic source of variability in the subject response. 3-

There exists a universal constant ℎ̅ that can be estimated as ℎ̅ ≃ 2𝛼𝛽𝜎 , where 𝜎 is the dispersion of data.

3. Two experiments

We now introduce two experiments of Pavlovian conditioning with which to test the theoretical models of the previous section. A full, more detailed description of these experiments can be found in the original references.

In Experiment 2 of Harris et al. (2015), 16 male Hooded Wistar rats were used, with unrestricted access to water and restricted daily food rations. The US was food delivered in a dispenser (head entries were recorded) and four different CSs were used: white noise, a tone, a flashing light, and a steady light. For each rat, each CS was allocated to one of the following configurations (and counterbalanced across rats): CR10: continuous reinforcement (CS presented at 100% of the trials) with random duration of 10 s mean, 30 sessions, 6 trials per session; CR30: continuous reinforcement with random duration of 30 s mean, 30 sessions, 6 trials per session; PR10: partial reinforcement (CS presented at 33% of the trials) with random duration of 10 s mean, 30 sessions, 18 trials per session; PR30: partial reinforcement with random duration of 30 s mean, 30 sessions, 18 trials per session. Trial by trial, the CS duration varied randomly on a uniform distribution with a mean of either 10 s or 30 s, according to the name of the group. The number of reinforced trials per session per CS was the same in all configurations and equal to 6. Each of the 30 sessions consisted in a delayed conditioning where presentations of each of the four CSs were randomly intermixed: 6 of each of the continuously reinforced CSs and 18 of each of the partially reinforced CSs, for a total of 48 trials per session. In the experiment of Calcagni et al. (2020), we employed 32 male Wistar Han rats, without food or water restriction, divided into two experimental groups (US: saccharine solution at 0.1% - Group 1- and 0.2% concentration - Group 2) and two control groups. One experimental subject was removed due to poor health. The US was delivered in individual conditioning boxes via a water pump activated by an electrovalve. In the case of experimental subjects, delivery happened on a variable-time 5 s schedule (VT-5) implemented as a uniform random distribution during the presentation of a 10 s tone (CS). Licks were automatically recorded. We took 90 sessions each comprising 44 trials.

4. Model selection: RW, DOM, and NAOM

Having several theoretical models at hand, we want to see whether one of them explains data better than the others. The first step is to find the best fit of individual data for each model; the second step is to compare these best fits. In this section, the RW model, the DOM, and the NAOM are treated democratically as three different fitting curves for discrete data (the quantum model will be discussed in Appendix C). This procedure is fair because, on one hand, one makes the same assumptions for all models, i.e., that they can be used as continuous fitting curves; and, on the other hand, we already showed that the approximation of the sequence of points 𝑣 𝑛 as a curve 𝑣(𝑡) is valid for the large number of trials run in our experiments. Once the three fits are performed for each subject, they are compared using a model selection procedure. The comparison is made by calculating, for each fit, the Bayes Information Criterion (BIC) and the Akaike Information Criterion (AIC), two quantities that depend on the error variance and the number of free parameters. The better the fit, the smaller the IC and the greater the number of free parameters, the greater the IC. The difference between BIC and AIC is in the way the number of parameters is penalized. For any given subject, the model with smaller IC is the most favored, independently of its theoretical justification. In practice, the RW model has two free parameters (the asymptote of learning 𝜆 and the product of the US and CS salience 𝛼𝛽 , here considered as one parameter), the DOM has four ( 𝜆 , 𝛼𝛽 , the oscillation frequency 𝜇 , and the amplitude 𝐴 ), and the NAOM has five ( 𝜆 , 𝛼𝛽 , 𝜇 , and two amplitudes 𝐴 and 𝐵 ). A priori, a fitting curve with more free parameters will fit data better, but it will also be penalized more severely in the AIC and BIC. In other words, the AIC and BIC take into account and partially compensate the advantage of the NAOM over the DOM, and both over the RW model, for having more parameters available. Preliminarily, we checked some convergence issues of the NAOM. We performed a nonlinear best-fit analysis for all subjects with the parameter 𝜇 constrained to be positive, and the parameters 𝐴 and 𝐵 constrained in several ways ( 𝐴 ≠ 0 , 𝐵 = 0 ; 𝐴 =0 , 𝐵 ≠ 0 ; 𝐴 ≠ 0 , 𝐵 ≠ 0 ). However, the problem is that the oscillations of the NAOM are not damped in time, and this forces the nonlinear fit algorithm to solutions that are unsatisfactory in all cases, regardless of the priors on 𝐴 and 𝐵 : small-amplitude, high-frequency oscillations which have little to do with the behavior of the subjects. In general, the IC associated with these fits is much larger than RW or the DOM. There are some cases where the AIC is smaller than the other fits, but they clearly are artifacts. In fact, one cannot use the NAOM as a model of short-range fluctuations either, since the spectral analysis in the previous notes proved that these fluctuations are random, while the NAOM is completely deterministic. These results are just artifacts of the prior 𝜇 > 0 given above and they can be fixed by noting that all oscillatory best-fits in the previous notes have very low-frequency oscillations with 𝜇 < 0.1 . Therefore, we implemented the prior , plus the priors 𝐴 ≠ 0 , 𝐵 ≠ 0 . Fits where 𝜇 was close to 0.1 where double-checked with a longer prior range. Also, in cases where one of the amplitudes of the fit was zero within the fit error, as a double check we redid the analysis by setting that amplitude to zero as a prior, in order to reduce the penalty from the number of free parameters. Fits with 𝛼𝛽 very large are avoided by imposing the prior , while best fits with 𝛼𝛽 approximately zero within the error are ruled out as artifacts. Oscillatory fits yielding bigger IC and those that fail to produce a nontrivial model (i.e., if 𝜇 and/or 𝐴 vanish) are discarded too. The AIC for the DOM and NAOM best fits of the data of the subjects of this experiment is shown in Fig. 2. The tables B1-B4 (Appendix B) include more details of the Bayesian analysis, including the BIC. Figure 2: The difference between the AIC of the RW model and the DOM (blue circles) or the NAOM (red triangles) for the 16 subjects of Harris et al. (2015). Positive values indicate that the DOM/NAOM is a better fit than the RW model, while for negative values the RW model is a better fit. Comparing the DOM and the NAOM, the fit with better evidence is the one with larger value. Cases where one could not fit with the DOM or NAOM have no corresponding point.

From these figures and tables, we observe that the best fit of the majority of subjects is either the DOM or the NAOM (with a predominance of the DOM), not the RW model. When the RW model wins it does so with weak to positive evidence in CR groups and weak to strong evidence in PR groups. This increase in evidence in favor (but not in the number of favored cases) is probably due to the larger number of data points (three times as many in PR groups with respect to CR groups). On the other hand, the great majority of oscillatory winners have very strong evidence in favor. Thus, when subjects display oscillations in their learning curve the effect is usually strong, not just tiny perturbations of the monotonic RW curve. A group view of these results is summarized in Tab. 1. CR10 CR30 PR10 PR30 Total RW

19 % 37 % 31 % 13 % 25 %

DOM

50 % 19 % 31 % 69 % 42 %

NAOM

31 % 38 % 25 % 19 % 28 %

No fit

Table 1: Percentage of subjects following the RW model, the DOM, or the NAOM. The learning curve of subjects CR30-6, PR10-6, and PR10-11 did not follow any of the models.

It is important to note that RW is a subcase of the DOM and of the NAOM. This means that, whenever RW is selected as the most favored fit, then the DOM and the NAOM with frequency approximately zero may also be good fits, while if one of the oscillatory models wins, then the frequency cannot be set to zero and RW is not a good fit. Given that the DOM and the NAOM have RW as a reduced form, they are guaranteed to be at least as good as RW if we do not penalize them for having more free parameters. However, indeed they are penalized by the BIC and the AIC and one cannot invoke chance to trivially explain why they fare better than RW for some subjects. Therefore, since the goal is to fit as many subjects as possible with the same model, Tab. 1 leads us to conclude that the DOM provides a more powerful explanation of the data than RW. Table 1 is helpful also to look for inter-group differences. In CR groups (continuous reinforcement), the number of subjects following the RW model and the NAOM increases when extending the trial duration, while the number of those following the DOM decreases. In PR groups (partial reinforcement), the opposite occurs: the number of subjects following the RW model and the NAOM decreases when extending the trial duration, while the number of those following the DOM increases. This puzzling pattern received an explanation in Calcagni et al. (202019), without including the NAOM. To summarize it here, it seems that longer trials stabilize the behavior, but a partial reinforcement schedule destabilizes it . Here stability means fewer erratic patterns and more monotonic (RW) learning, but it is not accompanied by smaller oscillations in oscillatory patterns. This interpretation implies an ordering of the groups in terms of increasing stability: PR10 → PR30 → CR10 → CR30 (13) Including also the NAOM, we can check if data reflect this feature by constructing three orderings from Tab. 1: One where the number of subjects following RW increases and those following the DOM decreases,

PR30 → CR10 → PR10 → CR30 , another where the number of subjects following the NAOM increase,

PR30 → PR10 → CR10 → CR30 , and a third one where the number of subjects with erratic behavior increase:

PR30 ~ CR10 → CR30 → PR10.

Except for the position of PR10 in the sequence, (13) is essentially confirmed. A fourth way is to sum all the best-fit and second-best-fit BIC and AIC within each group (Tab. 2) and then order the groups in terms of decreasing data dispersion (decreasing value of the group sum). Despite being penalized by having less subject with a viable fit, groups PR10 and PR30 have much higher ICs, which is due to the greater dispersion of data of PR responses. By definition, the Information Criteria measure the dispersion of data points with respect to the fitting curve and they can be regarded as a measure of stability. The stability order found from Tab. 2 reproduces Eq. (13) exactly.

CR10 CR30 PR10 PR30 BIC

AIC

Total 5848 2848 18694 8983

Table 2: Sum all the BIC and AIC of the best and second-best fits. To summarize, we identify an order of increased behavioral stability determined by the trial length and the reinforcement schedule: the longer the trial and the more continuous the reinforcement, the more stable and monotonic the learning curve. Finally, in Figs. 3-6 we report the best fits for all subjects.

Figure 3: best-fit plots for the subjects of group CR10 (subjects 1 to 18 left to right and top to bottom). RW: black line; DOM/NAOM: blue line. Horizontal axis: trials (1 to 180). Vertical axis: normalized response. Case 5 is a false positive because the fit is not significant (see Appendix B). Figure 4: best-fit plots for the subjects of group CR30 (subjects 1 to 18 left to right and top to bottom). RW: black line; DOM/NAOM: blue line. Horizontal axis: trials (1 to 180). Vertical axis: normalized response. Subject 6 has no fit. Figure 5: best-fit plots for the subjects of group PR10 (subjects 1 to 18 left to right and top to bottom). RW: black line; DOM/NAOM: blue line. Horizontal axis: trials (1 to 540). Vertical axis: normalized response. Cases 5, 8 and 13 are false positives because their fit is not significant (see Appendix B), while subjects 6 and 11 have no fit. Figure 6: best-fit plots for the subjects of group PR30 (subjects 1 to 18 left to right and top to bottom). RW: black line; DOM/NAOM: blue line. Horizontal axis: trials (1 to 540). Vertical axis: normalized response. 4.2 Analysis of Calcagni et al. (2020) data

Here we do not repeat the analysis done in Calcagni et al. (2020) but we expand it to include the NAOM best fits. The results are in Fig. 7 and Tabs. B5 and B6. On the one hand, the NAOM never beats the DOM when the latter beats RW. This reinforces the view that the DOM is typically a better fit than the NAOM. On the other hand, in most Note that the fits of the oscillatory and RW models of this subsection were made with session data (90 points), while those in section 5.1 were made with trial data (180 or 540 points). Trial data of Calcagni et al. (2019) have larger dispersion than Harris et al. (2015) trial data, while their dispersion is comparable when the former data are binned into sessions. cases the RW model still beats the oscillatory models, probably because of the heavy penalty on the latter (4 to 5 parameters versus 2). Figure 7: The difference between the AIC of the RW model and the DOM (blue circles) or the NAOM (red triangles) for the 15 experimental subjects of Calcagni et al. (2020). Positive values indicate that the DOM/NAOM is a better fit than the RW model, while for negative values the RW model is a better fit. Comparing the DOM and the NAOM, the fit with better evidence is the one with larger value. Cases where one could not fit with the DOM or NAOM have no corresponding point. The point of subject 2-8 corresponding to the NAOM is at a very negative vertical position and is not shown.

There are four cases (subjects 2-2, 2-5, 2-6 and 2-7) where the NAOM beats RW (and the DOM is beaten by RW). These exceptions do not jeopardize the overall picture of a successful DOM, the reason being that we may expect an ad-hoc model to fit some data better than RW and the DOM. Nevertheless, it is still remarkable that the DOM and the NAOM can fare better than RW despite having at least twice as many parameters. Figure 8 includes only subjects for whom the best fit is the NAOM; the other fits can be found in Calcagni et al. (2020). Figure 8: Subjects where the NAOM is the most favored model (evidence is against the RW model in all cases): 2-2 (top left), 2-5 (top right), 2-6 (bottom left) and 2-7 (bottom right). Horizontal axis: sessions (1 to 90). Vertical axis: normalized response.

The percentages of subjects following the RW model, the DOM, or the NAOM, are shown in Tab. 3. As one can see, in Group 1 the NAOM is absent, while in Group 2 it explains half of the data, better than the other two models. Overall, the RW model explains slightly less than half of the data, while the NAOM explains as many subjects as the DOM, about three quarters of the total number of subjects (remember that the RW model is a subcase of both the DOM and the NAOM).

Group 1 Group 2 Total RW

86 % 12 % 47 %

DOM

14 % 38 % 27 %

NAOM

Table 3: Percentage of subjects following the RW, the DOM, or the NAOM. The fact that the US in Group 2 was twice as sweet as that in Group 1 (and that was preferred by rats in a pilot discrimination experiment) endorses the theoretical view that the NAOM is a motivational model rather than an associative one. Still, it remains difficult to understand where such an oscillatory motivational pattern might come from. If oscillations were due to motivational factors (performance versus learning), then one would like to keep track of these factors. The first thing one can check is hunger. In this experiment, all subjects were not food-deprived and were kept at 100% of their theoretical weight (calculated by an ideal growth curve). Food was delivered in the cage at least half an hour after the last session of the day for the rat. Hunger does not appear to be a relevant factor on the oscillations that took place in the range of several experimental sessions and not locally determined by individual sessions. It is also true that the saccharin dispensed in the experimental sessions was more appreciated than the regular food in the cage. Therefore, some additional incentive properties may be attributed to the reinforcer in our experiment. If this were the case, the determinants of the level of oscillations might be critical manipulations related to hunger and to incentive properties of the reinforcer. Additionally, oscillations should be greater at the start and end of long sessions given fatigue effects. However, these considerations do not apply to our experiment. We did not find any correlation between the session time, response level and cage food delivery. If there existed a motivational factor, then it would have been something different from food. Moreover, in this experiment all sessions had the same duration and there was no correlation between the number of sessions per day (either 1 or 2) and the response level. Notice also that any response fluctuation at the beginning or end of a session would be on a far shorter time scale than the long-range oscillations of the DOM and the NAOM. The motivational factor we are trying to find must be on scales much larger than that of session quirks. Finally, but somewhat unrelated from the rest, one may ask whether motivational factors are responsible for response fluctuations happening on a few trials or sessions, but response fluctuations (dispersion of data) were constant throughout the whole experiment (Calcagni et al. 2020), just as in the case of the data of Harris et al. (2015). Looking at the tables, one sees that dispersion was not systematically greater for longer trials or sessions.

We have seen that both the DOM and the NAOM capture aspects of the acquisition data for quite a few rats that are not modeled by the RW learning curve. What they seem to be explaining is fluctuations in levels of responding over a longer time scale than trial by trial. The DOM attributes these fluctuations to an oscillatory behavior of the learning system where the animal goes through a sequence of over- and under-adjusted predictions with respect to the optimal asymptote of learning. As such, the oscillations are largest early on in training and progressively damped. The NAOM, on the other hand, implements fluctuations as a not well specified motivational effect of a non-associative nature. According to Tab. 1, the DOM explains more data than the NAOM. Together with the difficulty in finding a convincing motivational explanation in either experiment, and with the fact that the presence of long-range oscillations in Harris et al. (2015) data is modulated by the duration of the trial and by the type of reinforcement schedule (partial versus continuous), we may conclude that the majority of the observed response oscillations have an associative origin. The key point is that it is impossible to fit a random phase (as a motivational quirk could produce) with a long-range deterministic pattern. Any motivational factor should be of the same time scale as the oscillations and we can hardly find one in these experiments, where the only change in the animal diet and habits was of order of a few hours. Convincing evidence that the DOM is better than the NAOM is also given by model selection statistics. The NAOM has 4 to 5 parameters (depending on whether we set one of the amplitudes to zero), while the DOM has 3 to 4. In general, the NAOM beats the DOM always when it has more parameters, while the DOM beats the NAOM when it has less than or the same number of parameters. This outcome seems to be helped by the penalty from the number of parameters in the NAOM, but not much. In many cases, the 4-parameter DOM beats the 4-parameter NAOM with very strong evidence, which means that the lion’s share in the IC is not the number of parameters but the lower data dispersion with respect to the fit curve. To put it simply, the DOM wins by brute force, not because of its lesser penalty. If one simply looks at the goodness of fit, without taking account the number of parameters, the DOM usually gives better fits than NAOM. So, while the DOM beats the RW model often, adding oscillations to RW in the form of the NAOM does give it some advantage, but there are still plenty of cases where the DOM beats the NAOM, or is about as good. This is important because the DOM is more constrained. If the best description of responding were really just RW to account for the acquisition part plus an oscillatory process to account for some independent factor (e.g., motivation) that causes fluctuations in performance, then the NAOM should beat DOM much more often, even more so in the very long experiment of Calcagni et al. (2020). However, this is not the case. Therefore, the overall evidence favoring the DOM justifies some new description of acquisition that includes long-range oscillations of associative origin. Acknowledgments.

We thank our collaborators in Calcagni et al. (2020) and Harris et al. (2015) for participating in collecting the data we analyzed here, and Stefano Ghirlanda for insightful comments on a previous version of the manuscript. R.P. was supported by grant PSI2016-80082-P from Ministerio de Economía y Competitividad, Secretaría de Estado de Investigación, Desarrollo e Innovación, Spanish Government (R.P. Principal Investigator). Appendix A – The NAOM is non-associative

Equation (12) is the most general solution of the equation of motion ( 𝑑 𝑑𝑡 + 𝜇 ) [𝑣̇ + 𝛼𝛽(𝑣 − 𝜆)] = 0 . (A1) Since the NAOM stems from a third-order equation of motion, it needs more initial conditions (four in total) than RW and than the DOM to specify the solution (12). This is the reason why (12) has six free parameters, the DOM has five, and the RW model only three. Comparing with (3), it is clear that the RW model is a special case of the NAOM. However, while it is easy to write down an action for the RW model, it is nontrivial for the NAOM. In fact, the equation of motion (A1) is third order in time derivatives, but the simplest form of the action principle only generates equations of motion of even derivative order. There is a trick to obtain differential equations of odd order that will show how the NAOM includes an extra degree of freedom of non-associative character. The action of the RW model (3) and of the DOM is (Calcagni et al., 2020) 𝑆 = ∫ 𝑑𝑡 𝑒 [ 𝑣̇ − 𝑈(𝑣)] 𝑇0 , (A2) where 𝑇 is some final time and the potential 𝑈(𝑣) is given by (5) for the DOM and by (5) with 𝜇 = 0 for the RW model. The action (A2) only depends on one degree of freedom, the association strength 𝑣(𝑡) . Varying 𝑆 with respect to 𝑣(𝑡) yields the equation of motion (7) or (9). However, it is not possible to find an action for the NAOM only in terms of 𝑣(𝑡) and we must include a new degree of freedom, which is technically called a Lagrange multiplier and that we will dub it 𝑦(𝑡) . Then, varying the action 𝑆 NAOM = ∫ 𝑑𝑡 𝑦 ( 𝑑 𝑑𝑡 + 𝜇 ) [𝑣̇ + 𝛼𝛽(𝑣 − 𝜆)] 𝑇0 (A3) with respect to 𝑦 immediately yields the equation of motion (A1). The variation with respect to 𝑣 gives an independent equation for the extra degree of freedom, ( 𝑑 𝑑𝑡 + 𝜇 ) (𝑦̇ − 𝛼𝛽𝑦) = 0 , whose solution is 𝑦(𝑡) = 𝑒 𝛼𝛽 𝑡 + 𝐴 sin 𝜇𝑡 + 𝐵 cos 𝜇𝑡 . (A4) A key difference with respect to the profile (12) of the association strength is that 𝑦(𝑡) increases indefinitely in the future. One might interpret it as motivation, assuming no room for boredom effects in this model. There are other ways to define an action for the NAOM, but they entail more extra degrees of freedom. Appendix B – BIC and AIC tables

Tables B1-B4 show the BIC and AIC (presented in the format “BIC, AIC”) of the best fits of the three models for all subjects of Harris et al. (2015) experiment. Calling the difference Δ = (IC model 1) − (IC model 2) for the Bayes or Akaike IC, one finds evidence in favor of model 2 if Δ > 0 . This evidence is weak if Δ < 2 (blue color in Tabs. 1-4), positive if < 6 (green), strong if < 10 (orange), and very strong if Δ ≥ 10 (red). In the table, all numbers are rounded to zero decimals, but to calculate Δ all digits were used. For each IC, “winners” with respect to the second best are in boldfaced color. The order of comparison is always RW – DOM ( Δ RD ) and NAOM – RW or NAOM – DOM ( Δ N ), depending on which between RW and the DOM is the best or second best in each IC (the comparison in a given IC is made with the second best model, unless this model is the third best in the other IC). When two models are favored in different ICs, the one with strongest evidence wins. For subjects CR30-15 and PR10-10, where positive evidence is balanced between RW and the NAOM, we declared RW the winner by Occam’s razor. Trivial RW fits with vanishing 𝛼𝛽 are reported except when also the other fits fail, in which case all cells are left blank. All cases have a p -value 𝑝 ≪ 0.001 for nonzero µ except those indicated with an asterisk, which may be false positives since 𝑝 ≈ 1 . Subject RW DOM NAOM 𝚫 𝐑𝐃 𝚫 𝐍 ,

80, 60 74, 78 24, 18 , − − − −

3 141 , 131 146, 133 147, − −

2 6, − , − − − −

99, 90 , −

19, 20

32, 23 ,

26, 10 11, 14 5, 1 − − − , − − −

70 17, 21 − −

8 141 , 132 142, 130 145, −

1, 2 3, 0

9 140 , 130 143, 130 145, −

3, 0 6, −

10 8 , −

1 13, 1 11, − − −

2 3, − ,

12 102 , 92 107, 94 107, − −

2 5, −

82, 72 87, 74 , − − − −

50, 40 ,

59, 40 1, 4 10, 3 , ,

1, 5 0, − Table B1: BIC and AIC for the RW model, DOM, and NAOM best fits of subjects in Group CR10.

Subject RW DOM NAOM 𝚫 𝐑𝐃 𝚫 𝐍 ,

8, 11 − −

2 207 , 197 211, 198 212, − −

1 5, −

3 23 ,

28, 15 30, 14 − −

2 7, 0 ,

2, 5 * ,

74, 80 − − − − − , − − −

127 19, 23 9, 6

8 163 , 154 168, 156 167, − −

2 4, −

57, 48 63, 50 , − − − − − − − − − , − − − − − ,

92, 76 17, 20 6, 2

92, 82 71, 58 ,

21, 24 − − − , − − − − − − −

2 5, − − − − − − , −

8, 11 − −

15 96 , 86 102, 86 99, −

6, 0 3, −

16 150 , 141 156, 143 156, − −

2 5, − Table B2: BIC and AIC for the RW model, DOM, and NAOM best fits of subjects in Group CR30. Subject RW DOM NAOM 𝚫 𝐑𝐃 𝚫 𝐍 ,

2 458 , 445 460, − −

2 10, 1 , − − − − ,

31, 36 ,

7 160 , 147 166, 149 165, − −

2 5, − , − − − − , − − − − − , − − − − − −

5, 0 4, − ,

13 344 , 331 350, 333 346, − −

2 2, −

25, 12 ,

26, 5 6, 10 7, 3

15 439 , 426 445, 428 445, − −

2 6, −

16 581 , 568 587, 570 588, − −

2 7, − Table B3: BIC and AIC for the RW model, DOM, and NAOM best fits of subjects in Group PR10.

Subject RW DOM NAOM 𝚫 𝐑𝐃 𝚫 𝐍 , −

8, 1 − − ,

22, 26

3 334 , 322 341, 324 344, − −

2 9, − ,

6, 10

5 129 , 116 136, 118 136, − −

2 6, − − − − , −

7, 11 − − − , − − −

225 18, 22 13, 9 , , − − − − − − − − − , − −

5, 3 − − , ,

28, 15 , 9 29,

2, 6 3, − − − − , − − −

190 27, 36 33, 33 ,

16 724 , 711 724, 707 728,

0, 4 4, − Table B4: BIC and AIC for the RW model, DOM, and NAOM best fits of subjects in Group PR30. The results for the experiment of Calcagni et al. (2020) are given in Tabs. B5 and B6.

Subject RW DOM NAOM 𝚫 𝐍 , ,

48, 38 50, 37 7, 2 , −40 −28, −38 −26, −38 6, 1 ,

71, 61 73, 61 5, 0 , −69 −57, −67 −59, −72

3, −2 ,

18, 5 , − − − − Table B5: BIC and AIC for the RW model, DOM, and NAOM best fits of subjects in Group 1. Subject 8 was removed from the group due to poor health.

Subject RW DOM NAOM 𝚫 𝐍 , 56 65, 55 67,

3, −2 , 95 107, 97 103,

1, −4 −3, −10 −4 , −14 −1, −14 2, 0 −27, −35 −33 , −43 −30, −42 −2, −7

84, 76 85, 75 ,

0, −5

49, 41 52, 42 , −4, −9 −139, −147 −135, −145 −183 , −196 −44, −49 −75, −82 −76 , −86 −68, −80 8, 6 Table B6: BIC and AIC for the RW model, DOM, and NAOM best fits of subjects in Group 2. Appendix C – Quantum RW model

We contrasted the characteristic predictions of this model listed in section 2.4 with the data in Calcagni et al. (2020). Here we only revise, taking the NAOM into account, the one stating that “There exists a universal constant ℎ̅ that can be estimated as ℎ̅ ≃2𝛼𝛽𝜎 .” We checked this in individual subjects from Harris et al. (2015), only considering those subjects following the RW model. The values of ℎ̅ are given in Tab. C1. Subject CR10 CR30 PR10 PR30 1 Average ± ± ± ± Table C1: Values of ℎ̅ for the experimental subjects of Harris et al. (2015). Within each group, the estimated average of ℎ̅ is of the same order as its error, which means that the range of values is quite dispersed, at least more than in the experiment by Calcagni et al. (2020) where the average ℎ̅ was nonzero significantly. Therefore, the conclusion is that the Harris et al. (2015) experiment is unable to verify this prediction of the quantum model. The latter would be ruled out if the group values have been significantly different from one another. References

Blanco, F., & Moris, J. (2018). Bayesian methods for addressing long-standing problems in associative learning: The case of PREE.

Q. J. Exper. Psychol. 71 , 1844. Bush, R.R., & Mosteller, F. (1951a). A mathematical model for simple learning.

Psychol. Rev. 58 , 313; reprinted in Mosteller (2006), pp. 221–234. Bush, R.R., & Mosteller, F. (1951b). A model for stimulus generalization and discrimination. Psychol. Rev. 58, 413; reprinted in Mosteller (2006), pp. 235–250. Calcagni, G. (2018). The geometry of learning.

J. Math. Psychol. 84 , 74. Calcagni, G., Caballero-Garrido, E. & Pellón, R, (2020). Behavior stability and individual differences in Pavlovian extended conditioning.

Frontiers in Psychology in press. Electronic preprint at https://arxiv.org/abs/1806.01778 and https://psyarxiv.com/GDR2W. Estes,W.K. (1950) Toward a statistical theory of learning.

Psychol. Rev. 57 , 94. Gallistel, C.R. (2012). On the evils of group averaging: Commentary on Nevin’s “Resistance to extinction and behavioral momentum”.

Behav. Proc. 90 , 98. Gallistel, C.R., Fairhurst, S., & Balsam, P. (2004). The learning curve: implications of a quantitative analysis.

Proc. Nat. Acad. Sci. USA 101 , 13124. Ghirlanda, S., & Enquist, M. (2019). On the role of responses in Pavlovian acquisition.

J. Exp. Psychol.: Anim. Learn. Cog. 45 , 59. Ghirlanda, S., & Ibadullaiev, S. (2015). Solution of the comparator theory of associative learning.

Psychol. Rev. 122 , 242. Glautier, S. (2013). Revisiting the learning curve (once again).

Front. Psychol. 4 , 982. Harris, J.A., Patterson, A.E., & Gharaei, S. (2015). Pavlovian conditioning and cumulative reinforcement rate.

J. Exp. Psychol. Anim. Learn. Cogn. 41 , 137. Hayes, K.J. (1953). The backward curve: a method for the study of learning.

Psychol. Rev. 60 , 269. Hull, C.L. (1943).

Principles of Behavior . New York, NY: Apple-Century-Crofts. Jaksic, H., Vause, T., Frijters, J.C., & Feldman, M. (2018). A comparison of a novel application of hierarchical linear modeling and nonparametric analysis for single-subject designs.

Behav. Anal. Res. Prac. 18 , 203. Le Pelley, M.E. (2004). The role of associative history in models of associative learning: a selective review and a hybrid model.

Q. J. Exp. Psychol. 57 , 193. Mackintosh, N.J. (1975). A theory of attention: Variations in the associability of stimuli with reinforcement.

Psychol. Rev. 82 , 276. Mazur, J.E., & Hastie, R. (1978). Learning as accumulation: A reexamination of the learning curve.

Psychol. Bull. 85 , 1256. McClelland, J.L., & Rumelhart, D.E. (Eds.) (1988).

Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises.

Cambridge, MA: MIT Press. Merrill, M. (1931). The relationship of individual growth to average growth.

Hum. Biol. 3 , 37. Miller, R.R., Barnet, R.C., & Grahame, N.J. (1995). Assessment of the Rescorla–Wagner model.

Psychol. Bull. 117 , 363. Miller, R.R., Greco, C., & Vigorito, M. (1981). Classical conditioned tail flexion in rats: CR-contingent modification of US intensity as a test of the preparatory response hypothesis.

Anim. Learn. Behav. 9 , 80. Mosteller, F. (2006). S.E. Fienberg & D.C. Hoaglin (Eds.),

Selected Papers of Frederick Mosteller . New York, NY: Springer. Pearce, J.M., & Hall, G. (1980). A model for Pavlovian learning: variations in the effectiveness of conditioned but not of unconditioned stimuli.

Psychol. Rev. 87 , 532. Rescorla, R.A., & Wagner, A.R. (1972). A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In A.H. Black & W.F. Prokasy (Eds.),

Classical Conditioning II (pp. 64–99). New York, NY: Appleton-Century-Crofts. Sidman, M. (1952). A note on functional relations obtained from group data.

Psychol. Bull. 49 , 263. Smith, P.L., & Little, D.R. (2018). Small is beautiful: in defense of the small-N design.

Psychon. Bull. Rev. 25 , 2083. Spence, K. (1956).

Behavior Theory and Conditioning . New Haven, CT: Yale University Press. Wagner, A.R. (1981). SOP: a model of automatic memory processing in animal behavior. In N.E. Spear & R.R. Miller (Eds.),

Information Processing in Animals: Memory Mechanisms (pp. 5–47). Hillsdale, NJ: Erlbaum. Wagner, A.R., & Rescorla, R.A. (1972). Inhibition in Pavlovian conditioning: applications of a theory. In M.S. Halliday & R.A. Boakes (Eds.),

Inhibition and Learning (pp. 301–336). London, UK: Academic Press. Wagner, A.R., & Vogel, E.H. (2009). Conditioning: theories.

Encyclopedia of Neuroscience 3 , 49. Young, M.E. (2018). A place for statistics in behavior analysis.

Behav. Anal. Res. Prac. 18 , 193. Zelikowsky, M., & Fanselow, M.S. (2010). Opioid regulation of Pavlovian overshadowing in fear conditioning.