[PDF] A model of discrete choice based on reinforcement learning under short-term memory

Abstract

A family of models of individual discrete choice are constructed by means of statistical averaging of choices made by a subject in a reinforcement learning process, where the subject has short, k-term memory span. The choice probabilities in these models combine in a non-trivial, non-linear way the initial learning bias and the experience gained through learning. The properties of such models are discussed and, in particular, it is shown that probabilities deviate from Luce's Choice Axiom, even if the initial bias adheres to it. Moreover, we shown that the latter property is recovered as the memory span becomes large. Two applications in utility theory are considered. In the first, we use the discrete choice model to generate binary preference relation on simple lotteries. We show that the preferences violate transitivity and independence axioms of expected utility theory. Furthermore, we establish the dependence of the preferences on frames, with risk aversion for gains, and risk seeking for losses. Based on these findings we propose next a parametric model of choice based on the probability maximization principle, as a model for deviations from expected utility principle. To illustrate the approach we apply it to the classical problem of demand for insurance.

Full PDF

aa r X i v : . [ ec on . E M ] A ug A model of discrete choice based on reinforcementlearning under short-term memory

Misha Perepelitsa [email protected] of MathematicsUniversity of Houston4800 Calhoun Rd.Houston, TX.

Abstract

A family of models of individual discrete choice are constructed by meansof statistical averaging of choices made by a subject in a reinforcement learn-ing process, where the subject has short, k-term memory span. The choiceprobabilities in these models combine in a non-trivial, non-linear way the ini-tial learning bias and the experience gained through learning. The propertiesof such models are discussed and, in particular, it is shown that probabil-ities deviate from Luce’s Choice Axiom, even if the initial bias adheres toit. Moreover, we shown that the latter property is recovered as the memoryspan becomes large.Two applications in utility theory are considered. In the ﬁrst, we usethe discrete choice model to generate binary preference relation on simplelotteries. We show that the preferences violate transitivity and independenceaxioms of expected utility theory. Furthermore, we establish the dependenceof the preferences on frames, with risk aversion for gains, and risk seeking forlosses. Based on these ﬁndings we propose next a parametric model of choicebased on the probability maximization principle, as a model for deviationsfrom expected utility principle. To illustrate the approach we apply it to theclassical problem of demand for insurance.

Keywords:

Discrete choice models, Luce’s choice axiom, reinforcementlearning, expected utility principle

Preprint submitted to Elsevier August 20, 2019 . Introduction

The problem of choice is one of the fundamental problems in psychology,economics and behavioral biology. The second half of the last century saw arapid growth of theoretical works on this subject as well as increasing amountof experimental data.In economics, the ﬁeld was dominated by expected utility theory (EUT),its critique based on the experimental evidence and its ramiﬁcations. EUTwas put forward by Von Neumann and Morgenstern (1947) as a mathematicalformalization of what one can call rational preferences between contingentprospects. EUT is an axiomatic theory that starts out with postulates aboutpreferences among prospects. The postulates are completeness, transitivity,continuity and independence (substitution) axioms. The theory derives autility function u which assigns values to payoﬀs, and a random prospect X is ranked according to its expectation E [ u ( X )] . Psychology diﬀers from economics in the approach to the choice behaviorby assuming a more general description of choices as being probabilistic anddependent on the set of alternatives oﬀered to a subject. The axiomatic treat-ment for discrete choice was undertaken in a seminal work by Luce (1959),who introduced a choice axiom (Luce’s choice axiom) that postulates howthe probability to select an alternative from one set is related to the prob-ability to select this alternative from a larger set. Luce’s theory establishesexistence of a value function v on a ﬁnite set of alternatives T such that theprobability to select i from set S ⊂ T equals P S ( i ) = v ( i ) P j ∈ S v ( j ) , i ∈ S. (1)Psychologically interpreted, value function (scale ratio) v ( i ) is a subject’sresponse strength for alternative i. Under Luce’s choice axiom, choice prob-abilities verify the principle of independence of irrelevant alternatives of Ar-row (1951), and thus, when the latter is normative, becomes a reasonableassumption.Utility function and ratio scale function provide convenient tools for anal-ysis of decision making. Empirically, the values of these functions can ob-tained by comparing pairs of alternatives.While the ﬁrst three axioms of EUT are generally accepted, the last one,independence axiom, drew a signiﬁcant amount of critique from the exper-imentalists, starting with Allais (1953). Over the years several alternative2tility theories were proposed that provide some variants of the expectedutility without the independence axiom or with its weaker version. Amongthem, the generalized expected utility of Machina (1982), weighted utilitytheory of Chew and MacCrimmon (1979), the regret theory developed inde-pendently by Bell (1982), Fishburn (1982), Loomis and Sugden (1982), rankdependent utility theory of Quiggin (1982, 1993), and the dual utility theoryby Yaari (1987). Kahneman and Tversky (1979, 1984, 1992) based on theirexperimental ﬁndings, introduced framing eﬀects, value functions, and prob-ability weights into the analysis and incorporated them into the prospecttheory that was later developed, using the approach of the rank dependentutility theory, into the cumulative prospect theory.In the theory of discrete choice Luce’s choice axiom (LCA) is not a uni-versal imperative either and there are situations where it does not apply,as in the example provided by Debreu (1960). This example was furtherdeveloped by Tversky (1972) who attributed it to the similarity eﬀect andproposed aspect theory as a reﬁnement of the probabilistic decision making.More detailed discussion of validity of LCA can be found in Luce (1977), orin a more recent review of Pleskac (2013).Let us now return to the work of Luce (1959) and mention its anothermajor contribution, this time to the ﬁeld of reinforcement learning theo-ries. Learning theories concern with subjects building their choice probabili-ties through experience, by adapting their responses accordingly to receivedstimuli. Following Bush and Mosteller (1951, 1955) the learning models weretypically formulated in terms of the choice probabilities to select option i attime n : P n ( i ) , which is determined as function of the probabilities from thelast period and an outcome of some random event conditioned on the lastchoice.Luce (1957), citing works of Thurstone (1930) and Gulliksen (1953), ar-gued that learning must be formulated in terms of the strength of responseto each alternative, with the choice probabilities being dependent variablesof responses, for example, through the relation (1). Several learning modelsof this types were proposed typically with a linear law between response foralternative i at times n and n − U n ( i ) = µU n − ( i ) + (1 − µ ) u n ( i ) , (2)where u n ( i ) is the response to the stimuli from the prior selection. Thisapproach has been widely accepted by scientiﬁc community, and used in3uch ﬁelds as mathematical biology, game theory, and engineering, see forexample Harley (1981), Roth and Erev (1995, 1998), Fudenberg and Levine(1998), Sutton and Barto (1998).There seems to be an unanimous agreement about general principles ofreinforcement learning and, naturally, one can turn to it as the tool in con-structing models of individual choice. In contrast with axiomatic approaches,one starts with a model of learning. Its parameters should be found exper-imentally, but once ﬁxed, the model serves as a “decision automaton” thatgenerates a series of choices, or in mathematical language, a stochastic pro-cess on a suitable state space. The process is analyzed for convergence tosome stationary process, as the number of learning periods increases. In thestationary process the choice probabilities might settle at certain values thatchange little as more and more choices are being made, but we will not forcethis assumption. In particular, the stationary process can be consistent withthe case when a repeated series of positive experiences with a particular alter-native increases the probability to select this alternative in a non-negligibleway. What is needed for the theory is existence of well-deﬁned statisticalaverages for the choice probabilities. The latter are computed and recordedas the probabilities for a discrete choice model. The model can be eﬀectivelydescribed as an “expected probability” choice model. However, except forsome special cases that we will mention later, it is not an LCA-type model(1), nor any expected utility-type model.The properties of such models is the focus of this work. To that end weproceed, ﬁrst, with the mathematical framework. In section 2 we derive exactformulas for the choice probabilities for a ﬁnite set of alternatives, showingby this that our approach is computationally feasible. We introduce a class of k –term learning models, where in the process of learning a subject accountsonly for responses to the last k stimuli obtained for his/her actions. Forexample, with k = 1 , the response strength to alternative i at period n isgiven by U n ( i ) = U ( i ) + u n ( i ) , i ∈ S, (3)where U ( i ) is a learning prior (bias) for alternative i, and u n ( i ) is eitherzero, if alternative i was not selected at period n − , or, if it was, is theresponse to the received (random) stimulus for that alternative. Notice thatthe contribution of the learning prior U ( i ) is not diminishing in the courseof learning (as n increases) and it will enter into the formulas of asymptoticprobabilities. Another important assumption is implicitly included in (3).4he initial priors U ( i ) do not depend on the subset S of alternatives oﬀeredto the subject. That is, at the beginning of the experiment the subject haschoice probabilities vitrifying Luce’s choice axiom. Modiﬁed by learning,they enter into an asymptotic, expected probability model, which in general,loses that property.In section 3 we turn to applications and our choice here will be on deter-ministic binary preferences, since they take a prominent place in economics.We consider an individual presented with a set T of q alternatives. For everypair of alternatives S = { x, y } ⊆ T, the individual constructs choice prob-abilities for x and y from set S, according to a learning process describedabove, where for simplicity we assume that the individual has very short,one period, memory span ( k = 1). The binary preference can be derivedfrom a probabilistic choice model in many diﬀerent ways, most obvious be-ing the trace relation which deﬁnes x ≻ y iﬀ the probability to chose x fromset { x, y } is greater than 1/2. Depending on the parameters of the model, weobserve a wide range of behaviors. There are some extreme cases when thebinary preferences are according to EUT, while, generically, the preferencesare not transitive and violate independence axiom. Intransitivity, in partic-ular, implies that the choice probabilities, from which the binary preferenceswere derived, violate Luce’s choice axiom. There is more to it however, aswe show that the binary preferences are characterized by a “framing eﬀect”,with risk averse preferences for gains and risk seeking for losses, similar tothe preferences in prospect theory of Kahneman and Tversky (1979, 1984).The binary preferences also shown to detect persistently better alternatives,by adhering to ﬁrst order stochastic dominance.Intransitivity of the preferences and violations of independence axiomare two phenomena that typically enter any set of empirical data. The factthat they are revealed in our choice model, combined with the fact thatthe model is founded on the behavioral principles, warrants the interest inthe experimental veriﬁcation of the model. This work however is limitedonly to presentation of the model and its properties, not to establishing itsempirical validity. In section 4 we introduce a parametric variant of theexpected probability model with a partial motivation to facilitate the task ofperforming statistical tests.The material of sections 2 and 3 serves as a motivation for a model for“deviations from the EU principle”, which we describe in section 4. Themodel is best described as a mediator between two expected utility principles.One is based on maximization of response E [ u ( X )] . The other is based on5he minimization of E (cid:20)

11 + e u ( X ) (cid:21) , the quantity related to the expected probability. One can notice the depen-dence of this type of preferences on a “frame” through the shift of scalesfrom u to u + u , which can change the ordering of preferences. Interestinglyenough it also establishes higher risk aversion for gains, and risk seeking forlosses, even if u itself is risk averse (concave), the phenomenon that we havementioned earlier. A generic case, described by formulas (8)–(10) on page13, combines two types of EU principles into one. This case, however, is notEU-type principle any longer. At the end of section 4 we apply the modelof “deviations from EUT” to a classical problem of determining demand forinsurance.

2. K-period reinforcement learning choice model

All reinforcement learning models have three ingredients in common. Thereinforcement schedule, the response, as a function of the stimulus, and thechoice probabilities depending on the response strength. We will follow theapproach of Luce (1959), according to which a subject has a mental recordof responses to each choice alternative and updates them according to therealized reinforcement (stimulus) for a corresponding alternative. Then, thesubject implements choice through a subject-speciﬁc function that selects analternative with a probability proportional to the response strength for thischoice.Consider a succession of experiments in which a subject is oﬀered a stim-ulus for an alternative he chooses. Let r ni be the reinforcement given to thesubject at n th trail, when alternative i was chosen last. We assume thateach r ni is sampled from a random variable R i , independently from other al-ternatives and independently from one period to another. Reinforcement ismeasured in the experiment-dependent units such as dollars, carrots, inten-sities of light signals, etc.Let U ni be the total response strength for alternative i that subject hasfor this alternative at the end of trial n. It expresses the cumulative eﬀectof the past reinforcements for alternative i on the subject’s attitude towardthis alternative on some internal scales.6 einforcement-to-Response law: we will assume that the total responsestrength is additive over the incremental response strengths from each rein-forcement. That is, there is a subject speciﬁc function u = u ( r ) that mapsthe reinforcement value r into the subject’s response scales such that U ni = U i + X j ∈ [0 ..k − u ( r n − ji ) /N ( n, i ) , (4)where we agreed that if alternative i is not selected during period j, then u ( r ji ) = 0 . N ( n, i ) is the count of the number of times during the last k trialswhen alternative i was selected. A partial justiﬁcation for the short-memorymodel are the ﬁndings of Kaheman and Trversky (1979) in the context ofutility theory. They report that choices are governed by the increments inthe subject’s wealth rather than the total accumulated wealth. The law (4)expresses the balance of contributions of the default response U i and new ex-perience to the total response strength, when the former has non-diminishingcontribution. (4) can be thought as a balance between the instinct and theexperience, or the balance between things learned in childhood, or bias, andthe presents reinforcement. Response-to-Probability law: choice probability P ni for alternative i isdetermined through the relation P ni = Φ( U ni ) P nj =1 Φ( U nj ) , (5)where Φ = Φ( U ) is a re-scaling of the subject’s response values U intothe range of strictly positive numbers. The function is selected to be non-decreasing to preserve the ordering of responses. Consider now an experiment in which a subject is responding accordingto the rule just described. An outside observer with a capacity for statisticalcomputations will notice that the proportion of times each alternative i isselected becomes ﬁxated with time and settles at certain positive number¯ P T ( i ) . Appendix section gives the formal mathematical account of this typeof asymptotic behavior together with the formulas for probabilities (cid:8) ¯ P T ( i ) (cid:9) . When the experiment is repeated with T replaced by any subset S ⊂ T, itleads to probabilities (cid:8) ¯ P S ( i ) (cid:9) for alternatives to be selected from S. In this7ay we obtain a family of probability measures¯ P S ( R ) = P j ∈ R ¯ P S ( j ) P j ∈ S ¯ P S ( j ) , R ⊂ S, (6)on subsets of T. We will refer to this set of probabilities as RL(k) choicemodel. The properties of (6) are the main interest of the paper. Theirsigniﬁcance, hypothetically, arises from the fact that the subject may learnthe statistics of self behavior and use it next time the choice is to be made..Or, he/she might run a quick mental “simulation” of the learning process andcome up with the same choice probabilities even when presented with choiceonce. Better yet, with the memory span k = 1 , when the response is basedonly on the last reinforcement, the subject can “generalize” other people’one-time experiences about the alternatives as his/her own, in arriving atthe probabilities ¯ P T . These considerations warrant the interest in theoretical analysis of thefamily of probabilities (6). We start with few simple observations. Considera special case when an experiment provides no reinforcement whatsoever.Then, the choice probabilities¯ P S ( R ) = P j ∈ R Φ( U j ) P j ∈ S Φ( U j ) , which verify Luce’s Choice Axiom. Another extreme case occurs when thememory span becomes increasingly large. Consider model RL(k) with increasingly large values of k. We will proceedinformally by noticing that the averages of responses to reinforcement in thelast k trails in (4) converge by the law of large numbers to the mean: X j ∈ [0 ..k − u ( r n − ji ) /N ( n, i ) → E [ u ( R i )] , i = 1 ..q. Thus for large values of k and n, the total response strength for alternative i, U ni ≈ ¯ U i = U i + E [ u ( R i )] , i.e. they change little from trial to trial. Inthe limit we obtain constant response levels and the corresponding choiceprobabilities become LCA probabilities with the value scales Φ( ¯ U i ) , i = 1 ..q. This is almost the expected utility principle. It is in fact agrees with it ifthe priors U i are all equal, or if they are proportional to the correspondingexpected response strength E [ u ( R i )] . . Binary preference for lotteries The probabilistic description of choice is more general than algebraic (de-terministic), and so, there are many diﬀerent ways in which the latter canbe derived from the former. Given a set of choice probabilities one can, forexample, introduce a preference relation for alternatives by postulating somerelation between the corresponding probabilities P { i,j } ( i ) and P { i,j } ( j ) . Oneof such preferences is called trace relation which deﬁnes i (cid:23) j if and onlyif P { i,j } ( i ) ≥ P { i,j } ( j ) . It was shown by Luce (1959), that if the family ofprobabilities P T veriﬁes LCA, then trace relation is a weak order relation.In this section we consider the trace relation for alternatives that aremonetary payoﬀs contingent on random events, with objectively known prob-abilities. We will assume that the positive scale function Φ( u ) = exp ( u/β ) , for some positive parameter β, the response strength function u ( s ) equals tothe payoﬀ s , and denote by R i random payoﬀ to alternative i, and the initialbias U i = E [ u ( R i )] . The rationale for making such selections is the following. Function Φ isthe function from the logit probability model, originated from random utilitytheory, see Marschak (1960), and it is customary used as a scale function inthe models of learning, see for example Fudenberg and Levine (1998). Theresponse u is a linear function. We could have selected u = ar + b, a > , asa generic approximation of arbitrary non-decreasing function, but in orderto simplify the exposition we restrict analysis to the case a = 1 , b = 0 . Itreasonable to assume that the initial bias U i is some stochastic characteristicof the random payoﬀ R i , known a priori, and having the units of “utility” u ( s ) . Thus, the choice of U i as the expected “utility”. In section 4 weconsider slightly more general model. Lastly, we restrict the analysis only tothe learning models with the shortest memory span k = 1 . Let

X, Y be two lotteries. Let ¯ P , ¯ P be the equilibrium probabilitiesconstructed from model RL(1) with set T of two alternatives. We say that X ≻ Y if and only if ¯ P > ¯ P . Equivalently (see appendix), the preference isdeﬁned by the inequality E (cid:20) e E [ Y ] /β e E [ Y ] /β + e (( E [ X ]+ X ) /β ) (cid:21) < E (cid:20) e E [ X ] /β e E [ X ] /β + e (( E [ Y ]+ Y ) /β ) (cid:21) . (7)We deﬁne relation X (cid:23) Y when the inequality in (7) is not strict, and X ∼ Y when the expectations are equal. Notice that if the alternatives are restricted9o that they have the same expected payoﬀs E [ X ] = const., then binarypreference ≻ corresponds to EU principle with “utility” ˜ u ( s ) = (1 + e s/β ) − . We will come back to this property in section 4.Our ﬁrst ﬁnding is that relation ≻ is not transitive. The proof of this ispostponed to the appendix as it based on some technical manipulations withthe integrals.The intransitivity of the trace relation in its turn implies that choiceprobabilities ¯ P S , violate Luce’s choice axiom, see Luce (1959).Intransitivity of preferences is a severe obstacle in constructing any choicetheory which limits their usefulness. However, RL(1) model provides muchmore information than just the binary preferences. In fact, any ﬁnite num-ber of lotteries can be evaluated and and unambiguous rankings can be con-structed. We will exploit this property in full in section 4. The independence axiom for preferences states that if lottery L ≻ L and L is any other lottery, then a complex lottery in which L is mixed with L insome proportion dominates L mixed with L in the same proportion. Thisis the axiom for expected utility that drew the earliest critique to Neumann-Morgenstein theory, in particularly by Allais (1953).The axiom requires preference to be linear in the probability distribution.Quick look at the formula (7) reveals that the binary preference relationdepends in a non-linear way on the distribution of the random variables. Itis not surprising that such preferences violate the independence axiom of theexpected utility theory. To illustrate this property we consider an Allais-type experiment. In what follows we say that lottery L ( a, b : p ) pays $ a withprobability p and $ b with probability 1 − p. We shorten notation to L ( c : 1)to describe lottery with 100% of $ c payoﬀ.We compare lottery L ( c : 1) and L (1 , x ) according to (7). The result isrepresented graphically in ﬁgure 1 on page 22. The red line divides the unitsquare into two parts: below the line, the certain bet L ( c : 1) ≻ L (1 , x ) , above, L (1 , x ) ≻ L ( c : 1) , and the line itself is the indiﬀerence curve(certainty equivalent curve).Similar to Allais experiment, we mix each lottery with 80% chance oflottery L (0 : 1) which pays $0 for sure. The new lotteries to compare are L ( c, .

2) and L (1 , . x ) , and the indiﬀerence curve is marked blue inﬁgure 1. The independence axiom requires the same preference between thenew lotteries as between the original. For the binary preferences in question,10owever, for all pairs ( c, x ) between two curves in ﬁgure 1 preferences arereversed. Consider now the situation when the wealth increments are framed aslosses (negative) or gains (positive). In this setting we are going to look atthe problem of determining the certainty equivalent for probabilistic lotteriesinvolving only positive or only negative payoﬀs.First, we consider lottery L (1 , x ) that pays $0 with probability 1 − x and $1 with probability x. For such lottery we ﬁnd its certainty equivalent$c according to (7). Figure 2 on page 23 shows (blue line) the correspondingindiﬀerence curve, in the ﬁrst quadrant. Then, we consider lotteries L ( − , x ) that pay $0 with probability 1 − x and − x. The red linein the third quadrant is the indiﬀerence curve for losses. For comparison, wedraw on the same ﬁgure the expected payoﬀ curve E [ L (1 , x )] , which is thesame the lottery weight x. As seen from the ﬁgure, the indiﬀerence curve for the binary preferenceslies above expected utility in the region of gains, and below it, for losses.Thus, an agent is more risk averse, compared to the expected utility prefer-ences, in gains, and more risk tolerant in losses. This example is a genericfact as proved in the next

Lemma 1.

Let X be a random payoﬀ of a lottery. Then, if X ≥ ( X ≤ ),the certainty equivalent of X is less (greater) than the expected utility E [ X ] . Proof.

The certainty equivalent c for lottery X in the binary preferences isdeﬁned by the equation E (cid:20) e c/β e c/β + e ( E [ X ]+ X ) /β (cid:21) = e E [ X ] /β e E [ X ] /β + e c/β . Consider ﬁrst the gains: X ≥ . Let p ( c ) denote the left-hand side of theequation, and q ( c ) the right-hand side. We have: p ′ ( c ) > , and q ′ ( c ) < . Moreover, p ( E [ X ]) = E (cid:20)

11 + e X/β (cid:21) ≥

11 + e E [ X ] /β = q ( E [ X ]) , because function (1 + e x/β ) − is convex for x ≥ . Thus, the point of inter-section, c, of graphs of p and q is not greater than E [ X ] . e x/β ) − isconcave for x ≤ . It should be noted that the situation in ﬁgure 2 applies only to two-valuedlotteries described there. Since (7) is not an expected utility no conclusionsabout other types of lotteries can be drawn from the graph of the certaintyequivalents of such lotteries. In particular, the type of convexity of thecertainty equivalent curve can not be used to characterize (7) preferences inthe relation to the attitude toward risk.

We will show in this section that an individual using binary preferences(7) can detect the prospect with persistently better payoﬀs than the other.This concept is formalized as ﬁrst order stochastic dominance.Consider two lotteries X and Y. It is said that X stochastically dominates Y, if for any non-decreasing function u and any non-increasing function v, E [ u ( Y )] ≤ E [ u ( X )] , E [ v ( X )] ≤ E [ v ( Y )] . Lemma 2. If X stochastically dominates Y, then X (cid:23) Y in the binarypreferences (7) .Proof. Recall that X is better than Y if E (cid:20) e E [ Y ] /β e E [ Y ] /β + e ( E [ X ]+ X ) /β (cid:21) ≤ E (cid:20) e E [ X ] /β e E [ X ] /β + e ( E [ Y ]+ Y ) /β (cid:21) . This inequality can be proved by using the deﬁnition of the stochastic dom-inance: E (cid:20) e E [ Y ] /β e E [ Y ] /β + e ( E [ X ]+ X ) /β (cid:21) = E (cid:20)

11 + e ( E [ X ] − E [ Y ]) /β ) e X/β (cid:21) ≤ E (cid:20)

11 + e ( − E [ X ]+ E [ Y ]) /β e X/β (cid:21) ≤ E (cid:20)

11 + e ( − E [ X ]+ E [ Y ]) /β e Y/β (cid:21) = E (cid:20) e E [ X ] /β e E [ X ] /β + e ( E [ Y ]+ Y ) /β (cid:21) . . A model for deviations from EU principle In this section we slightly generalize the arguments presented above tointroduce a parametric model for deviations of choices from EUT. Supposethat we are to choose among lotteries X , .., X q oﬀering random monetarypayoﬀs. Let u be a subject utility function, and let the subject form the choiceprobabilities based on RL(1) model with learning priors U i = E [ u ( X i )] andusing u ( X i ) to measure his/her responses. We introduce parameter α > U i = E [ u ( X i )] + αu ( X i ) , α > , (8)if i was selected last, and U i = E [ u ( X i )] , otherwise.With the positive scales function Φ( u ) = e u/β , equilibrium choice proba-bilities { ¯ P ( i ) } qi =1 are determined by formula (15) from appendix:¯ P ( i ) = Ke U i /β (cid:18) E (cid:20) K K + e ( U i + αu ( R i ) /β − e U i /β (cid:21)(cid:19) − , i = 1 ..q, (9)and K = P i e U i /β , and K is a positive constant.The choice is lottery X i which maximizes the probability:¯ P ( i ) = max { ¯ P (1) , .., ¯ P ( q ) } . (10)Formula (9) has two parts. First, take α = 0 . This corresponds to thechoices made on the basis of EU principle according to E [ u ( R i )] . Now let α, β → ∞ , and at the same time ﬁx the ratio α/β = 1 . In this case thesubject is not using priors, but only one-period experience. This is alsodescribed by another EU principle based on the minimization of E (cid:20)

11 + e u ( R i ) (cid:21) . (11)To illustrate the properties such choices, consider, for example, that thesubject uses logarithmic log(1 + s ) function as the response increments (rein-forcements), but it is shifted so that the response is psychologically “framed”at some reference value u , so that u < u is considered a loss and u > u asgains. Thus, u ( s ) = log(1 + s ) − u , (12)13nd without loss of generality we assume that u = log(1 + s ) , for some posi-tive s . We will consider simple lotteries with payoﬀs in the interval [0 , s ] wedetermine how the subjects values them by the certainty equivalent. First weconsider a lottery in the “gains” region of [ s , s ] . The lottery pays $2 s withprobability p and $ s with probability 1 − p. We parametrize p by variable x expressed in the units of utility: p = x − u ( s ) u (2 s ) − u ( s ) , x ∈ [ u ( s ) , u (2 s )] . For such lottery we determine its certainty equivalent c using (11), (12), andmark point ( c, s ) on the graph in ﬁgure 3 on page 24. For comparison we plotthe certainty equivalents for the lotteries but using expected utility E [ u ( X )] , where u is from (12). In the last case the certainty equivalent curve is simplythe graph of u ( s ) . For the region of losses, we repeat the construction using lottery thatpays $ s with probability p, and $0 with 1 − p, where p = x − u (0) u ( s ) − u (0) , x ∈ [ u (0) , u ( s )] . Figure 3 shows certainty equivalents in this case as well. The certainty equiv-alent curve is below the graph of utility for losses, and above that for gains.In between these two extremes, decision-making based on (9)–(10) showsdeviations from the EU principle, of the similar kind as was discussed in theprevious sections, on an example of linear utility u ( s ) = s. Now we apply choice model (8)–(10) to the following classical problem.

Consider a person contemplating purchasing an insurance against a lossof $∆ that might occur next year, with objectively known probability p. Thenext year earnings will be $ y. An insurance company oﬀers protection againstthe loss with actuarially fair premium δ = p ∆ . We assume that the personcan purchase any level of insurance $ a ∆ for the price of $ aδ, with a ∈ [ − , . We assuming here that the person can overprotect ( a > , or can actuallyborrow cash on the promise to return a part of the loss if it occurs ( a < Y a described in table 1.Consider model (8) with α = 0 , with logarithmic utility u ( s ) : u ( s ) = log(4 + c ) − log 8 . y − ap ∆ py − ∆ + (1 − p ) a ∆ 1 − p Table 1: Values of lottery Y a . It is an EU principle of maximization of E [ u ( Y a )] . For a concave utility u ( s ) , the solution is always a = 1 , for any level of income y, probability of loss p, and loss ∆ > . Consider model (8) with α/β = 0 . , α, β → + ∞ , and same u ( s ) . Noticethat income Y a < Y a > y ∈ [0 , , and loss ∆ = 2 , comparable to the income.The selection is now based on the minimization of a functional similar to(11). The solution is shown in ﬁgure 4 on page 25. The ﬁgure shows theamount of insurance a one buys, as a function of the level on income y andthe loss probability p. Due to diﬀerent risk attitude for losses and gains thedecision depends on the values of y and p. Notice that among all possiblevalues of a ∈ [ − ,

2] only three are being selected: a = − , . That is,the choice undergoes phase transition in ( y, p ) values.To illustrate the selection by non-EU choice model, take α = 0 . , β = 1 . The model applies only to ﬁnite number of alternatives. In fact, it dependsnon-trivially on the number of alternatives through parameter K in (9).We will give a person the choice between levels of insurance from a = − a = 2 with increment of 0 . , totaling to q = 7 choices. Figure 5 showswhich one is selected depending on y and p. Notice that again that the choiceis mostly between three values a = − , , , similar to that of the second case,when y > . The same values, somewhat symmetrically appear in the region y < , and for y < a = 1 , as in the risk averse case. Thusthe model shows deviations from EUT ( a = 1) in the middle section of theﬁgure, around a point of framing, the width of this region depending on theparameters of the model.

5. Appendix

In this section we will give a proof that in the course of the learningaccording to the process described section 2, the choice probabilities settle15t some equilibrium values and provide the formulas for them. To simplifythe presentation but not the generality we will assume that choices are madefrom the set of all n alternatives.The learning process can be described as a Markov chain on the ﬁnitestate space S k of k most recent alternatives that a subject has selected, i.e., S k = { ( i , .., i k ) : i j ∈ ..n } . We denote the stochastic process as X n = ( i n , .., i nk ) . From the formula (4)we see that the state X n +1 is completely determined from the current state X n , i.e. { X n } ∞ n =0 is a Markov chain.We will proceed with the computation of the transition probabilities fromthe state ¯ i = ( i , .., i k ) to the state ¯ m = ( m , .., m k ) , that we denote by p ( ¯ m : ¯ i ) . This probability is zero unless m = i , m = i , .., m k = i k − . Inthe remaining cases, according to (4) and (5) for j = 1 ..n,p ( j, i , i , .., i k − : i , .., i k ) = E (cid:20) Φ( U j ) P nl =1 Φ( U l ) (cid:21) > , (13)where U l = U l + X l ∈ [1 ..k ] u ( R i l ) /N ( l, i , .., i k , R i , .., R i k ) , and N ( l, i , .., i k , r i , .., r i k ) is a random variable that counts the number oftimes alternative l has been selected, given the last k selections i , .., i k andthe last k reinforcements r i , .., r i k . The expectation in (13) is with respectto the joint distribution of independent random variables ( R i , .., R i k ) . The Markov chain { X n } ∞ i =0 is irreducible and all states are ergodic, seethe monograph of Feller (1957) for the theory Markov chains. This impliesthat the distribution of X n converges to an invariant measure on S k thatwe denote my µ. The probability that alternative i has been selected last iscomputed from µ by the formula¯ P T ( i ) = X i ,..,i k ∈ [1 ..n ] µ ( i, i , .., i k ) . For the learning model with k = 1 computation of { ¯ P T } is somewhat sim-pliﬁed. The details are presented below as this is the case of the principleinterest of the paper. 16or k = 1 , S = { , .., n } is simply the set of alternatives, and the tran-sition probabilities equal p ( j : i ) = E (cid:20) Φ( U j ) P nl =1 Φ( U l ) (cid:21) > , where U l = (cid:26) U i + u ( R i ) , l = i,U l , l = i. The vector of equilibrium measure µ = ( µ (1) , .., µ ( n )) t is determined fromthe linear system µ = M µ, (14)where the elements of the transition matrix M equal M j,i = p ( j : i ) . The i th equation in this system reads: µ ( i ) = X k = i µ ( k ) E " Φ( U i ) P l = k Φ( U l ) + Φ( U k + u ( R k )) + µ ( i ) E " Φ( U i + u ( R i )) P l = i Φ( U l ) + Φ( U i + u ( R i )) . It can be rearranged as µ ( i ) = X k µ ( k ) E " Φ( U i ) P l = k Φ( U l ) + Φ( U k + u ( R k )) ++ µ ( i ) E " Φ( U i + u ( R i )) − Φ( U i ) P l = i Φ( U l ) + Φ( U i + u ( R i )) , or as µ ( i )(Φ( U i )) − E " P l Φ( U l ) P l = i Φ( U l ) + Φ( U i + u ( R i )) = X k E " µ ( k ) P l = k Φ( U l ) + Φ( U k + u ( R k )) . i ; we denote it by K. Also we assign K = P l Φ( U l ) . Then, the last equation provides a formula µ ( i ) = K Φ( U i ) (cid:18) E (cid:20) K K + Φ( U i + u ( R i )) − Φ( U i ) (cid:21)(cid:19) − , i = 1 ..q. (15)The set of equilibrium choice probabilities { ¯ P T } is the same as measure µ. Ifonly two alternatives are present, T = { , } , (14) can be reduced to a singleequation, since µ (1) + µ (2) = 1 : µ (1) E (cid:20) Φ( U )Φ( U ) + Φ( U + u ( R )) (cid:21) = µ (2) E (cid:20) Φ( U )Φ( U ) + Φ( U + u ( R )) (cid:21) , (16)which we use in applications to compute ratio µ (1) /µ (2) . Lemma 3.

For the binary preferences deﬁned in (7) , there are lotteries

X, Y, Z such that X ≻ Y, Z ≻ X but Y ≻ Z .Proof. We will show that there are

X, Y, Z such that X (cid:23) Y, Z (cid:23) X but Y ≻ Z, i.e. that (cid:23) is not transitive. The lotteries X, Y, Z can be suitablyperturbed to show that ≻ is not transitive as well.Let X, Y be two lotteries such that E [ X ] = E [ Y ] = 0 , E (cid:20)

11 + e X (cid:21) = E (cid:20)

11 + e Y (cid:21) , (17)and E (cid:20) e X (1 + e X ) (cid:21) > E (cid:20) e Y (1 + e Y ) (cid:21) (18)Let ¯ Z be a lottery with E [ ¯ Z ] = 0 and E (cid:20)

11 + e ¯ Z (cid:21) > E (cid:20)

11 + e X (cid:21) . (19)First we establish the following Claim 1.

For any number z > there is z ∈ (0 , z ) and a lottery Z suchthat E [ Z ] = z, and E (cid:20)

11 + e z + Z (cid:21) = E (cid:20)

11 + e − z + X (cid:21) . (20)18 roof. Consider functions f ( z ) = E h e z + ¯ Z i and g ( z ) = E (cid:2) e − z + X (cid:3) . From(19) we have f (0) > g (0) . Moreover f ( z ) is monotone function with f ′ (0) < , and g ( z ) is monotone with g ′ (0) > Z is chosen suﬃciently close to X, graphs of f ( z ) and g ( z ) willintersect at some point in the interval (0 , z ) . For such point, call it z, andlottery Z = z + ¯ Z, equation (20) holds.Condition (17) means that X (cid:23) Y. Consider function f ( z ) = E (cid:2) e − z + X (cid:3) and g ( z ) = E (cid:2) e − z + Y (cid:3) . Conditions (17),(18) imply that there is z > s ∈ (0 , z ) , E (cid:20)

11 + e − s + X (cid:21) > E (cid:20)

11 + e − s + Y (cid:21) . Given this z and using the claim we ﬁnd z ∈ (0 , z ) and lottery Z, such (20)holds, which implies that Z (cid:23) X. On the other hand E (cid:20)

11 + e − z + X (cid:21) > E (cid:20)

11 + e − z + Y (cid:21) . Thus, E (cid:20)

11 + e z + Z (cid:21) > E (cid:20)

11 + e − z + X (cid:21) . By formula (7) it implies that Y ≻ Z (recall that E [ Z ] = z, E [ Y ] = 0), while(17) establishes that X (cid:23) Y. References [1] Allais, M. (1953) Les comportement de l’homme rationnal devantle risque: critique des postulates and axioms de l’ecole americaine.Econometrica, 21.[2] Arrow, K. (1951). Social Choice and Individual Values. Wiley & Sons,New York, NY.[3] Bush, R.R., and Mosteller, F. (1951). A mathematical model for sim-ple learning. Psychological Review 58, 313–323.194] Bush, R.R. and Mosteller, F. (1955). Stochastic models for learning.Wiley & Sons, New York, NY.[5] Erev, I. and Roth, A. E. (1996). On the need of low rationality cogni-tive game theory: reinforcement learning in experimental games withunique mixed equilibria. Mimeo. University of Pittsburgh.[6] Erev, I. and Roth, A. E. (1998). Predicting how people play games:reinforcement learning in experimental games with unique, mixedstrategy equilibrium. American Econ. Review 88, 848–881.[7] Feller, W. (1957). An Introduction to Probability Theory and ItsApplications. Vol II. John Wiley & Sons, New York, NY.[8] Fudenberg, D., and Levine, D. (1998). The theory of learning ingames. MIT Press, Cambridge, MA., London, England.[9] Harley, C.B. (1981). Learning the Evolutionary Stable Strategy. J.theor. Biol. 89, 611–633.[10] Kahneman, D., and Tversky, A. (1979). Prospect Theory: An Anal-ysis of Decision under Risk, Econometrica, XVLII 263–291.[11] Kahneman, D., and Tversky, A. (1984). Choices, Values and Frames.American Psychologist 39, 341–350.[12] Marschak, J. (1960). Binary choice constraints on random utility in-dicators. in K. Arrow (ed.), Stanford Symposium on MathematicalModels in the Social Sciences, Stanford University Press, Stanford,CA.[13] Luce, R. D. (1959). Individual Choice Behavior. Wiley &Sons, NewYork, NY.[14] Luce, R. D. (1977). The choice axiom after twenty years. Journal ofMath. Psych. 15, 215–233.[15] Machina, M.J. (1982). Expected utility analysis without the indepen-dence axiom”. Econometrica. 50 (2), 277–323.[16] Quiggin, J. (1982). A theory of anticipated utility. Journal of Eco-nomic Behavior and Organization 3(4), 323–343.2017] Quiggin, J. (1993). Generalized Expected Utility Theory. The Rank-Dependent Model. Kluwer Academic Publishers, Boston, MA.[18] Pleskac, T. (2015). Decision and Choice: Luce’s Choice Axiom. inP. Bona, (ed.) International Encyclopedia of the Social & BehavioralSciences.[19] Roth, A.E., and Erev, I. (1995). Learning in extensive-form games:experimental data and simple dynamics models in the intermediateterm. Games and Economic Behavior, 8 164–212.[20] Sutton, R.S. and Barto, A.G. (1998). Reinforcement learning: anintroduction. MIT Press, Cambridge MA.[21] Tversky, A., Kahneman, D. (1992). Advances in prospect theory:Cumulative representation of uncertainty. Journal of Risk and Un-certainty, 5:297–323.[22] Von Neumann, J. and Morgenstern, O. (1947). Theory of Games andEconomic Behavior. Princeton, NJ: Princeton University Press.[23] Yaari, M. (1987). The dual theory of choice under risk. Econometrica,55, 95–115. 21 certainty equivalent, c l o tt e r y w e i gh t, x Figure 1: Allais Paradox. Red line is the certainty equivalent for lottery L (1 , x ) . Blueline represents points ( c, x ) for which lotteries L (1 , . x ) are equivalent to L ( c, . . The region between two curves is set of points for which Allais paradox holds. RL(1) modelhas positive scale function Φ( u ) = e u/ . and learning priors U = E [ L (1 , x )] = x. payoff, $ -1-0.8-0.6-0.4-0.200.20.40.60.81 u t ili t y ( l o tt e r y w e i gh t x ) Figure 2: Certainty equivalent curves for simple lotteries for losses and gains according toEUT (blue) and the binary preferences from RL(1) model (red). The function of positivescales Φ( u ) = e u/. . payoff, $ -1-0.500.5 u t ili t y Figure 3: Certainty equivalent curves for simple lotteries for losses and gains accordingto EU (blue) and model (8)-(10) (red). Utility u ( c ) = log(1 + c ) − log 3 . Transition fromlosses to gains occurs at c = 2 . The function of positive scales Φ( u ) = e u/. . igure 4: Demand for insurance I. The ﬁgure shows fraction a of purchased insurance (incolor) for every pair of ( y, p ) – level of income and probability of loss, according to themodel (8)–(10) with parameters α, β = + ∞ , α/β = 0 . . igure 5: Demand for insurance II. The ﬁgure shows fraction a of purchased insurance(in color) for every pair of ( y, p ) – level of income and probability of loss, according to themodel (8)–(10) with parameters α = 0 . , β = 1 ..