[PDF] Efficiency of learning vs. processing: Towards a normative theory of multitasking

Abstract

A striking limitation of human cognition is our inability to execute some tasks simultaneously. Recent work suggests that such limitations can arise from a fundamental tradeoff in network architectures that is driven by the sharing of representations between tasks: sharing promotes quicker learning, at the expense of interference while multitasking. From this perspective, multitasking failures might reflect a preference for learning efficiency over multitasking capability. We explore this hypothesis by formulating an ideal Bayesian agent that maximizes expected reward by learning either shared or separate representations for a task set. We investigate the agent's behavior and show that over a large space of parameters the agent sacrifices long-run optimality (higher multitasking capacity) for short-term reward (faster learning). Furthermore, we construct a general mathematical framework in which rational choices between learning speed and processing efficiency can be examined for a variety of different task environments.

Full PDF

EEfﬁciency of learning vs. processing: Towards a normative theory of multitasking

Yotam Sagiv ([email protected]), Sebastian Musslick ([email protected]),Yael Niv ([email protected]), Jonathan D. Cohen ([email protected])

Princeton Neuroscience InstitutePrinceton University

Abstract

A striking limitation of human cognition is our inability to ex-ecute some tasks simultaneously. Recent work suggests thatsuch limitations can arise from a fundamental tradeoff in net-work architectures that is driven by the sharing of represen-tations between tasks: sharing promotes quicker learning, atthe expense of interference while multitasking. From this per-spective, multitasking failures might reﬂect a preference forlearning efﬁciency over multitasking capability. We explorethis hypothesis by formulating an ideal Bayesian agent thatmaximizes expected reward by learning either shared or sep-arate representations for a task set. We investigate the agent’sbehavior and show that over a large space of parameters theagent sacriﬁces long-run optimality (higher multitasking ca-pacity) for short-term reward (faster learning). Furthermore,we construct a general mathematical framework in which ratio-nal choices between learning speed and processing efﬁciencycan be examined for a variety of different task environments.

Keywords: multitasking; cognitive control; Bayesian infer-ence; capacity constraints;

Introduction

The human brain’s ability to simultaneously perform distincttasks contains a curious tension. On one hand, we are able toconcurrently carry out a large number of actions (e.g. breathe,speak, chew gum, etc.) seemingly without exerting any effort.In contrast, some behaviors defy parallel execution (e.g. solv-ing calculus problems and constructing shopping lists) andrequire serialization to successfully execute.The distinction between sets of tasks that can be executedconcurrently and those that cannot is often referred to in termsof a fundamental distinction between controlled and auto-matic processing (Posner & Snyder, 1975; Shiffrin & Schnei-der, 1977). Early theories attributed the inability to carry outmultiple control-demanding tasks in parallel to reliance ona single, limited-capacity, serial processing mechanism – ahypothesis that has continued to dominate major theories ofcognition (e.g., Anderson, 2013). The “multiple-resource hy-pothesis” presents a challenge to this view, arguing that mul-titasking limitations may reﬂect competition for the use oflocal resources (e.g., shared task-speciﬁc representations) bysets of tasks, rather than common reliance on a central con-trol mechanism (Allport, Antonis, & Reynolds, 1972; Feng,Schwemmer, Gershman, & Cohen, 2014; Navon & Gopher,1979; Meyer & Kieras, 1997; Musslick et al., 2016; Salvucci& Taatgen, 2008; Wickens, 1991). Under this view, the roleof cognitive control is to resolve such conﬂicts when theyarise by limiting processing to only a single task at a time(Cohen, Dunbar, & McClelland, 1990; Botvinick, Braver,Barch, Carter, & Cohen, 2001). That is, limiting processing isthe purpose of control , rather than a reﬂection of a constrainton the control system itself. Recent computational work has provided a formal grounding for this argument, showing thateven modest amounts of overlap between task representationscan drastically limit the number of tasks a network can engageat the same time without invoking interference among them(Feng et al., 2014; Musslick et al., 2016; Petri et al., 2020).Critically, this number appears to be relatively insensitive tothe size of the network.The ﬁndings above raise an important question: insofaras shared representation between tasks impose limitations onmultitasking, why would a neural system prefer shared rep-resentations over separate ones? Insights into this questioncan be gained from the machine learning literature, wherethe learning of shared representations between tasks is con-sidered a desirable outcome (Baxter, 1995; Caruana, 1998;Bengio, Courville, & Vincent, 2013). For instance, work onmulti-task learning suggests that shared representations be-tween tasks promote faster learning, as well as better general-ization performance across tasks (Caruana, 1997; Collobert& Weston, 2008). Moreover, learning dynamics in neuralnetworks themselves promote the learning of shared repre-sentation based on shared structure in the task environment(Hinton, 1986; Saxe, McClelland, & Ganguli, 2013; Mus-slick et al., 2017). Thus, there appears to be a fundamentaltradeoff in neural networks between the efﬁciency of learn-ing (and generalization) on the one hand, and the efﬁciencyof processing (i.e., multitasking capability) on the other hand(Musslick et al., 2017).The tradeoff between learning and processing efﬁciencyconstitutes an optimization problem that is dependent on thedemands of the task environment. The work described hereexamines this optimization problem as a function of criticalparameters, such as the differences in rate of learning forshared vs. separated representations, and the beneﬁts gainedby parallel over serial task performance. Analysis of thisproblem may help provide a formally rigorous, and even nor-mative account of longstanding, well-characterized psycho-logical phenomena, such as the common trajectory in skillacquisition from controlled to automatic processing (Shiffrin& Schneider, 1977; Logan, 1980).Ideally, our analysis would build on formal characteriza-tion of the learning rate for different types of representations,given a speciﬁed learning algorithm (e.g. backpropagation).However, since this is not immediately available, to con-struct a probabilistic generative model we begin by assuming Note that the term ’multi-task’ differs from the term ’multitask-ing’. The former refers to the paradigm of training the same networkon multiple tasks, whereas the latter refers to the process of carryingout multiple tasks concurrently. a r X i v : . [ q - b i o . N C ] J u l imple functional forms for the learning trajectory associatedwith shared vs. separated task representations in a multitask-ing environment, and then use the generative model to deﬁnean ideal Bayesian agent that behaves optimally inside that en-vironment. Taken together, the environment and agent mod-els provide a simple, normative framework in which ques-tions about the learning-processing tradeoff can be explored. A rational model of multitasking

We begin our analysis of the optimal balance between learn-ing and processing efﬁciency by formalizing the task envi-ronment. We then describe how the agent model chooses be-tween the use of shared vs. separate representations in thatenvironment to optimize performance, which we deﬁne asmaximizing reward over the entire horizon of performance.

Task Environment

We consider an environment in which a task can be deﬁnedas a process (e.g. naming the color of a stimulus) that mapsthe dimension of a stimulus (e.g. color) to a particular re-sponse dimension (e.g. verbal response). Here we assumethat stimuli consist of N dimensions (e.g. color, shape, andtexture) and that responses are carried out over K responsedimensions (e.g. naming, pointing, or looking), resulting in NK possible tasks in any environment. We adopt a formaldeﬁnition of multitasking from earlier work (Musslick et al.,2016; Alon et al., 2017; Lesnick, Musslick, Dey, & Cohen,2020), in which a multitasking condition is deﬁned as the re-quirement to execute multiple tasks at the same time, none ofwhich share a stimulus or response dimension. Consequently,at most min { N , K } tasks can be carried out concurrently.The agent is asked to optimize performance over a seriesof τ multitasking trials. On each trial, the agent is asked toperform α tasks, where α is drawn from a latent multinomialdistribution. We introduce multitasking pressure by specify-ing a reward schedule that favors concurrent performance oftasks. For every task answered correctly, the agent receives1 unit of reward, resulting in α rewards if the agent is ableto perform all tasks with maximal accuracy at the same time.However, if the agent chooses instead to perform all taskssequentially, it loses jC reward units on task j , where j in-dexes the tasks from 0 to α − ∑ α − j = − jC rewards given maximal accuracy). C is termedthe “serialization cost” or “time cost”. We note that this re-ward schedule is chosen largely for analytical convenience,and is not itself based on a particular normative principle orproperty of the environment. One alternative could be to set apenalty for serialized execution based on the opportunity costper time-step. We will extend our results to arbitrary rewardschedules in a later section.Optimization is deﬁned as the choice, on each trial, ofa performance strategy that maximizes total future reward;that is, summed over the current trial and the potentially dis-counted reward anticipated for each future trial. This requiresestimating and convolving the expected multitasking require-ments over trials, performance for executing the tasks concur- rently vs. individually as a function of the estimated learningrate for each (see below), and the serialization costs associ-ated with performing tasks sequentially. Agent

The agent is considered to be a rational decision-makerthat chooses between two independent, trainable processingstrategies that result from two extremes of how multiple taskscan be represented in a single network (see Figure 1). Theﬁrst representation strategy is as a minimal basis set, in whichall tasks relying on the same stimulus dimension encode thestimuli using the same (shared) set of hidden representations(i.e. N sets of hidden representations) that are then mapped tothe output dimensions for each of the tasks. The second strat-egy uses tensor product representations, in which each taskencodes its stimuli using its own set of (separated) hidden rep-resentations (resulting in NK sets of hidden representations)that are mapped to the output dimension for the task. Whilethe minimal basis set provides a more efﬁcient encoding ofthe stimuli, it does not permit multitasking since the use ofshared representations introduces crosstalk between any pairof simultaneously activated tasks (Feng et al., 2014; Musslicket al., 2016; Alon et al., 2017). Thus, use of the minimal ba-sis set forces a serialization cost of jC reward units for task j = , , . . . , α −

1. Conversely, the tensor product represen-tation permits multitasking without interference, since eachtask is assigned its own set of hidden representations thatcomprise independent processing pathways in the network.We assume that the agent has the potential to develop bothforms of representation, but these must be learned.Figure 1: Schematic of network schemes that maximize rep-resentation overlap (a) vs. multitasking capability (b). C, S,T designate the stimulus dimensions (”color”, ”shape”, and”texture”), while W, K, P designate the response dimensions(”word”, ”keyboard”, ”point”). The hidden-layer representa-tion of the stimulus in (a) is shared for all three tasks involvingthe same input dimension (minimal basis set representation),whereas in (b) a separate hidden-layer representation is dedi-cated to each task (tensor product representation).Previous work has shown that, for a set of tasks that are inprinciple multitaskable, training using shared representations(such as a minimal basis set) leads to faster acquisition thanlearning separate representations for each task (such as a ten-sor product), as the former enables the sharing of learning sig-als across tasks (Musslick et al., 2017). We implement theseeffects by assuming that 1) the agent learns these two typesof representations (i.e. processing strategies) by selecting andexecuting one or the other on each trial; 2) performance foreach strategy improves as a function of the number of trialsselected, and 3) learning is faster for the minimal basis setstrategy than for the tensor product strategy, as described be-low.To model the learning of tasks, we deﬁne a probability ofsuccess function (aka “training function”) for each of the twoprocessing strategies. Let f B , f T : N ≥ → [ , ] denote thesetraining functions for the minimal basis set and tensor productstrategies, respectively. These serve as explicit characteriza-tions of the agent’s learning dynamics; f X ( t ) implements thelearning curve by evaluating the probability of success on agiven task after representation X has been selected t times.That is, every time the agent chooses to process the tasks inthe trial using strategy X , the success probability for the taskunder strategy X increases for the next time-step. More for-mally, let x , x , . . . , x t − be a sequence of t − X on a task on trial t as: P X ( success on a task in trial t ) = f X ( t − ∑ i = x i = X ) (1)For convenience, we use the logistic function f X ( t | k , t ) = + e − k ( t − t ) . However, our analysis applies to any learningfunction that is monotonically increasing and is bounded0 ≤ f X ( t ) ≤

1, for all t . As noted above, we assume thatlearning occurs at a faster rate for the minimal basis set strat-egy as compared to tensor product strategy, and examine theinﬂuence of this difference by exploring a range of values for k , t that together determine the rate of learning.The agent uses standard Bayesian machinery to infer theexpected reward under each representation, and then selectsthe representation that maximizes total discounted future re-ward. Speciﬁcally, let E X [ R ] denote the expected reward forstrategy X , E X [ R | t ] denote the expected reward on trial t , and µ ( t ) be the temporal discounting function. Then we have that E X [ R ] = ∑ τ t = µ ( t ) E X [ R | t ] . Though temporal discounting canbe irrational in many contexts, we note that a fully rationalagent can be achieved with µ ( t ) = α is the randomly assigned number of tasks re-quired to be performed on a given trial. By marginalizingover α , we get that the expected reward on each individualtrial is E X [ R | t ] = ∑ min { N , K } i = P ( α = i ) E X [ R | t , α = i ] . Thus, theexpected rewards for the minimal basis set and tensor productstrategies correspond to E B [ R | t ] = min { N , K } ∑ i = P ( α = i ) i − ∑ j = P B ( success | t )( − jC ) E T [ R | t ] = min { N , K } ∑ i = P ( α = i ) i − ∑ j = P T ( success | t )( ) (2) In order to compute the expected reward terms in Equa-tion (2), the agent must be able to evaluate P ( α = i ) and P X ( success | t ) by inferring the multinomial task distribution,as well as the parameters of each training function f X . Theﬁrst can be inferred using Bayes’ theorem, by keeping trackof the number of times each particular α value was seen, inconjunction with a Dirichlet prior (we start from a uniformprior, implying absence of strong a priori belief about thedistribution).Inferring the parameters for the two training functions f B , f T can similarly be done by tracking the history of suc-cesses and failures and then performing a Bayesian logisticregression (intuitively, this can be understood as the agent in-ferring how fast it will learn). In this model, k and t haveindependent normal priors centered on their true values withhigh variance. Finally, we assume that the agent alreadyknows τ , the sequential processing cost C , and the tempo-ral discounting function µ ( t ) .Once the expected values are computed, the agent mustselect an action. We assume this is done using a standardexplore-exploit algorithm, the ε -greedy rule, in which theagent picks the action associated with greatest value withprobability 1 − ε , and uniformly otherwise. Formal analysis of equilibrium

We begin by analyzing an agent that has perfect knowledgeabout the task environment and learning rate, in order to as-sess performance independently of noise that might be gener-ated by an inference process over these factors. This allows usto analytically derive equilibrium conditions under which theagent should be indifferent between the minimal basis set andthe tensor product strategies. For this section, we let N < K so that N = min { N , K } without loss of generality.Observe that the expressions in Equation (2) reduce to: E B [ R | t ] = f B ( t ) E [ g ( α , C )] E T [ R | t ] = f T ( t ) E [ α ] (3)where g ( i , C ) = ∑ i − j = ( − jC ) . Note that g ( i , C ) encodes theamount of reward accrued by the agent for completing i tasksin a serial fashion with time cost C . Plugging Equation (3)into the expression for the expected reward of both strategieswe can express the condition for which the agent should beindifferent between them: E [ α ] E [ g ( α , C )] = ∑ τ t = µ ( t ) f B ( t ) ∑ τ t = µ ( t ) f T ( t ) (4)An interesting property of this result is that agent-relatedand environmental parameters are analytically separable. Ob-serve that the expectation terms on the left correspond tothe agent’s expected reward at asymptotic performance lev-els, and that the sum terms on the right denote the numberof expected successes in a critical time period speciﬁed bythe conjunction of the temporal discounting function and thetraining function. The indifference point can be understoodntuitively as a surface over which the ratio of expected even-tual rewards is equal to the ratio of times at which they arelikely to be accrued (discounted by time). That is, the left sidecontains the ratio of the rewards the agent expects to earn if itis always correct, whereas the right side is a ratio of functionsthat weight when the agent prefers to receive the rewards.Recall that E [ g ( α , C )] corresponds to E [ ∑ α − j = ( − jC )] = E (cid:104) α (cid:16) + [ − ( α − ) C ] (cid:17)(cid:105) . Since C is a constant, it can beisolated from the expectation in Equation (4) to get an expres-sion for the precise value of the serialization cost that charac-terizes the indifference surface. That is: C eq = E [ α ] (cid:16) − ∑ τ t = µ ( t ) f T ( t ) ∑ τ t = µ ( t ) f B ( t ) (cid:17) E [ α ( α − )] (5)Equation (5) provides a rigorous characterization of thetradeoff between basis set and tensor product learning in mul-titasking environments described in the Introduction:1. As the average number of parallel tasks increases, the costof serialization must vanish for minimal basis set represen-tations to remain preferable: E [ α ] → ∞ = ⇒ C eq → ∑ τ t = µ ( t ) f T ( t ) ∑ τ t = µ ( t ) f B ( t ) → = ⇒ C eq → ∑ τ t = µ ( t ) f T ( t ) ∑ τ t = µ ( t ) f B ( t ) → = ⇒ C eq → E [ α ] E [ α ( α − )] : As the ratio ofthe discounted training functions for the tensor productand minimal basis set representations approaches 0, theequilibrium-deﬁning serialization cost becomes a functionof the number of tasks required to be performed. Partic-ularly, C eq is the serialization cost that sets expected re-ward for the minimal basis set representation to 0. Thisimplication is not immediately obvious. Consider the taskdistribution P [ α = ] = P [ α = ] = /

2. In this environ-ment, C eq = α =

1, or win − α =

2. This makes sense; if learning tensor prod-uct representations is so much slower than minimal basisset representations that the ratio of the sums goes to 0, theagent is indifferent only if the expected earnings are 0.Finally, we note that we have used arbitrary reward func-tions for the analyses above. However, it is possible to gen-eralize the equilibrium condition in Equation (4) to any sta-tionary reward function (i.e. does not change over the courseof the experiment). Let g B ( j , θ B ) denote any reward func-tion applied independently to each task, with arbitrary depen-dence on the task’s index j and other ﬁxed parameters θ B .Furthermore, let G B ( i , θ B ) = ∑ i − j = g B ( j , θ B ) be the accumu-lated reward across a task set consisting of i tasks. Note that previous analysis corresponds to the case g B ( j , θ B ) = − jC .Speciﬁcally, g B and G B are the per-task and cumulative re-ward functions when the agent executes tasks serially. Fi-nally, deﬁne g T , G T analogously for the case where the tasksare being processed concurrently. Then a generalized equi-librium condition is: E [ G T ( α , C )] E [ G B ( α , C )] = ∑ τ t = µ ( t ) f B ( t ) ∑ τ t = µ ( t ) f T ( t ) (6)Observe that for g B = − jC and g T =

1, this reduces tothe expression in Equation (4). The existence of this gener-alized equilibrium condition allows a large set of questionsto be phrased within this framework. For example, it is easyto include an explicit cost of cognitive control (e.g., Shenhav,Botvinick, & Cohen, 2013; Shenhav et al., 2017; Manoharet al., 2015) by adding a term to the basis set reward func-tion that implements a cost that increases with the numberof tasks executed, or the use of a per-task penalty consistingof the asymptotic-performance opportunity cost (a functionexclusively of α ). Numerical analysis with parameter inference

The analysis above characterized the behavior of an agentwith perfect knowledge of the task environment and its learn-ing functions. Here we relax these assumptions, and use nu-merical simulations to evaluate the behavior of an agent thatmust infer these parameters. We assess the agent’s perfor-mance across a series of task environments and learning spec-iﬁcations by crossing a set of reasonable parameter ranges.We let τ = C ∈ [ , ] , varying from no pun-ishment to receiving no reward for a correct answer. Weuse an exponential discounting scheme µ ( t ) = γ − . t for γ ∈ [ . , . ] . This covers the range from extreme discountingto no discounting at all. We characterize the training func-tions as logistic with f X ( t ) = + e − . ( t − tX ) . This allows us toprecisely characterize difference in learning rates through theratio t T / t B . To that end, we set t B = t T vary in [ , ] .We let N = K = P ( α = ) = . , P ( α = ) = P ( α = ) = P ( α = ) = . ε = . P ( pick X ) = number of times X was picked τ ,and track how P ( pick basis set ) varies with the parameters .The results (see Fig. 2) show that there is a broad rangeof parameters under which the agent will opt for select-ing the minimal basis set strategy over the tensor product code available at https://github.com/yotamSagiv/thesis We can use Equation (5) to show that even with weak discount-ing ( γ = .

90) and a modest learning rate ratio t T / t B =

2, the impor-tance of fast training is such that the time cost must nearly equal thereward value ( C eq ≈ .

75) for indifference in this environment. igure 2: Simulation results for the inference model. t T / t B refers to the midpoint ratio of the tensor product and minimalbasis set training functions. Time cost denotes the value of C .Note that the agent increases their preference for the minimalbasis set representation when the time cost is decreased, thelearning rate ratio is increased, or gamma is decreased.strategy ( P ( pick basis set ) > . P ( select basis set ) ∼ b × t T t B + b × timeCost + b × γ ( b = . , t ( ) = . , p < . b = − . , t ( ) = − . , p < . b = − . , t ( ) = − . , p < . Discussion

The constraints on human multitasking abilities present an in-teresting puzzle given the enormous processing capability ofthe brain. Here, we explored the hypothesis that this reﬂectsa fundamental tradeoff between learning and processing efﬁ-ciency (Musslick et al., 2017), in which preference for learn-ing to perform a set of tasks faster, which relies on the use ofshared representations (Caruana, 1998; Baxter, 1995), comesat the expense of multitasking efﬁciency (Allport et al., 1972;Feng et al., 2014; Navon & Gopher, 1979; Meyer & Kieras,1997; Musslick et al., 2016; Salvucci & Taatgen, 2008; Wick-ens, 1991). This tradeoff between the value of shared vs. sep-arated representations is reminiscent of the complementarylearning systems hypothesis (McClelland, McNaughton, &O’Reilly, 1995), which proposes the existence of two inde-pendent learning mechanisms. The ﬁrst relies on shared rep-resentations to support inference, and the second uses sep-arate representations to avoid the cost of catastrophic inter-ference for memory encoding and retrieval. Thus, the trade-off between shared and separated representations appears toa fundamental one, that has different consequences in differ-ent processing contexts. Here, we have provided a normativeanalysis of this tradeoff in the context of task performance that, under various assumptions, deﬁnes the conditions underwhich limitations in multitasking ability can be viewed as aresult of optimal decision-making.Agent behavior in our model was governed by several fac-tors: the distribution of multitasking opportunities within theenvironment, the cost of serial vs. parallel performance, therate at which each strategy is learned, and the discount rate forfuture rewards. The broad range of these factors over whichthe minimal basis set strategy was optimal suggests that thetheory provides a plausible account of why so many skills(e.g. driving a car, playing an instrument) seem to rely oncognitive control and serial execution during acquisition.Theories of bounded rationality (Simon, 1955, 1982;Gigerenzer, 2008) assume that suboptimalities in human be-havior arise from the use of heuristics rather than full deliber-ation, given the bounds of limited multitasking capacity andlimited available information. Research in artiﬁcial intelli-gence has suggested that such behavior is normative; that is,it may reﬂect bounded optimality , in which an agent maxi-mizes reward per unit time given intrinsic limitations in itscomputational architecture (Russell & Subramanian, 1995).The principles of bounded optimality are reﬂected in psycho-logical models of cognition, in which humans perform opti-mally within the constraints of the cognitive system (Grifﬁths,Lieder, & Goodman, 2015; Gershman, Horvitz, & Tenen-baum, 2015). Yet, these accounts do not explain why com-putational limitations exist in the ﬁrst place, other than theassumption of limited processing power/speed. The workhere suggests that the bounds may arise from a normative re-sponse to constraints imposed by tradeoffs intrinsic to anynetwork architecture , whether neural or artiﬁcial – speciﬁ-cally, the tradeoff between the advantages of faster learningand generalization provided by shared representations, andthe advantages of concurrent parallelism and processing efﬁ-ciency provided by separated representations (Musslick et al.,2017). Under this framework the source of the limitation isnot in the brain/computing device, but rather in the fact thattime in life is ﬁnite (i.e., the beneﬁts of learning a task quicklyfar outweigh the value of learning it “optimally”).Of course, the model we described is relatively simple, andcan be extended in a number of ways. Rather than using alogistic function to characterize learning, it may be more rea-sonable to scale the beneﬁt of shared representations by thenumber of tasks (e.g. as in Musslick et al., 2017), or to im-plement the learning dynamics of actual neural networks onsimilar task spaces. Additionally, a cost of control parametercould be incorporated that scales with the number of tasksbeing executed and/or the complexity of the task environ-ment (Shenhav et al., 2013). It is also plausible to considerthe transfer of learning between the two strategies (i.e. gen-eralization). This may be an important factor in shaping howrepresentations evolve from the minimal basis set to tensorproduct forms over the course of training, as suggested bysome neural evidence (Garner & Dux, 2015).One might also consider meta-learning. The simulatedgents learned about their task environment and learningfunctions, but always began with the same predetermined,static priors. It is possible that repeated experience over dif-ferent task domains could inform these priors, improving theinitial estimates of the learning functions. This would in-duce a higher rate of convergence to the optimal decision forcases in which the agent’s prior experiences are relevant, andmight also explain any reluctance to switch away from sub-optimal decision-making in contexts where its experience ismisleading. Such effects could be informative to similar linesof inquiry regarding separate mechanisms for goal-directedand habitual responding in mammals undergoing instrumen-tal conditioning (Yin & Knowlton, 2006).In sum, the results presented here strongly support the pro-posal that constraints in multitasking observed in human per-formance may arise from a normative approach to an in-escapable tradeoff between the value of rapidly acquiring aset of novel skills, and optimizing the efﬁciency with whichthese skills can be exercised. Such a normative theory ofmultitasking may have value not only for understanding hu-man performance, but also for the design of artiﬁcial systems.Having a formal language with which to consider the tradeoffbetween learning efﬁciency and multitasking capability (andthe closely related constructs of controlled vs. automatic pro-cessing) will facilitate precise analysis of the design of au-tonomous agents that are capable not only of guiding theirown actions, but also of learning the best ways of doing so.

References

Allport, A., Antonis, B., & Reynolds, P. (1972). On thedivision of attention: A disproof of the single channel hy-pothesis.

Quarterly Journal of Experimental Psychology , (2), 225-235. doi: 10.1080/00335557243000102Alon, N., Reichman, D., Shinkar, I., Wagner, T., Musslick, S.,Cohen, J. D., . . . Ozcimder, K. (2017). A graph-theoreticapproach to multitasking. In Advances in neural informa-tion processing systems (pp. 2097–2106).Anderson, J. R. (2013).

The architecture of cognition . Psy-chology Press.Baxter, J. (1995). Learning internal representations. In

Pro-ceedings of the eighth annual conference on computationallearning theory (pp. 311–320).Bengio, Y., Courville, A., & Vincent, P. (2013). Repre-sentation learning: A review and new perspectives.

IEEETransactions on Pattern Analysis and Machine Intelli-gence , (8), 1798–1828.Botvinick, M. M., Braver, T. S., Barch, D. M., Carter, C. S.,& Cohen, J. D. (2001). Conﬂict monitoring and cognitivecontrol. Psychological review , (3), 624.Caruana, R. (1997). Multitask learning. Machine learning , (1), 41–75.Caruana, R. (1998). Multitask learning. In S. Thrun &L. Pratt (Eds.), Learning to learn (pp. 95–133). Boston,MA: Springer US. doi: 10.1007/978-1-4615-5529-2 5 Cohen, J. D., Dunbar, K., & McClelland, J. L. (1990). On thecontrol of automatic processes: A parallel distributed pro-cessing model of the stroop effect.

Psychological Review , .Collobert, R., & Weston, J. (2008). A uniﬁed architecture fornatural language processing: Deep neural networks withmultitask learning. In Proceedings of the 25th internationalconference on machine learning (pp. 160–167).Feng, S. F., Schwemmer, M., Gershman, S. J., & Cohen, J. D.(2014). Multitasking versus multiplexing: Toward a nor-mative account of limitations in the simultaneous execu-tion of control-demanding behaviors.

Cognitive, Affective,& Behavioral Neuroscience , (1), 129–146.Garner, K., & Dux, P. E. (2015, 10). Training conquersmultitasking costs by dividing task representations in thefrontoparietal- subcortical system. Proceedings of the Na-tional Academy of Sciences .Gershman, S. J., Horvitz, E. J., & Tenenbaum, J. B. (2015).Computational rationality: A converging paradigm forintelligence in brains, minds, and machines.

Science , (6245), 273–278.Gigerenzer, G. (2008). Why heuristics work. Perspectives onpsychological science , (1), 20–29.Grifﬁths, T. L., Lieder, F., & Goodman, N. D. (2015). Ratio-nal use of cognitive resources: Levels of analysis betweenthe computational and the algorithmic. Topics in cognitivescience , (2), 217–229.Hinton, G. E. (1986). Learning distributed representationsof concepts. In Proceedings of the 8th confererence ofthe Cognitive Science Society (pp. 1–12). Hillsdale, NJ:Lawrence Erlbaum Associates.Lesnick, M., Musslick, S., Dey, B., & Cohen, J. D. (2020). Aformal framework for cognitive models of multitasking.doi: https://doi.org/10.31234/osf.io/7yzdnLogan, G. D. (1980). Attention and automaticity in stroopand priming tasks: Theory and data.

Cognitive psychology , , 523-53.Manohar, S. G., Chong, T. T.-J., Apps, M. A., Batla, A.,Stamelou, M., Jarman, P. R., . . . Husain, M. (2015). Re-ward pays the cost of noise reduction in motor and cogni-tive control. Current Biology , (13), 1707–1716.McClelland, J., McNaughton, B., & O’Reilly, R. (1995).Why there are complementary learning systems in the hip-pocampus and neocortex: Insights from the successes andfailures of connectionist models of learning and memory. Psychological Review , .Meyer, D., & Kieras, D. (1997, 02). A computational theoryof executive cognitive processes and multiple-task perfor-mance: Part 1. basic mechanisms. Psychological Review , , 3-65.Musslick, S., Dey, B., ¨Ozcimder, K., Mostofa, M., Patwary,A., Willke, T., & Cohen, J. D. (2016, 08). Controlled vs.automatic processing: A graph-theoretic approach to theanalysis of serial vs. parallel processing in neural networkarchitectures. In Proceedings of the 38th annual conferencef the Cognitive Science Society (pp. 1547–1552).Musslick, S., Saxe, A., ¨Ozcimder, K., Dey, B., Henselman,G., & Cohen, J. D. (2017, August). Multitasking capabilityversus learning efﬁciency in neural network architectures.In

Proceedings of the 39th Annual Meeting of the CognitiveScience Society (p. 829-834).Navon, D., & Gopher, D. (1979). On the economy of thehuman-processing system.

Psychological Review , (3),214.Petri, G., Musslick, S., czimder, K., Dey, B., Ahmed, N.,Willke, T., & Cohen, J. D. (2020). Universal limits toparallel processing capability of network architectures. Re-trieved from https://arxiv.org/abs/1708.03263 Posner, M., & Snyder, C. (1975). attention and cognitive con-trol. In

Information processing and cognition: The loyolasymposium (pp. 55–85).Russell, S. J., & Subramanian, D. (1995). Provably bounded-optimal agents.

Journal of Artiﬁcial Intelligence Research , , 575–609.Salvucci, D. D., & Taatgen, N. A. (2008). Threaded cogni-tion: An integrated theory of concurrent multitasking. Psy-chological review , (1), 101.Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Learn-ing hierarchical category structure in deep neural networks.In Proceedings of the 35th annual meeting of the cognitivescience society (pp. 1271–1276).Shenhav, A., Botvinick, M., & Cohen, J. D. (2013, 07). Theexpected value of control: An integrative theory of anteriorcingulate cortex function.

Neuron , , 217-40.Shenhav, A., Musslick, S., Lieder, F., Kool, W., L Grifﬁths,T., D Cohen, J., & Botvinick, M. (2017, 01). Toward arational and mechanistic account of mental effort. AnnualReview of Neuroscience , .Shiffrin, R., & Schneider, W. (1977, 03). Controlled and auto-matic human information processing: II. Perceptual learn-ing, automatic attending and a general theory. Psychologi-cal Review , , 127-190.Simon, H. A. (1955). A behavioral model of rational choice. The quarterly journal of economics , (1), 99–118.Simon, H. A. (1982). Models of bounded rationality. 1982.

Cambridge: MIT Press.Wickens, C. D. (1991). Processing resources and attention.

Multiple-task performance , , 3–34.Yin, H., & Knowlton, B. (2006). The role of the basal gangliain habit formation. Nature reviews. Neuroscience ,7