[PDF] Evaluating Models of Robust Word Recognition with Serial Reproduction

Abstract

Spoken communication occurs in a "noisy channel" characterized by high levels of environmental noise, variability within and between speakers, and lexical and syntactic ambiguity. Given these properties of the received linguistic input, robust spoken word recognition -- and language processing more generally -- relies heavily on listeners' prior knowledge to evaluate whether candidate interpretations of that input are more or less likely. Here we compare several broad-coverage probabilistic generative language models in their ability to capture human linguistic expectations. Serial reproduction, an experimental paradigm where spoken utterances are reproduced by successive participants similar to the children's game of "Telephone," is used to elicit a sample that reflects the linguistic expectations of English-speaking adults. When we evaluate a suite of probabilistic generative language models against the yielded chains of utterances, we find that those models that make use of abstract representations of preceding linguistic context (i.e., phrase structure) best predict the changes made by people in the course of serial reproduction. A logistic regression model predicting which words in an utterance are most likely to be lost or changed in the course of spoken transmission corroborates this result. We interpret these findings in light of research highlighting the interaction of memory-based constraints and representations in language processing.

Full PDF

RRunning head: ROBUST WORD RECOGNITION 1Evaluating Models of Robust Word Recognition with Serial ReproductionStephan C. MeylanDepartment of Brain and Cognitive Sciences, Massachusetts Institute of TechnologySathvik NairDepartment of Psychology, University of California, BerkeleyThomas L. GriﬃthsDepartment of Psychology, Princeton UniversityAuthor NoteWe thank the Computational Cognitive Science Lab at U.C. Berkeley for valuablefeedback. This material is based on work supported by the National Science Foundation(Graduate Research Fellowship to S.C.M. under Grant No. DGE-1106400), the U.S. AirForce Oﬃce of Scientiﬁc Research (FA9550-13-1-0170 to T.L.G), and the Defense AdvancedResearch Projects Agency (Next Generation Social Sciences to T.L.G).Address all correspondence to Stephan C. Meylan. E-mail: [email protected] . a r X i v : . [ c s . C L ] J a n OBUST WORD RECOGNITION 2AbstractSpoken communication occurs in a “noisy channel” characterized by high levels ofenvironmental noise, variability within and between speakers, and lexical and syntacticambiguity. Given these properties of the received linguistic input, robust spoken wordrecognition—and language processing more generally—relies heavily on listeners’ priorknowledge to evaluate whether candidate interpretations of that input are more or lesslikely. Here we compare several broad-coverage probabilistic generative language models intheir ability to capture human linguistic expectations. Serial reproduction, an experimentalparadigm where spoken utterances are reproduced by successive participants similar to thechildren’s game of “Telephone,” is used to elicit a sample that reﬂects the linguisticexpectations of English-speaking adults. When we evaluate a suite of probabilisticgenerative language models against the yielded chains of utterances, we ﬁnd that thosemodels that make use of abstract representations of preceding linguistic context( i.e., phrase structure) best predict the changes made by people in the course of serialreproduction. A logistic regression model predicting which words in an utterance are mostlikely to be lost or changed in the course of spoken transmission corroborates this result.We interpret these ﬁndings in light of research highlighting the interaction ofmemory-based constraints and representations in language processing.

Keywords: spoken word recognition; sentence processing; generative language models;noisy-channel communication; iterated learning; serial reproductionOBUST WORD RECOGNITION 3Evaluating Models of Robust Word Recognition with Serial ReproductionSpoken communication occurs in a “noisy channel” (Shannon, 1948, 1951): Toovercome high levels of environmental noise, variation within and between speakers, andlexical and syntactic ambiguity, listeners must make extensive use of linguistic expectations ,or knowledge of what speakers are likely to say (Gibson, Bergen, & Piantadosi, 2013). Thisrequires the integration of many sources of information, including knowledge of wordfrequencies (Howes, 1957; Hudson & Bergman, 1985; Yap, Balota, Sibley, & Ratcliﬀ, 2012),probable word sequences (Miller, Heise, & Lichten, 1951; Smith & Levy, 2013), sub-wordphonotactic regularities (Storkel, Armbrüster, & Hogan, 2006; Vitevitch & Luce, 1999), theplausibility of syntactic relationships (Altmann & Kamide, 1999; Kamide, Altmann, &Haywood, 2003), cues from the visual scene (Tanenhaus, Spivey-Knowlton, Eberhard, &Sedivy, 1995), and pragmatic expectations (Rohde & Ettlinger, 2012). More broadly,people are able to adjudicate between various candidate interpretations of auditory inputin light of the speciﬁc discourse context, and bring considerable “general world knowledge”(e.g., knowledge of intuitive physics, properties of people and objects) to the task ofnatural language understanding (Levesque, Davis, & Morgenstern, 2011).One particularly promising avenue of inquiry has been the development of statisticalmodels that encode regularities in language structure in terms of probability distributionsover sequences of words, or probabilistic generative language models (PGLMs). Thesemodels are distinguished in their ability to provide precise quantitative predictions for avariety of behavioral observables such as reading time and word naming latencies(Goodkind & Bicknell, 2018; Roark, Bachrach, Cardenas, & Pallier, 2009; Smith & Levy,2013). These models are further distinguished in their ability to reﬂect human-relevantpsycholinguistic hypotheses in their architecture, for example whether syntactic structure issymbolically represented or implicit (Fossum & Levy, 2012; Frank & Bod, 2011). However,as we argue below, there are a number of substantive challenges and limitations in the waythese models are currently evaluated, especially in the auditory modality. This leavesOBUST WORD RECOGNITION 4important open questions regarding which of these models best capture humanexpectations in spoken word recognition in sentential contexts.The behavioral experiment presented here speciﬁcally targets a gap in methods forestimating people’s linguistic expectations in the audio modality, i.e. , what people expectother people to say. In the current work, we develop a method to sample from the linguisticexpectations used by people in a naturalistic language task using the technique of serialreproduction (Bartlett, 1932; Xu & Griﬃths, 2010), in this case a large-scale web-basedgame of “Telephone”. A succession of participants reproduces each initial sentence, suchthat the utterance produced by each participant is used as input to the next (Figure 1).Similar to the related technique of iterated learning (Kirby, 2001), this technique of serialreproduction can be shown to converge to participants’ prior expectations given a suﬃcientnumber of iterations (repetitions, with changes introduced by participants), and as long ascertain conditions are met (Griﬃths & Kalish, 2007). Even with fewer iterations thanrequired for convergence to the stationary distribution, the yielded samples are still usefulin that they reﬂect an approach to the distribution of interest.To preview our results: the changes people make to sentences in the Telephone gameare best explained by probabilistic generative language models that make use of abstractsymbolic representations of preceding context. A token-level analysis of which individualwords are successfully transmitted from one participant to another provides convergingevidence for this result. More generally, we introduce a method to elicit samples frompeople’s expectations for receptive language tasks in the audio modality, and show howthat method can be used to evaluate language models in their ability to explain key humanlinguistic behaviors.

Theoretical Background

In a landmark study, Luce and Pisoni (1998) showed that the recognition of isolatedwords embedded in noise can be modeled as a competitive process, combining evidenceOBUST WORD RECOGNITION 5

Produced SpeechInferred Messages ... hypothesis (h ) data (d) data (d) hypothesis (h) AB p ( h | d ) p ( h | d ) p ( d|h ) p ( h | d ) p ( d|h ) data (d) p ( d|h ) p ( h ) p ( d|h ) p ( h ) p ( d|h ) p ( h ) ...... d h d h d Figure 1 . A: Serial reproduction. Each participant chooses a hypothesis regarding whatthey heard, and on the basis of that hypothesis produces data for the next listener. B:Under the Bayesian model of the telephone task presented here, participants assignposterior probability p ( h | d ) to hypotheses regarding what they heard proportional to theproduct of the likelihood, p ( d | h ), which depends on the current auditory input, and theprior, p ( h ), which does not.from the received auditory signal with each word’s probability ( i.e., relative frequency) in acorpus. Subsequent work has formalized this process of competition among words withinan explicitly Bayesian framework, and extended it to the recognition of words in sententialcontexts (Norris & McQueen, 2008). It is now widely accepted that linguisticexpectations—probabilistic knowledge regarding what is more or less likely to besaid—make the process of word recognition substantially more robust than it would be as apurely data-driven, bottom-up process (Norris, McQueen, & Cutler, 2016). Figure 2provides a toy example of how a listener might overcome perceptual noise to arrive at aspeaker’s intended message. While a listener may receive auditory input more consistentwith bear than pear after the preceding context I bought a , the probabilities of the twoOBUST WORD RECOGNITION 6

Figure 2 . Example of the contributions of the likelihood and prior in spoken wordrecognition in a sentential context. The phonemes /b/ and /p/ are largely distinguished bya single dimension of diﬀerence, their voice onset time (VOT), and are relatively easilyconfused. Using broader language knowledge, a listener can recognize a speaker’s likelyintended meaning when data is ambiguous or noisy. In this toy example, the priorprobability of pear in this context overwhelms the evidence for bear provided by thereceived auditory signal.candidates as continuations given the preceding context would suggest that the latter is amuch more likely interpretation. In this way, a listener can overcome noise and ambiguityand successfully recover a speaker’s message using their knowledge of language.A critical question at Marr’s computational level of analysis (Marr, 1982) is what sortof information sources might be combined to accomplish the task of word recognition,independent of the precise timecourse or computational tractability. Luce and Pisoni(1998) use word probability—normalized frequency—as an example of basic word-levelknowledge that listeners have access to before encountering auditory input. This simplemodel would, however, often lead to incorrect predictions for the bear/pear example inFigure 2, because bear is more frequent than pear in many linguistic corpora. Rather, theplausibility of the two candidates reﬂects more detailed world knowledge, for example whatsort of things might be bought and sold at a farmer’s market. In the absence of modelsthat encode this sort of world information directly, increasingly sophisticated models oflinguistic structure, or probabilistic generative language models (PGLMs), have been usedto approximate people’s expectations.OBUST WORD RECOGNITION 7Following on the work of Hale (2001), Levy (2008) tested how the probabilisticlinguistic expectations encoded by such a model predicts sentence processing diﬃculty, asrevealed by measures of reading time. This work speciﬁcally advanced the concept of a“causal bottleneck:” the expectedness of a word determines its processing diﬃculty, andthis expectedness may reﬂect a broad range of linguistic and non-linguistic knowledge. Thenegative log probability of that upcoming word under a listener’s expectations, or surprisal (see also Itti & Baldi, 2009 for a related formulation in the visual domain), provides asuccinct measure of a word’s expectedness. Levy (2008) further demonstrated thatsurprisal estimates derived from a PGLM can be used to approximate people’s rich implicitknowledge of linguistic regularities, and can provide signiﬁcant explanatory power aboveand beyond theories of processing diﬃculty that focus on memory constraints alone.Though Levy (2008) used a speciﬁc probabilistic generative model (the Earley parserof Hale, 2001), the theory of a causal bottleneck posits a broader relationship betweensurprisal and processing diﬃculty. Speciﬁcally, more sophisticated models—those capableof capturing additional information sources available to people—should be expected toproduce surprisal estimates more strongly correlated with observed processing diﬃculty.Thus while these models entail strong simplifying assumptions regarding the knowledge oflanguage structure available to people (let alone knowledge of language structure availableto linguists), they can nonetheless be used to derive expectations from large corpora orother large-scale datasets that are strongly predictive of human linguistic behavior, such asresponse times in word naming tasks. Subsequent work in psycholinguistics and cognitivescience has demonstrated the utility of a variety of PGLMs in understanding humansentence processing, especially for reading (Demberg & Keller, 2008, 2009; Fine, Jaeger,Farmer, & Qian, 2013; Fossum & Levy, 2012; Frank & Bod, 2011; Futrell & Levy, 2017;Goodkind & Bicknell, 2018; Smith & Levy, 2011, 2013). However, the question of which models best capture human linguistic expectations—using which information sources andvia which representations—remains a central open question.OBUST WORD RECOGNITION 8In addition to the utility of increasingly accurate statistical models of languagestructure for psychology and cognitive science, these same models are of critical importancefor a broad range of speech-related engineering applications, where noise, ambiguity, andspeaker variation have been well-known challenges since the 1940s (e.g., Shannon, 1948).Generative models of utterance structure have received extensive treatment in the ﬁelds ofComputational Linguistics, Natural Language Processing, and Automatic SpeechRecognition, where such models are collectively known as “language models” (Jurafsky &Martin, 2009). Their probabilistic form allows them to be combined in principled wayswith auditory information using Bayesian techniques, or approximations thereof( e.g.,

Graves, Fernández, Gomez, & Schmidhuber, 2006). A growing interest in commercialapplications in recent years has resulted in a profusion of model architectures, especiallyones taking advantage of large-scale deep artiﬁcial neural networks (e.g., Devlin, Chang,Lee, & Toutanova, 2018; Dyer, Kuncoro, Ballesteros, & Smith, 2016; Gulordava,Bojanowski, Grave, Linzen, & Baroni, 2018; Hannun et al., 2014; Jozefowicz, Zaremba, &Sutskever, 2015; Peters et al., 2018; Zaremba, Sutskever, & Vinyals, 2014). These neuralnetwork models have received signiﬁcant attention in their ability to capture humanlinguistic phenomena above and beyond simpler structural models, though the exact natureof such remains an active area of investigation (Futrell & Levy, 2019; Futrell et al., 2019;Linzen, Dupoux, & Goldberg, 2016; Linzen & Leonard, 2018).

Language Models

The primary objective of this contribution is to evaluate a variety of broad-coverageprobabilistic generative language models (PGLMs) in their ability to capture the linguisticexpectations implicit in participants’ behavior in a game of Telephone. By “languagemodels,” we follow the convention of the natural language processing literature to refer toprobabilistic models that encode information regarding the distribution over possiblecontinuations (the next word in the utterance) before encountering the relevant auditoryOBUST WORD RECOGNITION 9Table 1

Overview of probabilistic language models used in the current work.

Model Type Training Data CitationOBWB / BIG LSTM LSTM One Billion Word Benchmark Jozefowicz et al. (2015)BNC / Kneser-Ney Trigram Smoothed N-gram British National Corpus Chen and Goodman (1998)BNC / Unigram N-gram British National CorpusDS / Kneser-Ney 5-gram Smoothed N-gram Librivox, Fisher, Switchboard Chen and Goodman (1998)PTB + GTB / BLLIP PCFG Penn + Google TreeBanks Charniak and Johnson (2005)PTB / BLLIP PCFG Penn TreeBank Charniak and Johnson (2005)PTB / Roark Parser PCFG Penn TreeBank Roark et al. (2009)PTB / RNNLM RNN Penn TreeBank Mikolov et al. (2010)PTB / Unigram N-gram Penn TreeBankPTB / Good-Turing Trigram Smoothed N-gram Penn TreeBank Gale and Sampson (1995)PTB / Good-Turing 5-gram Smoothed N-gram Penn TreeBank Gale and Sampson (1995) input. We thus exclude many “language models” in the broader sense that are concernedwith the online dynamics of word recognition like the logogen model (Morton, 1969),cohort model (Marslen-Wilson, 1987), or TRACE model of word recognition (McClelland& Elman, 1986). In that modeling human performance in the Telephone game requireslarge vocabularies and a means of handling newly-encountered words, all language modelshere track expectations over 10,000 or more word types and have a method for handlingout-of-vocabulary items.The language models evaluated here are outlined in Table 1. They include n -grammodels of varying orders, parsers that use probabilistic context-free grammars, andrecurrent neural networks, including LSTMs. Where possible, two versions of each modelare evaluated: one ﬁt/trained on the Penn TreeBank, and one with a large dataset knownto yield competitive prediction results. For additional details, refer to Technical Appendix1: Language Models and Training Procedure . Insofar as each model is the subject of manydissertations, conference papers, and journal articles in its own right, the overviewsprovided here are neither exhaustive nor formally rigorous; we refer readers to theappropriate publications for more details.The PGLMs under study here vary in structural complexity from none at all (aunigram model, which may be considered a null model positing no structure at all) toOBUST WORD RECOGNITION 10distributions over parse trees, positing hierarchical syntactic relations, such as nounphrases and verb phrases, for all words in a sentence. Language models with higherstructural complexity bring the task of in-context spoken word recognition into contactwith the task of sentence processing : if people think that the utterances they hear willadhere to grammatical rules, their expectations about upcoming words will be quitediﬀerent. While people certainly use detailed representations of relationships betweenwords in sentence understanding more broadly, it remains an open question whether thissame information is used online in the process of identifying words in noisy linguistic input(C. M. Connine & Clifton, 1987; Lash, Rogers, Zoller, & Wingﬁeld, 2013; Levy, Bicknell,Slattery, & Rayner, 2009; McQueen & Huettig, 2012).

Abstract Structure in Preceding Context

One of the central distinctions that we address here is whether there is evidence thatpeople use categorical, hierarchical syntactic representations in the service of prediction.Contrary to previous work ﬁnding evidence of extensive use of abstract representations insentence processing, (e.g., Gibson, Desmet, Grodner, & Watson, 2005), Frank and Bod(2011) found no additional predictive power for models that use explicithierarchically-structured representations of preceding sentence context for predictingreading times in the Dundee corpus. The models under comparison in their study included n -gram models, the Roark parser (Roark, 2001), and echo state networks (Jaeger, 2001), akind of recurrent neural network architecture. They further expand on this position anddescribe the prospects and implications of a non-hierarchical theory of language processingin (Frank, Bod, & Christiansen, 2012). By contrast, a replication and extension of Frankand Bod (2011) by Fossum and Levy (2012) found that the use of improved lexical n -gramcontrols derived from a larger model and using a more sophisticated smoothing techniqueeliminated the performance diﬀerences between unlexicalized sequential and hierarchicalmodels. Further, they showed that a state-of-the-art lexicalized hierarchical model fromOBUST WORD RECOGNITION 11Petrov and Klein (2007)—a model that tracks more granular relationships between wordsand grammatical categories—predicts reading times better yet. We expand on the contrastbetween lexicalized and unlexicalized models below. Van Schijndel and Schuler (2015) alsofound evidence that models with hierarchical representations of syntactic context betterpredict reading times in the same corpus.In the current work, we investigate the utility of abstract representations in theauditory domain, using a larger sample of language models. We adopt a diﬀerent typologyof models than that used in the above works: we group recurrent neural networkmodels—of which we include two more recent architectures—with models that infer parsetrees. While Fossum and Levy (2012) and Frank and Bod (2011) separate those modelsthat make full hierarchical representations of the phrase structure (“latent context in theform of syntactic trees”) vs. others, we note that recurrent neural networks, such as theecho state network used by Frank and Bod (2011), capture signiﬁcant higher-orderlinguistic regularities of a qualitatively similar type to the nonterminals in a phrasestructure grammar. An examination of the word embeddings from Elman (1990), a simplerecurrent neural network, show that the average activation pattern for a word reﬂects acombination of both semantic and syntactic similarity, in that both of these are reﬂected inthe lexical distributions in the surrounding context. While this is admittedly far from a fullhierarchical parse tree, the use of any sort of abstract representation of context may have astronger eﬀect on the sort of expectations that are encoded in a model, rather thanwhether the representation is represented symbolically as hierarchical. We evaluate supportfor this partition of models, as well as the performance of the language models inpredicting the pattern of changes, using our collected serial reproduction data. FollowingFrank and Christiansen (2018), we do not entertain the hypothesis that languageprocessing and especially sentence processing is devoid of hierarchical phenomena (e.g.,Altmann & Kamide, 1999). Rather, we focus our investigation on whether hierarchical ornon-hierarchical models better account for the data collected from our task, speciﬁcallyOBUST WORD RECOGNITION 12investigating the role of prediction in in-context word recognition.As noted in the introduction, PGLMs are typically evaluated according to theprobability that they assign to a test sample of language (in natural language processing)or in their ability to predict experimental observables pertaining to processing diﬃculty (inpsycholinguistics). Each of these evaluation methods has notable limitations.Engineering-centric research in natural language processing typically evaluatesgenerative language models in terms of the probability that they assign to a held-outdataset: the model that assigns a higher probability to a held-out dataset (or perplexity , inwhich probability is normalized by its length in words) is the better model in the absenceof confounding factors (Jurafsky & Martin, 2009). This metric assigns the highest scores tothose models that best capture the statistics of the corpora they are tested on. Fittinglanguage models to large naturalistic corpora produces excellent ﬁrst-approximation,broad-coverage estimates of people’s general-purpose language expectations. However,people’s expectations may deviate signiﬁcantly from these corpus statistics: people do notexpect newly encountered speakers to produce utterances following the statisticalproperties of corpora derived from books or news ( e.g., the Penn TreeBank). Even the useof naturalistic corpora like Switchboard (Godfrey, Holliman, & McDaniel, 1992) or SantaBarbara (Du Bois, Chafe, Meyer, Thompson, & Martey, 2000) fails to address thisproblem, in that participants’ expectations for the purposes of comprehension may deviatesigniﬁcantly from that of any known corpora, a point on which we elaborate below.Likewise, eliciting a sample of unconditioned speech may not reﬂect a person’s expectationsof what others are likely to say.Regarding maximizing test sample probability, there is no guarantee that acorpus-derived test set has a high probability under human linguistic expectations. Asrational agents, people should be expected to develop expectations that match the task ofinterpreting linguistic material under noisy conditions, but in the case of language it is veryhard to know what will constitute future “typical” linguistic experience. Composition inOBUST WORD RECOGNITION 13terms of written vs. auditory sources, topics, and speech registers (i.e., levels of formality)is undoubtedly subject to signiﬁcant individual variation. Furthermore, even with a perfectestimate of a person’s experience with language thus far, there is no guarantee thatlinguistic material encountered in the future will have the same composition as thatencountered in the past. In such a case, a rational agent might be expected to allocateprobability mass to as yet unobserved linguistic events to avoid overﬁtting on the basis ofprevious experience. As such, we argue that no corpus collected from natural sourcesshould be expected to reﬂect people’s linguistic expectations for the task of in-context wordrecognition. Psycholinguistic studies, by contrast, have typically evaluated how well surprisalestimates derived from probabilistic language models predict measures of processingdiﬃculty, such as reaction times in word naming, looking time or regressions in reading asmeasured by eye tracking, and neural measures such as the magnitude of event-relatedpotentials. We note a challenge in interpreting the relationship between surprisal andprocessing diﬃculty. Unpredictable linguistic events should cause increased processingdiﬃculty if and only if they are correctly interpreted: they may also simply be missed bylisteners/readers and result in transmission errors (i.e., a listener could incorrectly interpretan unexpected word as a more probable alternative, without immediately incurring anyincrease in processing diﬃculty). As such, behavioral observables constitute an imperfectrecord of human linguistic expectations. Models should thus be evaluated with respect tohow surprisal estimates predict measures of processing diﬃculty, as well as the kinds oftransmission errors they predict. We note, however, that language models ﬁt on large corpora of naturalistic text are an excellent ﬁrstapproximation to people’s linguistic expectations, and that they provide broad coverage for relatively littleeﬀort by comparison to the iterated learning method introduced here. We return to this point, as well asprospects for combining the two approaches, in the Discussion.

OBUST WORD RECOGNITION 14

Characterizing Transmission Errors

A substantial body of work in signal processing has evaluated the performance ofautomatic speech recognition systems (ASR) on conversational speech (Lippmann, 1997).While most of this work has focused on engineering applications, some work has treatedthese systems as computational-level cognitive models of word recognition. Goldwater,Jurafsky, and Manning (2010), for example, evaluated two contemporary ASR systems intheir ability to recognize words from a standard dataset. Consistent with the humanresults reviewed below, they found higher error rates for low probability words, with strongindependent eﬀects of unigram and trigram probabilities. They also found that longerwords have lower error rates, that words are more likely to be mis-recognized at thebeginnings of utterances, and that the model was particularly prone to transmissionfailures among “doubly confusable pairs:” pairs of words with similar acoustic proﬁles thatappear in similar lexical contexts. More recently, Xiong et al. (2018) found that neuralnetwork-based ASR systems make similar errors to people, though many of the remainingdiﬀerences can be traced to how these models handle backchanneling (e.g. “mm-hmm”)and disﬂuencies (“uh”,“hmmm”). This work evaluating ASR systems has, however,systematically evaluated how transmission errors vary as a function of language model.Word recognition errors have also been thoroughly explored in psycholinguistics andaudiology. Early work focused on establishing psychometric functions to relatesignal-to-noise ratios of word presentation in noise to recognition accuracy (Egan, 1948;French & Steinberg, 1947). Subsequent work tested the eﬀects of word frequency(Broadbent, 1967; Owens, 1961; Savin, 1963), the interaction of word frequency andacoustic similarity (Luce & Pisoni, 1998), and the importance of lexical neighborhoods (thenumber of phonetically similar words to a target word, Treisman, 1978; Luce, Pisoni, &Goldinger, 1990; J. Ziegler, Muneaux, & Grainger, 2003) as determinants of wordrecognition accuracy. While much of this work focuses on the recognition of individualwords in isolation, a major thread explores the facilitatory eﬀects of utterance contextOBUST WORD RECOGNITION 15(Benichov, Cox, Tun, & Wingﬁeld, 2012; C. Connine, Blasko, & Hall, 1991; Kalikow,Stevens, & Elliott, 1977; Miller et al., 1951). We note however that these studies havelargely focused on recognition accuracy in speciﬁc sentence positions, usually with smallsets of carefully experimentally-controlled stimuli. Further, the characterization ofpredictability has relied on small corpora, and predictability estimates have often beencoded as categorical rather than continuous variables. As noted by Theunissen andSwanepoel (2009), experiments investigating the transmission of entire utterances haverarely characterized the word-level changes. In the current work, we adopt a methodologythat allows us to track transmission errors among all words in a large collection ofutterances, and uses predictability estimates from sophisticated language models trained onlarge datasets.One challenge for word recognition research is that there is a rich correlationalstructure among word features: shorter words are more frequent, have more phonologicallysimilar competitors, are learned earlier, and are more central in lexical networks(Vincent-Lamarre et al., 2016). In the absence of controlling for these variables, this makesit diﬃcult to interpret many of the above results. More recent work addresses thischallenge by including a larger set of predictors, explicitly characterizing the correlationalstructure among those predictors, and using linear and logistic mixed-eﬀects models toaccount for participant-level and item-level variability. Looking at this expanded set ofpredictors and using appropriate regression models, Gahl and Strand (2016) reproducedthe longstanding eﬀect of lexical frequency (more frequent words are easier to recognize)and found that words with many phonological neighbors are harder to recognize.More broadly, there are several reasons to question whether the eﬀects observed inisolated word recognition—which has received the overwhelming balance attention in theliterature thus far—will extend to the recognition of words in utterance contexts. First,many candidate interpretations may be much less competitive when contextual cues areavailable to guide processing. In isolated word recognition, participants may adapt to theOBUST WORD RECOGNITION 16circumstances of the task (e.g., expect that low-frequency and high-frequency words areequally likely in an experimental context), and thus diverge from expectations forin-context conversational speech. Second, many of these studies have investigated only alimited set of words, for example only monosyllabic words or words without complexconsonant clusters. The current work thus provides an opportunity to test whetherpatterns observed in isolated word recognition experiments extend to the task of in-contextspeech recognition.

Serial Reproduction

In the present study, we use the technique of serial reproduction to investigate humanin-context word recognition. Intuitively, we start with a small test corpus and use serialreproduction to gradually change the properties of that corpus so that it better reﬂectspeople’s linguistic expectations. This transition in corpus properties emerges in the courseof serial reproduction because participants’ expectations are reﬂected in the changes theymake between the utterance they hear and the utterance they produce at each iteration.Information transmission by serial reproduction was ﬁrst studied by Sir FredericBartlett, who tracked the evolution of stories and pictures recreated from memory afterrapid presentation (Bartlett, 1932). More recent work has identiﬁed that the technique, likeiterated learning (Kirby, 2001), can be used to experimentally reveal inductive biases, orreasons that people would favor one hypothesis over another independent of observed datain inferential tasks (Mitchell, 1997). If participants (reproducers for serial reproduction,learners for iterated learning) use Bayesian inference to infer the posterior distribution overhypotheses, and then draw from that distribution in production, then their output at eachsequential generation reﬂects a combination of observed data and participants’ inductivebiases. Over time, the distribution implicit in the output data comes to reﬂect participants’inductive biases. For a trial in the Telephone game, participants’ hypotheses are the set ofpossible interpretations that they might provide for an utterance, in that the message theyOBUST WORD RECOGNITION 17infer reﬂects both the auditory input and their expectations regarding language.Under certain assumptions, both kinds of transmission chains (iterated learning andserial reproduction) can be interpreted as a form of Gibbs sampling, a common techniquein Markov chain Monte Carlo based parameter estimation (Geman & Geman, 1984). Thetransmission of a linguistic utterance across a succession of participants can be interpretedas a Markov process; given certain assumptions, the resulting Markov chain is guaranteedto converge to its stationary distribution if run for a suﬃcient number of iterations. Aparticipant estimates the probability of the interpretation of an utterance h given someobserved data (i.e. an audio recording) d following Bayesian inference (Figure 1). Inparticular, following a process of sampling a hypothesis h from the posterior, p ( h | d ) ∝ p ( d | h ) p ( h ) and sampling data from p ( d | h ) means that the probability that aparticipant selects a hypothesis h converges to their prior distribution p ( h ) (Griﬃths &Kalish, 2007). This approach has been proﬁtably used to reveal inductive biases in memory(Xu & Griﬃths, 2010), category structure (Griﬃths, Christian, & Kalish, 2008; Sanborn,Griﬃths, & Shiﬀrin, 2010), and function learning (Griﬃths, Kalish, & Lewandowsky,2008). In the domain of language speciﬁcally, iterated learning has been used to revealbiases relevant to language evolution (Griﬃths & Kalish, 2007).We motivate the use of serial reproduction with language in the auditory modality byanalogy to the function learning experiments of Griﬃths, Kalish, and Lewandowsky (2008),which can be thought of as a game of “Telephone” in the space of mathematical functions.In these experiments, participants saw a small number of points in a two-dimensional spacedrawn from an unknown—and not necessarily linear—function. They were then asked toprovide their best guess regarding the function which generated the observed points. Thisfunction was then used to generate points for the next participant in the chain. Whilediﬀerent transmission chains started with radically diﬀerent functions, including some ofconsiderable complexity, all are reduced to simple, mostly positive linear functions in thecourse of reproduction. In the face of limited, noisy data people tend towards theirOBUST WORD RECOGNITION 18expectation for a simple linear function. The case of language implicates a more complexand nuanced set of expectations—what people expect other people to say—but the basicexperimental logic remains the same. The process of serial reproduction can consequently be thought of as graduallychanging a corpus so that it better reﬂects what people expect others to say. To the degreethat the initial set of utterances in that corpus is not representative of what people expectto hear, then the process of serial reproduction will introduce edits that change them toincreasingly reﬂect the broader linguistic expectations of participants. If, for example, theinitial corpus contained an excessive number of low-frequency words pertaining to ﬁnance,participants should be expected to misinterpret these and replace them with wordsprototypical of normal conversational registers. To our knowledge, no previous work hasused this technique of serial reproduction to investigate language processing. However, wenote that the task of serial reproduction of isolated speech sounds was previously used toinvestigate whether repeated imitation of environmental noises can give rise to word-likeforms (Edmiston, Perlman, & Lupyan, 2018).

Methods

We use a web experiment to gather chains of audio recordings and transcriptionsappropriate for answering our primary research questions. This web experiment letsparticipants listen to and record audio, coordinates the ﬂow of stimuli such that recordingsfrom one participant can be used as stimuli for later participants, and ensures that thesuccession of recordings remains interpretable linguistic material, while avoiding explicitjudgments of appropriateness by the experimenters. For simplicity of exposition, we ﬁrstdescribe how a participant would experience the task; then we describe the ﬂow of data inthe experiment, focusing on how the successive contributions of participants are used asstimuli for future participants. An intuitive interpretation of this process is that repetition entails reversion to the mean. In the case oflinguistic expectations, this “mean” is of unknown character and the subject of broad theoretical interest.

OBUST WORD RECOGNITION 19

Experimental Interface

Upon accepting the experiment on Amazon Mechanical Turk, a participant isprovided with a link to a web application that allows them to listen to and recordresponses to audio recordings of utterances. Participants proceed through two practicetrials that test their speakers and microphone. A participant is then introduced to the fulltrial format, where 1) they listen to a recording 2) they choose whether or not to ﬂag therecording they heard as appropriate and interpretable 3) they record a response (i.e., theirbest guess of the content of the utterance they heard) 4) they decide whether to ﬂag their own recording ( e.g., in case of speech errors or excessive ambient noise) 5) they provide awritten transcription of the utterance they recorded. After the participant submits a trial,the new recording and transcription are programmatically evaluated with a number ofﬁlters conﬁrming that the audio recording is not blank, that it is of similar length to theprevious utterance, and that it passes basic tests of consistency between recording andtranscription using an automated speech recognition system. The participant does twopractice trials in this full trial format to conﬁrm their understanding of the directions ofthe task; if they fail to pass the basic checks, they continue through additional practicetrials. For additional details regarding the web interface and the automated checks, refer to

Technical Appendix 2: Experiment Interface . Stimuli and Experimental Design

We select a set of 40 initial target sentences and 9 ﬁller sentences from the TASAcorpus (Zeno, Ivens, Millard, & Duvvuri, 1995) and the Brown Corpus (Kucera & Francis,1967). These sentences are chosen to provide maximal variation in probability underdiﬀerent language models for sentences of the same length. First we determined whichsentence length (in terms of words and characters) was the most common in the corpus (10words, 42 nonspace characters + 9 spaces). All were evaluated as grammatical by theexperimenters. For this cohort of length-matched sentences, we then obtain theirOBUST WORD RECOGNITION 20

Figure 3 . The browser-based audio recording and transcription interface used byparticipants. This screenshot depicts the state of the application after the participant haslistened to a recording by a previous participant, recorded a response, and is in the processof submitting a transcription for their contributed recording. In this case the participanthas been ﬂagged for misspelling a word.probabilities under unigram and trigram language models trained on the British NationalCorpus. For the trigram model, we used modiﬁed Kneser-Ney smoothing (Chen &Goodman, 1998) on transitions of order three (for further details regarding this smoothingscheme, see Language Models below). For both unigram and trigram probabilities, we usedthe empirical distribution of probabilities for the yielded sentences to generate 205-percentile tranches. Then for each tranch, we iterated through sentences, rejecting allsentences phrased as questions, interpretable as inappropriate, or containing numericalcharacters, hyphenated words, or contractions. This yielded a single sentence for each ofthe 20 unigram and 20 trigram tranches. Two additional sentences were chosen from theOBUST WORD RECOGNITION 21TIMIT corpus (Garofolo, Lamel, Fisher, Fiscus, & Pallett, 1993). Initial audio stimuli wereread by a male speaker at a normal conversational pace in a soundproof environment.All chains are initialized with a grammatical, semantically interpretable sentence, butwe make no explicit eﬀort to maintain either property over the course of serial reproductionbesides the requirement for intelligibility by downstream participants. Because laterrecordings may not be grammatical sentences, we refer to them as utterances, though theyare sentences in a high proportion of cases (see Discussion).

Serial Transmission Structure.

The web application included logic for routingnewly-contributed recordings to other, later participants in the experiment. Iteratedlearning and serial transmission experiments typically use a strict generation-based designwhere each participant contributes data in response to all relevant experiment stimuli, afterwhich they are given to another participant. We implemented a system that tracked thesuccession of inputs for each sentence separately for two reasons. First, this increases thethroughput of the experiment: under a strict generation-based design, a new participantcannot start until the previous participant has ﬁnished. Second, it allows the experiment tobalance the contributions of participants across chains in response to real time ﬂagging ofaudio ﬁles (e.g., for a blank audio ﬁles). Sequence of utterances in a chain satisfy theMarkov condition because participants sequentially contribute one utterance each.However, unlike a strict generation-based designs, two participants can provide inputs (i.e.,be the immediately preceding participant) for one another on diﬀerent sentences. Anexample history of a sentence in the experiment, including contributions of new recordingsand ﬂagging, is presented in Figure 4. For more details on the transmission structure, see

Technical Appendix 3: Experiment Transmission Structure . Participants and Protection of Human Subjects.

A total of 266 participantswere recruited using Amazon Mechanical Turk. Participants were limited to those living inthe United States with a fast internet connection, a microphone, and speakers. Each IPaddress and worker identiﬁer (temporarily stored locally in hashed form) was allowed toOBUST WORD RECOGNITION 22 [ . . . ]

Sentence 5Chain 1S₀S₁S₂ S₃S₄S₅ S₆

12 3 45 6

S₀ [initial sentence]: “your teeth begin breaking up the food by chewing it”S₁: “your teeth begin breaking up the food by chew-ing”S₂: [rejected because too short] “your teeth begin breaking” S₃: “your teeth breaking up the food by chewing it”S₄: “her teeth ended up breaking as the food got hard”S₅: [self-flagged, e.g., if interrupted during recording]S₆: “her key ended up breaking off into her car”

Transmission used in analysisTransmission excluded from analysis[ . . . ] S n >0 S n >0 S₀ Pre-recorded initial sentenceUtterance contributed by participant n Flagged utterance contributed by participant n [ . . . ] Omitted from figure for brevity Figure 4 . Schematic of the transmission structure for a single sentence in the experiment.Participants contribute recordings according to the current state of each utterance, unlikegeneration-based designs. The web app maintains a record of which recordings are ﬂaggedby participants or automated methods as too short, bad recordings etc., and oﬀers the lastnon-ﬂagged recording as input to a new participant.participate only once. The quality of each participant’s internet connection was screened atthe beginning of the experiment to avoid possible issues with downloading and uploadingrelatively bandwidth-intensive audio. Data collection methods, including the audio dataretention and distribution policy, were reviewed and approved by the U.C. BerkeleyCommittee for Protection of Human Subjects. In addition to providing informed consent,participants also completed a media release allowing their submitted audio recordings(which constitute personally-identiﬁable information) to be used in publicly-availableOBUST WORD RECOGNITION 23corpora. With the exception of the audio recordings, Mechanical Turk worker identiﬁers,and IP addresses (stored in a hashed format and discarded after the completion of theexperiment), no other personally-identiﬁable information was collected. Participants werenot explicitly told that the recordings that they heard might come from other participants,though the media release states that their recordings could be used as experimental stimuli.The introduction to the experiment stated that the task was designed to gather data onhow people recognize words in conditions with high levels of background noise.

Results

The ﬁnal recording chains consist of 2,999 utterances. An analysis of participants’ﬂagging behavior is presented in

Technical Appendix 4: Analysis of Flagged Recordings . Inthis section, we evaluate a suite of probabilistic generative language models in their abilityto predict the changes made by participants in a large web-based game of “Telephone.” Webegin by conﬁrming that the parallel sequences of responses elicited in the telephone gamedemonstrate movement towards a consistent set of linguistic expectations. We thenevaluate predictive language models with two diﬀerent techniques. First, we evaluate whichlanguage models demonstrate the highest magnitude increase in probability over the courseof the experiment. Second, we test whether surprisal under each model is predictive ofwhich words are transmitted successfully from speaker to listener.

Evaluating Movement Towards Convergence

We ﬁrst evaluate whether the recording chains change in a way that suggests thatthey are headed towards convergence. Convergence among sampling chains in Markovchain Monte Carlo is often evaluated in terms of a potential scale reduction factor , or PSRF(Gelman & Rubin, 1992). The PSRF measures the degree to which between-chain variancein parameter estimates reduces with respect to within-chain variance over the course ofsampling. In this case, we cannot directly access quantitative estimates of the parametersof the “true” latent model — the linguistic expectations of human participants. Instead weOBUST WORD RECOGNITION 24

Figure 5 . A representative recording chain yielded by the serial reproduction experiment.The ﬁrst transcription is the initial stimulus. Each subsequent transcription is that of aparticipant who heard the preceding sentence presented in naturalistic background noise.All responses are collected as audio recordings ﬁrst, then participants are prompted toprovide a written transcription.use a proxy measure: whether the probability estimates for independent chains becomemore similar within a model over the course of the serial reproduction experiment. Whilethe probability measures could potentially become progressively more similar for otherreasons, this pattern is a necessary condition that the chains are approaching thedistribution of interest. Because the model-based proxy measures above are noisy, weaggregate utterance chains into groups. Chains are stratiﬁed into four groups on the basisof the probability quartile of the initial sentence as computed by each language model.Initial conditions of the recording chains are known to vary signiﬁcantly on this dimension,OBUST WORD RECOGNITION 25 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

PTB / Good−Turing Trigram PTB / Unigram PTB / Roark Parser (PCFG)BNC / Unigram DS / Kneser−Ney 5−gram PTB / RNNLM (RNN) PTB / Good−Turing 5−gramOBWB / BIG LSTM PTB+GTB / BLLIP parser (PCFG) PTB / BLLIP parser (PCFG) BNC / Kneser−Ney Trigram

Generation I n t e r − Q ua r t il e V a r i an c e i n U tt e r an c e P r ob . ( R e l a t i v e t o I n i t i a l I n t e r − Q ua r t il e V a r i an c e ) Figure 6 . Variance in language probability estimates at each generation of the serialreproduction experiment relative to initial variance. Variance estimates are computed withrespect to means for four groups of chains, deﬁned by initial sentence probability quartile.The reduction in variance suggests that the utterances in chains starting with highprobability sentences and low probability sentences becomes much more similar over thecourse of the experiment.in that initial stimuli reﬂect a stratiﬁed sample based on unigram and trigram probabilitymeasures. We take the mean probability within each quartile at each generation (sequentialposition within the recording chain), and from that compute the inter-quartile variance.The asymptotic decrease in inter-quartile variance seen in Figure 6 suggests that therecording chains are increasingly similar over the course of the experiment under alllanguage models. Figure 6 shows that variance drops to less than a quarter of the initialvalue for all models (with one exception for Big LM at the 25th generation), and around atenth of the initial value for several of the models trained on large datasets ( n -gram modelstrained on the BNC and the DeepSpeech datasets). However, we caution against thestronger assertion that the chains are sampling directly from people’s expectations becauseOBUST WORD RECOGNITION 26utterances continue to increase in probability at the end of the experiment (next section),suggesting that the utterances need to undergo further changes to converge to humanexpectations. We leave the possibility of sampling directly from the participants’ revealedexpectations after convergence to future research, and for now use samples generated asparticipants approach the distribution of interest. Evaluating Language Models

Ideally, the language models here could be evaluated using the probability eachassigns to a large set of utterances sampled from participants after strong evidence thatsampling chains have fully converged to participants’ expectations—both thatinter-quartile variance approaches zero and that the probability of utterances is no longerincreasing. Because utterance probabilities continue to increase, these chains are likely notdirectly sampling from participants’ expectations by the end of the experiment.Instead, we use the fact that serial reproduction yields sentences that are increasinglyrepresentative of people’s linguistic expectations in the task, even if they have not yetconverged. As such, the pattern of changes towards the target distribution can be used toevaluate models: more representative models should have a greater magnitude increase inprobability (manifested as a decrease in average per-word surprisal) over the course of theexperiment. A model reﬂecting the true expectations of participants should exhibit thelargest possible decrease in surprisal over the course of serial transmission. Though we lackdirect access to people’s expectations, the magnitude of the increase in utteranceprobability for each language model—as an approximation of those human expectations– issuﬃcient to rank it with respect to others.To eliminate the eﬀect of shorter utterances on probability, which would trivially beassigned higher probabilities under all language models, we divide each utterance’snegative log probability by the number of words in that utterance. The resulting measure,“average per-word surprisal” has an intuitive interpretation as the average surprisal,OBUST WORD RECOGNITION 27 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

PTB / Good−Turing Trigram PTB / Unigram PTB / Roark Parser (PCFG)BNC / Unigram DS / Kneser−Ney 5−gram PTB / RNNLM (RNN) PTB / Good−Turing 5−gramOBWB / BIG LSTM PTB + GTB / BLLIP parser (PCFG) PTB / BLLIP parser (PCFG) BNC / Kneser−Ney Trigram

Generation P e r − W o r d S u r p r i s a l ( b i t s ) Figure 7 . Average per-word surprisal in bits ( − log ( p ( w | c )) under each language modelover the course of serial reproduction. Error bars indicate standard error of the mean.measurable in bits, for each utterance under each model. The average of this measureacross chains over the course of the experiment is shown in Figure 7.Models vary in their surprisal estimates for reasons outside of the scope of the currentanalysis, especially the choice of smoothing scheme. For example, consider two unigrammodels that withhold diﬀerent amounts of probability mass for word tokens notencountered during training, e.g., .05 and .01. If these two models were used to produceprobability estimates for a set of sentences comprised of exclusively known tokens, the ﬁrstmodel would assign a lower probability estimate compared to the second model, eventhough the probability estimates would be perfectly correlated. We thus focus not on theintercept for each model but the slope of the change in surprisal, taking advantage of thefact that the change in the number of bits (log p ( w )) corresponds to a constantmultiplicative factor in probabilities (an increase of one bit, whether it is the third or theseventh, means that the words in the set of utterances are, on average, half as probable).While average per-word surprisal under each model for recording chains through 25OBUST WORD RECOGNITION 28 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l PTB / Good−Turing Trigram PTB / Unigram PTB / Roark Parser (PCFG)BNC / Unigram DS / Kneser−Ney 5−gram PTB / RNNLM (RNN) PTB / Good−Turing 5−gramOBWB / BIG LSTM PTB + GTB / BLLIP parser (PCFG) PTB / BLLIP parser (PCFG) BNC / Kneser−Ney Trigram

Generation C hange i n P e r − W o r d S u r p r i s a l ( b i t s ) Figure 8 . Change in the average per-word surprisal for utterances over the course of thetelephone game under 11 language models. A decrease in average per-word surprisal of onebit means that words in the utterances yielded from serial reproduction are on averagetwice as probable under that model.sequential transmissions is plotted in Figure 7, the change in average per-word surprisal (inbits) is plotted in Figure 8. This latter graph reveals that higher-order n -gram modelstrained on large datasets and the BLLIP PCFG parser trained on the Penn Treebank bestcapture the changes observed in the course of serial reproduction. The observed pattern ofincreases in probability corresponds with the absolute model probabilities, in that thesemodels are the same ones that assign the highest probability to the ﬁnal generations of thesampling chains. n -gram models and the Roark parser (trained on the same TreeBankdataset) show statistically signiﬁcant — though more modest — increases in probability forlater sentences. Big LM exhibits low initial average per-word surprisal, but exhibits arelatively minor increase in probability for later sentences.We next turn to the question of whether there are diﬀerences in performance betweenbroad classes of models as a result of theoretically-relevant architectural distinctions, inOBUST WORD RECOGNITION 29particular whether models that represent hierarchical syntactic relationships moreaccurately reﬂect the changes made in the course of the serial reproduction experiment.To conﬁrm that the identiﬁed theoretical contrast (usage of abstract higher-orderrepresentations) is indeed reﬂected in the probability estimates produced by the models, weﬁrst measure the similarity between language models on the basis of the probabilityestimates they provide for all n = 3,193 utterance transcriptions yielded by theexperiment. If model performance is sensitive to this distinction, correlations should bestrongest within the two partitions (models that make use of hierarchical representationsvs. those that do not). Because the form of the relationship may not be linear, we useSpearman’s rank correlation coeﬃcient to evaluate the pairwise similarity. We limit thisanalysis to variants of models trained on the Penn TreeBank to eliminate the eﬀect oftraining dataset on model performance (see Technical Appendix 5: Similarity of LanguageModels for a comparison of all models).The results of this analysis yield two major clusters, the n -gram models, which do notuse higher-order representations of lexical context for prediction, and the RNN-LM and thetwo PCFG models which do (Fig. 12). Among the n -gram models, the unigram model isdistinguished from the higher-order n -gram models; these longer-context n -gram modelsare more similar than the unigram model to the models with higher-order abstractstructure. The yielded similarities show that the predictions derived from these modelsreﬂect this key architectural distinctions of theoretical interest regarding the representationof linguistic context. Further, these results substantiate our claim that recurrent neuralnetwork language models are better grouped with the PCFG models than the n -grammodels given their representational capacities, in contrast to the classiﬁcation used inFrank and Bod (2011) and Fossum and Levy (2012).The central architectural question we address is whether models that representhigher-order regularities, such as phrase structure or abstract semantic representations ofthe preceding context, better reﬂect the revealed linguistic expectations of participantsOBUST WORD RECOGNITION 30 UnigramSmoothed.TrigramSmoothed.5.gramRoark.ParserBLLIP.ParserRNNLM U n i g r a m S m oo t hed T r i g r a m S m oo t hed − g r a m R oa r k P a r s e r B LL I P P a r s e r R N N L M Figure 9 . A. Spearman’s rank correlation for pairwise combinations of sentenceprobabilities across the 3,193 sentences from the experiment. B. The resulting dendrogram,derived from the rank correlations using Ward’s method, shows that the model-basedprobability estimates reﬂect the architectural diﬀerence of interest—whether modelsrepresent hierarchical syntactic relations.than models that track only speciﬁc preceding words. The n -gram models (where n > recording chain × generation random slope. The regression model is ﬁtwith 11 model estimates of average per-word surprisal for each of 2,864 utterancesproduced between the 1st and 26th participant (thereby omitting the initial sentences).We ﬁrst test whether accounting for each language model’s type of representation(abstract hierarchical vs. lexical) improves the overall ﬁt of the mixed-eﬀects model bycomparing the above full model speciﬁcation with a nested model lacking type of contextrepresentation as a predictor. This reveals that the model that includes the type of contextrepresentation as well as its interaction with generation exhibits signiﬁcantly better ﬁt ( χ = 1401.1, p < .0001) For this full model (Tab. 2), there is a statistically signiﬁcant negativecoeﬃcient for the interaction for generation * abstract structure: used ( β = -0.0044, t value = -6, p < .0001). This greater magnitude decrease in surprisal estimates suggestthat the changes made by people are better reﬂected by the language models that useabstract representations of preceding context. As with the previous analysis, we ﬁnd thatmodels that are trained on the Penn TreeBank are less representative of people’s linguisticexpectations than models trained on larger datasets ( β = 0.0042, t value = 6.03, p < .0001). Predicting Word-Level Errors.

A second way to compare the ﬁt of probabilisticgenerative language models to human behavior is to evaluate how well they predict whichwords are misheard: altered or deleted in transmission. We speciﬁcally test whethermodels that represent abstract hierarchical structure provide predictive power, in the formof their conditional probability estimates, beyond lexical models regarding which words arealtered or deleted. Instances of deletions and substitutions are identiﬁed using dynamicprogramming, using the edit operations corresponding to the Levenshtein distance, orminimum edit distance between utterances. As is commonly done to estimate Word ErrorRate (WER) in natural language processing (Popović & Ney, 2007), this computation is While we also collect data regarding insertions in the course of serial reproduction, it is harder to identifythe relevant properties of the preceding recording that prompt the insertion of material.

OBUST WORD RECOGNITION 32Table 2

A mixed-eﬀects linear regression examining average per-word surprisal estimates (negativelog probability under a model) across 9 language models as a function of 1) whether themodel uses an abstract representation of the preceding context to inform expectations and 2)the dataset used to ﬁt the model.

Fixed Eﬀects Coef β SE( β ) t value p (Intercept) 2.3362 0.0272 85.81 <.0001 generation 0.001 0.0025 0.38 0.7abstract structure: used 0.2382 0.0092 25.8 <.0001 dataset: PTB 0.2923 0.0088 33.38 <.0001 generation x abstract structure: used -0.0044 7e-04 -6 <.0001 generation x dataset: PTB 0.0042 7e-04 6.03 <.0001 Random Eﬀects Std. Dev(Intercept) | Recording Chain 0.35generation | Recording Chain 0.03

Note: Signiﬁcance of ﬁxed-eﬀects is computed following Satterthwaite, 1946. applied to word sequences rather than character strings (Tab. 3). We present a separateanalysis of participants’ edit rates in relation to their ﬂagging behavior in

TechnicalAppendix 4: Analysis of Flagged Recordings .We then construct a mixed-eﬀects logistic regression model including model-basedestimates of surprisal, as well as the control variables of position in sentence, age ofacquisition ratings ([Dataset] Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012),concreteness ([Dataset] Brysbaert, Warriner, & Kuperman, 2014), number of phonemes,number of syllables ([Dataset] Brysbaert & New, 2009), and phonological neighborhooddensity ([Dataset] Yarkoni, Balota, & Yap, 2008) as ﬁxed eﬀects to predict whether theword changes (1 = change, 0 = no change). We restrict model based estimates to thosefrom larger datasets (i.e., omitting models trained on the Penn TreeBank if another modelOBUST WORD RECOGNITION 33Table 3

Example edit string from input sentence to output sentence. M, D, I, and S indicate Match,Deletion, Insertion, and Substitution respectively.

M M M M D I I M D M S Myou may not notice yourself grow from day to dayyou may not notice as you grow day by daywith the same architecture but trained on a larger dataset is available). To deal with thestrong correlations between language model predictions, we use a residualization scheme toisolate their respective contributions. Unigram surprisal (negative log probability) is useddirectly as a predictor. We then predict trigram probability from unigram probability, anduse the residuals as a predictor representative of the contribution of a language modelwhich considers the two words preceding the words in question as the preceding context.For each of the remaining models with word-level surprisal estimates (5-gram on theDeepSpeech dataset, BigLM on the One Billion Word Benchmark, the Roark Parsertrained on the Penn TreeBank) we take the residuals after predicting its surprisal estimatesfrom both unigram and trigram models.The identities of the listener and the identity of the speaker are treated as randomintercepts to account for variability in comprehension performance and speakerintelligibility. The model is ﬁt with 27,290 instances where a participant heard a word andeither 1) reproduced it faithfully in their own recording (22,482 cases) 2) produced anidentiﬁable substitute (substitution) or 3) did not produce an identiﬁable substitution(deletion). Cases 2) and 3) were collapsed into a single category of transmission failure,coded as a 0. We ﬁt the full model and conduct no pruning of predictors.We report the contributions of lexical properties and the word’s position of thesentence before reporting on the contribution of model-based estimates of predictability.Words are more likely to change if they are appear later in the utterance ( β = 0.0989, z value = 15.57, p< .0001), or if that word is acquired later in development ( β = 0.0544, z OBUST WORD RECOGNITION 34value = 3.96, p< .0001). By contrast, words with more syllables or more phonemes are less likely to change ( β = -0.249, z value = -4.91, p< .0001), as are words rated as highlyconcrete ( β = -0.0856, z value = -4.46, p< .0001). In line with recent regression modelsinvestigating recognition accuracy in isolated word (Gahl & Strand, 2016), this analysissuggests that words in sparse phonological neighborhoods — words with high averagephonological Levenshtein distance to the 20 most similar wordforms (or PLD-20, per([Dataset] Yarkoni et al., 2008)) are less likely to change ( β = -0.0948, z value = -2.22, p< .05). The mixed-eﬀects logistic regression reveals that words with higher unigram surprisalare more likely to change ( β = 0.3827, z value = 17.25, p< .0001), as are words with highertrigram surprisal with unigram surprisal partialed out ( β = 0.3729, z value = 19.37, p< .0001). For all models with abstract structure (with trigram surprisal partialed out), higherresidualized surprisal estimates are predictive of a transmission failure. This corroboratesthe eﬀect of word frequency on recognition seen in Gahl and Strand (2016), and isconsistent with the independent contributions of word frequency and word predictabilityseen in automatic speech recognition systems (Goldwater et al., 2010). An exampletransmission chain with each word colorized by the probability of successful recovery bythe listener under the full model are shown in Fig. 10. This ﬁgure shows higher rates oftransmission failures for low-frequency “content” words vs. generally high-frequency“function” words, though function words in certain constructions have higher probabilitiesof transmission failure (e.g. “ in to the door”).Given that the ﬁt model can be used to estimate the probability of change for eachword in the dataset, it can then be evaluated against the ground truth of whether the wordchanged or not. The area under the ROC curve corresponds to the probability that in alarge random sample of (transmission failure, transmission success) pairs the modelassigned a higher probability of change to the word that was actually changed vs. the wordthat was not changed. The full ensemble model (AUC = .650, AIC = 23,504) outperformsOBUST WORD RECOGNITION 35Table 4 Mixed-eﬀects logistic regression predicting whether a word will be changed in transmissionon the basis of its surprisal under various language models as well as other word properties.

Fixed Eﬀects Coef β SE( β ) z value P r ( > | z | )(Intercept) -2.6554 0.0933 -28.45 <.0001 BNC unigram surprisal 0.3827 0.0222 17.25 <.0001

Residualized BNC trigram surprisal 0.3729 0.0193 19.37 <.0001

Residualized Roark PCFG syntactic surprisal 0.1159 0.0266 4.36 <.0001

Residualized Big LM surprisal 0.1959 0.0181 10.82 <.0001

Residualized DS 5-gram surprisal 0.1593 0.0262 6.07 <.0001

Position in sentence 0.0989 0.0064 15.57 <.0001

Age of acquisition 0.0544 0.0137 3.96 <.0001

Number of phonemes -0.0609 0.0231 -2.64 <.01

Number of syllables -0.249 0.0507 -4.91 <.0001

Concreteness -0.0856 0.0192 -4.46 <.0001

Phonological Neighborhood Density (PLD20) -0.0948 0.0427 -2.22 <.05

Random Eﬀects Std. Dev.Listener ID 0.62Speaker ID 0.3

Note: Signiﬁcance of ﬁxed-eﬀects is computed following Satterthwaite, 1946. simpler ones, for example one based only on word predictability (surprisal) under languagemodels (AUC = .632; AIC = 23,900) and one based on invariant word properties (AUC =.583, AIC = 24,445). This diﬀerence between areas under the respective AUC curves andthe diﬀerence in AICs suggests that while both sets of predictors contribute to predictions,predictability under language models is a better predictor of transmission failure thaninvariant properties of words.The previous ensemble model tests whether each model architecture makes anindependent contribution in predicting transmission failures. Another question is how themodels directly compare to one another in their ability to predict transmission failures. ToOBUST WORD RECOGNITION 36this end, we ﬁt a series of models with the same set of non-model-based predictors (e.g.,position in sentence, age of acquisition, concreteness, and phonological neighborhooddensity) and use the per-word surprisal estimates from a single model. We then comparethe logistic models in terms of AIC and area under the ROC curve (AUC). This analysis(Table 5) reveals that large LSTMs and smoothed 5-grams trained on a large collection ofconversational text best predict transmission failures. The Roark Parser has the leastpredictive power, though this may reﬂect that its training corpus (the Penn TreeBank) wasat least a hundred times smaller compared to those used to ﬁt the other language models.Table 5

Comparison of individual language models for predicting transmission failures. Each modelcontains the same set of word properties, plus surprisal estimates from one language model.

Model AUC AICOBWB / Big LM 0.639 23677DS / 5-gram 0.641 23687BNC / Trigram 0.638 23732BNC / Unigram 0.604 24095PTB / Roark Parser 0.592 24370

Discussion

In this work, we investigate several probabilistic generative language models (PGLMs)in their ability to capture people’s linguistic expectations. A serial reproduction task — thegame of Telephone, where participants reproduce recordings made by other participants insequence—reveals participants’ prior expectations in a naturalistic spoken word recognitiontask. We ﬁnd evidence that people use abstract representations of that context to informtheir expectations, in line with the results of Fossum and Levy (2012) for reading.

The Role of Abstract Structure

Before further interpretation of the results regarding the utility of abstract structure,we clarify the exact nature of our claims and note several methodological challenges thatOBUST WORD RECOGNITION 37temper the strength and generality of these results.First a top-level theoretical clariﬁcation is in order: Higher-level abstractions arethoroughly implicated in the processes of sentence comprehension and production (e.g.,verb argument structure). Our objective in this study, following Frank and Bod (2011), isto investigate whether such abstractions may inform the lower-level process of wordrecognition.Second, we highlight an inherent brittleness in using a small cohort of probabilisticgenerative models to characterize human knowledge more generally: the encodedexpectations reﬂect a complex interaction of architecture, training data, and ﬁttingprocedure. Limited explanatory power for human performance may reﬂect any of one of theabove aspects, or a complex interaction in between them. Further, each language model isdrawn from a larger space of possible language models, such that the generalizations wemake about model architectures on the basis of a few examples may not be robust.On the matter of ﬁtting, the encoded expectations may reﬂect local minima ormaxima for some models. While this is not a concern with n -gram models ﬁt withcount-based methods, this problem increases in severity for language models which havelarge numbers of randomly-initialized parameters, especially neural network languagemodels. For these models, continued research focuses on how to improve the speed androbustness of the training procedure. All models may suﬀer from the problem of overﬁtting(Geman, Bienenstock, & Doursat, 1992), in which learned distributions reﬂect properties ofthe training data at the expense of their generality in extending to new data. Therelationship between corpus size and performance is modulated by architecture: Largemodels trained on small datasets may be more prone to overﬁting than small modelstrained on those same datasets. More abstract structure may allow models to performbetter on smaller datasets because they can fall back on expectations for higher-levelabstractions in the absence of experience with speciﬁc words or sequences. But smallertraining corpora may also make it harder for models to arrive at useful abstractions in theOBUST WORD RECOGNITION 38course of ﬁtting.We took several steps to address these pitfalls regarding model performance. First,we conﬁrmed that all models exhibited characteristic levels of performance on standardtest datasets, suggesting that the ﬁtting procedures used here are consistent with previousbenchmarks. Though we note above the broader shortcomings of evaluating languagemodels in terms of the perplexity on a held-out dataset, perplexity is nonetheless usefulinsofar as it allows us to check whether the ﬁtting procedure yields models comparable orequivalent to those used in other studies. Second, we treat the dataset on which eachmodel is trained as a separate predictor in our analyses (binarized into those trained on thePenn TreeBank and those trained on large datasets known to yield highly performantmodels). This helps isolate the contribution of the model architecture from the dataset onwhich the model was trained. Third, we note that deﬁciencies in model ﬁtting would mostlikely aﬀect the more sophisticated neural network models or PCFG parsers, which wouldyield a null result rather than the positive one in favor of structure obtained here.Another caveat is that the set of models investigated here represents only a smallsample of possible models. Sampling a larger set of models may reveal that the diﬀerencesin ﬁt to human behavior that we ﬁnd for model architectures are not robust, or that otherdistinctions in model architecture are predictive of larger diﬀerences in performance. Weacknowledge this limitation of the current study, and note the possibility of evaluating alarger set of models—reﬂecting a larger space of architectures—in future work.The regression analysis presented above yielded a statistically negative coeﬃcient for generation × abstract structure: used , suggesting that models using abstracthierarchical structure better predict the changes made by people in the course of thetransmission experiment. However, we note that this is a relatively minor quantitativeeﬀect. Several of the n -gram models, such as the DeepSpeech Kneser-Ney 5-gram and theBNC trigram model, exhibit a pattern of decrease in surprisal estimates that is very similarto models with considerably more sophisticated representations of abstract structureOBUST WORD RECOGNITION 39(BLLIP and Big LM). This pattern of near-parity between certain large, higher-order n -gram models and more sophisticated generative models may be explained by a recenttheoretical proposal regarding processing diﬃculty that posits graded use of detailedstructural representations in people’s linguistic expectations. Noisy-context surprisal suggests that while people may use structured, abstract representations of the precedinglinguistic context, that such representations are imperfect and in particular tend to degradethe longer they are kept active in memory (Futrell & Levy, 2017).Futrell and Levy (2017) highlight the case of structural forgetting eﬀects , wherepeople do not eﬀectively use preceding grammatical structures to predict the remainder ofthe sentence. For example, among the two utterances,1. *The apartment that the maid who the cleaning service had sent over was well-decorated.2. The apartment that the maid who the cleaning service had sent over was cleaning every week was well-decorated.the ﬁrst is grammatically ill-formed in that no verb corresponds with noun phrase with thehead “maid ”, yet people give it consistently higher grammaticality ratings than the secondsentence (originally presented in Gibson & Thomas, 1999). This eﬀect, they argue, arisesfrom an “information locality” eﬀect, whereby structural expectations—which should giveinﬁnitely more probability mass to the grammatically correct sentence than the incorrectone—are attenuated for material with longer intervening intervals (in terms of words ortime). Under this account, people have ﬂeeting access to the representations (approximatedwith parse trees) that they form in sentence processing. n -gram models can be thought ofas an approximation of an abstract, structured generative model, but one with aparticularly sharp memory decay function such that an extremely limited sample of thepreceding context is used to predict the identity of the next word.The memory-based eﬀect identiﬁed by Futrell and Levy (2017) may be furtherOBUST WORD RECOGNITION 40exacerbated in the current task by high levels of background noise, such that we observerelatively small advantages of abstract structural representations. While imperfect memoryimposes noise on the representation of context even under optimal acoustic conditions, thisdecay may be yet stronger if participants lack peaked estimates regarding what actuallyconstitutes the preceding context. An Error Model for In-context Spoken Word Recognition

Data regarding which words are successfully transmitted from speaker to listenerprovide for an alternative method for evaluating how well these language models capturepeople’s expectations. Transmission errors are more likely when a word is unpredictablegiven the preceding context. These models assign divergent probability distributions overcontinuations in many cases. For example, models with an abstract representation ofsyntax would assign diﬀerent probabilities to the continuations “is” and “are” given thepreceding context “the quick brown fox and the lazy dog”. A hierarchical model generallypicks out a subject consisting of the coordinated NPs with heads “fox” and “dogs,” suchthat the plural copular “are” is a more likely continuation (less surprising) compared to“is”. An n -gram model tracking a limited preceding lexical context, e.g., a trigram modelthat uses the only the two preceding words, would suggest that “is” is the more probablecontinuation given the two preceding words “lazy dog”; under this model “are” is moresurprising (*“the lazy dog are”).The regression analysis above shows a predictive utility — in the form of statisticallysigniﬁcant beta coeﬃcients—for trigram, 5-gram, PCFG (Roark), and LSTM (Big LM)estimates of in-context predictability—suggesting that even the most abstractrepresentations of context are predictive of the collected transmission errors. Nonetheless,we ﬁnd a continued role of word frequency, consistent with the predictions of the modelingwork by Goldwater et al. (2010). A comparison of the individual models reveals that theLSTM (Big LM) and the DeepSpeech 5-gram model best predict the edits; we leave it toOBUST WORD RECOGNITION 41future work in the same format as Goodkind and Bicknell (2018) to test whether wordpredictability estimates from these models also excel at predicting reading times and otherbehavioral measures of processing diﬃculty.The ﬁnding that the age of acquisition of a word is predictive of successfultransmission—earlier-acquired words are more likely to be recognized than later-acquiredones—extends previous results from isolated visual word recognition (Morrison & Ellis,2000) into the auditory realm. This is consistent with the hypothesis that such words enjoya privileged status above and beyond their frequency, perhaps relating the problem ofretrieving semantic representations (Austerweil, Abbott, & Griﬃths, 2012) given thecompositional structure of the lexicon (Vincent-Lamarre et al., 2016) in which lowerfrequency words are deﬁned in reference to higher frequency ones.A second notable ﬁnding is that once the relationship between word length andunigram surprisal is accounted for, longer words are more likely to be recognized. This canbe interpreted as evidence that people are better able to recognize words that are moreperceptually distinctive, in that words with more syllables have have fewer perceptuallysimilar competitors, above and beyond edit-distance-based measures of neighborhooddensity. These words may also be easier to recover because there is more redundantinformation about their identity, such that listeners are more likely to be able to recoverthe intended word if some portion is corrupted in transmission. Limitations of Serial Reproduction

A potential caveat to the generality of the results is that the expectations implicit inthe utterances obtained from the Telephone game may be task-dependent, and of limitedutility for characterizing linguistic expectations more generally. For example, participantscould infer that the task they are performing is qualitatively unlike “normal” language use,and make use of a diﬀerent set of expectations such that the collected data is notrepresentative of the distributions of interest for language processing. It would beOBUST WORD RECOGNITION 42particularly concerning, for example, if the collected utterances demonstrated markeddecreases in grammatical acceptability or semantic interpretability over the course of serialreproduction. To evaluate whether participants produced grammatical andsemantically-interpretable recordings for the duration of the experiment, authors S.M. andS.N. conducted a follow-up analysis in which they independently coded all utterances fromgenerations 23-25 with binary judgments of grammaticality and semantic interpretability.The latter category was used to distinguish sentences that are structurally well-formed butnot interpretable, e.g., “the bus and bus driver were opening the door.” Utterances in thisset were judged as 87% and 88% grammatical (Cohen’s κ = 0.94) and 78% and 79%semantically interpretable (Cohen’s κ = 0.89). The results of this analysis suggest thatparticipants largely maintain the grammaticality and semantic well-formedness of theirresponses through the course of the serial reproduction experiment, and suggest thatparticipants are tapping into a similar set of expectations as “normal” language use.Another important possibility is that participants could be modulating how they usepreceding context in word recognition based on the level of noise in the experiment.Because they may be unsure of the preceding context for a particular word, they mayprefer shorter, less-structured representations of context in the current experiment, whereasthey might rely on that context more heavily in a noise-free environment. This basic logicof noise-modulated expectations is substantiated by the ﬁnding of Luce and Pisoni (1998)that participants’ reliance on word frequency in isolated spoken word recognition increasesas a function of the level of background noise. Audio stimuli here are embedded inrelatively high levels of noise, qualitatively dissimilar to the reading tasks of Frank andBod (2011) and Fossum and Levy (2012). We highlight the importance of characterizingvariation in the use of linguistic expectations as a function of environmental noise as animportant next step to explore with this paradigm.One possibility is that participants are providing responses according to the maximum a posteriori (MAP) estimate, i.e., choosing the response with the highestOBUST WORD RECOGNITION 43posterior probability rather than sampling from possible responses according to theirposterior probabilities. If participants respond according to their MAP estimate of whatwas said by the previous speaker, the equivalence between the stationary distributionobtained in the limit and the participants’ prior is no longer guaranteed. Under certainconditions, the implications of participants choosing according to MAP are known, and theresulting stationary distribution is highly informative regarding the characteristics of theprior. Kirby, Dowman, and Griﬃths (2007) introduce a general-purpose method forinterpolating between MAP and posterior sampling when hypotheses have a constantlikelihood, and demonstrate that the ordering of hypotheses under the prior is preservedunder MAP. Griﬃths and Kalish (2007) demonstrate the equivalence of MAP toexpectation-maximization (Dempster, Laird, & Rubin, 1977) in continuous hypothesisspaces, with a guarantee that the stationary distribution will be centered on the maximumof the prior, with variance increasing according to the rate at which hypotheses change oversequential participants. If the asymptotic normality assumption holds, this sameequivalence between the stationary distribution and the prior should be expected here.While the experimental paradigm presented here can potentially reveal participants’linguistic expectations, it is relatively costly from a sampling perspective. Coverage in thedataset shared here is limited by the small number of initial sentences (reﬂecting a limitednumber of semantic contexts), relatively short chains, and relatively strong autocorrelationwithin those sampling chains. While this method could in principle be used to collect a verylarge collection of utterances after convergence, from which a new probabilistic model oflinguistic expectations could be estimated, a more data-eﬃcient method may be to use theword-by-word responses to ﬁne-tune the parameters of an existing model (D. Ziegler et al.,2019). In future work we will investigate whether this ﬁne-tuning approach can eﬀectivelycombine the broad coverage of language models learned from large text corpora with thespeciﬁc behaviors of human participants in the task of in-context spoken word recognition.Finally, we emphasize the limitations in the generality of these results with respect toOBUST WORD RECOGNITION 44individual variation among speakers of English, as well as variation across speakers ofdiﬀerent languages. This experiment makes the strong simplifying assumption that theprocess of serial reproduction yields samples from a single, uniﬁed set of linguisticexpectations that are shared across all English-speaking participants for the purposes ofword recognition. Of course, linguistic expectations should vary signiﬁcantly as a functionof linguistic experience, and may vary between speakers for other reasons likepopulation-level variability in working memory. At a higher level, the expectations ofEnglish speakers are certainly not representative of the expectations of speakers in otherlanguages. Given the pronounced typological diversity of languages, expectations may takequalitatively diﬀerent forms. For example, listeners may rely less on sequential word orderin languages with more ﬂexible word order. Future work will be needed to characterize theways and extent to which expectations vary across natural languages. Conclusion

In this study, we collected data on how utterances change in the course of aweb-based game of “Telephone.” We use this data, which provides an alternative tocorpus-based evaluation methods, to evaluate a range of broad-coverage probabilisticgenerative language models in their ability to characterize people’s linguistic expectationsfor in-context spoken word recognition. Addressing an outstanding empirical question inspeech processing, we ﬁnd that models that use an abstract (i.e. hierarchical)representation of preceding context to inform expectations regarding upcoming words morestrongly reﬂect the changes made by people. This paradigm oﬀers promise in helping tobetter understand humans’ remarkable language processing abilities.

Supplementary Material

All audio recordings, metadata, and analyses can be accessed through aproject-speciﬁc repository on the Open Science Foundation, osf.io/3hws2.OBUST WORD RECOGNITION 45 your teeth begin breaking up the food by chewing ityour teeth begin breaking up the food by chewing ityour teeth end up breaking up the food by chewingyour teeth end up breaking up the food by chewingyour teeth end up breaking up the food by chewingyour teeth end up breaking up the food by chewing ither teeth ended up breaking to the food back to youher teeth ended up breaking as the food got hardher key ended up breaking oﬀ into her carher key ended up breaking in to her carour key ended up breaking into the carour key ended up breaking into the doorberkie ended up breaking in to the doorour key ended up breaking into the doorour key ended up breaking in to the doorher key ended up breaking into the doorher key ended up breaking into the doorher key ended up breaking in the doorher key ended up breaking in the lockthe key ended up breaking in the lockthe key ended up breaking in the lockthe key ended up opening the lockthe key ended up opening the lockthe key ended up opening the lockthe key ended up opening the lockthe key to it is upholding the law

Figure 10 . Probability that each word from an example recording chain is changed intransmission as predicted by a mixed eﬀects logistic regression model. Red indicates that aword is likely to be deleted or replaced with a substitution, while green suggests that alistener is likely to reproduce the word successfully. Colors are mapped from high to lowfor each sentence.OBUST WORD RECOGNITION 46ReferencesAltmann, G. T., & Kamide, Y. (1999). Incremental interpretation at verbs: restricting thedomain of subsequent reference.

Cognition , , 247–264.Austerweil, J. L., Abbott, J. T., & Griﬃths, T. L. (2012). Human memory search as arandom walk in a semantic network. In Advances in neural information processingsystems (pp. 3041–3049).Bartlett, F. (1932).

Remembering: A study in experimental and social psychology .Cambridge University Press.Benichov, J., Cox, L., Tun, P., & Wingﬁeld, A. (2012). Word recognition within alinguistic context: Eﬀects of age, hearing acuity, verbal ability and cognitive function.

Ear and Hearing , (2), 250.Bies, A., Mott, J., Warner, C., & Kulick, S. (2012). English Web Treebank. LinguisticData Consortium, Philadelphia, PA .Broadbent, D. (1967). Word-frequency eﬀect and response bias.

Psychological Review , (1), 1.Charniak, E., & Johnson, M. (2005). Coarse-to-ﬁne n-Best Parsing and MaxEntDiscriminative Reranking. In Proceedings of the 43rd Annual Meeting pf theAssociation for Computational Linguistics (pp. 173–180).Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., & Robinson, T.(2013). One Billion Word Benchmark for Measuring Progress in Statistical LanguageModeling. arXiv preprint arXiv:1312.3005 .Chen, S., & Goodman, J. (1998). An empirical study of smoothing techniques for languagemodeling.

Computer Speech and Language , (4), 359-393.Connine, C., Blasko, D., & Hall, M. (1991). Eﬀects of subsequent sentence context inauditory word recognition: Temporal and linguistic constrainst. Journal of Memoryand Language , (2), 234–250.Connine, C. M., & Clifton, C. (1987, May). Interactive use of lexical information in speechOBUST WORD RECOGNITION 47perception. J Exp Psychol Hum Percept Perform , (2), 291–299.[Dataset]. (2007). The British National Corpus. (Version 3, XML Edition. Distributed byOxford University Computing Services on behalf of the BNC Consortium)[Dataset] Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A criticalevaluation of current word frequency norms and the introduction of a new andimproved word frequency measure for American English.

Behavior Research Methods , (4), 977–990.[Dataset] Brysbaert, M., Warriner, A., & Kuperman, V. (2014). Concreteness ratings for40 thousand generally known English word lemmas. Behavior Research Methods , (3), 904–911.[Dataset] Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012).Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods , (4), 978–990.[Dataset] Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: a newmeasure of orthographic similarity. Psychonomic Bulletin and Review , (5),971–979.De Marneﬀe, M., MacCartney, B., & Manning, C. (2006). Generating typed dependencyparses from phrase structure parses. In Proceedings of LREC (Vol. 6, pp. 449–454).Demberg, V., & Keller, F. (2008). Data from eye-tracking corpora as evidence for theoriesof syntactic processing complexity.

Cognition , (2), 193–210.Demberg, V., & Keller, F. (2009). A computational model of prediction in human parsing:Unifying locality and surprisal eﬀects. In Proceedings of the Annual Meeting of theCognitive Science Society (Vol. 31).Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete datavia the EM algorithm.

Journal of the Royal Statistical Society: Series B(Methodological) , (1), 1–22.Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deepOBUST WORD RECOGNITION 48bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805 .Du Bois, J., Chafe, W., Meyer, C., Thompson, S., & Martey, N. (2000). Santa Barbaracorpus of spoken American English. CD-ROM. Philadelphia: Linguistic DataConsortium .Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. (2016). Recurrent neural networkgrammars. In

Proceedings of NAACL.

Edmiston, P., Perlman, M., & Lupyan, G. (2018). Repeated imitation makes humanvocalizations more word-like.

Proceedings of the Royal Society B: Biological Sciences , (1874), 20172709.Egan, J. (1948). Articulation testing methods. Laryngoscope .Elman, J. (1990). Finding Structure in Time.

Cognitive Science , , 179–211.Fine, A., Jaeger, T., Farmer, T., & Qian, T. (2013). Rapid expectation adaptation duringsyntactic comprehension. PloS one , (10), e77661.Fossum, V., & Levy, R. (2012). Sequential vs. hierarchical syntactic models of humanincremental sentence processing. In Proceedings of the 3rd Workshop on CognitiveModeling and Computational Linguistics (pp. 61–69).Frank, S., & Bod, R. (2011). Insensitivity of the human sentence-processing system tohierarchical structure.

Psychological Science , (6), 829–834.Frank, S., Bod, R., & Christiansen, M. (2012). How hierarchical is language use? Proceedings of the Royal Society B: Biological Sciences , (1747), 4522–4531.Frank, S., & Christiansen, M. (2018). Hierarchical and sequential processing of language:A response to: Ding, Melloni, Tian, and Poeppel (2017). rule-based and word-levelstatistics-based processing of language: insights from neuroscience. language,cognition and neuroscience. Language, Cognition and Neuroscience , (9),1213–1218.French, N., & Steinberg, J. (1947). Factors governing the intelligibility of speech sounds.OBUST WORD RECOGNITION 49 The Journal of the Acoustical Society of America , (1), 90–119.Futrell, R., & Levy, R. (2017). Noisy-context surprisal as a human sentence processing costmodel. In Proceedings of the 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume 1 (Vol. 1, pp. 688–698).Futrell, R., & Levy, R. (2019). Do RNNs learn human-like abstract word orderpreferences? In

Proceedings of the Society for Computation in Linguistics (SCiL) (Vol. 2, pp. 50–59).Futrell, R., Wilcox, E., Morita, T., Qian, P., Ballesteros, M., & Levy, R. (2019). Neurallanguage models as psycholinguistic subjects: Representations of syntactic state. In

Proceedings of the 18th Annual Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies.

Gahl, S., & Strand, J. (2016). Many neighborhoods: Phonological and perceptualneighborhood density in lexical production and perception.

Journal of Memory andLanguage , , 162–178.Gale, W., & Sampson, G. (1995). Good-turing frequency estimation without tears. Journalof Quantitative Linguistics , (3), 217–237.Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., & Pallett, D. (1993). DARPA TIMITacoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASASTI/Recon technical report , .Gelman, A., & Rubin, D. (1992). Inference from iterative simulation using multiplesequences. Statistical Science , (4), 457–472.Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variancedilemma. Neural Computation , (1), 1–58.Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and theBayesian restoration of images. IEEE Transactions on pattern analysis and machineintelligence (6), 721–741.Gers, F., & Schmidhuber, J. (2001). LSTM recurrent networks learn simple context-freeOBUST WORD RECOGNITION 50and context-sensitive languages.

IEEE Transactions on Neural Networks , (6),1333–1340.Gibson, E., Bergen, L., & Piantadosi, S. (2013). Rational integration of noisy evidence andprior semantic expectations in sentence interpretation. Proceedings Natl. Acad. Sci.U.S.A. , (20), 8051–8056.Gibson, E., Desmet, T., Grodner, D., & Watson, K., D. Ko. (2005). Reading relativeclauses in english. Cognitive Linguistics , (2).Gibson, E., & Thomas, J. (1999). Memory limitations and structural forgetting: Theperception of complex ungrammatical sentences as grammatical. Language andCognitive Processes , (3), 225–248.Godfrey, J., Holliman, E., & McDaniel, J. (1992). SWITCHBOARD: Telephone speechcorpus for research and development. In IEEE International Conference on Speech,and Signal Processing, ICASSP-92 (Vol. 1, p. 517-520).Goldwater, S., Jurafsky, D., & Manning, C. (2010). Which words are hard to recognize?prosodic, lexical, and disﬂuency factors that increase speech recognition error rates.

Speech Communication , (3), 181–200.Goodkind, A., & Bicknell, K. (2018). Predictive power of word surprisal for reading timesis a linear function of language model quality. In Proceedings of the 8th workshop onCognitive Modeling and Computational Linguistics (cmcl 2018) (pp. 10–18).Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporalclassiﬁcation: labelling unsegmented sequence data with recurrent neural networks.In

Proceedings of the 23rd International Conference on Machine Learning (pp.369–376).Griﬃths, T., Christian, B., & Kalish, M. (2008). Using category structures to test iteratedlearning as a method for revealing inductive biases.

Cognitive Science , , 68-107.Griﬃths, T., & Kalish, M. (2007). Language evolution by iterated learning with Bayesianagents. Cognitive Science , (3), 441-480.OBUST WORD RECOGNITION 51Griﬃths, T., Kalish, M., & Lewandowsky, S. (2008). Theoretical and empirical evidencefor the impact of inductive biases on cultural evolution. Philosophical Transactions ofthe Royal Society of London B: Biological Sciences , (1509), 3503–3514.Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., & Baroni, M. (2018). Colorless greenrecurrent networks dream hierarchically. In Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Vol. 1, pp. 1195–1205).Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model.

NAACL ’01:Second meeting of the North American Chapter of the Association for ComputationalLinguistics on Language technologies 2001 , 1–8.Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., . . . others (2014).Deep speech: Scaling up end-to-end speech recognition. arXiv preprintarXiv:1412.5567 .Heaﬁeld, K. (2011). KenLM: Faster and smaller language model queries. In

Proceedings ofthe sixth workshop on statistical machine translation (pp. 187–197).Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.

Neural Computation , (8), 1735–1780.Howes, D. (1957). On the relation between the intelligibility and frequency of occurrence ofenglish words. The Journal of the Acoustical Society of America , (2), 296–305.Hudson, P., & Bergman, M. (1985). Lexical knowledge in word recognition: Word lengthand word frequency in naming and lexical decision tasks. Journal of Memory andLanguage , (1), 46–58.Itti, L., & Baldi, P. (2009). Bayesian surprise attracts human attention. Vision research , (10), 1295–1306.Jaeger, H. (2001). The echo state approach to analysing and training recurrent neuralnetworks-with an erratum note. Bonn, Germany: German National Research Centerfor Information Technology GMD Technical Report , (34), 13.OBUST WORD RECOGNITION 52Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration of recurrentnetwork architectures. In International Conference on Machine Learning (pp.2342–2350).Jurafsky, D., & Martin, J. (2009).

Speech & language processing: An introduction to naturallanguage processing, speech recognition, and computational linguistics . Prentice-Hall.Kalikow, D., Stevens, K., & Elliott, L. (1977). Development of a test of speechintelligibility in noise using sentence materials with controlled word predictability.

The Journal of the Acoustical Society of America , (5), 1337–1351.Kamide, Y., Altmann, G., & Haywood, S. (2003). The time-course of prediction inincremental sentence processing: Evidence from anticipatory eye movements. Journalof Memory and language , (1), 133–156.Kirby, S. (2001). Spontaneous evolution of linguistic structure: an iterated learning modelof the emergence of regularity and irregularity. IEEE Transactions on EvolutionaryComputation , (2), 102-110.Kirby, S., Dowman, M., & Griﬃths, T. (2007). Innateness and culture in the evolution oflanguage. Proceedings of the National Academy of Sciences , (12), 5241–5245.Kucera, H., & Francis, W. N. (1967). Computational analysis of present-day AmericanEnglish . Providence, RI: Brown University Press.Lash, A., Rogers, C., Zoller, A., & Wingﬁeld, A. (2013). Expectation and entropy inspoken word recognition: Eﬀects of age and hearing acuity.

Experimental AgingResearch , (3), 235–253.Levesque, H., Davis, E., & Morgenstern, L. (2011). The Winograd schema challenge. In AAAI spring symposium: Logical formalizations of commonsense reasoning (Vol. 46,p. 47).Levy, R. (2008). Expectation-based syntactic comprehension.

Cognition , (3), 1126–77.Levy, R., Bicknell, K., Slattery, T., & Rayner, K. (2009, Dec). Eye movement evidencethat readers maintain and act on uncertainty about past linguistic input. Proc. Natl.

OBUST WORD RECOGNITION 53

Acad. Sci. U.S.A. , (50), 21086–21090.Ling, W., Luís, T., Marujo, L., Astudillo, R., Amir, S., Dyer, C., . . . Trancoso, I. (2015).Finding function in form: Compositional character models for open vocabulary wordrepresentation. arXiv preprint arXiv:1508.02096 .Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of the Association for ComputationalLinguistics , , 521–535.Linzen, T., & Leonard, B. (2018). Distinct patterns of syntactic agreement errors inrecurrent networks and humans. In Proceedings of the 40th Annual Conference of theCognitive Science Society (pp. 692–697). Austin, TX: Cognitive Science Society.Lippmann, R. (1997). Speech recognition by machines and humans.

SpeechCommunication , (1), 1–15.Luce, P., & Pisoni, D. (1998). Recognizing spoken words: The neighborhood activationmodel. Ear Hear , (1), 1–36.Luce, P., Pisoni, D., & Goldinger, S. (1990). Similarity neighborhoods of spoken words.

TheMIT Press.Marcus, M., Marcinkiewicz, M., & Santorini, B. (1993). Building a large annotated corpusof English: The Penn Treebank.

Computational Linguistics , (2), 313–330.Marr, D. (1982). Vision: A computational investigation into the human representation andprocessing of visual information . New York, NY, USA: Henry Holt and Co., Inc.Marslen-Wilson, W. (1987). Functional parallelism in spoken word-recognition.

Cognition , (1-2), 71–102.McClelland, J., & Elman, J. (1986). The TRACE model of speech perception. CognitivePsychology , (1), 1–86.McQueen, J. M., & Huettig, F. (2012, Jan). Changing only the probability that spokenwords will be distorted changes how they are recognized. J. Acoust. Soc. Am. , (1), 509–517.OBUST WORD RECOGNITION 54Mikolov, T., Karaﬁát, M., Burget, L., Černock`y, J., & Khudanpur, S. (2010). Recurrentneural network based language model. In Eleventh Annual Conference of theInternational Speech Communication Association.

Mikolov, T., Kombrink, S., Burget, L., Černock`y, J., & Khudanpur, S. (2011). Extensionsof recurrent neural network language model. In

Acoustics, speech and signalprocessing (ICASSP), 2011 IEEE international conference on (pp. 5528–5531).Miller, G., Heise, G., & Lichten, W. (1951). The intelligibility of speech as a function ofthe context of the test materials.

Journal of Experimental Psychology , (5), 329.Mitchell, T. (1997). Machine Learning . New York, NY, USA: McGraw-Hill, Inc.Morrison, C., & Ellis, A. (2000). Real age of acquisition eﬀects in word naming and lexicaldecision.

British Journal of Psychology , (2), 167–180.Morton, J. (1969). Interaction of information in word recognition. Psychological Review , (2), 165-178.Navarro, G. (2001). A guided tour to approximate string matching. ACM computingsurveys (CSUR) , (1), 31–88.Norris, D., & McQueen, J. (2008, Apr). Shortlist B: a Bayesian model of continuous speechrecognition. Psychological Review , (2), 357–395.Norris, D., McQueen, J. M., & Cutler, A. (2016). Prediction, Bayesian inference andfeedback in speech recognition. Language, cognition and neuroscience , (1), 4–18.Owens, E. (1961). Intelligibility of words varying in familiarity. Journal of Speech andHearing Research , (2), 113–129.Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: an ASR corpusbased on public domain audio books. In Acoustics, speech and signal processing(ICASSP), 2015 ieee international conference on (pp. 5206–5210).Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L.(2018). Deep contextualized word representations. In

Proceedings of the NorthAmerican Chapter of the Association for Computational Linguistics.

OBUST WORD RECOGNITION 55Petrov, S., & Klein, D. (2007). Learning and inference for hierarchically split PCFGs. In

Proceedings of the National Conference on Artiﬁcial Intelligence (Vol. 22, p. 1663).Popović, M., & Ney, H. (2007). Word error rates: Decomposition over POS classes andapplications for error analysis. In

Proceedings of the Second Workshop on StatisticalMachine Translation (pp. 48–55).Roark, B. (2001). Probabilistic top-down parsing and language modeling.

Computationallinguistics , (2), 249–276.Roark, B., Bachrach, A., Cardenas, C., & Pallier, C. (2009). Deriving lexical and syntacticexpectation-based measures for psycholinguistic modeling via incremental top-downparsing. In Proceedings of the 2009 Conference on Empirical Methods in NaturalLanguage Processing (pp. 324–333).Rohde, H., & Ettlinger, M. (2012). Integration of pragmatic and phonetic cues in spokenword recognition.

Journal of Experimental Psychology: Learning, Memory, andCognition , (4), 967–983.Sanborn, A., Griﬃths, T., & Shiﬀrin, R. (2010). Uncovering mental representations withMarkov chain Monte Carlo. Cognitive Psychology , (2), 63–106.Savin, H. (1963). Word-frequency eﬀect and errors in the perception of speech. TheJournal of the Acoustical Society of America , (2), 200–206.Shannon, C. (1948). A mathematical theory of communication. The Bell System TechnicalJournal , (July 1928), 379–423.Shannon, C. (1951). Prediction and Entropy of Printed English. Bell Systems TechnicalJournal , , 50–64.Smith, N., & Levy, R. (2011). Cloze but no cigar: The complex relationship between cloze,corpus, and subjective probabilities in language processing. In Proceedings of theAnnual Meeting of the Cognitive Science Society (Vol. 33).Smith, N., & Levy, R. (2013). The eﬀect of word predictability on reading time islogarithmic.

Cognition , (3), 302–19.OBUST WORD RECOGNITION 56Stabler, E. (2004). Varieties of crossing dependencies: structure dependence and mildcontext sensitivity. Cognitive Science , (5), 699–720.Stolcke, A. (2002). SRILM – An Extensible Language Modeling Toolkit. In ProceedingsInternational Conference on Spoken Language Processing (Vol. 2, p. 901-904).Storkel, H., Armbrüster, J., & Hogan, T. (2006). Diﬀerentiating phonotactic probabilityand neighborhood density in adult word learning.

Journal of Speech, Language, andHearing Research .Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for languagemodeling. In

Thirteenth Annual Conference of the International SpeechCommunication Association.

Tanenhaus, M., Spivey-Knowlton, M., Eberhard, K., & Sedivy, J. (1995). Integration ofvisual and linguistic information in spoken language comprehension.

Science , (5217), 1632–1634.Theunissen, M., & Swanepoel, J., D. W. Hanekom. (2009). Sentence recognition in noise:Variables in compilation and interpretation of tests. International Journal ofAdiology , (11), 743–757.Treisman, M. (1978). A theory of the identiﬁcation of complex stimuli with an applicationto word recognition. Psychological Review , (6), 525.Van Schijndel, M., & Schuler, W. (2015). Hierarchic syntax improves reading timeprediction. In Proceedings of the 2015 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies (pp.1597–1605).Vincent-Lamarre, P., Massé, A. B., Lopes, M., Lord, M., Marcotte, O., & Harnad, S.(2016). The latent structure of dictionaries.

Topics in Cognitive Science , (3),625–659.Vitevitch, M., & Luce, P. (1999). Probabilistic Phonotactics and Neighborhood Activationin Spoken Word Recognition. Journal of Memory and Language , (3), 374–408.OBUST WORD RECOGNITION 57Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE , (10), 1550–1560.Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The microsoft2017 conversational speech recognition system. In (pp. 5934–5938).Xu, J., & Griﬃths, T. (2010). A rational analysis of the eﬀects of memory biases on serialreproduction. Cognitive Psychology , (2), 107–126.Yap, M., Balota, D., Sibley, D., & Ratcliﬀ, R. (2012). Individual diﬀerences in visual wordrecognition: Insights from the english lexicon project. Journal of ExperimentalPsychology: Human Perception and Performance , (1), 53-79.Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 .Zeno, S., Ivens, S., Millard, R., & Duvvuri, R. (1995). The educator’s word frequencyguide . Touchstone Applied Science Associates (TASA), Inc.Ziegler, D., Stiennon, N., Wu, J., Brown, T., Radford, A., Amodei, D., . . . Irving, G.(2019). Fine-tuning language models from human preferences. arXiv preprintarXiv:1909.08593 .Ziegler, J., Muneaux, M., & Grainger, J. (2003). Neighborhood eﬀects in auditory wordrecognition: Phonological competition and orthographic facilitation.

Journal ofMemory and Language , (4), 779–793.OBUST WORD RECOGNITION 58 Technical Appendix 1: Language Models and Training ProcedureLanguage ModelsN-gram models.

The simplest PGLMs that we evaluate here are n -gram models.First used to model natural language by Shannon (1948), n -gram models typically trackforward transitional probabilities for words by conditioning on preceding words. Thesemodels make the strong simplifying assumption that the sequence of words in an utteranceor sentence are generated following a Markov chain , in which the probability of each event(i.e., word) depends only on immediately preceding events. These models can condition ona larger or smaller preceding context: an n -gram model tracks conditional probabilitiesgiven n -1 preceding words: bigram models condition on just the preceding word, whiletrigram models condition on the two previous words. The probability of an utterance is theproduct of the conditional probabilities of the constituent words, P ( w , . . . , w m ) = m Y i =1 P ( w i | w i − ( n − , . . . , w i − ) , (1)where w , . . . , w m is an utterance comprised of m words and n is the order of the n -grammodel. By convention, sentences (or utterances, more generally) are augmented with astart symbol (of probability 1) and an end symbol (the probability of which is tracked bythe model, the same as any other word type). These models track statistics over sequencesof words, and do not include any explicit encoding of statistical expectations forhigher-order abstractions like parts of speech ( e.g., nouns and verbs), super-ordinategrammatical categories ( e.g., NPs or VPs), or semantic roles ( e.g., agents or patients). n -gram models may appear to encode expectations related to these abstractions becauselexical statistics capture these regularities implicitly: a trigram model will assign almost allof the probability mass following “near a” to nouns and adjectives. However, withoutsupport for abstract representations, a trigram model cannot assign higher probabilities tonouns as a class (i.e., nouns not observed following “near a”) in predicting the next wordsOBUST WORD RECOGNITION 59in that context. As such, comparing the probability of the serial transmission chains underthe n -gram models against the probability under the other models—which all posit andtrack regularities at a level higher than words—allows us to evaluate evidence for thedegree to which people may use representations of higher levels of linguistic structure toinform their expectations.A special case of the n -gram model of particular theoretical interest is the unigram,or 1-gram model, where the probability of a word is not conditioned on preceding context.Unigram probability estimates thus largely reﬂect normalized word frequencies, but mayassign nonzero probability mass to out-of-vocabulary words depending on the choice ofsmoothing technique, described below. Comparing the predictions of the unigram modelwith those of higher-order n -gram models allows us to evaluate the degree to whichparticipants’ linguistic expectations reﬂect any amount of preceding context, separate fromthe treatment of abstraction.Besides the length of the context on which they condition, a second dimension ofvariation in the architecture of n -gram models is the choice of smoothing technique, or howprobability mass is re-allocated to unobserved word sequences during model ﬁtting. Usingprobabilities directly derived from counts, i.e., the maximum likelihood estimate, of an n -gram model is generally ill-advised because of sparsity: many word sequences observed ina new dataset may not have been observed in the dataset used to ﬁt the corpus. Unlessunobserved sequences receive nonzero probability, sentences with zero-probabilitycontinuations will be evaluated as also having a probability of 0. This reallocation ofprobability mass among sequences often reﬂects simple assumptions about the statisticalstructure of languages. Smoothing over possible sequences complements the practice ofassigning some proportion of the unigram probability mass to newly-encountered tokens.Here we use two smoothing schemes when higher-order n -gram models. For largerdatasets, we use modiﬁed Kneser-Ney smoothing (Chen & Goodman, 1998). Thissmoothing scheme shifts probability mass to unobserved n -grams that are expected on theOBUST WORD RECOGNITION 60basis of the prevalence of constituent lower-order n -grams, e.g., assigning “near SanAntonio” a small non-zero probability because “near San” (as in “near San Francisco”) and“San Antonio” are both relatively probable bigrams. For smaller datasets, we useGood-Turing smoothing (Gale & Sampson, 1995), which shifts probability mass from n -grams seen once to ones that are not encountered in ﬁtting. We build several new n -gram models using the SRI Language Modeling Toolkit (SRILM, Stolcke, 2002), and usea proprietary 5-gram model from the DeepSpeech project (Hannun et al., 2014) which wasestimated using KenLM (Heaﬁeld, 2011). We continue here with the enumeration oflanguage models; the linguistic datasets used to ﬁt the models are treated in greater detailbelow. Probabilistic Context-Free Grammars (PCFGs).

The remainder of theprobabilistic generative language models under consideration here use abstractrepresentations above the level of the word to derive word-level expectations. The ﬁrstclass of model, probabilistic context-free grammars (PCFGs), posit that utterances reﬂecta latent structure composed of abstract, hierarchical categories like nouns, verbs, nounphrases, and verb phrases. The task of predicting the next word is thus not just dependenton the previously-observed words as in the n -gram models, but also depends on a listener’sbeliefs about the grammatical structure consistent with the previously-encountered words,and how that grammatical structure narrows the set of possible continuations. Forexample, having heard words that likely form a verb phrase for a transitive verb( e.g., “bought a”), a listener might expect a noun phrase containing the object of that verb.Under a PCFG, each hypothesis about the hierarchical structure of an utterance iscaptured in terms of a parse tree, which describes how an utterance could be generatedfrom a context-free grammar. A context-free grammar is a linguistic formalism thatdescribes sentences as a hierarchy of constituents, starting with a root “sentence” node.This sentence node is composed of a tree of nonterminals: grammatical categories such asnouns, verbs, noun phrases, verb phrases, each of which in turn consists of otherOBUST WORD RECOGNITION 61grammatical categories or words. All terminals, or leaf nodes in the hierarchy, are(observable) words. A CFG captures the intuition that the same (observable) linearsequence of words may reﬂect several possible (unobservable) interpretations of therelationship between constituents. For example, “the girl saw the boy with the telescope”has two high probability interpretations dependingon the attachment point of theprepositional phrase — whether “with the telescope” modiﬁes boy or how the girl saw .While CFGs cannot capture certain human linguistic phenomena (Stabler, 2004),probabilistic CFGs, or PCFGs, are commonly used as generative probabilistic models oflanguage given their principled account of higher-order structure. Unlike typed dependencyparsers (e.g. De Marneﬀe, MacCartney, & Manning, 2006), which track typed pairwiserelationships between words, PCFGs provide a probability distribution over hierarchicalparses for an utterance.The parameters of PCFGs can be learned in an unsupervised fashion from linearword representations alone, or ﬁt using corpora that have been annotated by linguists withgold-standard most-probable parses. Because of their stronger performance, we focus hereon the predictions of PCFGs derived from supervised training. Given the size of thehypothesis space of possible grammars for a language, as well as parses for a sentence,PCFGs vary in the techniques that they use to ﬁnd the highest probability parses. Here weuse two PCFGs, the Roark (2001) parser and the BLLIP parser (Charniak & Johnson,2005), which approach the problem of inference in this large hypothesis space in verydiﬀerent ways. Roark Parser.

The Roark parser (Roark, 2001) is a widely used top-down parser,meaning that it maintains and updates a set of candidate parse trees as it moves from leftto right in an utterance. In combination with other architectural decisions, this allows theRoark parser to calculate probabilities for each sequential word conditioned on thepreceding words. This incremental property makes it especially well-suited for modelingonline sentence processing, where people update their interpretation of the utterance asOBUST WORD RECOGNITION 62they receive new acoustic data (Roark et al., 2009).

BLLIP Parser.

Unlike the top-down approach of the Roark Parser, the BLLIP (orJohnson-Charniak) parser uses bottom-up inference in combination with a secondaryscoring function on the yielded highest probability trees (Charniak & Johnson, 2005). Thebottom-up approach means that the parser starts by positing grammatical categoriesdirectly above the level of words for the entire sentence, then iteratively identiﬁes possiblehigher-level grammatical categories up to the sentence root. As such, it cannot be used toproduce probability estimates incrementally after each word. The BLLIP parser usesbottom-up parsing to generate a set of high probability parses which are then re-rankedusing a separate discriminative model with a researcher-speciﬁed set of ad-hoc features.Though bottom-up parsing violates the constraints of online auditory language processingin that it requires that the complete signal be received previous to parsing, we include theparser in analyses tracking utterance-level changes.For both PCFG language models, the probability of an utterance reﬂects amarginalization over the possible parse trees that generate that utterance. Because ofcomputational constraints, these parsers typically track a relatively small number of thehighest probability parse trees (in this case, 50 for each model). While this represents atruncation of the true set of possible parse trees, the ﬁrst few (i.e., the ﬁrst one to ﬁve) areoverwhelmingly more probable than the remainder, and thus provide a reliable estimate ofutterance probability. In the case of the Roark parser, surprisal estimates for individualwords can be derived by querying the probability of the next word given the restricted setof highest probability provisional parse trees (Roark et al., 2009).

Recurrent Neural Network Language Model (RNN LM).

The expectationsderived from PCFGs reﬂect hierarchical parses of sentences, but listeners could alsodevelop expectations by noting abstract commonalities in the usage of words, for examplethat the words “ﬁve” and “six” tend to appear in very similar lexical contexts. We includehere two recurrent neural network models (RNNs) that can infer partially syntactic,OBUST WORD RECOGNITION 63partially semantic higher-level regularities in word usage, and can use these regularities inthe service of prediction. As with PCFGs, at least some RNN architectures can track longdistance dependencies (Linzen et al., 2016). While models lacking abstract representationsof context could in principle capture long-distance dependencies (e.g., a 9-gram model), thecombinatorial richness of language means evidence is far too sparse to infer theseregularities when tracking sequences of words when treated as discrete symbols.Recurrent neural networks are able to capture and use these higher-level regularitiesbecause they use the state of the hidden layer of the network at the previous timestep, oftenreferred to as a “memory,” in addition to newly-received data to make predictions. Thishidden layer from the previous timestep is a lossy—and thus generalizable—representationof the preceding context. By projecting words in the preceding context into alower-dimensional space, the observation of a speciﬁc word sponsors the activation of wordswith similar lexical distributions in the context, imbuing the model with robustness inprediction even if a particular sequence of words has not been seen. Since earlydemonstrations of their utility for language prediction with small-scale, synthesizedlinguistic corpora (Elman, 1990), an extensive literature has developed architectures todeal with web-scale natural language prediction tasks, particularly developing techniques toprevent overﬁtting given the massive number of parameters, and to manage theircomputational complexity (Mikolov, Kombrink, Burget, Černock`y, & Khudanpur, 2011).The ﬁrst RNN we use here, “RNN LM,” was ﬁrst described in Mikolov, Karaﬁát,Burget, Černock`y, and Khudanpur (2010), while additional performance optimizations fortraining can be found in Mikolov et al. (2011). We train a network with 40 hidden nodesthat uses backpropagation through time (BPTT, Werbos, 1990) for the two precedingtimesteps. The vocabulary is limited to the 9,999 highest-frequency word types, with theremainder of types assigned to an < unk > type. Big Language Model (Big LM).

One of the major challenges with recurrentneural networks is the problem of vanishing gradients where the error signal necessary toOBUST WORD RECOGNITION 64update connection weights in the network becomes too small to use standard gradientdescent techniques. This problem becomes especially prevalent when recurrent neuralnetworks track longer histories of preceding events. The RNN language model describedabove, for example, can only update weights using the model state at two precedingtimesteps. One solution that has received signiﬁcant attention is to use a special kind ofnode in the neural network’s hidden layer, or long short-term memory (LSTM) cells(Hochreiter & Schmidhuber, 1997). While non-LSTM RNNs directly copy the previousstate of the hidden layer at each timestep, the LSTM uses these special cells that regulatehow information is propagated from one timestep to the next. These cells have input,output, and forgetting “gates” that regulate how the cell’s state changes, such that it canmaintain state for longer intervals than a typical RNN. Such networks have been widelyuseful in sequence prediction tasks (Gers & Schmidhuber, 2001), with particular successesin language modeling (Sundermeyer, Schlüter, & Ney, 2012; Zaremba et al., 2014).One challenge for RNNs, left unaddressed by the use of LSTMs, is the diﬃculty ofscaling to larger vocabularies, a necessity for modeling large naturalistic language corpora.Evaluating the network’s loss function, which requires generating a probability distributionover all words, becomes prohibitively computationally expensive owing to the normalizingterm in a softmax function. This has lead researchers to limit vocabularies to sizes muchsmaller than those typical of n -gram models or PCFGs, often in the range of 5,000 to30,000 word types.Jozefowicz et al. (2015) address this limitation in their “Big LM” architecture byrepresenting words as points in a lower-dimensional continuous space using a convolutionalneural network (CNN). These models can represent words as embeddings (real-valuedvectors) that capture perceptual similarity between words on the basis of shared structure,such as word lemmas or part of speech markings like -ing (Ling et al., 2015). The LSTM isthen used to make predictions in this lower dimensional space, reducing the number ofcomputations in the softmax function from the size of the vocabulary to the dimensionalityOBUST WORD RECOGNITION 65of the CNN-derived embeddings.Given the signiﬁcant technical resources necessary to train such a model (24-36graphics processing units) and that model hyperparameters have not been made publiclyavailable, we use the model distributed by Jozefowicz et al. (2015). This model uses4096-dimensional character embeddings to represent words, makes use of backpropagationthrough time for the previous 20 timesteps, and has two layers with 8192-dimensionalrecurrent state in each layer. Training Data and Model Fitting

All of the above language models are ﬁt or trained using corpus data represented aswritten text. Ideally, we would evaluate all combinations of language models and corporato identify how model performance reﬂects an interaction of model architecture andtraining data. However computational limitations, licensing restrictions, and limited publicavailability of codebases make this goal unfeasible at present. Instead, wherever possible weevaluate each language model trained on two datasets: one on a large corpus for which it isknown to produce competitive or state-of-the-art results, and the other on the(publicly-available) contents of the Penn TreeBank (Marcus, Marcinkiewicz, & Santorini,1993).The Penn TreeBank is a standard training, validation, and test dataset in the publicdomain, consisting of about 930,000 word tokens in the training set. For those models thattake advantage of the validation dataset for setting hyperparameters, these models haveadditional access to 74,000 tokens. All sentences are annotated with gold-standard parsetrees, making this dataset appropriate for training the two supervised PCFG models.While the thematic content and register of the Wall Street Journal are not representativeof conversational English, all language models should be equally disadvantaged by thisshortcoming.For the instance of each model architecture trained on a larger dataset, we use aOBUST WORD RECOGNITION 66variety of datasets with known strong performance in the psycholinguistics or NLPliterature. For n -gram models we use the British National Corpus (BNC; [Dataset], 2007),an approximately 200 million word dataset commonly used in psycholinguistics. The BNCconsists of material from newspapers, academic and popular books, college essays, andtranscriptions of informal conversations. The BLLIP parser is trained on the GoogleTreeBank (Bies, Mott, Warner, & Kulick, 2012) in addition to the Penn TreeBank. TheGoogle TreeBank contains approximately 255,000 word tokens in 16,600 sentences withgold-standard parse trees, taken from blogs, newsgroups, emails, and other internet-basedsources. Big LM is trained on the One Billion Word Benchmark of Chelba et al. (2013), alarge collection of news articles. The DeepSpeech project (Hannun et al., 2014) provides asmooth 5-gram model trained on a proprietary dataset including Librispeech (Panayotov,Chen, Povey, & Khudanpur, 2015) and Switchboard (Godfrey et al., 1992). Only the PennTreeBank was used in the cases of the Roark parser and the RNN LM. Computation of Lexical Surprisal

Estimating surprisal — the unexpectedness of a word, in the form of its negative logprobability under a predictive model — for each individual word token requires that amodel produce conditional word probabilities. These probabilities are straightforwardlyaccessible because n -gram models encode these continuation probabilities directly.The Roark parser produces an estimate of lexical continuation probability thatmarginalizes over the set of highest probability candidate parses given the words seen sofar. The bottom-up scoring procedure used by the BLLIP means that it assignsprobabilities to parse trees that generate the whole sentence, rather than successive words.For this reason, we analyze changes in sentence probability under the BLLIP, but do notuse it as a predictor in analyses requiring word-level surprisal estimates.The RNN LM yields a vector of activations over the vocabulary which is translatedwith a softmax function into a probability distribution. Big LM produces a probabilityOBUST WORD RECOGNITION 67distribution over continuations, though the softmax function is computed over thelower-dimensional representations of words. In both cases the prediction by the neuralnetwork is conditioned on the state of the model at preceding timesteps. We treat theprobabilities obtained from models in the same way as the conditional probabilities fromthe n -gram models. Computation of Sentence Probabilities

For all models we omit the end of sentence marker in the computation of utteranceprobability. We adopt this strategy because utterance-ﬁnal punctuation was not collectedin the Telephone game, and those models trained on datasets with punctuation assign veryhigh surprisal to end-of-sentence markers without preceding punctuation. In the case of n -gram models, sentence probability is simply the product of conditional word probabilities.Evaluating the probability of a sentence for the two PCFG models, the Roark andBLLIP parsers, requires marginalizing over the set of possible parse trees. For these twomodels, we sum over the probabilities of the top 50 parse trees yielded by each model.While there may be many more trees that would yield the same string of words, the ﬁrstfew (1-5) highest ranking parses typically account for nearly all of the aggregate probabilitymass for that utterance.As ﬁrst noted by Bartlett (1932), messages decrease in length over the course of serialreproduction. Shorter utterances are necessarily more probable, in that they are comprisedof a smaller number of events. To investigate the eﬀects on utterance probability separatefrom utterance length, we normalize the negative log probability by the number of words ineach utterance. This measure, which we call “average per-word surprisal” has an intuitiveinterpretation in terms of bits of information. However, this measure may disguise animportant diﬀerence between models in the interpretation of utterance probabilities amongthe PCFGs. For the non-incremental models ( e.g., BLLIP), the probability reﬂects the setof parse trees consistent with the complete utterance. By contrast, the set of parse treesOBUST WORD RECOGNITION 68changes after observing each word under the incremental models. Garden path sentencesprovide a case where the latter class of models might assign high probability to the initialsequence of words “the horse raced past the barn”, then a low probability to thecontinuation “fell,” which is likely to be low probability under the set of parses in thebeam. By contrast, a bottom-up, non-incremental model would assign a low probability toall expansions necessary to generate this sentence, in that it only considers the subset ofparses compatible withe the full utterance.

Technical Appendix 2: Experiment Interface

Participants are directed from Mechanical Turk to an externally-hosted version of theTelephone Game. The app guides the participant through adjusting their speakers andmicrophone, after which participants proceed through a sequence of four practice trials.The ﬁrst practice trial tests whether the participant’s speakers or headphones are workingproperly by having them transcribe a ten-word sentence. The second practice trial testswhether a participant’s microphone is working by having them listen to a recording of asentence and repeating its content. At the beginning of this trial, the participant grantspermission in the web browser for the use of their microphone. Participants are promptedto repeat the second practice trial until their recording is similar to the gold-standardtranscription of that sentence, as evaluated by a normalized Levenshtein-Damerau distanceof .2 between the gold-standard transcription and the output of an automatic speechrecognition system run on their recording (described in greater detail below).After completing the practice trials, the participant begins a series of 47 test trials,including 40 target trials and 7 randomly interspersed ﬁllers. The 40 target trials consist ofeither sentences from a pool of initial recordings (described below) or the mostrecently-contributed recording of another participant. The seven ﬁller utterances are thecomplement of the four ﬁller utterances used for the practice trials, and are drawn from aninventory of nine audio recordings with similar properties to the initial stimuli recordings,OBUST WORD RECOGNITION 69plus two standard test sentences from the TIMIT corpus meant to elicit variation inAmerican English dialects (Garofolo et al., 1993). If the participant ﬂags the utterancethat they heard as inappropriate in (2) or their own recording as compromised in some way(4),the trial ends early, and they progress to the next trial. Otherwise, if the recordingpasses the set of automated tests, then the state of the experiment is updated after eachtarget trial to point to that newest recording as the appropriate stimulus for the nextparticipant.An example screenshot is presented in Fig. 3. We now present in greater detail theﬁve steps an experimental trial and the automated ﬁlters outlined above.

1. Listening to a recording.

The participant clicks a button to start the trial,which triggers a three-minute audio timer that remains visible to the user through Step 5,below. This three minute timer ensures that the web experiment (given the dependenciesbetween participants) is not blocked by an inactive participant; if time expires, anotherparticipant receives this stimulus. Participants receive friendly feedback to encourage themto complete trials more quickly if this timer expires. Besides the button to start playingthe audio recording, there are no controls to pause, repeat, or move location in therecording. The audio is presented embedded in noise recorded from a coﬀee shop (averageSNR dB across recordings = -6.8). Pilot experiments established that this naturalisticsource of noise introduced a high rate of edits without introducing signiﬁcant participantattrition, as was observed with similar amplitude white noise. Each utterance has between500 and 1000 ms of noise-only padding preceding and following the speech content.

2. Flagging the upstream recording.

Though the participant cannot pause ormove within the recording that they hear, they may ﬂag the recording at any time. If theparticipant chooses to ﬂag a recording, they are then asked to choose one of a set ofprovided reasons: Contains speech errors, Speech starts or stops abruptly, Containsobscenities, and Other. Upon choosing Other, the participant could provide a free-formtext response. The trial ends early if any of these options are chosen.OBUST WORD RECOGNITION 70

3. Recording best guess of what was heard.

If the participant does not ﬂagthe audio recording, they are immediately prompted to record their best estimate of whatwas said, with the speciﬁc prompt of “Repeat the sentence you just heard as best you can.”After clicking the Record button, the waveform for the recording is drawn in real time.The participant may listen to and re-record their response as many times as desired. If arecording is more than eight seconds long, the participant is prompted to record again.When ﬁnished, the participant clicks “Submit.” Because the HTML5 audio recordingspeciﬁcation leaves the choice of sampling rate and bit depth to the client, all recordingsare normalized upon receipt by the server to 16 kHz, 16 bit PCM WAV ﬁles. The acousticproperties of the recording environments vary between participants in that each participantrecords their responses in an uncontrolled environment.

4. Self-ﬂagging the new recording.

After submitting a recording, a participantis given the opportunity to “self-ﬂag” it in case of a speech error or other problems such asunexpected background noise. If the participant chooses to self-ﬂag the recording, they areprompted to provide a reason, with the same set of candidate reasons as the upstreamﬂagging procedure in Step 2, above. If a participant self-ﬂags, the trial ends early.

5. Transcribing the new recording.

Finally, if the participant has submitted arecording and has attested that it is of good quality, they are then prompted to provide awritten transcription thereof. The inclusion of punctuation or of a misspelled word (asdetermined with the Linux utility

Aspell ) returns an error message to the participant, whomay then edit the transcription and resubmit. After submitting, the trial ends. Theparticipant is then provided with a button entitled “Click to play next audio recording”;because there is no timer on this screen, they may pause between each trial.

Automated ﬁlters.

After the participant submits the written transcription, theyadvance to the start button for the next trial. The web application asynchronously appliesa set of automated ﬁlters to the audio ﬁle and the transcription on the server. We employthese ﬁlters to determine if a recording is of suﬃcient quality to be used as input to otherOBUST WORD RECOGNITION 71downstream participants. In principle, these ﬁlters could be implemented with humanparticipants, but using automated tools allows us to direct participants’ eﬀort (andcommensurate compensation) towards data collection. Of note from a design and dataanalysis perspective, these ﬁlters must be applied at the time of collection: because futureparticipants hear responses from earlier participants, responses must be ﬁltered in real timeto maintain the continuity and integrity of the recording chains. These ﬁlters include:• Is the ﬁle silent?• Is the transcription provided by the new participant between 20% longer and 20%shorter (in terms of the number of non-space characters) than the transcription of theinput sentence they received? If utterances become too short, they become diﬃcultto characterize with language models.• Is the transcription provided by the new user more than 2 words longer or 2 wordsshorter than the transcription of the input sentence? This follows the same logic asabove.A second set of tests pertains to the audio quality and the intelligibility of therecording. For this we use an automatic speech recognition system to generate atranscription of the newly-recorded audio ﬁle. Speciﬁcally, we use one of the mostadvanced publicly-available automatic speech recognition systems, DeepSpeech (Hannun etal., 2014), to check:• Is the DeepSpeech-generated transcription of the participant’s audio ﬁle similar tothe transcription they provided? This guards against the possibility that a userwould provide an acceptable transcription, but an unrelated audio ﬁle (e.g., one ﬁlledwith obscenities).• Is the DeepSpeech-generated transcription of the participant’s audio ﬁle similar tothe transcription provided by the upstream participant? This prevents theOBUST WORD RECOGNITION 72introduction of material like “I didn’t hear the last sentence.”The above checks are operationalized by testing whether the normalizedLevenshtein-Damerau distance (Navarro, 2001) between the strings in question is less thana precomputed threshold. We use the threshold of .58, arrived at by computing thenormalized Levenshtein distance between each sentence in a large corpus and a number ofcandidate transcriptions, including the correct one and several unrelated foils. On this testcorpus with highly dissimilar sentences, this threshold yields a negligible false positive rate;in practice, this threshold is extremely permissive (allows changes) and ﬂags only highlydeviant recordings. The length ﬁlter was motivated by the ﬁnding in pilot work thatutterances decreased in length very rapidly without such constraints, yielding many 1-3word utterances of questionable linguistic status.If a recording fails any of the above tests, the participant receives feedback at the endof the following trial. This asynchronous evaluation allows for eﬃcient speech recognition(which is computationally costly and requires the use of a graphics processing unit on theserver) and prevents participants from having to wait for the web application to recognizeand validate their responses. Utterances that are rejected by any of these ﬁlters are retainedin that they are potentially relevant to a number of research questions outside of the scopeof the current study. If a recording passes the above tests, the recording is combined with arandomly selected interval of cafe background noise (with the same acoustic properties asabove) so that it may be used as a stimulus for later participants. The implications ofﬂagging the upstream stimulus, self-ﬂagging, and automated ﬁltering for the stimuli heardby later participants are described below in relation to the process of serial reproduction.After submitting a response for the ﬁnal stimulus, participants are prompted to takean optional demographic survey detailing age, gender, level of education, currentgeographical location, proﬁciency with English, and information regarding previousresidence for the purpose of dialectal analyses (outside of the scope of this work).OBUST WORD RECOGNITION 73

Technical Appendix 3: Experiment Transmission Structure

The succession of stimuli and responses (the latter constituting stimuli for laterparticipants) can be conceived of in terms of a directed acyclic graph, or DAG. Consideringthe succession of recordings for each sentence as a graph—a collection of nodesrepresenting recordings and edges representing the history among participants—allows usto concisely represent the data collected in the course of the experiment and tooperationalize the logic for both automated and participant-based ﬂagging of recordings.Per the speciﬁcation of the interface above, a recording takes one of ﬁve possiblestates: accepted by ﬁat because it is an initial stimulus (“protected”), provisionallyaccepted (“accepted”), ﬂagged by a downstream participant (“downstream-ﬂagged”),ﬂagged by the participant who recorded the sentence (“self-ﬂagged”), or ﬂagged by one ofseveral automated methods (“auto-ﬂagged”). When a participant starts a test trial indexedby a particular initial sentence, they are provided with either the most recent acceptedrecording or the initial recording itself (if no previous recordings have been accepted). Inother words, if a participant ﬂags the input recording they heard, the following participantwill then hear the previously accepted recording in the graph for that sentence. Thesequence of recordings appropriate for analysis, or recording chain , is then the initialsentence recording and the subsequent succession of accepted recordings. Flagged andself-ﬂagged recordings comprise the complement of the nodes in the graph. We follow theconvention established in the iterated learning literature of referring to the sequentialposition of a recording within a chain as its generation , though note that a participant inthis setup may contribute recordings for diﬀerent stimuli at diﬀerent generations, unlikemost iterated learning experiments.The process by which participants are assigned to stimuli can then be considered interms of threading . Each participant must provide recordings for each of the 40 testsentences. Because of the need for strictly successive recordings, only one participant at atime may listen to and record a response to the same recording. We implement theseOBUST WORD RECOGNITION 74constraints by implementing a mutex , a data structure that limits concurrent access to aresource, in this case the recording chains. In combination with multiple independentchains for each stimulus, this setup allows participants to listen and record sentencescontinuously while maintaining a strictly sequential relationship between the contributedrecordings. The three-minute timer on each trial means that the controller (the web app)will re-assign a stimulus to another participant if a response is not recorded within threeminutes. This setup also means that the order in which a participant contributes to each ofthe recording chains is randomized across participants.These dynamics mean that any recording may be ﬂagged and removed for theduration of the experiment, even if another participant records a downstream utterance.For example, it is possible that sentence s recorded by participant p is provisionallyaccepted, and that a succeeding participant p records a downstream repetition, s . If,however, p ﬂags s , then p will hear s , and may in principle ﬂag that recording. We ﬁndthat such cases of retroactive ﬂagging (and loss of analyzable data) are relatively rare, butthat this mechanism provides an automated and scalable method to produce chains ofinterpretable utterances appropriate for analysis. This architecture also means that if twoparticipants p and p are progressing through the experiment at the same time, then p may provide the stimulus recording heard by p for some sentences and p may provide thestimulus recording heard by p in others. Technical Appendix 4: Analysis of Flagged Recordings R = .006, p = .157). This suggests that participants who ﬂag manyrecordings do not behave diﬀerently than those who ﬂag few recordings. Technical Appendix 5: Similarity of Language Models

In the main text we present the pairwise similarity of all models trained on the PennTreeBank in order to isolate the contribution of model architecture (vs. training data) tosurprisal estimates. Here we present the same measure of pairwise similarity, but for all 11models. This reveals that n -gram models trained on the PTB constitute one well-deﬁnedcluster separate from all other models. Another cluster emerges among Kneser-Neytrigrams and 5-grams trained on large datasets and the Big LSTM trained on the OneBillion Word Benchmark. The BLLIP PCFGS constitute a 2-model cluster. The RoarkParser, the RNNLM and unigram probabilities from the BNC constitute a separate setfrom the BLLIP PCFGs, but pattern more closely with them than the large models or thePTB-trained n -gram models.OBUST WORD RECOGNITION 76 l ll lllll ll l lllllll l llll l llll l lll l lll llllll lll ll llll llllllllllllll ll llllllll lllll l llll lll lll ll ll lll lllll lll lll llll l ll ll llll l lll ll lll ll llll lll lll lll llllllll lll l l l llll lllll y = + −0.002 (cid:215) x , R = , p = Proportion of Input Sentences Flagged by Participant(Participant Rejected Sentences) P r opo r t i on o f W o r d s C hanged B y P a r t c i pan t ( P a r t i c i pan t A cc ep t ed S en t en c e s ) Figure 11 . Relationship between participant ﬂag rate (i.e., number of recordings rejected)and participant edit rate (proportion of words changed among accepted recordings). Mostparticipants change around 1 in 5 words that they hear, and ﬂag none of the recordingsthat they hear. The non-signiﬁcant correlation suggests that there is no relationshipbetween ﬂag rate and edit rate by participant.OBUST WORD RECOGNITION 77

PTB / UnigramPTB / Good−Turing TrigramPTB / Good−Turing 5−gramOBWB / Big LSTMBNC / Kneser−Ney TrigramDS / Kneser−Ney 5−gramPTB / Roark Parser (PCFG)PTB+GTB / BLLIP parser (PCFG)PTB / BLLIP parser (PCFG)BNC / UnigramPTB / RNNLM (RNN) P T B / U n i g r a m P T B / G ood − T u r i ng T r i g r a m P T B / G ood − T u r i ng − g r a m O B W B / B i g L S T M B N C / K ne s e r − N e y T r i g r a m D S / K ne s e r − N e y − g r a m P T B / R oa r k P a r s e r ( P C F G ) P T B + G T B / B LL I P pa r s e r ( P C F G ) P T B / B LL I P pa r s e r ( P C F G ) B N C / U n i g r a m P T B / R N N L M ( R N N ) OBWB / Big LSTMPTB / Roark Parser (PCFG)BNC / UnigramBNC / Kneser−Ney TrigramDS / Kneser−Ney 5−gramPTB+GTB / BLLIP parser (PCFG)PTB / BLLIP parser (PCFG)PTB / RNNLM (RNN)PTB / UnigramPTB / Good−Turing TrigramPTB / Good−Turing 5−gram

A B

Figure 12 . Model similarity across 11 language models. A. Spearman’s rank correlation forpairwise combinations of sentence probabilities across the 3,193 sentences from theexperiment across all models. B. The resulting dendrogram, derived from the rankcorrelations using Ward’s method.OBUST WORD RECOGNITION 78Table 6