[PDF] Child-directed Listening: How Caregiver Inference Enables Children's Early Verbal Communication

Abstract

How do adults understand children's speech? Children's productions over the course of language development often bear little resemblance to typical adult pronunciations, yet caregivers nonetheless reliably recover meaning from them. Here, we employ a suite of Bayesian models of spoken word recognition to understand how adults overcome the noisiness of child language, showing that communicative success between children and adults relies heavily on adult inferential processes. By evaluating competing models on phonetically-annotated corpora, we show that adults' recovered meanings are best predicted by prior expectations fitted specifically to the child language environment, rather than to typical adult-adult language. After quantifying the contribution of this "child-directed listening" over developmental time, we discuss the consequences for theories of language acquisition, as well as the implications for commonly-used methods for assessing children's linguistic proficiency.

Full PDF

CChild-directed Listening:How Caregiver Inference Enables Children’sEarly Verbal Communication

Stephan C. Meylan

1, 3 , Ruthe Foushee , Elika Bergelson , and Roger P. Levy Department of Brain and Cognitive Sciences, MIT ( { smeylan, rplevy } @mit.edu) Department of Psychology, University of Chicago ([email protected]) Department of Psychology and Neuroscience, Duke University ([email protected])

February 10, 2021

Abstract

How do adults understand children’s speech? Children’s productions over thecourse of language development often bear little resemblance to typical adult pro-nunciations, yet caregivers nonetheless reliably recover meaning from them. Here,we employ a suite of Bayesian models of spoken word recognition to understandhow adults overcome the noisiness of child language, showing that communicativesuccess between children and adults relies heavily on adult inferential processes.By evaluating competing models on phonetically-annotated corpora, we show thatadults’ recovered meanings are best predicted by prior expectations ﬁtted speciﬁ-cally to the child language environment, rather than to typical adult-adult language.After quantifying the contribution of this “child-directed listening” over develop-mental time, we discuss the consequences for theories of language acquisition,as well as the implications for commonly-used methods for assessing children’slinguistic proﬁciency.

Keywords: language development, child-directed speech, noisy channelcommunication, spoken word recognition, Bayesian inference

The past ﬁve decades have seen extensive research dedicated to characterizing howadults speak to infants and young children (Snow & Ferguson, 1977; Soderstrom,2007), and to investigating the degree to which adults’ child-directed speech directlysupports language learning (Golinkoff et al., 2015). By contrast, how caregivers un-derstand the communicative acts of young children — child-directed listening ( CDL )— has received far less attention. In this paper, we investigate how English-speakingadults interpret English-learning children’s verbal productions, making meaning out1 a r X i v : . [ c s . C L ] F e b f vocalizations that are often perceptually distant from targets in the adult language( e.g., /wid/ for read ; see Table 1A).This characterization of adults’ role in conversations with young learners dove-tails with “noisy-channel” accounts of spoken language interpretation, which providea framework for describing how listeners overcome imperfect acoustic information,verbal ambiguity, distractions, and speaker variability present in everyday conversation(Levy, 2008; Shannon, 1951; Gibson et al., 2013). To recover meanings from highlynoisy input, adult listeners rely on their expectations about what speakers are likelyto say, combined with the perceptual similarity between what the listener heard andguesses as to what the speaker might intend. We argue that child language representsa “noisier-than-usual” channel, where adults must use expectations ﬁtted to the childlanguage environment to recover meaning from child productions. That is, while hear-ing /wid/ might typically suggest weed or wheat as a speaker’s intended word (basedsolely on acoustic information), an adult caregiver might instead recover read as theintended word from a child speaker. In what follows, we seek evidence for the role of child-directed listening in lan-guage development. We present a computational framework to predict what adultsare likely to recover from children’s imperfect speech, and compare it to what adults actually recovered. As a proxy for caregivers’ realtime interpretations, we use theorthographic annotations made by trained in-lab transcribers of spontaneous at-homechild language recordings. This approach allows us to characterize the utility of adultlisteners’ expectations, versus the acoustic/phonetic signal produced by the child. Tocapture the degree to which listening is truly child -directed ( i.e., distinct from adult-directed listening), we compare the utility of expectations tuned on large-scale adultcorpora, versus expectations tailored to reﬂect the child language environment.

We focus here on the adult listener’s task of recovering meaning from noisy child pro-ductions. Speciﬁcally, we look at a large set of phonetically-transcribed productions( e.g., / A@ w A n d @ wid/ in Table 1A) from the Providence corpus (Demuth et al., 2006),and treat the challenge of inferring a word identity in context (here, an orthographicword like read ) as a masked word prediction task (Devlin et al., 2019). To combinethe contributions of caregiver expectations given the linguistic context with the speciﬁcsequence of phonemes produced by the child, we employ a Bayesian model of spokenword recognition in the vein of Norris & McQueen (2008), which assigns a probabilityto a candidate word identity w given corresponding perceptual input d in context c : P ( w | d , c ) = P ( d | w , c ) P ( w | c ) ∑ w (cid:48) ∈ V P ( d | w (cid:48) , c ) P ( w (cid:48) | c ) (1)This cashes out the intuition that the probability assigned to a candidate word w in spoken word recognition reﬂects the combination of (a) ﬁt to perceptual data and(b) linguistic expectations. Fit to perceptual data is evaluated via a likelihood function, One intriguing deviation from the classic noisy-channel setup is that adults may “recover” messageswhen children do not intend to communicate anything at all ( i.e., drawing from a noise distribution). ( d | w , c ) , which reﬂects the probability that the word w would generate the observeddata d in context c . Linguistic expectations are captured in the prior, P ( w | c ) , or theanticipated probability of the word in context c , absent any perceptual data. The de-nominator in Equation 1 reﬂects the summed strength of all competitor words w (cid:48) inthe candidate vocabulary V . Thus, the predictions derived from the model (a poste-rior ) constitute a probability distribution over candidate words, with highly favoredinterpretations receiving more of the probability mass than disfavored ones.Our principal goal is to ﬁnd a model that best simulates how adults understandchildren. We discuss the likelihood and prior of the set of models under considerationin turn. All models used the same likelihood, derived from measures of pairwise stringsimilarity a phonemic transcription of the child’s production and phonemic forms ofall candidate words (translated into the International Phonetic Alphabet, IPA, via adictionary of conventional English pronunciations). To illustrate, given the transcribedproduction /wid/, the likelihood term for the candidate word weed (citation phoneticform /wid/) will be higher than the likelihood term for the candidate word read (wherethe citation phonetic form / ô id/ differs by one phoneme).However, the inferential process sketched in Equation 3 foreshadows the inad-equacy of the acoustic signal alone: if children often produce noisy, idiosyncraticphoneme sequences, the prior must do more “work.” The priors we evaluate take theform of probabilistic language models: computational models that return a probabilitydistribution over word guesses, based on the surrounding linguistic context (Table 1C).When priors from each model are combined with the likelihood, they yield posteriordistributions (Table 1D).Here, we take advantage of a distinction within the transcripts of caregiver-childspeech in the PhonBank database (Rose & MacWhinney, 2014), which allows us toevaluate competing models on two different dimensions. First, we evaluate models intheir ability to reproduce the speciﬁc words recovered by annotators. This analysis fo-cuses speciﬁcally on what we term communicative successes (Table 1A) — instanceswhere a phoneme sequence was not only phonemically transcribed (PhonBank %phon tier), but also received a gloss , or orthographic transcription. This allows us to as-sess the probability that each model assigns to the annotator-recovered word, with thebest model being the one that assigns the highest average probability (alternatively, thelowest surprisal , or negative log probability ) to the glosses.Second, we test whether models can predict when a child’s production will not receive a gloss (reﬂecting the annotator’s uncertainty as to the child’s intended word).This analysis relies on the communicative successes described above, as well as so-called communicative failures — instances where phoneme sequences are transcribed,but lack a gloss, due to difﬁculty in identifying the child’s intended word (Table 1B).In the absence of an annotator-recovered word, surprisal cannot be calculated. Instead,we measure the “peakedness” of the guesses regarding word identity by calculating the information entropy of the posterior distribution, H ( X ) = − n ∑ i = P ( x i ) log P ( x i ) , (2) We cannot know whether the word recovered by an annotator was the word intended by the child speaker. For statistics-oriented readers, this is the per-instance log-likelihood of the data under the model. P ( x i ) is the probability of the i th candidate word. This provides us with a conciseindex of uncertainty: if posterior probability mass is centered on one or a few guessesfor a given phoneme sequence, then entropy will be low; if the posterior is split acrossmany candidate guesses, then entropy will be high. The best model under this analysiswill be the one most able to discriminate failures from successes on the basis of entropy.We measure this with the receiver operating characteristic , or ROC, which measuresthe diagnostic ability of a classiﬁer over the range of possible thresholds.In the third analysis, we quantify how much the estimate of word identity changesas a function of 1) conditioning on context (using a ﬁtted prior) 2) conditioning ondata (the posterior when using a uniform prior), or 3) conditioning on both context and data (the posteriors reﬂecting the ﬁtted priors). As a baseline for comparison, westart with a uniform prior, where all words in the vocabulary are equiprobable. Wethen measure the per-word average information gain , or Kullback-Liebler divergence,between that uniform prior distribution and each of the distributions identiﬁed above.Information gain can be interpreted as a measure of entropy reduction , correspondingto the difference between the uniform prior and the somewhat more peaked estimatesof word identity under the ﬁtted priors, and the (usually) yet more peaked estimatesunder the posteriors. If the models are using the perceptual signal to identify words,then the prior information gain will be small in comparison to the posterior informationgain. If, by contrast, caregivers are relying heavily on their prior expectations, then theprior information gain will be larger with respect to the posterior information gain.A further question is how these measures of information gain will track with devel-opmental time. We expect prior information gain to increase over developmental time:as the child says more words in the surrounding context, the priors can better constrainguesses for the masked words (placing more mass on a smaller set of words, reﬂectedin lower entropy). At the same time, as children’s productions approximate conven-tional pronunciations, we expect to see an increase in posterior information gain. Itremains to be seen how these two quantities will interact. We test several language models in their ability to predict adult caregivers’ interpreta-tions of children’s linguistic and proto-linguistic vocalizations in the Providence corpus(Demuth et al., 2006). Utterances and phonological transcripts with both phonemic andorthographic transcription were retrieved through childes-db 2020.1. (Sanchez et al.,2019).

We selected as communicative successes all tokens produced by children in the in-tersection of four criteria: (1) possessing monosyllabic IPA forms (motivated below)(2) possessing no unintelligible (CHILDES code xxx ) or phonology-only ( yyy ) tokensin the same utterance (3) whose gloss is extant as a token in BERT (motivated below)(4) whose gloss is included in the Carnegie Mellon Pronunciation Dictionary (hence-forth CMU dictionary). Communicative failures had to meet the ﬁrst criterion, but must4able 1: Examples of communicative success and failure, with samples from highest-ranked prior and posterior candidates. P hon B a nk t r a n s c r i p t A. Communicative Success † B. Communicative Failure † MOT this is

MOT do you want ta put some beans in your eggs?

MOT you want mamma let’s see

CHI no %phon CHI / A@ w A n d @ wid ∗ / CHI / ju m EI k yo @ô f E t ∗ / gloss I want to < read > you make your < unintelligible > → yyy MOT okay that’s ﬁne

MOT can I make one?

MOT okay mommy’s gonna pick out a book

MOT no Language Model C. Best P

RIOR

Guesses for wid / f E t CDL + CONTEXT ‡ see (.86) look (.03) go (.02) play (.01) own (.74) house (.01) shapes (.01) friends (.01) BERT + CONTEXT ‡ read (.49) see (.28) play (.04) know (.04) own (.25) choice (.24) point (.04) bed (.03) call (.03) CHILDES -1 GRAM

I (.04) a (.03) the (.03) yeah (.03) I (.04) a (.03) the (.03) yeah (.03) it (.02)

D. Best P

OSTERIOR

Guesses for wid / f E t CDL + CONTEXT ‡ see (.967) watch (.012) read (.005) look (.001) own (.59) feet (.27) foot (.02) food (.01) hat (0.01) BERT + CONTEXT ‡ read (.61) see (.35) watch (.01) hear (.01) bet (.31) own (.24) cut (.06) shot (.04) bed (.03) CHILDES -1 GRAM we (.34) need (.11) and (.06) would (.04) it (.15) that (.11) ﬁt (.06) what (.06) feet (.05) ∗ masked phoneme sequence † MOT = Mother,

CHI = Child ‡ Model considers + / −

20 utterances of surrounding context.5ave received the gloss of yyy (with no other yyy or xxx in the same utterance). Underthese deﬁnitions, an utterance could contain several communicative successes, but atmost one failure. The inventory of candidate words considered by each model was the intersection of(1) words in the CMU dictionary with one or two syllables and (2) tokens present inBERT (motivated below) (3) tokens that appeared 3 or more times in CHILDES (tolimit to words that might reasonably be said in this context). This means that whileonly one-syllable phoneme sequences were analyzed, two-syllable words were alsoconsidered as possible candidate interpretations. The ﬁnal inventory of candidates, V ,included 7,904 words. We reconcile IPA formats following a procedure detailed in ourcode. For each communicative success and failure, we retrieve prior probabilities over candi-date words using a suite of probabilistic language models. As a “best” prior architec-ture, we use BERT (Devlin et al., 2019), which has demonstrated extremely competi-tive performance for single-word completion tasks, including spoken word recognition(Salazar et al., 2020). By virtue of its attentional mechanisms, BERT is able to ef-fectively model long distance dependencies (Jawahar et al., 2019), and capture speechregister and discourse-level information. We compute the probabilities for the maskedword P ( w ) from BERT, using a language modeling head with the transformers li-brary (Wolf et al., 2020). For each masked phoneme sequence, we take the real-valuedvector of predictions corresponding to the model’s vocabulary, extract the activationscorresponding to the candidate words, and compute the softmax to yield a vector ofprobabilities over the candidate words (Table 1).We test an “off-the-shelf” model of BERT trained on large quantities of (princi-pally adult-directed) language scraped from the internet, predicting the word from theimmediate utterance only (BERT+O NE U TT ). We additionally test the predictions of aBERT model meant to best capture adult expectations about children’s utterances. Todo this we “ﬁne-tune” the above model on adult and child CHILDES utterance glosses,excluding PhonBank. In ﬁne tuning, a new model is initialized with an “off-the-shelf”model, then the weights in the model are updated to best predict masks inserted intoa new training set — in this case, the lines of 80% of CHILDES transcripts (20%were held out for model validation). This ﬁne-tuned model (CDL+O NE U TT ) shouldbe expected to be more representative of adult linguistic expectations in understand-ing child speech than the off-the-shelf model for three reasons. First, it should assignhigher probability to words that are common in speech to and from children. Second,it should assign higher probability to non-sentence fragments, which are ubiquitous inconversational speech but somewhat less prevalent in adult-directed written language.Third, it may prove capable of developing an expectation for the dyadic, back-and-forthstructure of scenes typically captured in transcripts.6n addition to ﬁne-tuning the model, we manipulate whether prior estimates reﬂectaccess to the larger discource context as captured by the transcript before and after aphoneme sequence. In that these models are meant to be representative of caregiver expectations, these models condition the prediction of the masked token on what thecaregiver and child both say, both before and after the masked token. We create priorsparallel to those above by feeding the models 20 utterances preceding and followingeach mask (CDL+C ONTEXT and BERT+C

ONTEXT ).BERT has its own vocabulary, which imposes limitations on the vocabulary in theanalysis. Standard implementations of BERT split longer words into “word pieces”,or most common repeated sub-sequences. In English, this often yields morphologicalsegmentation ( e.g., ﬁshing → fish ), but the process is highly noisy. For thepurposes of predicting a masked word, BERT predicts only one word or word piece.We limit the vocabulary to word-initial word pieces like ﬁsh , and exclude continuationslike from consideration. This also motivates the choice to predict monosyllabicphoneme sequences, in that the model does not allow us to predict multiple words(which might be contained in yyy ).In addition to the BERT models, we also test two simpler priors. The ﬁrst is asimple smoothed unigram model estimated from counts in CHILDES. This model,CHILDES 1- GRAM , assigns probability to all word types proportional to their countsin the same CHILDES dataset used in the CDL models, above. To account for unseendata, we add a small pseudocount (.001) smoothing to all counts before computingprobabilities. The second is the U

NIFORM P RIOR model, which assigns equal proba-bility to all words (1 / | V | , where | V | is the number of candidates). This provides thecomparison case of a maximally uninformative prior. For the likelihood, P ( d | w ) , we use a transformation of string edit distance between thephoneme sequence produced by the child and all candidate words. Speciﬁcally, we useexponentiated negative edit distance (Levy, 2008): P ( d | w ) ∝ e − β × dist ( d (cid:48) : w (cid:48) , d ) (3)where dist is the Levenshtein distance (minimal number of deletions, insertions andsubstitutions) between citation form d (cid:48) for candidate word w (cid:48) , designated here ( d (cid:48) : w (cid:48) ) ,and the observed transcription ( d ) . For the results presented here, we grid sample β values between 1 and 6 by 0 . β = . i.e., that certain phonemes are much more perceptually similar. We propose anothermore sophisticated likelihood function that captures this in the Discussion.All model training and analysis code, as well as the ﬁne-tuned model can be ac-cessed at https://osf.io/v7c3e/?view only=176bb0f538af424da59007c53eff7e05 .7 Results

A comparison of Bayesian speech recognition models reﬂecting different priors revealsthat the

CDL + CONTEXT prior assigns the lowest average surprisal (highest averageprobability) to the recovered word gloss in the transcript. As Table 2 reveals, BERTmodels making use of context perform better than those that do not. CHILDES-tunedBERT models outperform the respective off-the-shelf BERT models. All BERT modelsoutperform the CHILDES 1

GRAM model, and all models with ﬁtted priors assign sig-niﬁcantly higher probability to the recovered glosses than the U

NIFORM P RIOR model.These results mean that the model that is (1) ﬁne-tuned to the child environment and(2) uses the surrounding utterance context is best able to predict the recoveries madeby adults.We next investigate how the prior probabilities in the previous analysis combinewith likelihoods to predict word identity. That is, how do the adults’ prior expectationssupport inference when children’s productions are more or less adult-like? Compar-ing average surprisal across edit distances (Figure 1) reveals that models using BERT-based priors assign massively higher probability to word identities posited by annota-tors. For child productions that are 2 phonemes away from the citation form ( x = ONTEXT assigns on average a probability of .24 (2 − × surprisal ) tothe correct gloss. This compares favorably to .12 under BERT+C ONTEXT , .08 underCDL+O NE U TT , .03 under BERT+O NE U TT , .006 under the CHILDES 1 GRAM , and.002 under U

NIFORM P RIOR . CDL+C

ONTEXT assigns uniformly higher probability(lower surprisal) to the correct word identity, particularly when the phonetic form ismore dissimilar (3 or more edits). This means that priors support recognition morewhen the perceptual input is noisier.

A separate question is which model best predicts whether a particular phoneme se-quence will be a communicative success or failure. We address this by testing howTable 2: Average prior surprisal on communicative successes from the Providence cor-pus (lower is better). The difference in average probability assigned to the actual glossis 2 diff , where diff is the difference between two model scores. ∗ Paired t -tests conﬁrmsig. differences between models, p < − .Model Avg. Prior Surprisal ∗ (bits)CDL+C ONTEXT

ONTEXT NE U TT NE U TT GRAM

NIFORM P RIOR l l l l l l

Lower is better

Edit Distance from Actual to Citation Form for Word A v e r age P o s t e r i o r S u r p r i s a l ( b i t s ) Model l CDL+ContextCDL+OneUttBERT+ContextBERT+OneUttCHILDES 1gramUniformPrior

Figure 1: Posterior surprisal (negative log probability) of the recovered meaning forcommunicative successes. Error bars indicate standard error of the mean.well posterior entropy under the models can predict communicative failures. As withthe ﬁrst analysis, the

CDL + CONTEXT model provides the best trade-off between theprevalence of true positives and false positives (Figure 2). As both U

NIFORM P RIOR and CHILDES 1

GRAM models assign constant entropy to phoneme sequences (priorprobabilities of candidates do not change as a function of context), their posterior en-tropy only reﬂects the contribution of the perceptual data. This analysis provides con-verging evidence that a model that is tuned speciﬁcally to child language and uses thesurrounding utterance context — the one that best instantiates child-directed listening— is best able to replicate adult inferences.

Finally, we quantify the information gain over time in conditioning on context (the ﬁt-ted priors), conditioning on data (the posterior under the U

NIFORM P RIOR model), andconditioning on both (the posteriors corresponding to the ﬁtted priors). This analysisshows a larger shift in the probability distribution over candidates (greater informationgain) going from the uniform prior to the

CDL + CONTEXT prior compared to going fromthe uniform prior to its corresponding posterior (red line vs. green line in panel 1 ofFigure 3). That is, the prior under the CDL+C

ONTEXT model contributes more infor-mation (better constrains guesses to word identity) than perceptual information alone.Contrary to our predictions, we ﬁnd that the information gain for the prior is relativelyconstant over time for the CHILDES-ﬁtted models. This suggests that child-directedlistening can helpfully constrain adult listeners’ interpretations of children’s earliestverbal productions. As expected, children’s improving articulatory abilities result in anincrease of all models’ posteriors over developmental time, as the likelihood functionshared across models is able to contribute more and more to the task of interpretation.9 igher is better

Proportion of False Positives P r opo r t i on o f T r ue P o s i t i v e s CDL+ContextCDL+OneUttBERT+ContextBERT+OneUttCHILDES 1gramUniformPrior

Figure 2: Classiﬁcation performance in predicting communicative failures, as mea-sured by the ROC of posterior entropy. The solid line with slope = 1 indicates chance.The area above this line indicates better classiﬁcation performance. ll ll ll lll ll ll l ll ll ll lll ll ll l ll ll ll l

BERT+OneUtt CHILDES 1gramCDL+Context CDL+OneUtt BERT+Context12 24 36 48 12 24 36 48 12 24 36 4801020300102030 01020300102030

Child Age in Months A v g . I n f o r m a t i on G a i n vs . U n i f o r m P r i o r ( b i t s ) Distribution conditioned on l context (fitted prior)perceptual dataperceptual data & context(posterior) Figure 3: Average information gain from conditioning word prediction on context only(red, corresponding to the prior), perceptual data only (green), and context and percep-tual data (blue, corresponding to the posterior) relative to a uniform prior.10

Discussion

Language development is often characterized in terms of an increasing facility withprocesses on the side of the learner: developing motor planning, recognizing regu-larities of linguistic structure at different levels, and relating structure to entities andcommunicative contexts in the world. The current work suggests that early verbalcommunication depends not only on these well-studied developmental processes, butalso on cognitive processes in the minds of adult caregivers.We note two limitations with the current work before discussing its implications.First, the simple measure of edit distance does not capture the perceptual confusabil-ity of phonemes: bug and rug are equally good candidates for pug . One potentialelaboration would be to use a weighted edit distance measure that takes into accountthe perceptual confusability of the phonemes. For example using a probabilistic ﬁ-nite state string transducer would allow assigning edits different “costs” according toexperimentally-obtained confusion probabilities, e.g., (Cutler et al., 2004).Second, we make the simplifying assumption that inferences made by adult anno-tators in the lab are representative of the inferences made by adult caregivers in themoment, communicating in real time with children. While the inferential capacitiesof annotators are likely substantially less than those of adult caregivers (who have ac-cess to the non-linguistic context, as well as signiﬁcantly more shared history with thechild), research assistants may well be a decent proxy for adult listeners, due to theirtraining as transcribers and exposure to child language. Potential differences in theinferential capacities of caregivers relative to other adult “listeners” should be testedexperimentally.These results additionally call attention to the interpretation of common methodsin child language research. For example, vocabulary production measures on the Com-municative Development Inventories (Fenson et al., 2007), have been historically in-terpreted as an index of children’s vocabulary and articulatory maturity. However, thecurrent work suggests that successful communication – adult recognition of a word asa conventional form — relies additionally on adult inferential processes. Indeed themeasure of a word’s “babiness,” a signiﬁcant predictor of the order of children’s re-ported vocabulary production, may reﬂect the degree to which a word is more likely inchild-directed speech compared to adult-directed speech.Furthermore, our data invite a reconstrual of the nature of feedback in early lan-guage development. For example, if we assume that successful communication is itselfreinforcing, child-directed listening might provide feedback to the child learner evenin the absence of child-directed speech : a caregiver who interprets a child’s productionof “uh” to mean “up” may not say anything in response to the child’s production, butprovides feedback by effecting change on the part of the child when they pick the childup. This, in turn, leads to new puzzles: if adult caregivers can help many deﬁcientcommunicative acts succeed, what presses children to get better?Finally, we speculate regarding the role that child-directed listening might con-tribute to the emergence of language, both on evolutionary timescales and cases ofrapid language emergence like Nicaraguan Sign Language. The current work suggeststhat successful recovery of meaning from child speech acts reﬂect not only the induc-tive biases, linguistic knowledge, and articulatory maturity of speakers, but also the11nferential biases of listeners.

We present a suite of Bayesian models of spoken word recognition to characterize theprocess of child-directed listening , or how adult caregivers ﬁnd meaning in the noisyand often non-conventional speech productions of young children. We ﬁnd that priorscapitalizing on recent neural architectures — when trained speciﬁcally on child speechsamples, and taking advantage of the greater linguistic context to make predictions— are best able to simulate adult inferential processes when interpreting noisy childspeech. This research paves the way for understanding how children learn to employlanguage as goal-seeking agents in the presence of others.12 eferences

Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004, Dec). Patterns of English phoneme confusions bynative and non-native listeners.

J Acoust Soc Am , (6), 3668–3678.Demuth, K., Culbertson, J., & Alter, J. (2006). Word-minimality, epenthesis and coda licensing in the earlyacquisition of English. Lang Speech , (2), 137–174.Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectionaltransformers for language understanding. In Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp.4171–4186). Association for Computational Linguistics.Fenson, L., et al. (2007).

Macarthur-Bates communicative development inventories . Paul H. BrookesPublishing Company Baltimore, MD.Gibson, E., Bergen, L., & Piantadosi, S. (2013). Rational integration of noisy evidence and prior semanticexpectations in sentence interpretation.

Proceedings Natl. Acad. Sci. U.S.A. , (20), 8051–8056.Golinkoff, R. M., Can, D. D., Soderstrom, M., & Hirsh-Pasek, K. (2015). (Baby)Talk to Me: The SocialContext of Infant-Directed Speech and Its Effects on Early Language Acquisition. Current Directions inPsychological Science , (5), 339–344.Jawahar, G., Sagot, B., & Seddah, D. (2019, July). What does BERT learn about the structure of language?In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3651–3657). Association for Computational Linguistics.Levy, R. (2008). A noisy-channel model of human sentence comprehension under uncertain input. In

Proceedings of the 2008 conference on Empirical Methods in Natural Language Processing (pp. 234–243).Norris, D., & McQueen, J. M. (2008, Apr). Shortlist B: a Bayesian model of continuous speech recognition.

Psychol Rev , (2), 357–395.Rose, Y., & MacWhinney, B. (2014). The PhonBank project..Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2020, July). Masked language model scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2699–2712). Online: Association for Computational Linguistics.Sanchez, A., Meylan, S., Braginsky, M., MacDonald, K., Yurovsky, D., & Frank, M. (2019). childes-db:A ﬂexible and reproducible interface to the child language data exchange system.

Behavior ResearchMethods , (4), 1928–1941.Shannon, C. (1951). Prediction and Entropy of Printed English. Bell Systems Technical Journal , , 50–64.Snow, C. E., & Ferguson, C. A. (1977). Talking to children . Cambridge University Press.Soderstrom, M. (2007). Beyond babytalk: Re-evaluating the nature and content of speech input to preverbalinfants.

Developmental Review , (4), 501–532.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., . . . Rush, A. M. (2020, October).Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45). Online: As-sociation for Computational Linguistics.(pp. 38–45). Online: As-sociation for Computational Linguistics.