[PDF] A phonetic model of non-native spoken word processing

Abstract

Non-native speakers show difficulties with spoken word processing. Many studies attribute these difficulties to imprecise phonological encoding of words in the lexical memory. We test an alternative hypothesis: that some of these difficulties can arise from the non-native speakers' phonetic perception. We train a computational model of phonetic learning, which has no access to phonology, on either one or two languages. We first show that the model exhibits predictable behaviors on phone-level and word-level discrimination tasks. We then test the model on a spoken word processing task, showing that phonology may not be necessary to explain some of the word processing effects observed in non-native speakers. We run an additional analysis of the model's lexical representation space, showing that the two training languages are not fully separated in that space, similarly to the languages of a bilingual human speaker.

Full PDF

AA phonetic model of non-native spoken word processing

Yevgen Matusevych

School of InformaticsUniversity of Edinburgh [email protected]

Herman Kamper

E&E EngineeringStellenbosch University [email protected]

Thomas Schatz

Department of Linguistics & UMIACSUniversity of Maryland [email protected]

Naomi H. Feldman

Department of Linguistics & UMIACSUniversity of Maryland [email protected]

Sharon Goldwater

School of InformaticsUniversity of Edinburgh [email protected]

Abstract

Non-native speakers show difﬁculties withspoken word processing. Many studies at-tribute these difﬁculties to imprecise phonolog-ical encoding of words in the lexical memory.We test an alternative hypothesis: that some ofthese difﬁculties can arise from the non-nativespeakers’ phonetic perception. We train a com-putational model of phonetic learning, whichhas no access to phonology, on either one ortwo languages. We ﬁrst show that the modelexhibits predictable behaviors on phone-leveland word-level discrimination tasks. We thentest the model on a spoken word processingtask, showing that phonology may not be nec-essary to explain some of the word processingeffects observed in non-native speakers. Werun an additional analysis of the model’s lexi-cal representation space, showing that the twotraining languages are not fully separated inthat space, similarly to the languages of a bilin-gual human speaker.

Compared to native speakers, non-native speakersperform differently in a variety of tasks related toauditory language processing, both at the phoneand at the word level. At the phone level, thesetasks usually require speakers to compare individ-ual phones (e.g., phone discrimination or identiﬁca-tion), while spoken word processing tasks usuallytest the implicit activation of a certain word in thememory (e.g., lexical priming, word translation).In some cases, non-native speakers’ behavior is consistent across the tasks: lower performance inspoken word processing tasks is directly associatedwith difﬁcult phone contrasts. For example, uponhearing a word rock , Japanese speakers activateboth rock and lock in their lexical memory (Cutlerand Otake, 2004), probably because they ﬁnd itdifﬁcult to discriminate the English [ô] – [l] phonecontrast (Miyawaki et al., 1975). In other cases,however, non-native speakers’ behavior in spokenword processing tasks cannot be explained by dif-ﬁcult phone contrasts (Cook et al., 2016; Amen-gual, 2016; Darcy et al., 2012). For example, in atranslation task native English speakers may con-fuse Russian words moloko [m@ë2"ko] (‘milk’) and molotok [m@ë2"tok] (‘hammer’), even though thispair of words does not have a difﬁcult phone con-trast (Cook et al., 2016).This dissociation between the behavior in phonediscrimination vs. spoken word processing taskshas been attributed to different kinds of representa-tions involved. On the one hand, thanks to phoneticknowledge, speakers recognize individual phonesin a given language. On the other hand, speak-ers store phonological representations of the wordsthey know in their mental lexicon and use those rep-resentations to recognize spoken words (e.g., Pal-lier et al., 2001). It is often implicitly assumed thatany lexical processing effect should be attributedto stored phonological representations of the words(Gor and Cook, 2020; Cook et al., 2016; Cook andGor, 2015; Darcy et al., 2013, 2012; McQueenet al., 2006). At the same time, phonetic effects arenot limited to the perception of individual phones: a r X i v : . [ c s . C L ] J a n ome existing theories argue that phonetic detailsare encoded in the lexical memory (e.g., Pierrehum-bert, 2002; Hawkins, 2003; Port, 2007). This raisesthe question: can some of the spoken word pro-cessing effects normally attributed to phonology beinstead explained in terms of phonetic perception?In this study, we use computational modelingto test a hypothesis that some spoken word pro-cessing effects observed in non-native speakerscan be explained without involving phonolexicalrepresentations—by phonetic perception, whichresults from phonetic learning, or speakers’ at-tunement to the sounds of their native language(Werker and Tees, 1984). We use a model devel-oped for speech technology applications (Kamper,2019). Earlier, it was used to simulate early pho-netic learning and successfully predicted some in-fant phone discrimination data (Matusevych et al.,2020b). This model learns from natural speech datathat is not segmented at the phone level. It neverreceives information about individual phones inisolation or about phone-level differences betweenwords. Therefore, the model is not equipped withan explicit mechanism to learn abstract phonolex-ical representations, making it a good candidateto test our hypothesis: if our model without theknowledge of phonology can correctly predict aparticular effect, that effect can at least partially beattributed to phonetic learning. Because our goalis to see whether phonetic perception can explainsome of the existing data in principle, even just onephonetic model making correct predictions aboutthe data would be a positive result.By design, lexical processing tasks require atleast minimal knowledge of the target language.That is why they are normally carried out withbilingual speakers (or second language learners:Gor and Cook, 2020; Amengual, 2016; Cook et al.,2016, etc.). In bilingual speakers, the two lan-guages interact at various levels, including lexi-cal (e.g., Weber and Cutler, 2004; Sunderman andKroll, 2006). To take this into account, we sim-ulate bilingual speakers by training the model ontwo languages simultaneously.We present three simulations. The ﬁrst two showthat the model exhibits predictable behaviors indiscrimination tasks. In the third one, we presenta case study to test whether a lexical processingeffect commonly attributed to phonological repre-sentations can be explained in terms of phoneticlearning alone, without the inﬂuence of phonol- ogy. In addition, we examine whether the repre-sentations in our bilingual model match the patternobserved in bilingual lexical access. Existing stud-ies (Weber and Cutler, 2004; Lagrou et al., 2011;Shook and Marian, 2012, etc.) show that upon thepresentation of a word, competitor words in bothlanguages may get activated (non-selective lexicalaccess). We carry out a language classiﬁcation taskwith our model, showing that the two languagesare not fully separated in its representation space. We train a computational model—a correspon-dence autoencoder recurrent neural network (CAE-RNN; Kamper, 2019)—on speech data from oneor two languages in order to simulate monolingualand bilingual speakers. We then test these differ-ent versions of the model on discrimination tasksand compare the observed patterns to those foundin human speakers. This general methodologicalframework is adopted from Schatz et al. (2019).We run three simulations, described in more de-tail in the respective sections below. Although itwould be ideal to use the same set of languages ineach simulation, the choice of languages is limitedby available results from studies with human par-ticipants. We use language pairs for which humandata is available and where our model has previ-ously been tested on the target languages in themonolingual context. In Simulation 1, we look atphone discrimination by infants exposed to two lan-guages (English and Mandarin), showing that themodel correctly predicts a discrimination patternobserved in such infants (Kuhl et al., 2003). InSimulation 2, we show that the model can predictdiscrimination effects at the word level, observed inadult native English speakers and Japanese learnersof English (MacKain et al., 1981). In Simulation 3,we show that the result obtained in a translationjudgment task with native Russian speakers andEnglish learners of Russian (Cook et al., 2016) canbe at least partially explained in terms of phoneticlearning, without the effects of phonology.

The CAE-RNN (Kamper, 2019) is an extensionof a recurrent autoencoder (Chung et al., 2016),in which both encoder and decoder are recurrentneural networks. Unlike an autoencoder, the CAE-RNN is trained on pairs of word tokens of the same coustic wordembedding

Figure 1: The model learns to reconstruct an acousticinstance of a word, X (cid:48) , from another acoustic instanceof the same word, X . type (e.g., two acoustic instances of the word ap-ple ). It receives one instance of a word (representedas a speech sequence), encodes it into a vector ofa ﬁxed dimensionality (an acoustic embedding ),and then tries to reconstruct the other instance inthe pair, as shown in Figure 1. Formally, eachtraining item is a pair of acoustic words ( X, X (cid:48) ) .Each word is represented as a sequence of vectors: X = ( (cid:126)x , . . . , (cid:126)x T ) and X (cid:48) = ( (cid:126)x (cid:48) , . . . , (cid:126)x (cid:48) T (cid:48) ) . Theloss for a single training item is: (cid:96) ( X, X (cid:48) ) = T (cid:48) (cid:88) t =1 || (cid:126)x (cid:48) t − (cid:126)f t ( X ) || (1)where X is the input and X (cid:48) the target output se-quence, and f t ( X ) is the t th decoder output condi-tioned on the embedding z . At inference time, wecan encode a sequence of arbitrary duration (e.g., aphone or a word) into a ﬁxed-dimensional acousticembedding in the model’s representation space.We choose this model because it showed promisefor the study of human cognition: it correctly pre-dicted some patterns of infant phonetic learning(Matusevych et al., 2020b), and some of its basicproperties are compatible with human auditory cog-nition and lexical access (Matusevych et al., 2020a).The advantage of this model compared to others(e.g., Schatz et al., 2019) is its ability to representspeech sequences of any duration in a commonrepresentation space (the embeddings have a ﬁxednumber of dimensions), in which perceptual sim-ilarity between sequences can be computed usinga simple distance function. The model handles in-dividual phones and acoustic words in exactly thesame way, allowing us to easily generalize fromphone-level to word-level representations. In ad- Table 1: Corpus samples used in the simulations.

A. Training data.Sim. Wall Street Journal CSR corpus (Paul and Baker, 1992). Multilingual text and speech database (Schultz, 2002). Buckeye corpus of conversational speech (Pitt et al.,2005). Corpus of spontaneous Japanese (Maekawa, 2003). Open-source Mandarin speech corpus (Bu et al., 2017). dition, the model has been successfully trained onmultiple languages for a speech technology applica-tion (Kamper et al., 2020a,b), potentially making ita good candidate for simulating bilingual speakers.Following earlier studies (Kamper, 2019; Matu-sevych et al., 2020b), we ﬁrst pretrain the model asan autoencoder RNN for epochs without earlystopping using the Adam optimization (Kingmaand Ba, 2015) with a learning rate of . . Wethen train the model for epochs on k groundtruth pairs from either one or two languages as de-scribed next. We use hidden layers ( gatedrecurrent units each) in both the decoder and theencoder, and an embedding dimensionality of . The model is trained on isolated words and testedon either phones or words extracted from corporaof natural speech based on existing forced align-ments (Matusevych et al., 2020b; Kamper et al.,2020b). All speech data is encoded using a com-mon approach in speech processing: each speechsequence is divided into -ms-long frames (sam-pled every ms), from which Mel-frequencyepstral coefﬁcients (MFCCs) are extracted usingKaldi (Povey et al., 2011).The subsets of the corpora that we use are listedin Table 1. Within each pair in part A of the table,the subsets are matched on the number of speakers,their gender, and the amount of data per speaker.This ensures that we only compare models trainedon the same amount and type of data. In Simula-tions 1 and 2, we follow the setup from a previousstudy and use two subsets from different corporaper language. In Simulation 3, we could not ob-tain two different corpora of Russian speech, andinstead train the model ﬁve times with differentrandom initializations on the same data.In case of bilingual models, we train them simul-taneously on both languages, using mixed input.We use the relative amount of training data in eachlanguage as a simple proxy variable for languageproﬁciency: the higher the model’s relative expo-sure to a language, the higher its proﬁciency in thatlanguage. In bilingual training, we use the sametotal amount of data as for the corresponding mono-lingual models—in terms of both the number oftokens (for pretraining the model) and of trainingpairs. For example, consider a monolingual En-glish model and a monolingual Mandarin model,each trained on k tokens and k pairs. Thenfor training a ‘balanced’ bilingual model we takethe k most frequent tokens from English and Man-darin each, generate k pairs in each language,and use the combined k English–Mandarin to-ken data for pretraining and the combined kpairs for training the CAE-RNN. To test a model’s ability to discriminate a partic-ular phonetic or lexical contrast, we use the ma-chine ABX task (Schatz et al., 2013), which isstandard in research on zero-resource speech tech-nology for evaluating discriminability of speechunits (Versteegh et al., 2015; Dunbar et al., 2017,2020, etc.) and is commonly used for simulatinghuman speech discrimination tasks (e.g., Martinet al., 2015; Schatz et al., 2019; Millet et al., 2019).The machine ABX task allows us to easily designprecise comparisons (e.g., compare words that onlydiffer in phone) and is not sensitive to the abso-lute distances in the embedding spaces, which mayvary across simulations.In the ABX task, A and X are two instances ofthe same word type (e.g., right ), while B is a differ- ent word type (e.g., light ). If A and X are closer toeach other in a model’s representation space thanB and X, the model’s prediction is correct, other-wise it is not. Irrespective of the test units (phones,words), an acoustic segment in our model is rep-resented by a single vector. We perform the ABXtask directly on the vectors, allowing us to comparesegments of different duration (without doing anytype of alignment). Following earlier studies, weuse angular cosine distance to measure the distancebetween the stimuli in the embedding space. Themodel is evaluated by considering the proportion ofABX triplets for which it makes correct predictions: error corresponds to perfect discrimination, and to chance performance.To test whether the difference between the ABXerror rates of several models is signiﬁcant, we ﬁtmixed-effects regressions to the error rates of thesemodels. Signiﬁcance for the effect of interest isthen determined using two-tailed ANOVA tests(with Satterthwaite degrees of freedom approxi-mation) on the predicted values of the regressions. Previously, the monolingual CAE-RNN was shownto correctly predict the crosslinguistic differencein the discrimination of the Mandarin [C] – [tC h ] contrast (Matusevych et al., 2020b), observedin Mandarin-learning vs. English-learning infants(Tsao et al., 2006). Considering this, it may seemtrivial to show that a model with some exposure toMandarin data (bilingual) would also achieve lowererror than a model with no such exposure (Englishmonolingual). However, potentially complex inter-actions between training languages may result in anunpredictable phonetic space. This is why we needto ensure the bilingual model behaves as expected,before we move on to the word-level tasks. To dothis, we run a simple sanity check: whether a modeltrained on two languages behaves in a predictableway on the same Mandarin contrast. We focus on the experiment of Kuhl et al. (2003),who showed that exposing English-learning infantsto a small amount of Mandarin Chinese improvestheir ability to discriminate [C] – [tC h ] . Our goal isto test whether the model also correctly predictsthe pattern for English-learning infants with vs.without exposure to Mandarin.It is difﬁcult to estimate how the infants’ amount English–Mandarin ratio in training A B X d i s c r i m i n a ti on e rr o r , % Figure 2: Models’ ABX error rates in the Mandarin [C] – [tC h ] phone discrimination task. Error bars show stan-dard error of the mean over two training corpus sam-ples × two test samples. of Mandarin exposure in the experiment maps ontothe English–Mandarin ratio of training data in ourmodel. We therefore try the ratios of : and : (to simulate an infant with a higher exposureto English than to Mandarin), as well as : (acontrol condition, a balanced bilingual). As a base-line to compare our bilingual models to, we trainthe model on English speech alone ( : ratio).For reference, we also train a model on Mandarinspeech alone ( : ). Using each model, we em-bed a set of [C] and [tC h ] phones from the test corpusand run a [C] – [tC h ] discrimination task. We expecteach bilingual model to show lower error than theEnglish monolingual model. The ABX error rates of the models (see Figure 2)show the expected pattern: the higher the expo-sure to Mandarin, the lower the error rates in thetarget discrimination task. Even having ofMandarin data ( : model) on average resultsin . reduction in absolute error compared tothe monolingual English model. A mixed-effectsregression ﬁtted to the models’ error rates showsthat this difference is not signiﬁcant, but the othertwo bilingual models do show signiﬁcantly lowererror rate than the monolingual English model: weobserve a . error reduction in the : modeland . in the : model compared to the En-glish baseline. This suggests that even a relativelysmall amount of training data in a given language(under ) can improve the model’s ability to dis-criminate between some contrasts in that language,consistent with the empirical ﬁndings of Kuhl et al.(2003) with infants. To summarize, the bilingualCAE-RNN model behaves as we expected: it can correctly predict infant-like behavior in phone dis-crimination. In the next simulation, we test themodel on a word discrimination task. Simulation 2 tests whether our bilingual model be-haves in predictable ways at the word level. Recallthat it represents a sequence of any given durationas a ﬁxed-dimensional vector. The compressionof a dynamic speech sequence that unfolds in timeinto a ‘static’ vector results in information loss.Since words are normally longer speech sequencesthan phones, it is not obvious whether the model’sbehavior in word discrimination and phone discrim-ination will be consistent. To examine this, we testthe model on minimal pairs of words with [ ô ]– [l] ,a phone contrast on which the monolingual CAE-RNN previously showed an infant-like crosslin-guistic discrimination pattern (Matusevych et al.,2020b). As before, we train the model on one ortwo languages and see if it behaves in a predictableway, this time in a word discrimination task. MacKain et al. (1981) tested adult native speakersof American English and native Japanese learnersof English on the discrimination of English words rock–lock (i.e., [ ô ]– [l] contrast). Learners with lowEnglish proﬁciency scored nearly at chance in thistask, while highly proﬁcient learners showed thediscrimination scores close to those of native En-glish speakers. This result is also in line with stud-ies showing that native Japanese speakers’ discrimi-nation of [ ô ]– [l] can improve after relevant phonetictraining in English under certain conditions (e.g.,Strange and Dittmann, 1984; Logan et al., 1991;Bradlow et al., 1997; Iverson et al., 2005). Our goalis to test whether our model can correctly predictthis word-level discrimination pattern.Similarly to Simulation 1, we train the model onJapanese (for reference) or English speech alone(as the baseline to compare to), or on a combinationof the two in proportion : , : , or : , tosimulate native Japanese learners of English withvariable proﬁciency. For each model, we embed aset of acoustic words ([ ô ]– [l] minimal pairs) fromthe test corpus. As in the original experiment, wecompare the bilingual models to the native Englishmodel: to make correct predictions, a bilingualmodel needs to show higher ABX discriminationerror than the English model. Also, we expect Japanese–English ratio in training A B X d i s c r i m i n a ti on e rr o r , % Figure 3: Models’ ABX error rates in the word discrim-ination task with [ ô ]– [l] minimal pairs (e.g., rock–lock ).Error bars show standard error of the mean over twotraining × two test corpus samples. the error rate to decrease with higher exposure toEnglish: : > : > : . It is clear from Figure 3 that the monolingualJapanese model ( : ) shows higher ABX errorrate on the word discrimination task than the mono-lingual English model ( : ). This extends theprevious result for phone-level [ ô ]– [l] discrimina-tion (Matusevych et al., 2020b) to word level.Comparing all the models, we observe that theerror rate decreases as the relative amount of En-glish exposure increases. Even of Englishin the training data ( : model) improves thediscrimination by . in absolute error rate com-pared to the monolingual Japanese model ( . vs. . ), and more so for the models with higherEnglish exposure. A mixed-effects regression ﬁt-ted to the error rates shows that the pairwise differ-ences between most models are statistically signiﬁ-cant, except that the : model shows error ratestoo close to both its neighbors: : and : .Despite that, the expected trend is still present.In other words, our model successfully replicatesthe direction of the main effect in MacKain et al.(1981): the discrimination error rate decreases withhigher English exposure.The result shows that our bilingual model be-haves in a predictable way at the word level, rulingout potentially damaging effect that crosslinguisticinteractions can have on its lexical representationspace. With this knowledge, in the next simulationwe proceed with applying our model to a spokenword processing task. In this section, we present a case study to showhow the model can be used to get a better under-standing of spoken word processing. Speciﬁcally,we are interested to know if some of the effectsreported in the literature can be explained in termsof phonetic learning alone. Difﬁculties with spokenword processing have been attributed to imprecisephonological encoding of non-native lexical repre-sentations (Gor and Cook, 2020; Cook et al., 2016;Cook and Gor, 2015; Darcy et al., 2013, 2012),which results in a spurious activation of similarlysounding competitor words. To give an examplefrom Cook et al. (2016), if the word parent is en-coded as [pEr@(n)t] , with an optional [n] , it mayoften be confused with parrot [pEr@t] .We focus on one of the experiments in Cooket al. In a translation judgment task, native Russianspeakers (proﬁcient in English) and native Englishspeakers (learning Russian) heard a Russian word(e.g., moloko [m@ë2"ko] ‘milk’) and then saw anEnglish word (e.g., hammer , which translates intoRussian as ‘molotok’ [m@ë2"tok] ). The participantshad to decide if the English word was a good trans-lation of the Russian one. Cook et al. manipulatedthe phone edit distance between the true translationand the competitor word: in the example above, thedistance between [m@ë2"ko] and [m@ë2"tok] is .They found that non-native speakers made moremistakes than native speakers, and that increasingthe phone edit distance between the target wordsdecreased the size of this effect. They explain theeffect by ambiguous (‘fuzzy’) non-native phonolex-ical representations. It is unclear, however, whetherlexical phonology is necessary to explain the ob-served effect. To answer this question, we testwhether our model with no access to phonologycan correctly predict the described effect.Clearly, our model does not know anything aboutword meanings and cannot be tested on a transla-tion task. Instead, we use a series of ABX discrim-ination tasks to test whether the lower performanceof non-native speakers can be explained in termsof acoustic embeddings of individual word tokensin the model’s representation space. Recall that themodel has no access to phonology, and its acousticembeddings result from phonetic learning alone.Figure 4 shows the error rates of human partic-ipants in Cook et al. (2016). The patterns that wefocus on are: (1) lower error rate is associated withhigher proﬁciency (exposure to Russian) and higher i gh fr e q . L o w fr e q . Edit distance E rr o r r a t e , % Russian exposure

Native High Low

Figure 4: Average error rate of participants in the Rus-sian translation judgment task of Cook et al. (2016) de-pending on their amount of exposure to Russian andthe edit distance between the target and the competitorword. Error bars show mean standard error over partic-ipants. Based on Table 3 in Cook et al. (2016). edit distance; and (2) the difference between theproﬁciency groups is the highest when the phoneedit distance between the target and the competitorword is low. Cook et al. also looked at the effect ofcompetitor word frequency (hence the two panelsin Figure 4), but we do not consider this effect here.

We train the model on only Russian or only En-glish data (for reference), and also on the combi-nation of the two in proportion : (to simulatenative Russian speakers with some knowledge ofEnglish), : , : , or : (to simulate na-tive English speakers with variable proﬁciency inRussian). Each model is trained ﬁve times withdifferent random initializations.For testing, we prepare four ABX discriminationtasks: in each task, the words A and X are of thesame type, and the words B and X differ in , , or phones (phone edit distance). We could notobtain the list of the original stimuli from Cooket al. (2016), therefore we sample ABX tripletsfrom our test corpus subset. Following the originalexperiment, we only consider words containing – phones. Furthermore, we only consider tripletsin which all pairwise ratios of the absolute dura-tions of the words are within the factor of . (weknow from previous work that this model is sensi- tive to the absolute duration of the test stimuli) andin which B and X are not morphological forms ofthe same word (such stimuli could not have beenused in the original translation judgment task bydesign). We then sampled triplets per task,except for edit distance we only had triplets.As in the original experiment, we ﬁrst look at theerror rates within each model: the error rates areexpected to decrease with greater edit distance. Sec-ond, we compare the bilingual models to each other:to match the ﬁndings from the human study, modelswith less exposure to Russian ( : , : , : )must show a higher ABX discrimination error thanthe model with more exposure ( : ). As a sanitycheck, we also consider the monolingual Russian( : ) and English ( : ) models. Figure 5 shows the model’s ABX error rates acrosstasks and training conditions. We ﬁrst observe thatall lines have a negative slope: all the six modelsshow lower error in the tasks with greater edit dis-tance between the words, which is the expectedpattern. Note that comparing the absolute valuesto the results of human participants (Figure 4) isnot necessarily meaningful because of the task dif-ference: human participants of Cook et al. (2016)had to compare an acoustic word in Russian to atranslation of an English word they saw, whereasour model directly compared two Russian acousticwords embedded in its representation space.Second, we see in Figure 5 that the models withless exposure to Russian have higher error rates.This is especially evident for the data points withedit distance , whereas the difference across themodels gets smaller with greater edit distances.Again, this is the expected pattern. A mixed-effectsregression ﬁtted to the models’ error rates suggeststhat there are signiﬁcant effects of (1) the amountof Russian language exposure (higher exposure isassociated with lower error) and (2) the edit dis-tance (higher edit distance is associated with lowererror). A similar pattern is observed when we onlyconsider the bilingual models, for a better anal-ogy with the original experiment: the : modelshows signiﬁcantly lower error rates than both the : model and the : model (but not the : model), and also the error rates generally get loweras the edit distance increases.Recall that in each discrimination task, we con-sidered pairs of words with a certain edit distance. Edit distance A B X d i s c r i m i n a ti on e rr o r , % Russian–Englishratio in training

Figure 5: Average ABX discrimination error rate depending on the amount of the model’s exposure to Russianand the edit distance between the words in ABX triplets. Error bars show mean standard error over random modelinitializations.

Most edit operations involve Russian phone con-trasts that are also linguistically meaningful in En-glish (e.g., [v] – [s] ). However, there is a small num-ber of Russian phone contrasts that are allophonesof the same phoneme in English (e.g., [d] – [d j ] ). Ifour data included a substantial number of contrastsof this type, the model’s higher error rates couldbe attributed to these difﬁcult phone contrasts. Toensure that was not the case, we looked at the con-trasts in our pairs with edit distance and foundthat out of contrasts present in that data, only ( [ë] – [l j ] ) was not phonemic in English, and exclud-ing the corresponding test pairs from the analysishad only minor impact on the absolute error rates,but not on the reported patterns.To summarize, our model could correctly pre-dict the direction of the two main effects found ina translation judgment task of Cook et al. (2016).This suggests that their result can at least partiallybe explained in terms of comparing two acousticinstances: the word a participant hears and thetranslation of a word that (s)he sees. This presentsan alternative explanation of the non-native speak-ers’ difﬁculties with spoken word processing interms of phonetic perception, which does not in-volve phonology. Most studies on bilingual lexical access advocateits non-selective nature: that is, speakers activatewords in both languages in parallel, including inspoken word processing (e.g., Weber and Cutler,2004; Lagrou et al., 2011; Shook and Marian,2012). Ideally, our model should show a similarpattern and not completely separate the two lan-guages in its representation space. To examine this,

Table 2: Accuracy (in %) of logistic regression classi-ﬁers predicting language identity of a given word fromits acoustic embedding, averaged over ﬁve random ini-tializations of each model.

Russian–English ratio :

25 50 :

50 25 :

75 10 : Mean . . . . SD . . . . we run a language classiﬁcation task, similar toKamper et al. (2020a). We are interested whetherthe model can identify the language of a given wordbased on its acoustic embedding.Using the bilingual models from Simulation 3,we embed words per language. We then traina logistic regression classiﬁer on of this datato predict the language of a given word from itsacoustic embedding, and test the classiﬁer on theremaining of words. The higher the accu-racy of the classiﬁer, the more linearly separablethe two languages—speciﬁcally, their lexicons—inour model’s representations. The results (Table 2)show that all models reach accuracy much higherthan the chance, although no model reaches accuracy. This means that the lexical rep-resentations of words in two languages (acousticword embeddings) in our bilingual models are notfully linearly separable, indicating a substantial( . – . ) overlap between the two languages.Because some of the representations from the twolanguages are close to each other in the embeddingspace, the model may confuse them, similar to thenon-selective lexical access in bilingual speakers. Discussion

We started by asking whether some of the difﬁ-culties in non-native spoken word processing canbe explained at the level of phonetic perception,without involving phonolexical representations. Toaddress this question, we presented a case study(Simulation 3) with a computational model thatlearns from unsegmented speech data and does nothave access to phonology. Our model showed pat-terns similar to those found by Cook et al. (2016) inhuman speakers. This suggests that their results canbe at least partly explained by phonetic learning.While we cannot estimate the relative contributionof the two factors—non-native phonetic perceptionvs. imprecise phonolexical representations—to thebehavior of non-native speakers in the experimentof Cook et al., we argue that both factors need to beconsidered as possible explanations of the spokenword processing difﬁculties in non-native speak-ers. Note, however, that this result does not tell uswhether the phonetic or the phonolexical explana-tion is more parsimonious—a question that shouldbe addressed in the future.One could interpret our main result differently:that our model, in fact, has succeeded in learningphonological systems from speech data and cannotbe considered a purely phonetic model. Indeed, weknow that deep neural networks can learn to encodevarious types of linguistic structure without explicitsupervision (e.g., Manning et al., 2020; Linzen andBaroni, 2021). In particular, speech models canachieve high accuracy in phone discrimination (Al-ishahi et al., 2017) and classiﬁcation (Chung et al.,2019), a ﬁnding sometimes interpreted as a success-ful acquisition of phonetic/phonological categories .While our model can discriminate at least somephone contrasts, too (Simulation 1), this does notnecessarily mean that it learns phonetic categories(see Schatz et al., 2019, for a relevant discussion).More importantly, what our model does not do isstore explicit phonolexical representations in itsmemory, whereas the (imprecise) storage of wordforms is one of the key premises of the phonolex-ical account explaining non-native speakers’ dif-ﬁculties in spoken word processing (Cook et al.,2016). Therefore, we conclude that our resultshighlight the effects of phonetic perception on non-native word processing.In Simulation 1 and 2 we showed that our modeltrained simultaneously on two languages could cor-rectly predict some phone- and word-level discrim- ination effects in infants and adults (Kuhl et al.,2003; MacKain et al., 1981). This extends previousresults on phone discrimination with monolingualmodel (Matusevych et al., 2020b) to word discrimi-nation and to bilingual speakers. Also, our analysisof model’s representations indicates a substantialoverlap between the lexicons of the two languages,mimicking non-selective lexical access in bilingualspeakers (e.g., Lagrou et al., 2011). All together,this suggests that the CAE-RNN can be used as atool to study not only native/non-native phoneticlearning, but also native/non-native spoken wordprocessing, including in bilingual speakers.Our model helps to tease apart the potential im-pact of phonetic learning from other effects on spo-ken word processing. At the same time, it is nota cognitive model of the human mental lexicon,for example because it is devoid of semantics. Amethod to learn acoustic and semantic embeddingsin parallel has been proposed in speech engineering(Chen et al., 2018), and future research could shedsome light on whether this method can be used forstudying human mental lexicon.

Acknowledgments

This work is based on research supported in partby an ESRC-SBE award ES/R006660/1, a JSMFScholar Award 220020374, and an NSF awardBCS-1734245. We thank the anonymous review-ers, as well as Kate McCurdy, Seraphina Goldfarb-Tarrant and other members of AGORA readinggroup at the University of Edinburgh for their help-ful feedback.

References

Afra Alishahi, Marie Barking, and Grzegorz Chrupała.2017. Encoding of phonology in a recurrent neu-ral model of grounded speech. In

Proceedings ofCoNLL , pages 368–378.Mark Amengual. 2016. The perception of language-speciﬁc phonetic categories does not guarantee ac-curate phonological representations in the lexiconof early bilinguals.

Applied Psycholinguistics ,37:1221–1251.Ann R. Bradlow, David B. Pisoni, Reiko Akahane-Yamada, and Yoh’ichi Tohkura. 1997. TrainingJapanese listeners to identify English /r/ and /l/: IV.Some effects of perceptual learning on speech pro-duction.

The Journal of the Acoustical Society ofAmerica , 101:2299–2310.Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and HaoZheng. 2017. AISHELL-1: An open-source Man-arin speech corpus and a speech recognition base-line. In

Proceedings of O-COCOSDA , pages 58–62.Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen,Hung-Yi Lee, and Lin-Shan Lee. 2018. Phonetic-and-semantic embedding of spoken words with ap-plications in spoken content retrieval. In

Proceed-ings of IEEE SLT Workshop , pages 941–948.Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James R.Glass. 2019. An unsupervised autoregressive modelfor speech representation learning. In

Proceedingsof Interspeech , pages 146–150.Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen,Hung-Yi Lee, and Lin-Shan Lee. 2016. Unsuper-vised learning of audio segment representations us-ing sequence-to-sequence recurrent neural networks.In

Proceedings of Interspeech , pages 765–769.Svetlana V. Cook and Kira Gor. 2015. Lexical accessin L2: Representational deﬁcit or processing con-straint?

The Mental Lexicon , 10:247–270.Svetlana V. Cook, Nick B. Pandˇza, Alia K. Lancaster,and Kira Gor. 2016. Fuzzy nonnative phonolexicalrepresentations lead to fuzzy form-to-meaning map-pings.

Frontiers in Psychology , 7:1345.Anne Cutler and Takashi Otake. 2004. Pseudo-homophony in non-native listening. Paper presentedat the 174th meeting of the Acoustical Society ofAmerica.Isabelle Darcy, Danielle Daidone, and Chisato Kojima.2013. Asymmetric lexical access and fuzzy lexicalrepresentations in second language learners.

TheMental Lexicon , 8:372–420.Isabelle Darcy, Laurent Dekydtspotter, Rex A. Sprouse,Justin Glover, Christiane Kaden, Michael McGuire,and John H. G. Scott. 2012. Direct mapping ofacoustics to phonology: On the lexical encoding offront rounded vowels in L1 English–L2 French ac-quisition.

Second Language Research , 28:5–40.Ewan Dunbar, Xuan-Nga Cao, Juan Benjumea, JulienKaradayi, Mathieu Bernard, Laurent Besacier,Xavier Anguera, and Emmanuel Dupoux. 2017. TheZero Resource Speech Challenge 2017. In

Proceed-ings of ASRU Workshop , pages 323–330.Ewan Dunbar, Julien Karadayi, Mathieu Bernard,Xuan-Nga Cao, Robin Algayres, Lucas Ondel,Laurent Besacier, Sakriani Sakti, and EmmanuelDupoux. 2020. The Zero Resource Speech Chal-lenge 2020: Discovering discrete subword and wordunits. In

Proceedings of Interspeech , pages 4831–4835.Kira Gor and Svetlana V. Cook. 2020. A mare in apub? Nonnative facilitation in phonological priming.

Second Language Research , 36:123–140.Sarah Hawkins. 2003. Roles and representations of sys-tematic ﬁne phonetic detail in speech understanding.

Journal of Phonetics , 31:373–405. Paul Iverson, Valerie Hazan, and Kerry Bannister. 2005.Phonetic training with acoustic cue manipulations:A comparison of methods for teaching English /r/-/l/to Japanese adults.

The Journal of the AcousticalSociety of America , 118:3267–3278.Herman Kamper. 2019. Truly unsupervised acous-tic word embeddings using weak top-down con-straints in encoder-decoder models. In

Proceedingsof ICASSP , pages 6535–3539.Herman Kamper, Yevgen Matusevych, and SharonGoldwater. 2020a. Improved acoustic word embed-dings for zero-resource languages using multilingualtransfer. arXiv:2006.02295 .Herman Kamper, Yevgen Matusevych, and SharonGoldwater. 2020b. Multilingual acoustic word em-bedding models for processing zero-resource lan-guages. In

Proceedings of ICASSP , pages 6414–6418.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In

Proceedingsof ICLR .Patricia K. Kuhl, Feng-Ming Tsao, and Huei-Mei Liu.2003. Foreign-language experience in infancy: Ef-fects of short-term exposure and social interactionon phonetic learning.

Proceedings of the NationalAcademy of Sciences , 100:9096–9101.Evelyne Lagrou, Robert J. Hartsuiker, and WouterDuyck. 2011. Knowledge of a second language in-ﬂuences auditory word recognition in the native lan-guage.

Journal of Experimental Psychology: Learn-ing, Memory, and Cognition , 37:952–965.Tal Linzen and Marco Baroni. 2021. Syntactic struc-ture from deep learning.

Annual Review of Linguis-tics , 7:195–212.John S. Logan, Scott E. Lively, and David B. Pisoni.1991. Training Japanese listeners to identify English/r/ and /l/: A ﬁrst report.

The Journal of the Acousti-cal Society of America , 89:874–886.Kristine S. MacKain, Catherine T. Best, and WinifredStrange. 1981. Categorical perception of English /r/and /l/ by Japanese bilinguals.

Applied Psycholin-guistics , 2:369–390.Kikuo Maekawa. 2003. Corpus of SpontaneousJapanese: Its design and evaluation. In

Proceedingsof SSPR Workshop , pages 7–12.Christopher D. Manning, Kevin Clark, John Hewitt,Urvashi Khandelwal, and Omer Levy. 2020. Emer-gent linguistic structure in artiﬁcial neural networkstrained by self-supervision.

Proceedings of the Na-tional Academy of Sciences , pages 30046–30054.Andrew Martin, Thomas Schatz, Maarten Versteegh,Kouki Miyazawa, Reiko Mazuka, EmmanuelDupoux, and Alejandrina Cristia. 2015. Motherspeak less clearly to infants than to adults: A com-prehensive test of the hyperarticulation hypothesis.

Psychological Science , 26:341–347.Yevgen Matusevych, Herman Kamper, and SharonGoldwater. 2020a. Analyzing autoencoder-basedacoustic word embeddings. Paper presented atBridging AI and Cognitive Science (ICLR 2020).Yevgen Matusevych, Thomas Schatz, Herman Kam-per, Naomi H. Feldman, and Sharon Goldwater.2020b. Evaluating computational models of infantphonetic learning across languages. In

Proceedingsof CogSci , pages 571–577.James M. McQueen, Anne Cutler, and Dennis Norris.2006. Phonological abstraction in the mental lexi-con.

Cognitive Science , 30:1113–1126.Juliette Millet, Nika Jurov, and Ewan Dunbar. 2019.Comparing unsupervised speech learning directly tohuman performance in speech perception. In

Pro-ceedings of CogSci , pages 2358–2364.Kuniko Miyawaki, James J. Jenkins, Winifred Strange,Alvin M. Liberman, Robert Verbrugge, and OsamuFujimura. 1975. An effect of linguistic experience:The discrimination of [r] and [l] by native speak-ers of Japanese and English.

Perception & Psy-chophysics , 18:331–340.Christophe Pallier, Angels Colom´e, and N´uria Se-basti´an-Gall´es. 2001. The inﬂuence of native-language phonology on lexical access: Exemplar-based versus abstract lexical entries.

PsychologicalScience , 12:445–449.Douglas B. Paul and Janet M. Baker. 1992. The designfor the Wall Street Journal-based CSR corpus. In

Proceedings of the Workshop on Speech and NaturalLanguage , pages 357–362.Janet B. Pierrehumbert. 2002. Word-speciﬁc phonetics.

Laboratory Phonology , 7:101–140.Mark A. Pitt, Keith Johnson, Elizabeth Hume, ScottKiesling, and William Raymond. 2005. The Buck-eye corpus of conversational speech: Labeling con-ventions and a test of transcriber reliability.

SpeechCommunication , 45:89–95.Robert Port. 2007. How are words stored in memory?Beyond phones and phonemes.

New Ideas in Psy-chology , 25:143–170.Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, PetrSchwarz, Jan Silovsk´y, Georg Stemmer, and KarelVesel´y. 2011. The Kaldi speech recognition toolkit.In

Proceedings of ASRU Workshop .Thomas Schatz, Naomi H. Feldman, Sharon Gold-water, Xuan-Nga Cao, and Emmanuel Dupoux.2019. Early phonetic learning without phoneticcategories—insights from large-scale simulations onrealistic input.

PsyArXiv . Thomas Schatz, Vijayaditya Peddinti, Francis Bach,Aren Jansen, Hynek Hermansky, and EmmanuelDupoux. 2013. Evaluating speech features with theminimal-pair ABX task: Analysis of the classicalMFC/PLP pipeline. In

Proceedings of Interspeech ,pages 1781–1785.Tanja Schultz. 2002. GlobalPhone: a multilingualspeech and text database developed at KarlsruheUniversity. In

Proceedings of ICSLP , pages 345–348.Anthony Shook and Viorica Marian. 2012. Bimodalbilinguals co-activate both languages during spokencomprehension.

Cognition , 124:314–324.Winifred Strange and Sibylla Dittmann. 1984. Effectsof discrimination training on the perception of /r-l/by Japanese adults learning English.

Perception &Psychophysics , 36:131–145.Gretchen Sunderman and Judith F. Kroll. 2006. Firstlanguage activation during second language lexicalprocessing: An investigation of lexical form, mean-ing, and grammatical class.

Studies in Second Lan-guage Acquisition , pages 387–422.Feng-Ming Tsao, Huei-Mei Liu, and Patricia K. Kuhl.2006. Perception of native and non-native affricate-fricative contrasts: Cross-language tests on adultsand infants.

The Journal of the Acoustical Societyof America , 120:2285–2294.Maarten Versteegh, Roland Thiolliere, Thomas Schatz,Xuan-Nga Cao, Xavier Anguera, Aren Jansen, andEmmanuel Dupoux. 2015. The Zero ResourceSpeech Challenge 2015. In

Proceedings of Inter-speech , pages 3169–3173.Andrea Weber and Anne Cutler. 2004. Lexical compe-tition in non-native spoken-word recognition.

Jour-nal of Memory and Language , 50:1–25.Janet F. Werker and Richard C. Tees. 1984. Cross-language speech perception: Evidence for percep-tual reorganization during the ﬁrst year of life.