[PDF] CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks

Abstract

How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN), that combine a Deep Convolutional GAN architecture for audio data (WaveGAN; arXiv:1705.07904) with an information theoretic extension of GAN -- InfoGAN (arXiv:1606.03657), and propose a new latent space structure that can model featural learning simultaneously with a higher level classification and allows for a very low-dimension vector representation of lexical items. Lexical learning is modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on `suit' and `dark' outputs innovative `start', even though it never saw `start' or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code.

Full PDF

CCiwGAN and ﬁwGAN: Encoding information in acoustic data to modellexical learning with Generative Adversarial Networks

Gaˇsper Beguˇs

Department of Linguistics, University of Washington, Guggenheim Hall 415H, Box 352425, Seattle, WA 98195

Abstract

How can deep neural networks encode information that corresponds to words in human speechinto raw acoustic data? This paper proposes two neural network architectures for modeling un-supervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) andﬁwGAN (Featural InfoWaveGAN), that combine a Deep Convolutional GAN architecture for au-dio data (WaveGAN; Donahue et al. 2019) with an information theoretic extension of GAN –InfoGAN (Chen et al., 2016), and propose a new latent space structure that can model featurallearning simultaneously with a higher level classiﬁcation. In addition to the Generator and theDiscriminator networks, the architectures introduce a network that learns to retrieve latent codesfrom generated audio outputs. Lexical learning is thus modeled as emergent from an architecturethat forces a deep neural network to output data such that unique information is retrievable from itsacoustic outputs. The networks trained on lexical items from TIMIT learn to encode unique infor-mation corresponding to lexical items in the form of categorical variables in their latent space. Bymanipulating these variables, the network outputs speciﬁc lexical items. The network occasionallyoutputs innovative lexical items that violate training data, but are linguistically interpretable andhighly informative for cognitive modeling and neural network interpretability. Innovative outputssuggest that phonetic and phonological representations learned by the network can be productivelyrecombined and directly paralleled to productivity in human speech: a ﬁwGAN network trainedon suit and dark outputs innovative start , even though it never saw start or even a [st] sequence inthe training data. We also argue that setting latent featural codes to values well beyond trainingrange results in almost categorical generation of prototypical lexical items and reveals underlyingvalues of each latent code. Probing deep neural networks trained on well understood dependen-cies in speech bears implications for latent space interpretability, understanding how deep neuralnetworks learn meaningful representations, as well as a potential for unsupervised text-to-speechgeneration in the GAN framework.

Keywords: artiﬁcial intelligence, generative adversarial networks, speech, lexical learning, neuralnetwork interpretability, text-to-speech

1. Introduction

How human language learners encode information in their speech is among the core questionsin linguistics and computational cognitive science. Acoustic speech data is the primary source oflinguistic input for hearing infants, and ﬁrst language learners must learn to retrieve informationfrom raw acoustic data. By the time language acquisition is complete, learners are able to not only

Email address: [email protected] (Gaˇsper Beguˇs)

Preprint submitted to tbd June 5, 2020 a r X i v : . [ c s . C L ] J un nalyze, but also produce speech consisting of words (or lexical items henceforth) that carry meaning(Saﬀran et al., 1996, 2007; Kuhl, 2010). In other words, speakers learn to encode information in theiracoustic output, and they do so by associating meaning-bearing units of speech (lexical items) withunique information. Lexical items in turn consist of units called phonemes that represent individualsounds. In fact, speakers not only produce lexical items that exist in their primary linguistic data,but are also able to generate new lexical items that consist of novel combinations of phonemes thatconform to the phonotactic rules of their language. This points to one of the core properties oflanguage: productivity (Hockett, 1959; Piantadosi and Fedorenko, 2017; Baroni, 2020).Computational approaches to lexical learning have a long history. Modeling lexical learning cantake many forms (for a comprehensive overview, see R¨as¨anen 2012), but the shift towards modelinglexical learning from acoustic data, especially from raw unreduced acoustic data, has occurredrelatively recently (Lee et al. 2015; Shafaei-Bajestan and Baayen 2018; Baayen et al. 2019, i.a.).Previously, the majority of models operated on either fully abstracted or already simpliﬁed featuresextracted from raw acoustic data. A variety of models have been proposed for this task including,among others, Bayesian and connectionist approaches (see, among others Goldwater et al. 2009;Feldman et al. 2009; R¨as¨anen 2012; Heymann et al. 2013; Lee and Glass 2012; Elsner et al. 2013;Feldman et al. 2013; Lee et al. 2015; Arnold et al. 2017; Kamper et al. 2017; Shafaei-Bajestan andBaayen 2018; Baayen et al. 2019; Chuang et al. 2020).As summarized in Lee et al. (2015), existing models of lexical learning that take some formof acoustic data as input can be divided into “spoken term discovery” models and “models ofword segmentation” (Lee et al., 2015, 390). Proposals of the ﬁrst approach most commonly involveclustering of similarities in acoustic data to establish a set of phonetic units from which lexical itemsare then established again based on clustering. The word segmentation models, on the other hand,“start from unsegmented strings of symbols and attempt to identify subsequences correspondingto lexical items” (Lee et al., 2015, 390).Deep neural network models operating on acoustic data have recently been used to modelphonetic, but not phonological or lexical acquisition. Several prominent autoencoder models haverecently been proposed which are trained to represent data in a lower-dimensionality space (R¨as¨anenet al., 2016; Eloﬀ et al., 2019; Shain and Elsner, 2019). Clustering analyses of the reduced space inthese autoencoder models suggest that the networks learn approximates to phonetic features. Thedisadvantage of the autoencoder architecture is that outputs simply reproduce inputs as closely aspossible: the network’s outputs are directly connected to its inputs, which is not an ideal setting forlanguage acquisition. Current proposals in the autoencoder framework do not model phonologicalprocesses, and there is only an indirect relationship between phonetic properties and latent space.Language acquisition has, to the author’s knowledge, not been modeled with the GAN archi-tecture prior to Beguˇs (2020), despite several aspects of the architecture that can be paralleled tolanguage acquisition. Beguˇs (2020) proposes that phonetic and phonological learning can simulta-neously be modeled as a dependency between latent space and output data in Deep ConvolutionalGenerative Adversarial Networks (Goodfellow et al., 2014; Radford et al., 2015; Donahue et al.,2019). Unlike in the autoencoder architectures, the outputs of the GAN models are innovative, notdirectly connected to the inputs, and violate training data distributions in highly informative ways.Despite their several advantages, to our knowledge, lexical learning has not yet been modeledwith unsupervised generative deep convolutional neural network models. In this paper, we followthe proposal in Beguˇs (2020) that phonetic and phonological acquisition can be modeled as adependency between latent space and generated data in the GAN architecture and add lexicallearning component to the model. We modify the WaveGAN architecture and add the InfoGAN’sQ-network (based partially on implementation in Rodionov 2018) to computationally simulatelexical learning from raw acoustic data. We introduce a deep convolutional network that learns2o retrieve the Generator’s latent code and propose a new latent space structure that can modelfeatural learning (ﬁwGAN). We train the networks on highly variable training data: lexical itemsfrom TIMIT database (Garofolo et al., 1993) that includes over 600 speakers from diﬀerent dialectalbackgrounds in American English. We present three computational experiments: on ﬁve lexicalitems in the ciwGAN architecture (Section 4.1), on ten lexical items in the ciwGAN architecture(Section 4.2), and on eight lexical items in the ﬁwGAN architecture (Section 4.3). Evidence forlexical learning emerges in all three experiments. The paper also features a section describing howto directly follow learning strategies of the Generator network (Section 4.1.2), a section on featurallearning that discusses innovative outputs and productivity of the model (Section 4.4), and a sectionthat proposes a technique for retrieving underlying representation of the latent variables in GANs(Section 4.5). We argue that exploration of innovative outputs and the latent space of deep neuralnetworks trained on dependencies on speech data that are well understood due to extensive studyof human phonetics and phonology in the past decades provides unique insights both for cognitivemodeling and for neural network interpretability.The basic principle in our proposal crucially diﬀers from the existing computational treatmentsof lexical learning that employ clustering or classiﬁcation of phonetic similarities from which a lexi-cal inventory is established. The model proposed here does not primarily classify or cluster acousticsimilarities into lexical items. Instead, the model is fully generative, in that a deep convolutionalnetwork generates innovative outputs and encodes unique information in its outputs. Lexical learn-ing is modeled in the following way: a deep convolutional network learns to retrieve informationfrom innovative outputs generated by a separate Generator network. The Generator network thuslearns to generate data such that unique lexical information is retrievable from its acoustic outputs.Lexical learning is not per se incorporated in the model: instead, lexical learning emerges becausethe most informative way to generate outputs given speech data as input is to encode unique in-formation into lexical items. The end result of the model is a Generator network that generatesinnovative data — raw acoustic outputs — such that each lexical item is represented by a uniquecode. Because the model diverges substantially from existing proposals of lexical learning, we willnot compare its performance to the existing models — it would be diﬃcult to ﬁnd a metric tocompare a fully generative model that outputs raw acoustic data. Instead, we propose to evaluatesuccess in the model’s performance in lexical learning with an inferential statistical technique —multinomial logistic regression (Section 4.1).The model of lexical learning proposed here features some desired properties. First, he networkis trained exclusively on raw unannotated acoustic data. Second, lexical learning emerges from therequirement on a deep convolutional network to output informative data. Only because associat-ing a unique code in the latent space with lexical items is the optimal way to encode informationsuch that another network will be able to retrieve it, does the lexical learning emerge. Third, themodel is fully generative: a deep convolutional network (the Generator) generates raw acousticoutputs that correspond to lexical items in the training data. Crucially, the Generator networkin the model does not simply replicate training data, but generates innovative outputs, becauseits main task is to increase the error rate of the network that distinguishes real from generateddata (the Discriminator) and its outputs are not directly connected to the training data. Occa-sionally, the Generator outputs innovative data that violate distributions of the training data, butare linguistically interpretable and highly informative. The model thus features one of the basicproperties of language: productivity. This allows us to compare lexical and phonological acquisi-tion in language-acquiring children to the innovative generated data in the proposed computationalmodel. The ﬁwGAN architecture has an additional advantage: it can model featural learning inaddition to a higher level classiﬁcation. This means that featural representations in phonology andphonetics can be modeled simultaneously with lexical learning. To be sure, there are also undesired3spects of the model: one of the main undesired aspects of the current model is that the numberof lexical classes that the network learns needs to be predetermined: the model requires a priornumber of unique categories to encode lexical information to. Also, while the model learns fromraw acoustic inputs, the individual lexical items in training data are sliced from the corpus (slicedat the lexical level rather on the phone level) instead of inferred by the model. These disadvantagesare not insurmountable, but are left to be addressed in future work.The proposed architectures and results of the computational experiments have implicationsfor deep neural network interpretability as well as some basic implications for NLP applications.Beside modeling lexical learning, the novel latent space structure in the ﬁwGAN architecture canbe employed as a general purpose unsupervised simultaneous feature extractor and classiﬁer foraudio data. We also propose a technique for exploring latent space representations: we argue thatmanipulating latent codes to marginal values that substantially excess the training range revealsunderlying values for each latent code. Outputs generated with the proposed technique featurelittle variability and have the potential to reveal learning representations of the Generator network.The proposed model also allows a ﬁrst step towards unsupervised text-to-speech synthesis at thelexical level using GANs: the Generator outputs speciﬁc lexical items when latent codes are set todiﬀerent values.

2. Background

The main characteristics of the GAN architecture (Goodfellow et al., 2014) are two networks:the Generator and the Discriminator. The Generator generates data from latent space that isreduced in dimensionality (e.g. from a set of uniformly distributed variables z ). The Discrimi-nator network learns to distinguish between “real” training data and generated outputs from theGenerator network. The Generator is trained to maximize the Discriminator’s error rate; the Dis-criminator is trained to minimize its own error rate. In the DCGAN proposal (Radford et al.,2015), the two networks are deep convolutional networks. Recently, the DCGAN proposal wastransformed to model audio data in WaveGAN (Donahue et al., 2019). The main architecture ofWaveGAN is identical to that of DCGAN (Radford et al., 2015), with the main diﬀerence beingthat the Generator outputs a one-dimensional vector corresponding to time series data (raw acous-tic output) and the Discriminator takes one-dimensional acoustic data as its input (as opposed totwo-dimensional visual data in DCGAN). WaveGAN also adopts the Wasserstein GAN proposal fora cost function in GANs that improves training (Arjovsky et al., 2017). Instead of estimating theprobability of whether the output is generated or real, WGAN estimates the Wasserstein distancebetween generated data and real data.Learning is unsupervised in the GAN framework and the model results in a Generator thatgenerates innovative outputs based on the principle of imitation, enforced by the Discriminator.Beguˇs (2020) models speech acquisition as a dependency between latent space and generated out-puts in the GAN architecture. The paper proposes a technique for identifying latent variables thatcorrespond to meaningful phonetic/phonological features in the output. The Generator networklearns to encode phonetically and phonologically meaningful representations, such as the presenceof a segment in the output with a subset of variables, i.e. with reduced representation. Using thetechnique proposed in Beguˇs (2020), we can identify individual variables that correspond to, forexample, a sound [s] in the output. By manipulating these identiﬁed variables to values that arebeyond the training range, we can force [s] in the output. Interpolating the values has an almostlinear eﬀect on the amplitude of frication noise of [s] in the output.One of the advantages of the proposal in Beguˇs (2020) is that the model learns phonologicalalternations, i.e. context-dependent changes in realization of speech sounds, simultaneously with4earning of acoustic properties of human speech. The WaveGAN model is trained on a simplephonological process: aspiration of stops /p, t, k/ conditioned on the presence of [s] in the input.English voiceless stops /p, t, k/ are aspirated (produced with a puﬀ of air [ p h , t h , k h ]) word-initially before a stressed vowel (e.g. in pit [ "p h It ]) except if an [s] precedes the stop (e.g. spit [ "spIt ])A computational experiment suggests that the network learns this distribution, but imperfectlyso. The network learns to output shorter aspiration duration when [s] is present, in line withdistributions in the training data. Outputs, however, also violate data in a manner that can bedirectly paralleled to language acquisition. Occasionally, the Generator network outputs aspirationdurations that are longer in the [s] condition than in any example in the training data: the generatoroutputs [ sp h It ], which violates the phonological rule in English. In other words, the network violatesthe distributions in the training data, and these violations correspond directly to phonologicalacquisition stages: children acquiring English start with signiﬁcantly longer aspiration durations inthe [s]-condition, e.g. [ sp h It ] (Bond and Wilson, 1980).In sum, GANs have been shown to represent phonetically or phonologically meaningful infor-mation in the latent space that has approximate equivalent in phonetic/phonological representa-tions and language acquisition. The latent variables that correspond to features can be activelymanipulated to generate data with or without some phonetic/phonological properties. These rep-resentations, however, are limited to the phonetic/phonological level exclusively in Beguˇs (2020)and contain no lexical information.We propose a GAN architecture that combines WaveGAN with the InfoGAN proposal (Chenet al., 2016). The InfoGAN proposal introduces the latent code to the GAN architecture anda Q-network. The Q-network usually shares convolutions with the Discriminator, but instead ofestimating the “realness” of the generated and real inputs, it estimates the latent code that theGenerator takes as an input. The weights of both the Q-network and the Generator networks areupdated based on the Q-network’s loss function. This forces the Generator network to output datasuch that the Q-network is successful in retrieving its latent code.Representing semantic information can take many forms in computational models. The currentproposal is a model of lexical learning, which is why unique lexical items are represented with eithera one-hot vector in the ciwGAN architecture or a binary code in the ﬁwGAN architecture. In otherwords, the objective of the model is to associate each unique lexical item in the training data witha unique representation. For example, in a corpus with four words, word1 can be associated with arepresentation [1, 0, 0, 0], word2 with [0, 1, 0, 0], word3 with [0, 0, 1, 0] in the ciwGAN architecture.In the ﬁwGAN architecture, word1 can be associated with [0, 0], word2 with [0, 1], word3 with [1,0], et cetera.

3. Model

The proposed ciwGAN and ﬁwGAN architectures involve three deep convolutional networks:the Generator, the Discriminator, and the Q-network (or the lexical learner). The models are basedon WaveGAN (Donahue et al., 2019), an implementation of the DCGAN architecture (Radfordet al., 2015) for audio data and the InfoGAN proposal (Chen et al., 2016). Unlike in mostInfoGAN implementations, the Q-network is a separate deep convolutional network . Barry and Kim (2019) in a recent presentation models piano with InfoWaveGAN. Their proposal, however, focuseson continuous variables and feature only one categorical latent variable with no apparent function. It is unclear fromthe poster what the architecture of their proposal is. The InfoGAN model based on DCGAN in Rodionov (2018) also proposes the Q-network to be a separate network. x = Generatornetwork G( z ) Input variables

97 random variables ( z ) z − ∼ U ( − , φ for 2 n classes (= 8) φ = φ φ φ Q network

Estimates ˆ φ [ φ , φ , φ ] x = Discriminatornetwork D( x ) Training data

Figure 1: Architecture of ﬁwGAN: green trapezoids represent deep convolutional neural networks; purple squaresillustrate inputs to each of the three networks. The Generator network takes 3 latent features φ (constituting binarycode) and 97 latent variables z uniformly distributed ( z ∼ U ( − , x ) that constitute approximately 1 s of audio ﬁle (sampled at 16000 Hz). The Discriminator takesgenerated data (ˆ x ) and real data and estimates the Wasserstein distance between them. The Q-network (lexicallearner) takes generated data as its input and outputs estimates of the unique feature values that the Generator usesfor generation of each data point. In the GAN architecture, the Generator network usually takes as its input a number of uni-formly distributed latent variables ( z ∼ U ( − , c ); but in the ﬁwGAN archi-tecture, we introduce binary code as the categorical input (labeled as φ ). This new structure inthe ﬁwGAN latent space allows the network to treat the binary variables as features, where eachvariable corresponds to one feature ( φ n ). As a consequence, the two networks diﬀer in how theQ-network is trained. In ciwGAN, the Q-network is trained on retrieving information from theGenerator’s output with a softmax function in its ﬁnal layer. In ﬁwGAN, the categorical variablesor features are binomially distributed and the Q-network is trained to retrieve information with asigmoid function in the ﬁnal layer accordingly. In sum, the Generator in our proposal takes twosets of variables as its input (latent space): (i) categorical variables ( c or φ ) which constitute aone-hot vector (ciwGAN) or a binary code (ﬁwGAN) and (ii) random variables z that are uniformlydistributed ( z ∼ U ( − , github.com/gbegus/ciwgan-fiwgan .The Generator network is a ﬁve-layer deep convolutional network (from WaveGAN; Donahueet al. 2019) that takes the input variables (referred to as the latent variables or the latent space) andoutputs a 1D vector of 16,384 datapoints that constitute just over 1 second of acoustic audio outputwith 16 kHz sampling rate. These generated outputs are fed to the Discriminator network and the6-network. The Discriminator network takes raw audio as its input: both generated data andreal data sliced at the lexical level from the TIMIT database (Garofolo et al., 1993). It is trainedon estimating the Wasserstein distance between generated and real data distributions, accordingto Arjovsky et al. (2017). It outputs “realness” scores which estimate how far from the real datadistribution an input is (Brownlee, 2019). The Generator’s objective is to increase the error rateof the Discriminator: such that the Discriminator assigns high “realness” score to its generatedoutputs.To model lexical learning, we add the Q-network to the architecture (InfoGAN; Chen et al.2016). Unlike in most InfoGAN implementations, the Q-network is in the proposed architectureindependent of the Discriminator network. The Q-network is thus a separate network, but in itsarchitecture identical to the Discriminator. It takes only generated outputs ( G ( z )) as its input inthe form of 16384 data points (approximately 1 s of audio data sampled at 16 kHz). The Q-networkhas 5 convolutional layers. The only diﬀerence between the Discriminator and the Q-network isthat the ﬁnal layer in the Q-network includes n number of nodes, where n corresponds to thenumber of categorical variables ( c in ciwGAN) or features ( φ in ﬁwGAN) in the latent space.The Q-network is trained on estimating the categorical part of the latent space ( c - or φ -values).Its output is thus a unique code that approximates the latent code in the Generator’s latent space —either a one-hot vector or a binary code. The training objective of the Q-network is to approximatethe unique latent code in the Generator’s hidden input. At each evaluation, weights of the Q-network as well as the Generator network are updated with cross-entropy according to the lossfunction of the Q-network. This forces the Generator to generate data, such that the latent codeor latent features ( c or φ ) will be retrievable: the Generator’s objective is to maximize the successrate of the Q-network. The Generator and the Discriminator networks are trained with the Adamoptimizer, whereas the Q-network is trained with the RMSProp algorithm (with the learning rateset at .0001 for all optimizers). The Generator and the Q-networks are updated once per ﬁveupdates of the discriminator network.To summarize the architecture, the Discriminator network learns to distinguish “realness” ofgenerated speech samples. The Generator is trained to maximize the loss function of the Discrimi-nator. The Q-network (or the lexical learner network) is trained on retrieving the categorical partof the latent code in the Generator’s output based on only the Generator’s acoustic outputs. Be-cause the weights of the Q-network as well as the Generator are updated based on the Q-network’sloss function, the Generator learns to associate lexical items with a unique latent code (one-hotvector or binary code), so that the Q-network can retrieve the code from the acoustic signal only.This learning that resembles lexical learning is unsupervised: the association between the code inthe latent space and individual lexical items arises from training and is not pre-determined. Inprinciple, the Generator could associate any acoustic property with the latent code, but it wouldmake harder for the Q-network to retrieve the information if the Generator encoded some other dis-tribution with its latent code. The association between a unique code value and individual lexicalitem that the Generator outputs thus emerges from the training.The result of the training in the architecture outlined in Figure 1 is a Generator networkthat outputs raw acoustic data that resemble real data from the TIMIT database, such that theDiscriminator becomes unsuccessful in assigning “realness” scores (Brownlee, 2019). Crucially,unlike in other architectures, the Generator’s outputs are never a full replication of the input:the Generator outputs innovative data that resemble input data, but also violate many of thedistributions in a linguistically interpretable manner (Beguˇs, 2020). In addition to outputtinginnovative data that resemble speech in the input, the Generator also learns to associate eachlexical item with a unique code in its latent space. This means that by setting the code to a certainvalue, the network should output a particular lexical item to the exclusion of other lexical items.7ord IPA data pointsoily ["OIli] ["ôæg] ["sut] ["wORÄ] ["jIô] Total

Table 1: Five lexical items use for training in the ﬁve-word ciwGAN model with their corresponding IPA transcription(based on general American English) and counts of data points for each item.

There are two supervised aspects of the model. First, the number of classes needs to be pre-determined and the number of lexical items in the training data needs to match the number ofclasses. For example, a one-hot vector in the ciwGAN architecture with 5 variables can categorize5 lexical items. We feed the network with 5 diﬀerent lexical items from the TIMIT database. Inthe ﬁwGAN architecture, n features ( φ ) can categorize 2 n classes. For example, 3 features φ allow2 = 8 classes and we feed the network 8 diﬀerent lexical items. Second, the network is trained onsliced lexical items and does not performed slicing in an unsupervised manner. Addressing thesetwo disadvantages is left for future work.

4. Experiments

The ﬁrst model is trained on the ciwGAN architecture with 5 lexical items from TIMIT: oily , rag , suit , water , and year . The latent space of this network includes 5 categorical variables ( c )constituting a ﬁve-level one-hot vector and 95 random variables z . The ﬁve lexical items werechosen based on frequency: they are chosen from the most frequent content words with at least 600data points in TIMIT. A total of 3205 data points were used in training and each of the ﬁve itemshas >

600 data points in the training data. The input data are 16-bit .wav slices of lexical items(as annotated in TIMIT) sampled with 16 kHz rate. Input lexical items with counts are given inTable 1.Since we are primarily interested in a generative model of lexical learning, we test the model’sperformance on generated outputs. To test whether the Generator network learns to associate eachlexical item with a unique code, the ciwGAN architecture is trained after 8011 ( ∼

800 epochs)and 19244 steps ( ∼ c ) in the generated samples aremanipulated not to 1 (as in the training stage), but rather to 2. The rest of the latent space (all z -variables) are sampled randomly, but kept constant across the ﬁve categorical variables.One hundred outputs are thus generated for each unique code (e.g. [2, 0, 0, 0, 0], [0, 2, 0, 0,0] ...). We analyze outcomes at two points during the training: after 8011 steps ( ∼

800 epochs)and after 19244 steps ( ∼ All acoustic analyses are performed in Praat (Boersma and Weenink, 2015). rag in 98/100 cases. In other words, the Generator learns to associate [0, 0, 0, 0, 2] with rag . The Generator thus not only learns to generate speech-like outputs, it also represents distinctlexical items with a unique representation: information that can be retrieved from its outputs bythe lexical learning network. We can argue that [0, 0, 0, 0, 2] is the underlying representation of rag . To determine the underlying code for each lexical item, we use success rates (or estimates fromthe multinomial logistic regression model in Table 2 and Figure 4 ): the lexical item that is the mostfrequent output for a given latent code is assumed to be associated with that latent code (e.g. rag with [0, 0, 0, 0, 2]). Occasionally, a single lexical item is the most frequent output for two latentcodes. As will be shown below, it is likely the case that this reﬂects imperfect learning where theunderlying lexical item for a latent code is obscured by a more frequent output (perhaps the onethat is easier to distinguish from the data). In this case, we associate such codes to the lexical itemfor which the given code outputs the highest proportion of that lexical item with respect to otherlatent codes. For example, [0, 0, 0, 2, 0] outputs water most frequently with oily accounting forapproximately a quarter of outputs. The assumed lexical item for [0, 0, 0, 2, 0] is oily , because thecode that outputs water most frequently is [0, 0, 2, 0, 0], while highest proportion of oily relativeto other latent codes is [0, 0, 0, 2, 0]. Observing the progress of lexical learning provides additionalevidence that oily is the underlying representation of [0, 0, 0, 2, 0]: as the training progresses thenetwork increases accuracy (see Section 4.1.2).The success rate for the other four lexical items is lower than for rag , but the outputs thatdeviate from the expected values are highly informative. For c = [2 , , , , suit . In seven additionalcases, the network outputs data points that can be transcribed with a sibilant [s] (79 total). Inthese outputs, [s] is followed by a sequence that either cannot reliable be transcribed as suit or doesnot correspond to suit , but rather to year (transcribed as sear [ sIô ]). The remaining 21 outputs donot include the word for suit or a sibilant [s]. However, they are not randomly distributed acrossother four lexical items either — they include lexical item year or its close approximation.An acoustic analysis of the training data reveals motivations for the innovative deviating out-puts. As already mentioned, the network occasionally generates an innovative output, sear . Thesources of this innovation are likely four cases in the training data in which [j] in year ([ jIô ]) isrealized as a post-alveolar fricative [ S ], probably due to contextual inﬂuence (something that couldbe transcribed as shear [ SIô ]). Figure 2 illustrates all four examples. The innovative generated Occasionally, a short vocalic element precedes the ["ôæg] . In the remaining two cases, the outputs include [ ô ] in the initial position, which is followed by a diphthong [ aI ]and a period of a consonantal closure. One output was transcribed as right . ime F r equen cy ( H z ) S oundp r e ss u r e l e v e l Generated data Training data from TIMIT sear [ sIô ] [0, 2, 0, 0, 0] Time0 8000Frequency (Hz) shear [

SIô ] TimeFrequency (Hz) shear [

SIô ] shear [ SIô ] shear [ SIô ] Figure 2: Waveforms (top), spectrograms (mid, from 0–8000 Hz), and 25 ms spectra (slices indicated in the spec-trograms) (bottom) of four data points of the lexical item year with clear frication noise in the training data (fromTIMIT) and the generated innovative output sear . output sear diﬀers from the four examples in the training data in one crucial aspect: the fricationnoise in the generated output is that of a post-alveolar [s] rather than that of a palato-alveolar [ S ].Spectral analysis in Figure 2 clearly shows that the center of gravity in the generated output issubstantially higher than in the training data (which is characteristic of the alveolar fricative [s]).The innovative sear output likely results from the fact that the training data contains four datapoints that pose a learning problem: shear that features elements of suit and year . The innovativegenerated sear [ sIô ] consequently features (i) frication noise that is approximately consistent with suit [ sut ] and (ii) formant structure consistent with year [ jIô ]. It appears that the network treats sear as a combination of the two lexical items. The network generates innovative outputs that combinesthe two elements ( sear [ sIô ]). Additionally, the sear output seems to be equally distributed amongthe two latent codes, [2, 0, 0, 0, 0] representing suit and [0, 2, 0, 0, 0] representing year . In otherwords, the error rate distribution of the two latent codes suggests that the network classiﬁes theoutput sear as the combination of elements consistent with [2, 0, 0, 0, 0] and [0, 2, 0, 0, 0].For c = [0 , , , , year or at least have a clear [ Iô ] sequence (without an [s]).

22 outputs feature a sibilant [s]. In 22cases, 16 can reliably be transcribed as suit , while the others are mostly variants of the innovative sear . The remaining cases (approximately 10) are diﬃcult to categorize based on acoustic analysis.For [0, 0, 2, 0, 0], the Generator outputs 84 data points that are transcribed as containing water [ wORÄ ]. In approximately 15 of the 84 cases, the output involves an innovative combinationtranscribed as watery [ "wOR@ôi ]. Figure 3 illustrates one such case. Watery is an innovative outputthat combines segment [i] from oily ( ["OIli] ) with [ "wOR@ô ] from water into a linguistically interpretableinnovation. This suggests that the Generator outputs a novel combination of segments, based onanalogy to oily . Unlike for sear , the training data contained no direct motivations based on which watery could be formed. Finally, for [0, 0, 0, 2, 0], the Generator outputs only 27 outputs that can reliably be transcribed The initial consonant is sometimes absent from transcriptions, but this is primarily because the glide interval isacoustically not prominent, especially before [ I ]. In 10 further cases of [0, 0, 2, 0, 0], the Generator outputs data points that contain a sequence oil ["OIl] . Tran-scription of the remaining 6 outputs is uncertain. igure 3: A waveform and spectrogram (0–4000 Hz) of an innovative output watery [ wOR@ôi ] (top right). Thatinnovative watery is a combination of water ["wORÄ] and oily ["OIli] is illustrated by two examples from the trainingdata (top and bottom right). The innovative output watery features a clear formant structure of water with a highfront vowel [i], characteristic of the lexical item oily (see marked areas of the spectrograms). At 19244 steps, thevocalic structure of [i] is not present in the output, given the exact same latent code and random latent space. Thenetwork thus corrects the formant structure from an innovative watery into water [ wOR@ô(@) ] as the training progresses.In some other cases, the network at 19244 steps outputs oily for what was watery at 8011 steps. c Most frequent % 2nd most freq. % Elsesuit [2, 0, 0, 0, 0] suit ["sut] year ["jIô]

21% 7%year [0, 2, 0, 0, 0] year ["jIô] suit ["sut]

12% 18%water [0, 0, 2, 0, 0] water ["wORÄ] oily ["OIli]

10% 6%oily [0, 0, 0, 2, 0] water ["wORÄ] oily ["OIli]

26% 13%rag [0, 0, 0, 0, 2] rag ["ôæg]

98% — — 2%

Table 2: Generated outputs and their percentages across the ﬁve one-hot vectors in the latent code. Transcriptionsof the outputs were coded as detailed in footnote 8. as oily ["OIli] . On the other hand, 61 outputs contain water . Oily is the less frequent output for [0,0, 0, 2, 0] compared to water , but water is assigned to [0, 0, 2, 0, 0] because it is its most frequentoutput, while [0, 0, 0, 2, 0] is the code that outputs the highest proportion of oily . This is why weanalyze oily as the underlying lexical item for the [0, 0, 0, 2, 0] code. Another evidence that oily might underly the [0, 0, 0, 2, 0] code is that as the training progresses, the Generator increases thenumber of outputs transcribed with oily for this code and decreases the number of outputs water for the same code (see Section 4.1.2 and Figure 4). For a conﬁrmation that the proposed methodfor assigning underlying assumed words for a given code based on annotated outputs yields validresults, see Section 4.5.To evaluate lexical learning in the ciwGAN model statistically, we analyze the results with amultinomial logistic regression model. To test signiﬁcance of the latent code as the predictor ofthe lexical item, annotations of the generated data were coded and ﬁt to a multinomial logisticregression model using the nnet package (Venables and Ripley, 2002) in R Core Team (2018). Thedependent variable is the transcriptions of the generated outputs for the ﬁve lexical items and the else condition. The independent variable is a single predictor: the latent code with the ﬁve levelsthat correspond to the ﬁve unique one-hot values in the latent code. The diﬀerence in AIC betweenthe model with the latent code as a predictor (

AIC = 674 .

7) vs. the empty model (

AIC = 1707 . The proposed model of lexical learning allows not only the ability to test learning of lexical items,but also to probe learning representations as training progresses. We propose that the progress oflexical learning can be directly observed by keeping the random variable constant across trainingsteps. In other words, we train the Generator at various training steps and generate outputs formodels trained after diﬀerent number of steps with the same latent code ( c ) and the same latentvariables ( z ). This reveals how encoding of lexical items with unique latent codes changes withtraining.To probe lexical learning as training progresses, we train the 5-word model at 8011 steps foran additional 11233 steps (total 19244) and generate outputs. The generation is performed asdescribed in Section 4.1.1: for each unique latent code (one-hot vector), we generate 100 outputswith latent variables identical to the ones used on the model trained after 8011 steps. The latent The following conditions were used for coding the transcribed output: if the annotator transcribed an output ascontaining “suit”, the coded lexical item was suit ), if “ear” or “eer” (and no “s” immediately preceeding), then year,if involving “water”, “oily”, and “rag”, then water , oily , rag , respectively. In all other cases, the output was coded as else . lseragoilywateryearsuitc c c c c c P r obab ili t y Steps

Figure 4: Estimates of two multinomial logistic regression models (for 5-word models trained after 8011 and 19244steps) with coded transcribed outputs as the dependent variable and the latent code with ﬁve levels that correspondto the ﬁve unique one-hot vectors in the model. oily :+10% in raw counts. The overall success rate is lowest in the 8011-step model precisely for lexicalitem oily . In fact, the success rate for [0, 0, 0, 2, 0] with assumed lexical representation oily is only26%. At 19244 steps, the success rate (give the exact same latent variables) increases to 37%. Generating data with identical latent variables allows us to observe how the network transformsan output that violates the underlying lexical representation to an output that conforms to it.Figure 5 illustrates how an output water at 8011 steps for latent code [0, 0, 0, 2, 0] changes to oily at 19244 steps. Both outputs have the same latent code and latent variables ( z ). Spectrogramsin Figure 5 clearly show how the formant structure of water and its characteristic period of reducedamplitude for a ﬂap [ R ] change to a formant structure characteristic for oily with a consonantalperiod that corresponds to [l]. The ﬁgure also features spectrograms of two training data points, water and oily , which illustrate a degree of acoustic similarity between the two lexical items.Similarly, Figure 3 illustrates how an output watery that violates the training data in a linguisticallyinterpretable manner at 8011 steps changes to water consistent with the training data.In sum, the results of the ﬁrst model suggest that the Generator in the ciwGAN architecturetrained on 5 lexical items learns to generate innovative data such that each unique latent codecorresponds to a lexical item. In other words, the network encodes unique lexical information in itsacoustic outputs based solely on its training objective: to generate data such that unique code isretrievable from its outputs. Figure 4 illustrates that each lexical item is associated with a uniquecode. Modeling of lexical learning is thus fully generative: when the latent code is manipulatedoutside of the training range to value 2, the network mostly outputs one lexical items per uniquecode with error rates from approximately 98% to 27%. The errors are not randomly distributed:the pattern of errors as well as innovative outputs suggests that (i) suit and year and (ii) water and oily are the items that the Generator associates more closely together. Output errors fall almostexclusively within these groups. The Generator also outputs innovative data that violate trainingdata distributions. Acoustic analysis of the training data reveals motivations for the innovativeoutputs. When we follow learning across diﬀerent training steps, we observe the Generator’s repairof innovative outputs or outputs that deviate from the expected values. The highest improvementis observed in the lexical item with overall highest error rate. To evaluate how the Generator performs on a higher number of lexical classes, another modelwas trained on 10 content lexical items from the TIMIT database, each of which is attested at least600 times in the database. All 10 lexical items with exact counts and IPA transcriptions are listedin Table 3.To evaluate lexical learning in a generative fashion, we use the same technique as on the 5-itemGenerator in Section 4.1. The Generator is trained for 27103 steps ( ∼ The model does not seem to improve further after 46002 steps. Occasionally, a change in the opposite direction is also present. igure 5: Waveforms and spectrograms (0-4000 Hz) of a generated output at 8011 steps (trained on ﬁve lexicalitems) for latent code [0, 0, 0, 2, 0] that can be transcribed as water (top left); and of a generated output for theexact same latent code as well as other 95 latent variables, but generated by a model trained after 19244 steps (topright) transcribed as oily . Circled areas point to three major changes on the spectrogram that occur from the outputat 8011 steps to the output at 19244 steps: vocalic formants change from [ wO ] to [ oI ] (area1), periods characteristicof a ﬂap [ R ] change to [l] (area 2) and formant structure for [ Ä ] turns into an [i]. Examples for water and oily fromthe TIMIT database (bottom left and right) illustrate close similarity of the generated outputs to the training data.While the opposite change (from oily to water ) also occurs, it appears less common. "æsk ] 633carry [ "k h æôi ] 632dark [ "dAôk ] 644greasy [ "gôisi ] 630like [ "laIk ] 697oily ["OIli] ["ôæg] ["sut] ["wORÄ] ["jIô] Total

Table 3: Ten content lexical items from the TIMIT database used for training. the training range to 2 (e.g. [2, 0, 0, 0, 0, 0, 0, 0, 0, 0]), while keeping the uniform latent variables( z ) constant across the 10 groups. 1000 outputs were thus annotated.Similarly to the 5-word model in Section 4.1.1, the generated data suggests that the Generatorlearns to associate each lexical item with a unique representation. To test the signiﬁcance of thelatent code as a predictor, the coded annotated data were ﬁt to a multinomial logistic regressionmodel (as described in Section 4.1). The AIC test suggests that the latent code is a signiﬁcantpredictor (

AIC = 4555 . rag without a clear repre-sentation. The highest proportion of rag appears for c = 2 at approximately 20%. However,this particular latent code ( c = 2) already outputs a substantially higher proportion of dark . Itthus appears that the Generator fails to generate outputs such that the diﬀerence between the twooutputs would be substantial. There is a high degree of phonetic similarity precisely between thesetwo lexical items: the vowels [ æ ] and [ A ] are acoustically similar and both lexical items contain arhotic [ ô ] and a voiced stop. Success rates for the other nine lexical items range from 39%–99% inraw counts.To illustrate that the network learns to associate lexical items with unique values in the latentcode (one-hot vector), we generate outputs by manipulating the one-hot vector for each value andby keeping the rest of the latent space ( z ) constant. Such a manipulation can result in gener-ated samples, where each latent space outputs a distinct lexical item associated with that value([20000000000] outputs dark , [02000000000] water , etc.). Note that the acoustic contents of thegenerated outputs that correspond to each lexical item are substantially diﬀerent (as illustrated bythe spectrograms in Figure 7), which means that the latent code ( c ) needs to be strongly associ-ated with the individual lexical items, given that all the other 90 variables in the latent space (the z -variables which constitute 90% of all latent space) are kept constant and that the entire changeof the output occurs only due to change of the latent code c . In other words, by only changing the The outputs were coded according to the following criteria: if transcription included “su[ie][td]”, then suit , if“[ˆs]e[ae]r” then year , if “water” then water , if “dar” then dark , if “greas” then greasy , if “[kc].*r” then carry , if“[ao][wia]ly” then oily , if “rag” then rag , if “as” then ask , if “li” then like. Often each series outputs one or two divergences from the ideal output. lsesuitragaskcarrylikegreasyyearoilywaterdarkc c c c c c c c c c c P r obab ili t y Figure 6: Estimates of a multinomial logistic regression model with coded transcribed outputs as the dependentvariable and the latent code with ﬁve levels that correspond to the ﬁve unique one-hot vectors in the model trainedon 10 lexical items from TIMIT after 27103 steps. Figure 7: Waveforms and spectrograms (0-8000 Hz) of generated outputs (of a model trained on 10 items after 27103steps) when only the latent code is manipulated and the remaining 90 latent random variables are kept constantacross all 10 outputs. Transcriptions (by the authors) suggest that each lexical item is associated with a uniquerepresentation. word IPA data pointsask [ "æsk ] 633carry [ "k h æôi ] 632dark [ "dAôk ] 644greasy [ "gôisi ] 630like [ "laIk ] 697suit ["sut] ["wORÄ] ["jIô] Total

Table 4: Eight content lexical items from the TIMIT database used for training in the ﬁwGAN architecture. latent code and setting the variables to desired values while keeping the rest of the latent spaceconstant, we can generate desired lexical items with the Generator network.

To evaluate lexical learning in a ﬁwGAN architecture, we train the ﬁwGAN model with threefeatural variables ( φ ). Because the latent code in ﬁwGAN is binomially distributed, three featuralvariables correspond to 2 = 8 categories. The model was trained on 8 content lexical items withmore than 600 attestations in the TIMIT database (listed in Table 4). The model used for theanalysis was trained after 20026 steps which correspond to a similar number of epochs as the 10-word ciwGAN model in Section 4.2 ( ∼ ( n ) variables to encode n lexical items (compared to n variables for n classes in ciwGAN). Despite the latent space for lexical learning being reduced, an analysis ofgenerated data in the ﬁwGAN architecture suggests that the Generator learns to associate eachbinary code with a distinct lexical item (for an additional test, see Section 4.5).18o test signiﬁcance of the featural code ( φ ) as a predictor, the annotated data were ﬁt to amultinomial logistic regression model as in Section 4.1 and 4.2. The dependent variables are againcoded transcriptions and the independent variable is the featural code ( φ ) with the eight uniquelevels as predictors: each for unique binary code. The diﬀerence in AIC between the model thatincludes the unique featural codes as predictors ( φ ) and the empty model (2038.5 vs. 3409.7) suggestthat featural values are signiﬁcant predictors.Estimates of the regression model in Figure 8 illustrate that most lexical items receive a uniquefeatural representation. Six out of eight lexical items ( dark, ask, suit, greasy, year and carry )all have distinct latent featural representations that can be associated with these lexical items.Success rates for the six items have a mean of 50.8% (in raw counts) with the range of 46% to 61%.Crucially, there appears to be a single peak in regression estimates per lexical item for these sixwords, although the peaks are less prominent compared to the ciwGAN architecture (expectedlyso, since learning is signiﬁcantly more challenging in the featural condition). Water and like aremore problematic: [0, 2, 0] outputs like and water at approximately the same rate. It is possiblethat learning of the two lexical items is unsuccessful. Another possibility is that [0, 2, 0] is theunderlying representation of water because it is water ’s most frequent code that is not alreadytaken by another lexical item. According to the guidelines in Section 4.1, like would have to berepresented by [0, 0, 0], because it outputs the highest proportion of like that is not already takenfor another lexical items. That this assignment of underlying values of each featural representationis valid is additionally suggested by another test in Section 4.5.In the ﬁwGAN architecture, we can also test signiﬁcance of each of the three unique features( φ , φ , and φ ). The annotated data were ﬁt to the same multinomial logistic regression modelas above, but with three independent variables: the three features each with two levels (0 and 2).AIC is lowest when all three variables are present in the model (2135.5) compared to when φ , φ ,or φ are removed from the model (2527.2, 2413.0, and 2773.3, respectively). An advantage of the ﬁwGAN architecture is that it can model classiﬁcation (i.e. lexical learn-ing) and featural learning of phonetic and phonological representations simultaneously. We canassume that lexical learning is represented by the unique binary code for each lexical item. Pho-netic and phonological information can be simultaneously encoded with each unique feature ( φ ).That phonetic and phonological information is learned with binary features has been the prevalentassumption in linguistics for decades (Clements, 1985; Hayes, 2009). Recently, neuroimaging evi-dence suggesting that phonetic and phonological information is stored as signals that approximatephonological features has been presented in Mesgarani et al. (2014).An analysis of featural learning in ﬁwGAN — how featural codes simultaneously representunique lexical items and phonetic/phonological representations, can be performed by using logis-tic regression as proposed in the following paragraphs as well as with a number of exploratorytechniques described in this section.Three out of eight lexical items used in training of the ﬁwGAN model include the segment [s]:a voiceless alveolar fricative with a distinct phonetic marker — a period of frication noise. Theassumed binary codes for the three items containing [s], ask , greasy , and suit are [2, 2, 0], [2, 0, 0],and [2, 0, 2] (see Figure 8). We observe that value 2 for feature φ is common to all three of thelexical items containing an [s]. The outputs are coded as described in fn. 11 for the 10-word ciwGAN model, except that if “[ae].*[sf]”, then ask ,because outputs contain a large proportion of s -like frication noise that can also be transcribed with f . lsedarkasksuitgreasyyearwatercarrylike000 002 020 022 200 202 220 2220.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00 φ P r obab ili t y Figure 8: Estimates of a multinomial logistic regression model with coded transcribed outputs as the dependentvariable and the latent code with ﬁve levels that correspond to the ﬁve unique one-hot vectors in the model trainedon 8 lexical items from TIMIT after 20026 steps. ime (s)0 0.387508000 F r equen cy ( H z ) Time (s)0 0.362508000 F r equen cy ( H z ) Figure 9: Waveforms and spectrograms (0-8000 Hz) of two lexical items dark from TIMIT with a clear s -like fricationnoise during the aspiration after the closure of [k] (highligted). To test the eﬀects of φ on presence of [s] in the output, 800 annotated outputs (100 for eachof the eight unique binary code) were ﬁt to a logistic regression model. The dependent variable ispresence of a s -like frication noise: if a transcribed output contains an s , z , or f , the output is codedas success. The independent predictors in the model are the three features without interactions: φ , φ , and φ , each with two levels (0 and 2). Figure 10 features estimates of the regression model.While all three features are signiﬁcant predictors, the eﬀect appears to be most prominent for φ .It is possible that the Generator network in the ﬁwGAN architecture uses feature φ to encodepresence of segment [s] in the output. This distribution can also be due to chance. Further work isneeded to test whether presence of phonetic/phonological elements in the output can be encodedwith individual features. Two facts from the generated data, however, suggest that the Generatorin the ﬁwGAN architecture associates φ with presence of [s].First, while ask , greasy , and suit all have φ = 2 in common, the fourth unique featural codewith φ = 2 ([2, 2, 2]) is associated with dark . Spectral analysis of lexical item dark in thetraining data reveals that aspiration of [k] in dark is in the training data from TIMIT frequentlyrealized precisely as an alveolar fricative [s] (likely due to contextual inﬂuences). Approximately27% data points for dark in the training data from TIMIT contain a [s]-like frication noise duringthe aspiration period of [k]. Figure 9 gives two such examples from TIMIT of dark with a clearfrication noise characteristic of an [s] sound after three aspiration noise of [k]. In other words,3 lexical items in the training data contain an [s] as part of their phonemic representation andtherefore feature it consistently. The Generator outputs data such that a single feature ( φ = 2)is common to all three items. An additional item often involves a s -like element and the networkuses the same value ( φ = 2) for its unique code ([2, 2, 2]). There is approximately a 8.6% chancethis distribution is random (of 70 possible featural code assignment for eight items, four of whichcontain some phonetic feature such as [s], six or 8.6% combinations contain the same value in onefeature).As already mentioned, the network outputs mostly dark for the featural code [2, 2, 2], buta substantial portion of outputs also deviate from dark . A closer look at the structure of theinnovative outputs for the [2, 2, 2] code reveals that a substantial proportion of them (35) containan [s]. As a comparison, other unique codes with φ set at the opposite value 0 ([0, 0, 0], [0, 0, 2],[0, 2, 0], [0, 2, 2]) output 43, 41, 1, and 0 outputs containing an s , z , or f . In other words, for two This estimate is based on acoustic analysis of the ﬁrst 100 training data points from the TIMIT database. φ φ Value of φ P r obab ili t y o f c on t a i n i ng [ s ] Figure 10: Fitted values with 95% CIs of a logistic regression model with presence of [s] in the transcribed outputsas the dependent variable and the three features φ , φ , and φ as predictors. unique codes given φ = 0, the network generates 0 or 1 outputs containing an s -like segment. Forthe two other codes, the network generates outputs with a similar rate of s -containing sequencesas [2, 2, 2] ( dark ). However, the motivation for an s- containing output in [0, 2, 2] is clear: year isin three training data points actually realized as [ SIô ] ( shear ). The [0, 0, 0] does not have a distinctunderlying lexical item, so the high proportion of outputs with [s] is not unexpected.The second piece of evidence suggesting that ( φ = 2) represents presence of [s] in the output areinnovative outputs that violate the training data distribution. The majority of s -containing outputswhen φ = 0 are non-innovative sequences that correspond to lexical items from the training data.The most notable feature of the s -containing outputs for [2, 2, 2] ( dark ), on the other hand, istheir innovative nature. Sometimes, these outputs can indeed be transcribed as suit , but in somecases the Generator outputs an innovative sequence that violates training data, but is linguisticallyinterpretable. In fact, some of the outputs with [2, 2, 2] are directly interpretable as adding an [s] tothe underlying form dark . Two innovative sequences that can be reliably transcribed as start [ "stAôt ]are given in Figure 11 and additional one transcribed as sart [ "sAôt ] in Figure 11. The network isnever trained on a [st] sequence, let alone on the lexical item start , yet the innovative output islinguistically interpretable and remarkably similar to the [st] sequence in human outputs that thenetwork never “sees”. Spectral analysis illustrates a clear period of frication noise characteristic of[s] followed by a period of silence and release burst characteristic of a stop [t]. Figure 12 provides twoexamples from the TIMIT database with the [st] sequence that was never part of the training data,yet illustrates how acoustically similar the innovative generated outputs in the ﬁwGAN architectureare to real speech data. This example constitutes one of the prime cases of high productivity ofdeep neural networks (for a recent survey on productivity in deep learning, see Baroni 2020). Beguˇs (2020) argues that the underlying value of a feature can be uncovered by manipulating agiven feature well beyond the training range. For example, Beguˇs (2020) proposes a technique foridentifying variables that correspond to phonetic/phonological features in the outputs. By settingthe values of the identiﬁed features well beyond the training range, the network almost exclusivelyoutputs the desired feature (e.g. a segment [s] in the output).22 ɑɹ tTime (s)5.112 5.413 s ɑɹ tTime (s)94.2 94.5 Figure 11: Waveforms and spectrograms (0-8000 Hz) of two innovative outputs for φ = [2 , ,

2] transcribed as [ "sAôt ]. This eﬀect of the underlying value of a variable is even more prominent in the ﬁwGAN architec-ture. When the values of latent features ( φ ) are set at 0 and 2, success rates appear at approximately50% (Figure 8). Value 2 was chosen for analysis in Section 4.3, because non-categorical outcomesyield more insights into learning. However, we can reach almost categorical accuracy when thevalues are set substantially higher than 2. For example, when we generate outputs with valuesof featural code set at 5 ([5, 5, 5]), the network generates 93/100 outputs that can be reliablytranscribed as dark and another 7 that closely resemble dark , but have a period of frication insteadof the initial stop ( sark ). With even higher values such as 15, the Generator outputs 100/100samples transcribed as dark for [15, 15, 15]. Similarly, [5, 5, 0] yields ask in 97/100 cases; it yieldsan innovative output with ﬁnal [i] in three cases. At [15, 15, 0], the Generator outputs 100/100 ask . The success rates diﬀer across featural codes, but value 15 triggers almost categorical outputsfor most of them. [15, 0, 0] yields 93/100 greasy (1 unclear and 6 ask ). For [15, 0, 15], the networkoutputs a [sVt] for suit (where V=vowel) sequence 87/100 times. In 13 examples, the frication noisedoes not have a pronounced s-like frication noise, but is more distributed and closer to aspirationnoise of [k]. The identity of the vowel is intriguing: the formant values are not characteristic of [u](as in suit ), but rather of a lower more central vowel (F1 = 663 Hz, F2 = 1618 Hz, F3 = 2515 Hzfor one listing). Since formant variability is high in the training data, the underlying prototypicalrepresentation likely defaults to a more central vowel.[0, 0, 15] yields carry in 100/100 outputs, but in 13 of these outputs the aspiration noise of [k] isdistributed with a peak in higher frequencies for an acoustic eﬀect of [ts]. [0, 15, 0] yields 100/100 water . The acoustic output is reduced to include only the main acoustic properties of water :formant structure for [ wO ] followed by a consonantal period for [ R ] and a very brief (sometimesmissing) vocalic period (Figure 13).The only two codes that do not yield straightforward underlying representations are [0, 0, 0]and [0, 2, 2]. It appears that the Generator is unable to strongly associate [0, 0, 0] with any lexicalrepresentation, likely due to lack of positive values in this particular code. This means that thenetwork needs to learn underlying representations for two remaining lexical items with a singlecode: like and year , both likely associated with [0, 2, 2]. When set to [0, 5, 5], the Generator Counts in this sections are performed by the author only. igure 12: (left) Waveforms and spectrograms (0-8000 Hz) of two innovative outputs for φ = [2 , ,

2] transcribed as[ "stAôt ]. The ciwGAN network trained after 20026 steps thus outputs innovative sequence [st] that is absent fromthe training data, but is a linguistically interpretable output that can be interpreted as adding [s] to dark . (right)Waveforms and spectrograms (0-8000 Hz) of two lexical items from TIMIT that are not part of training data, butillustrate that the innovative [st] sequence in the generated data is acoustically very similar to the [st] sequence inhuman outputs that the network has no access to. igure 13: Waveforms of the ﬁrst ﬁve generated outputs for each featural code when values are set at 15. Thewaveforms clearly show that the outputs feature minimal variability. Below each waveform is a spectrogram (0–8000Hz) of the ﬁrst output (the topmost waveform). All seven outputs have the exact same values for 97 random latentvariables ( z ); they only diﬀer in the three featural codes φ . outputs both like and year , but at [0, 10, 10] and [0, 15, 15] the underlying representation isan acoustic output that is diﬃcult to characterize and is likely a blend of the two representations(acoustically closer to like ; see Figure 13). Future analyses should thus include log ( n ) variablesfor n − z being randomly sampled for each output, as illustrated by Figure13. It appears that the network associates unique featural codes with prototypical underlyingrepresentations of lexical items. When values are lower, other random features ( z ) cause variationin the outputs, but the high values (such as 5 or 15) override this variation and reveal the underlyinglexical representation for each featural code.The generative test with values set well above the training range strongly suggests that the Occasionally, [0, 5, 5] also yields an output that can be characterized as water .

5. Discussion and future directions

This paper proposes two architectures for unsupervised modeling of lexical learning from rawacoustic data with deep neural networks. The ability to retrieve information from acoustic data inhuman speech is modeled with a Generator network that learns to output data that resembles speechand, simultaneously, learns to encode unique information in its outputs. We also propose techniquesfor probing how deep neural networks trained on speech data learn meaningful representations.The proposed ﬁwGAN and ciwGAN models are based on the Generative Adversarial Networkarchitecture and its implementations in WaveGAN (Donahue et al., 2019), DCGAN (Radford et al.,2015), and InfoGAN (Chen et al., 2016; Rodionov, 2018). Following Beguˇs (2020), we modellanguage acquisition as learning of a dependency between the latent space and generated outputs.We introduce a network that forces the Generator to output data such that information is retrievablefrom its acoustic outputs and propose a new structure of the latent variables that allows featurallearning. Lexical learning thus emerges from the architecture: the most eﬃcient way for theGenerator network to output acoustic data such that unique information is retrievable from itsdata is to encode unique information in its acoustic outputs such that latent codes coincide withlexical items in the training data. The result is thus a deep convolutional neural network that takeslatent codes and variables and outputs innovative data that resembles training data distributionsas well as learns to associate lexical items with unique representations.Three experiments tested lexical learning in ciwGAN and ﬁwGAN architectures trained ontokens of ﬁve, ten, and eight sliced lexical items in raw audio format from a highly variable database— TIMIT. The paper proposes that lexical learning can be evaluated with multinomial logisticregression on generated data. Evidence of lexical learning is present in all three experiments. Itappears that the Generator learns to associate lexical items with unique latent code — categorical(as in ciwGAN) or featural (as in ﬁwGAN). By manipulating the values of latent codes to value 2,the networks output unique lexical items for each unique code and reach accuracy that ranges from98% to 27% in the ﬁve-word model. To replicate the results and test learning on a higher numberof lexical items, the paper presents evidence that the model learns to associate unique latent codeswith lexical items in the 10-words model as well with only one exception. The paper also proposesa technique for following how the network learns representations as training progresses. We candirectly observe how the network transforms an output that violates training data into an outputthat conforms to it by keeping the latent space constant as training progresses.The ﬁwGAN architecture features, to our knowledge, a new proposal within the InfoGANarchitecture: to model classiﬁcation with featural codes instead of one-hot vectors. This shift yieldsthe potential to model featural learning and higher-level classiﬁcation (i.e. phonetic/phonologicalfeatures and unique lexical representations) simultaneously. The paper presents evidence suggestingthat the network might use some feature values to encode phonetic/phonological properties suchas presence of [s]. Regression models suggest that φ is associated with presence of [s] in theoutput (and simple probabilistic calculation reveals about 8.6% probability that the distribution isdue to chance). The strongest evidence for simultaneous lexical and featural learning comes frominnovative outputs in the ﬁwGAN architecture. The network trained on lexical items that lacka sequence of a fricative and a stop [st] altogether outputs an innovative sequence start or sart .These innovative outputs can be analyzed as adding a segment [s] (from suit ) to dark , likely underthe inﬂuence of the fact that φ represents presence of [s].26nnovative outputs that violate training data are informative for both computational models oflanguage acquisition as well as for our understanding of what types of dependencies the networksare able to learn. We discuss several cases of innovative outputs. Some innovations are motivatedby training data distributions (e.g. sear ) and reveal how the networks treat acoustically similarlexical items. For other innovative outputs, such as watery , the training data contains no apparentmotivations. We also track changes from innovative to conforming outputs as training progresses.We argue that innovative outputs are linguistically interpretable and acoustically very similar toactual speech data that is absent from the training data. For example, an innovative [st] sequencein start corresponds directly to human outputs with this sequence that were never part of thetraining data. Further comparisons of this type should yield a better understanding on how thecombinatorial principle in human language can arise without language-speciﬁc parameters in amodel.The paper also discusses how internal representations in the GAN architecture can be identiﬁedand explored. We argue that by setting the latent values substantially beyond the training range(as suggested for phonological learning in Beguˇs 2020), the Generator almost exclusively outputsone unique lexical item per each unique featural code (with only one exception) in the ﬁwGANarchitecture. In other words, for very high values of the featural code ( φ ), lexical learning appearsto be categorical. The variability of the outputs is minimal at such high values (e.g. 15). It appearsthat setting the featural code to such extreme values reveals the underlying representation of eachfeatural code. This property is highly desirable in a model of language acquisition and has thepotential to reveal the underlying learned representations in the GAN architecture.Several improvements to the model are left to future work. For example, future directionsshould include developing a model that does not require a pre-determined number of classes thatcorrespond to the number of lexical items and that could parse lexical items from a continuousacoustic stream.The proposed model of lexical learning has several further implications. Dependencies in speechdata are signiﬁcantly better understood than dependencies in visual data. A long scientiﬁc tra-dition of studying dependencies in phonetic and phonological data in human languages yields anopportunity to use linguistic data to probe the types of dependencies deep neural networks can orcannot learn. The proposed architectures allow us to probe what types of dependencies the net-works can learn, how they encode unique information in the latent space, and how self-organizationof retrievable information emerges in the GAN architecture. The models also have some basic im-plications for NLP tasks such as unsupervised text-to-speech generation: manipulating the latentvariables to speciﬁc values results in the Generator outputting desired lexical items. Acknowledgements

This research was funded by a grant to new faculty at the University of Washington. I would liketo thank Sameer Arshad for slicing data from the TIMIT database and Ella Deaton for annotatingdata. All mistakes are my own.

Declaration of interests

The author declares no known competing ﬁnancial interests or personal relationships that couldhave appeared to inﬂuence the work reported in this paper.27 eferences

Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial networks. In: InternationalConference on Machine Learning. pp. 214–223.Arnold, D., Tomaschek, F., Sering, K., Lopez, F., Baayen, R. H., 04 2017. Words from spontaneous conver-sational speech can be recognized with human-like accuracy by an error-driven learning algorithm thatdiscriminates between meanings straight from smart acoustic features, bypassing the phoneme as recog-nition unit. PLOS ONE 12 (4), 1–16.URL https://doi.org/10.1371/journal.pone.0174623

Baayen, R. H., Chuang, Y.-Y., Shafaei-Bajestan, E., Blevins, J. P., 2019. The discriminative lexicon: Auniﬁed computational model for the lexicon and lexical processing in comprehension and productiongrounded not in (de) composition but in linear discriminative learning. Complexity 2019.Baroni, M., 2020. Linguistic generalization and compositionality in modern artiﬁcial neural networks. Philo-sophical Transactions of the Royal Society B: Biological Sciences 375 (1791), 20190307.URL https://royalsocietypublishing.org/doi/abs/10.1098/rstb.2019.0307

Barry, S. M., Kim, Y. E., 2019. InfoWaveGAN: Informative latent spaces for waveform generation, posterat the North East Music Information Special Interest Group.URL http://nemisig2019.nemisig.org/images/kimSlides.pdf

Beguˇs, G., 2020. Generative adversarial phonology: Modeling unsupervised phonetic and phonological learn-ing with neural networks. Accepted at Frontiers in Artiﬁcial Intelligence.URL

Brownlee, J., 2019. Generative Adversarial Networks with Python: Deep Learning Generative Models forImage Synthesis and Image Translation. Machine Learning Mastery.Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P., 2016. Infogan: Interpretablerepresentation learning by information maximizing generative adversarial nets. In: Lee, D. D., Sugiyama,M., Luxburg, U. V., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems29. Curran Associates, Inc., pp. 2172–2180.URL http://papers.nips.cc/paper/6399-infogan-interpretable-representation-learning-by-information-maximizing-generative-adversarial-nets.pdf

Chuang, Y.-Y., Vollmer, M. L., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., Baayen, R. H., 2020. Theprocessing of pseudoword form and meaning in production and comprehension: A computational modelingapproach using linear discriminative learning. Behavior Research Methods.URL https://doi.org/10.3758/s13428-020-01356-w

Clements, G. N., 1985. The geometry of phonological features. Phonology Yearbook 2 (1), 225252.Donahue, C., McAuley, J., Puckette, M., 2019. Adversarial audio synthesis. In: ICLR. github.com/chrisdonahue/wavegan .Eloﬀ, R., Nortje, A., van Niekerk, B., Govender, A., Nortje, L., Pretorius, A., Biljon, E., van der Westhuizen,E., Staden, L., Kamper, H., 09 2019. Unsupervised acoustic unit discovery for speech synthesis usingdiscrete latent-variable neural networks. In: Proc. Interspeech 2019. pp. 1103–1107. lsner, M., Goldwater, S., Feldman, N., Wood, F., Oct. 2013. A joint learning model of word segmentation,lexical acquisition, and phonetic variability. In: Proceedings of the 2013 Conference on Empirical Methodsin Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA,pp. 42–54.URL Feldman, N. H., Griﬃths, T. L., Goldwater, S., Morgan, J. L., 2013. A role for the developing lexicon inphonetic category acquisition. Psychological review 120 (4), 751.Feldman, N. H., Griﬃths, T. L., Morgan, J. L., 2009. Learning phonetic categories by learning a lexicon. In:Scott, J., Waugtal, D. (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society.pp. 2208–2213.Garofolo, J. S., Lamel, L., M Fisher, W., Fiscus, J., S. Pallett, D., L. Dahlgren, N., Zue, V., 11 1993. Timitacoustic-phonetic continuous speech corpus. Linguistic Data Consortium.Goldwater, S., Griﬃths, T. L., Johnson, M., 2009. A bayesian framework for word segmentation: Exploringthe eﬀects of context. Cognition 112 (1), 21 – 54.URL

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio,Y., 2014. Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D.,Weinberger, K. Q. (Eds.), Advances in Neural Information Processing Systems 27. Curran Associates,Inc., pp. 2672–2680.URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Hayes, B., 2009. Introductory Phonology. Wiley-Blackwell, Malden, MA.Heymann, J., Walter, O., Haeb-Umbach, R., Raj, B., 2013. Unsupervised word segmentation from noisyinput. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 458–463.Hockett, C. F., 1959. Animal ”languages” and human language. Human Biology 31 (1), 32–39.URL

Kamper, H., Jansen, A., Goldwater, S., 2017. A segmental framework for fully-unsupervised large-vocabularyspeech recognition. Computer Speech & Language 46, 154 – 174.URL

Kuhl, P. K., 2019/06/27 2010. Brain mechanisms in early language acquisition. Neuron 67 (5), 713–727.URL https://doi.org/10.1016/j.neuron.2010.08.038

Lee, C.-y., Glass, J., Jul. 2012. A nonparametric Bayesian approach to acoustic model discovery. In: Pro-ceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers). Association for Computational Linguistics, Jeju Island, Korea, pp. 40–49.URL

Lee, C.-y., O’Donnell, T. J., Glass, J., 2015. Unsupervised lexicon discovery from acoustic input. Transactionsof the Association for Computational Linguistics 3, 389–403.URL

Mesgarani, N., Cheung, C., Johnson, K., Chang, E. F., 2014. Phonetic feature encoding in human superiortemporal gyrus. Science 343 (6174), 1006–1010.URL https://science.sciencemag.org/content/343/6174/1006

Piantadosi, S. T., Fedorenko, E., 04 2017. Inﬁnitely productive language can arise from chance under com-municative pressure. Journal of Language Evolution 2 (2), 141–147.URL https://doi.org/10.1093/jole/lzw013 Core Team, 2018. R: A Language and Environment for Statistical Computing. R Foundation for StatisticalComputing, Vienna, Austria.URL

Radford, A., Metz, L., Chintala, S., 2015. Unsupervised representation learning with deep convolutionalgenerative adversarial networks. arXiv preprint arXiv:1511.06434.R¨as¨anen, O., 2012. Computational modeling of phonetic and lexical learning in early language acquisition:Existing models and future directions. Speech Communication 54 (9), 975 – 997.URL

R¨as¨anen, O., Nagamine, T., Mesgarani, N., 08 2016. Analyzing distributional learning of phonemic categoriesin unsupervised deep neural networks. CogSci ... Annual Conference of the Cognitive Science Society.Cognitive Science Society (U.S.). Conference 2016, 1757–1762.URL https://pubmed.ncbi.nlm.nih.gov/29359204

Rodionov, S., 2018. info-wgan-gp. https://github.com/singnet/semantic-vision/tree/master/experiments/concept_learning/gans/info-wgan-gp .Saﬀran, J. R., Aslin, R. N., Newport, E. L., 1996. Statistical learning by 8-month-old infants. Science274 (5294), 1926–1928.URL https://science.sciencemag.org/content/274/5294/1926

Saﬀran, J. R., Werker, J. F., Werner, L. A., 2007. The Infant’s Auditory World: Hearing, Speech, and theBeginnings of Language. American Cancer Society, Ch. 2.URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470147658.chpsy0202

Shafaei-Bajestan, E., Baayen, R. H., 2018. Wide learning for auditory comprehension. In: Proc. Interspeech2018. pp. 966–970.URL http://dx.doi.org/10.21437/Interspeech.2018-2420

Shain, C., Elsner, M., Jun. 2019. Measuring the perceptual availability of phonological features duringlanguage acquisition using unsupervised binary stochastic autoencoders. In: Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics,Minneapolis, Minnesota, pp. 69–85.URL

Venables, W. N., Ripley, B. D., 2002. Modern Applied Statistics with S, 4th Edition. Springer, New York,iSBN 0-387-95457-0.URL