[PDF] Local and non-local dependency learning and emergence of rule-like representations in speech data by Deep Convolutional Generative Adversarial Networks

Abstract

This paper argues that training GANs on local and non-local dependencies in speech data offers insights into how deep neural networks discretize continuous data and how symbolic-like rule-based morphophonological processes emerge in a deep convolutional architecture. Acquisition of speech has recently been modeled as a dependency between latent space and data generated by GANs in Beguš (arXiv:2006.03965), who models learning of a simple local allophonic distribution. We extend this approach to test learning of local and non-local phonological processes that include approximations of morphological processes. We further parallel outputs of the model to results of a behavioral experiment where human subjects are trained on the data used for training the GAN network. Four main conclusions emerge: (i) the networks provide useful information for computational models of language acquisition even if trained on a comparatively small dataset of an artificial grammar learning experiment; (ii) local processes are easier to learn than non-local processes, which matches both behavioral data in human subjects and typology in the world's languages. This paper also proposes (iii) how we can actively observe the network's progress in learning and explore the effect of training steps on learning representations by keeping latent space constant across different training steps. Finally, this paper shows that (iv) the network learns to encode the presence of a prefix with a single latent variable; by interpolating this variable, we can actively observe the operation of a non-local phonological process. The proposed technique for retrieving learning representations has general implications for our understanding of how GANs discretize continuous speech data and suggests that rule-like generalizations in the training data are represented as an interaction between variables in the network's latent space.

Full PDF

LLocal and non-local dependency learning and emergence of rule-likerepresentations in speech data by Deep Convolutional GenerativeAdversarial Networks

Gaˇsper Beguˇs a a Department of Linguistics, University of California, Berkeley, 1203 Dwinelle Hall

Abstract

This paper argues that training Generative Adversarial Networks (GANs) on local and non-local dependencies in speech data oﬀers insights into how deep neural networks discretize continuousdata and how symbolic-like rule-based morphophonological processes emerge in a deep convolutionalarchitecture. Acquisition of speech has recently been modeled as a dependency between latent spaceand data generated by GANs in Beguˇs (2020c), who models learning of a simple local allophonicdistribution. We extend this approach to test learning of local and non-local phonological processesthat include approximations of morphological processes. We further parallel outputs of the modelto results of a behavioral experiment where human subjects are trained on the data used for trainingthe GAN network. Four main conclusions emerge: (i) the networks provide useful information forcomputational models of language acquisition even if trained on a comparatively small dataset ofan artiﬁcial grammar learning experiment; (ii) local processes are easier to learn than non-localprocesses, which matches both behavioral data in human subjects and typology in the world’slanguages. This paper also proposes (iii) how we can actively observe the network’s progress inlearning and explore the eﬀect of training steps on learning representations by keeping latent spaceconstant across diﬀerent training steps. Finally, this paper shows that (iv) the network learns toencode the presence of a preﬁx with a single latent variable; by interpolating this variable, we canactively observe the operation of a non-local phonological process. The proposed technique forretrieving learning representations has general implications for our understanding of how GANsdiscretize continuous speech data and suggests that rule-like generalizations in the training dataare represented as an interaction between variables in the network’s latent space.

Keywords: neural networks, behavioral experiments, machine learning, learning biases, speech,morphology

1. Introduction

The discussion between connectionist and symbolic approaches to language and human cog-nition in general has long been in the focus of computational cognitive science (Rumelhart et al.1986; McClelland et al. 1986; Marcus 2001, i.a.). Phonetic and phonological data are uniquely ap-propriate for addressing this problem. Over a century-long tradition of scientiﬁc study of acousticand perceptual phonetics (for an overview, see MacMahon 2013) that deals with physical proper-ties of speech sounds provides a solid understanding of the continuous data that hearing infantsacquire language from: raw acoustic speech. Phonology is the study of how humans analyze,

Email address: [email protected] (Gaˇsper Beguˇs)

Preprint submitted to tbd September 29, 2020 a r X i v : . [ c s . C L ] S e p iscretize, self-organize, and manipulate continuous speech data into discretized mental represen-tations called phonemes . The scientiﬁc study of phonology, too, has an over-a-century long history(for an overview, see van der Hulst 2013), which resulted in a solid understanding of local andnon-local discrete dependencies in human speech. Phonetic and phonological data and analysisare thus uniquely appropriate for probing what deep convolutional networks can and cannot learn,how discrete representations can emerge in deep neural networks, and how their performance canbe paralleled to human behavior. Despite these advantages, the majority of neural network in-terpretability studies focus on non-linguistic visual data or syntactic/semantic levels, the latter ofwhich lack a continuous component.Computational models of speech acquisition have a long history. The majority of models, how-ever, operate with abstract and already discretized data rather than raw acoustic inputs (McClel-land and Elman, 1986; Gaskell et al., 1995; Plaut and Kello, 1999). Deep neural network models ofphonetic and phonological data operating with raw acoustic inputs emerged only recently. Severalproposals model phonetic learning with deep autoencoder models (R¨as¨anen et al., 2016; Alishahiet al., 2017; Eloﬀ et al., 2019; Shain and Elsner, 2019). Autoencoders learn to reduce data andencode data distributions in latent representations: they are trained on reproducing inputs by gen-erating outputs from a reduced latent space. Inputs are thus directly connected to the outputs withan intermediate latent space that is reduced in dimensionality. Clustering analyses on the latentspace show that the networks trained on phonetic data learn approximations of phonetic featuresfrom phonetic similarity (R¨as¨anen et al., 2016; Alishahi et al., 2017; Eloﬀ et al., 2019; Shain andElsner, 2019).While the reduced dimensionality in the autoencoder architecture approximates phonetic fea-tures and phonetic similarity, the proposals do not model phonological processes. The humanlanguage learner has to acquire not only the identity of individual sounds based on acoustic sim-ilarity (as approximately modeled by the proposals using the autoencoder architecture), but alsoto manipulate those sounds in a given phonetic context. For example, a voiceless bilabial stop /p/in English can surface as aspirated [ p h ] (produced with aspiration or a puﬀ of air) before stressedvowels or as unaspirated [p] (without aspiration) if a fricative [s] precedes it. A minimal pair il-lustrating this distribution is [" p h It] ‘pit’ and ["s p It] ‘spit’. The learner needs to learn not only tooutput voiceless bilabial stop, but also to shorten the aspiration time (VOT) when an [s] precedesit. Autoencoders are also trained on replicating output data as closely as possible to the inputdata, which is not desirable in models of language acquisition. While dimensionality reduction inautoencoders is unsupervised, input-output pairing is not.To model phonetic learning simultaneously with the learning of simple allophonic processes,Beguˇs (2020c) proposes that speech acquisition can be modeled as a dependency between thelatent space and generated data in the Generative Adversarial Networks. Generative AdversarialNetworks (GAN), ﬁrst proposed by Goodfellow et al. (2014), have not been used for modelinglanguage acquisition, despite several advantages that this architecture features for computationalmodels of language learning. GAN models are unsupervised and fully generative, which means thata deep convolutional network outputs innovative data that have no direct link to the training data(unlike, for example, in the autoencoder architecture). In other words, deep convolutional networksin the GAN architecture need to learn to output data from some random distribution.Beguˇs (2020c) argues that deep convolutional networks in the GAN architecture encode dis-cretized phonetic and phonological representations in the latent space. A computational experimentis conducted on a GAN implementation for audio (as proposed in Donahue et al. 2019 based onRadford et al. 2015) by training the networks on an phonologically local allophonic distributionin English, where voiceless stops surface as aspirated word-initially before a stressed vowel (e.g. in[ "p h It ] ‘pit’), except if a sibilant [s] precedes the stop (e.g. in [ "spIt ] ‘spit’). The network learns2he allophonic distribution and encodes phonetically and phonologically meaningful features in itslatent space.Based on this local allophonic distribution, Beguˇs (2020c) proposes a technique for identifyingand manipulating variables in the latent space in the GAN architecture that correspond to desiredphonetic and phonological representations. Beguˇs (2020c) argues that the network uses a subsetof latent variables to encode presence of a sound in the output (e.g. [s]). By manipulating theidentiﬁed variables, especially well beyond the training range (as proposed in Beguˇs 2020c), we canactively force the sound in and out of the generated outputs. Moreover, a linear interpolation ofthe chosen latent variables from marginal values results in almost linear reduction of the amplitudeof the frication noise of [s] — a linguistically meaningful unit (Beguˇs, 2020c).The goal of this paper is to argue that using the technique proposed in Beguˇs (2020c), wecan model not only simple allophonic processes, such as English deaspiration, but also local andnon-local phonological processes that are based on what would be approximated as morphology(morphophonological alternations) that resemble rule-like behavior. We also argue that we canparallel human behavioral experiments with performance of the deep convolutional networks thatare trained on the same data as used in behavioral experiments. In general, natural languagesstrongly prefer local over non-local processes, both in phonology and on other levels such as mor-phology and syntax (Finley, 2011, 2012; McMullin and Hansson, 2019; White et al., 2018). In fact,the vast majority of phonological processes in the world’s languages are local (Finley, 2011), withonly a few processes, such as harmony, operating on non-adjacent sounds. Behavioral experimentsshow that local processes are easier to learn than non-local processes (Finley, 2011, 2012; McMullinand Hansson, 2019; White et al., 2018). In this paper, we test the learning of local and non-localphonological dependencies, and show that local processes (such as postnasal or intervocalic devoic-ing) are easier to learn for the networks than non-local vowel harmony. We parallel success ratesin the computational model to behavioral data — an artiﬁcial grammar learning experiment inwhich human subjects are trained on the same data (Section 4). This type of combining artiﬁcialgrammar learning experiments and computational models has the potential to reveal similaritiesin learning biases between human subjects and deep convolutional networks, and shed light onhow domain-general learning biases that require no language-speciﬁc mechanisms can result in thetypological prevalence of local processes and the rarity of non-local processes.Speciﬁcally, we test the learning of non-local vowel harmony and several devoicing patterns.Vowel harmony is a phonological process, usually non-local, in which a vowel becomes more similarto another vowel in a word. For example, the plural morpheme in Turkish surfaces as [ lAr ] afterroot vowels that are back and as [ lEr ] if the root vowel is front Kabak (2011): [ d A l-l A r ] ‘places’ and[ j e r-l e r ] ‘branches’ (Kabak, 2011).In formal phonological analysis, phonological computation is formalized with rewrite rules thatoperate as symbolic feature manipulation (Chomsky and Halle, 1968). As argued by Marcus et al.(1999) and several other works (Chomsky and Halle 1968; Heinz 2010; Berent 2013, i.a.), “algebraicrules” are required to derive a set of surface outputs such as Turkish [ d A l-l A r ] and [ j e r-l e r ] fromstored inputs. The stored mental representation of the preﬁx can be posited as / lar /. The roleof phonological grammar is to derive the two surface forms (outputs) from the stored mentalrepresentation (input).Sounds are represented with matrices of binary features that distinguish meaning (e.g. [+syl-labic, + front] means a front vowel). Vowel harmony can be formalized with a simple rewrite rule(in 1) that identiﬁes vowels ([+syllabic]) and assigns the same value ( α ) of feature [ ± front] as in thevowel that follows it (interrupted by any number of consonants C ). The formalism is illustratedin (1). 3+syllabic] → [ α front] / C [ α front] (1)The discussion of symbolic representation vs. connectionism has a long tradition in phonology.An inﬂuential proposal called Optimality Theory models phonology as an input-output pairingrather than a rule-based symbolic representation (Prince and Smolensky, 1993/2004; Legendreet al., 1990). Optimality Theory was directly inﬂuenced by earlier work on connectionism. Vowelharmony within this framework is modeled with the Agreement-by-correspondence proposal (Hans-son, 2010; Rose and Walker, 2004): two sounds (such as the two vowels [A] in Turkish [ d A l-l A r ])are in correspondence and share features, which, through surface optimization in the grammar,results in a harmonious process. Several independent facts support the approach of input-outputoptimization in phonology. However, both Optimality Theory and other proposals in phonologyusing neural networks (McClelland and Elman, 1986; Gaskell et al., 1995; Plaut and Kello, 1999)model local and non-local phonology with pre-assumed levels of abstraction, meaning that learningis not modeled from raw acoustic data but is already pre-discretized or requires language-speciﬁcmechanisms.We argue that approximates to rule-based behavior emerge in deep convolutional networks evenwithout any pre-assumed levels of abstraction (the networks are trained on raw acoustic inputs) andwhen models contain no language-speciﬁc parameters. The network discretizes the representationof a preﬁx in the output and uses only one latent variable (out of 100) to encode the presenceof the preﬁx. Equivalents to non-local phonological rules emerge from an interaction betweenthe variable that represents the preﬁx and a variable that generates some desired phonologicalprocess. We also argue that the same data used for training in the GAN architecture can be usedto test phonological learning in artiﬁcial grammar learning experiments in human subjects. In fact,the paper argues that training GANs on relatively few data points yields, somewhat surprisingly,highly informative results (Section 3.1). This observation should open numerous opportunities forparalleling performance in deep neural networks and behavioral outcomes of artiﬁcial grammarlearning experiments with human subjects. Finally, we outline a procedure to observe how thenetwork learns dependencies as the training progresses and claim that the generators search throughthe space of phone-level combinations are linguistically interpretable (Section 3.2).

2. Materials

The main characteristic of Generative Adversarial Network architecture (Goodfellow et al.,2014), and more speciﬁcally the DCGAN proposal by Radford et al. (2015), are two deep convolu-tional neural networks that are trained in a minimax setting. The Discriminator learns to estimaterealness of the data and minimize its own error rate. The Generator network learns to output datafrom a set of latent variables and maximize the Discriminator network’s error. Initially, the Gener-ator network produces noise, but as training progresses it becomes increasingly more successful inoutputting data such that the Discriminator becomes less successful in distinguishing actual fromgenerated data.The majority of GANs are trained on two-dimensional visual data; a shift to apply the archi-tecture to the audio domain has occurred only recently with the work of Donahue et al. (2019)(WaveGAN). The model in Donahue et al. (2019), used for training here, is based on the DCGANarchitecture (Radford et al., 2015) and features most of the same hyperparameters. The two maindiﬀerences are that the Generator involves an additional layer and takes a one-dimensional input4 nput variables zn = 100 z ∼ U ( −

1, 1)

Generator network G( z ) Training data

Preﬁxed and unpreﬁxed items "bunE ∼ Om"punE"linO ∼ E"linO ... ˆx = Time (s)0 0.8783-0.79610.6640 x = Time (s)0 0.7726-0.47490.37230

Discriminatornetwork D( x ) Generatedor real?Backpropagation Backpropagation Figure 1: The GAN architecture schematized from (Goodfellow et al., 2014; Radford et al., 2015; Donahue et al.,2019) used in this paper with training data as described in Section 2.2. that corresponds to approximately 1 second of audio. The cost function is taken from the Wasser-stein GAN proposal with gradient penalty (WGAN-GP) (as proposed in Arjovsky et al. 2017 andGulrajani et al. 2017). For all speciﬁcations of the model, see Donahue et al. (2019).Beguˇs (2020c) proposes a technique for exploring learning representations in deep convolutionalnetworks. First, variables that correspond to meaningful phonetic and phonological representationsare identiﬁed. For example, the network is trained on h V and p h æ ] and [ spæ ]) and learns the conditional distribution: it mostly outputs short VOT (noaspiration) if an [s] precedes the stop and long VOT (aspiration) if no [s] precedes it. However,the Generator’s outputs are not simply replications of its input: in about 12% of outputs, the stopafter an [s] is aspirated ([ sp h æ ]) and the VOT duration is longer than in any z ; see Figure 1) that correspond to presence of [s] in the output. More-over, it is shown that the relationship between the individual latent variables (e.g. those identiﬁedas representing [s]) and the generated data are often linear, even when non-linear regression is usedfor testing.Given this linear relationship, we can identify variables that correspond to a desired phoneticproperty and identify whether the property correlates with positive or negative values of the vari-able. Individual z -variables are uniformly distributed during the training with the interval ( − , − , .

5, results in an increased amplitude of the desired phonetic representation. In other words,as we interpolate a variable identiﬁed as representing an [s] in the output, the amplitude of [s]increases or decreases. We can thus actively force a phonetic or phonological feature in the output.That the proposed technique indeed identiﬁes variables corresponding to the presence of [s] is sug-gested by an independent generative test in Beguˇs (2020c). While explorations of latent space andrepresentation learning in GANs have been conducted before on visual data (Radford et al., 2015),the proposals, to the author’s knowledge, do not use single variables to explore their meaningfulequivalents and do not utilize interpolation to extreme values beyond the training range. Beguˇs (2020c) thus argues that the Generator network learns a local allophonic distribution aswell as learns to encode phonetic and phonological representations with a subset of variables inthe latent space. While the Generator network represents [s] in the latent space with a subset ofvariables in Beguˇs (2020c), the cutoﬀ between variables associated with presence of [s] and the restof the latent space is not completely categorical. The Generator network does not associate thepresence of [s] with a single variable: seven z -variables are associated with the representation of [s].There is a notable cutoﬀ between the regression estimates of the seven highest variables and therest of the latent space, but the diﬀerence is not substantial or categorical. Training data in Beguˇs(2020c) is sliced from TIMIT (Garofolo et al., 1993), which is considerably more variable than thetraining data in this experiment. As is argued in Section 4, discretization of some phonologicalrepresentation (e.g. presence of the preﬁx) is substantial in the current experiment. It appears thatless variable data results in a more rapid discretization. The training data contain evidence for one non-local phonological process — vowel harmony —and four local processes: (i) post-nasal devoicing ([ " b Alu ] ∼ [ Om" p h Alu ]), (ii) post-nasal occlusionwith devoicing of voiced stops ([ " v iô@ ] ∼ [ Em" p h iô@ ]), (iii) intervocalic devoicing ([ " b ulO ] ∼ [ O" p h ulO ]),and (iv) intervocalic fricativization with devoicing ([ b Oô@ ] ∼ [ O" f Oô@ ]). These processes are triggeredby preﬁxes; the training data thus contain bare (unpreﬁxed) and preﬁxed forms of lexical items ofthe shape ( prefix- )CVCV and ( prefix- )CVC (C = consonant, V = vowel), e.g. [ "ôinu ] ∼ [ En"ôinu ].The items are all nonce words in English, so that the same dataset can be used in the behavioralexperiment with human subjects (Section 4).

Non-local vowel harmony is triggered by the ﬁrst vowel of the base (unpreﬁxed) form andresults in two diﬀerent vowel qualities of the preﬁx, [ E ] and [ O ]. The descriptive generalization isthe following: the vowel of the preﬁx is [ E ] if the ﬁrst vowel of the lexical item is [ E, i ] and [ O ] ifthe vowel is [ A, O, u ]. For example, a lexical item such as [ "linO ] has a preﬁxed form [

En"linO ] with afront vowel in the preﬁx [

En- ] because the ﬁrst vowel in the lexical item [i] is front. A lexical itemsuch as [ "luru ] has a preﬁxed form with [

On- ]: [

On"luru ] because the ﬁrst vowel of the lexical item [u]is not front. The experiment thus features a similar case of vowel harmony as the Turkish example(see Section 1).The computational experiment presented here tests the learning of non-local vowel harmony.That the process tested here is phonologically non-local is clear from Table 1: the sounds incorrespondence (the vowel of the preﬁx and the ﬁrst vowel of the lexical item) are always separatedby one or two consonants. Radford et al. (2015) uses averaging over z -variables in some cases and performs logistic regression on the secondto last convolutional layer. .2.2. Local processes In addition to non-local vowel harmony, the training data contain evidence for four local pro-cesses that are triggered by the preﬁx. Two processes are triggered by a nasal sound in the preﬁxVN-. 16 unpreﬁxed-preﬁxed pairs (32 items total) contain evidence for post-nasal devoicing (D → T / N ), where a voiced stop devoices if a nasal precedes it: [ " b Alu ] ∼ [ Om" p h Alu ]. In another16 pairs (32 items total), a voiced fricative gets devoiced and occluded when a nasal precedes it (Z → T / N ): [ " v iô@ ] ∼ [ Em" p h iô@ ]. The other two processes are triggered by the V-preﬁx. Theevidence for intervocalic devoicing, where voiced stops devoice intervocalically (D → T / V V)is present in 16 unpreﬁxed-preﬁxed pairs (32 items total), e.g. [ " b ulO ] ∼ [ O" p h ulO ]. Another 16 pairs(32 items total) contain evidence for intervocalic fricativization and devoicing, where voiced stopsfricativize and devoice (D → S / V V) between vowels (triggered by the preﬁx), e.g. [ b Oô@ ] ∼ [ O" f Oô@ ]. In the 54 remaining pairs (108 total), no consonantal changes are present, e.g. [ "jAlu ] ∼ [ O"jAlu ] or [ "ôinu ] ∼ [ En"ôinu ].Because the learning of non-local processes is predicted to be more diﬃcult than that of localprocesses, the training data contain substantially more evidence for the non-local process. All itemsin which C is constant as well as those in which it changes contain evidence for the non-local vowelharmony process. Of 270 training items, there are 117 unpreﬁxed items with 117 correspondingpreﬁxed forms, all of which contain evidence for vowel harmony (234 total). The remaining items(36) only include unpreﬁxed forms (for testing learning). There is thus a substantial diﬀerencein the amount of training data that contain evidence for the non-local process (117 pairs, 234altogether) and the four local processes (16 pairs each). Even if all four local processes are pooledtogether, the data still contain only 64 pairs containing evidence for the four local processes (128altogether). Table 1 illustrates the training data: each slot is ﬁlled with a transcribed examplefrom the training data. The entire training in IPA transcription is given in Appendix Tables A.3,A.4, A.5, A.6, A.7, A.8, and A.9.In addition to the local and non-local processes described above, the data contain evidence for alocal assimilation process which is somewhat less relevant to our experiment: if the preﬁx containsa nasal stop (VN-), the place of articulation of the nasal stop depends on the ﬁrst consonant of theroot (C ). The nasal surfaces as labial [m] before the labials ([p] and [f]), and as an alveolar [n]elsewhere. Spectral diﬀerences are minimal between the two conditions, which is why a detailedanalysis of this process is not possible in the computational experiment; the main purpose forincluding this assimilation in the data is for the behavioral experiment to include an English-likeprocess (to not raise the attention of the subjects) and to facilitate the reading task for the speakerwho recorded the stimuli.The computational experiment tests the learning of the local devoicing processes and non-localvowel harmony that target the prefix (VN- or V-). In order to control for the potential eﬀectsof other segments on the learning of the targeted processes, we balance the experimental design asmuch as possible. The number of lexical items with the front vowel in V is, in all but three pairs,equivalent for every C condition. In other words, if there are four [d]-initial items that devoice andhave frontness harmony (V is front), there are also four items with backness harmony (V is notfront) for this condition. We also aim to balance the identity of C and V as much as possible, butbalancing these positions is limited by the requirement that the items not be real words of Englishor too similar to real words (due to the artiﬁcial grammar learning experiment). Only [m, n, l, ô ,s] can be members of C , and these along with V are relatively well balanced across the groups There are two missing frontness harmony pairs in the non-changing [p h ] - and [t h ] -initial condition and one missingbackness harmony pair in the non-changing [l]-initial condition for the VN- preﬁx. able 1: Examples of words used in training in the IPA transcription. Preﬁx Labial Coronal [j] [l] [ ô ]VN- C constant E -harmony "p h imi "fim@ "t h ElO "sEnO "jim "lEn "ôinuEm"p h imi Em"fim@ En"t h ElO En"sEnO En"jim En"lEn En"ôinu O -harmony "p h OôO "fuô@ "t h Aôu "sAnu "jAlu "lOô "ôOlOOm"p h OôO Om"fuô@ On"t h Aôu On"sAnu On"jAlu On"lOô On"ôOlO C changes E -harmony "beô@ "vir@ "dElO "ziô@ — — — Em"p h eô@ Em"p h ir@ En"t h ElO En"t h iô@ — — — O -harmony "bAlu "vOn@ "dun@ "zOlE — — — Om"p h Alu Om"p h On@ On"t h un@ On"t h OlE — — —

V- C constant E -harmony "p h in@ "fini "t h ElO "sEnO "jim "linO "ôElE"p h in@ E"fini E"t h ElO E"sEnO E"jim E"linO E"ôEl O -harmony "p h O m O "fuô@ "t h OmO "sAnu "jAm "luôu "ôAsO"p h O m O O"fuô@ O"t h OmO O"sAnu O"jAm O"luôu O"ôAs C changes E -harmony "bEl@ "bEm@ "dEni "dEmE — — — E"p h El@ E"fEm@ E"t h Eni EsEmE — — — O -harmony "bulO "bOô@ "dAôu "dAl@ — — — O"p h ulO O"fOô@ O"t h Aôu O"sAl@ — — — with changing C (e.g. approximately equal number of the same consonants across voiced-initialitems that devoice and those that undergo devoicing with fricativization or occlusion), but notacross other groups. A fully balanced design is diﬃcult to achieve due to diﬀerent groups and thenonce-word requirement, but given the relatively well balanced design, we do not expect undesireddependencies to aﬀect the learning distributions of interest.The 270 items described above were presented in a simpliﬁed transcription (see Appendix TablesA.3, A.4, A.5, A.6, A.7, A.8) and read by a single female speaker of American English. The wordswere of the shape C V C , C V C V , prefix -C V C , and prefix -C V C V . The preﬁxeswere of the shape VN- and V-: [ E n-], [ O n-], [ E m-], [ O m-], [ E -], and [ O -] (see Appendix A Tables A.3,A.5, and A.7). The speaker was unaware of the exact objectives and details of the study and wascompensated for her work. Recordings of training data were made in a sound-attenuated boothusing a USBPre 2 (Sound Devices) pre-amp and Shure 53 Beta omnidirectional condenser head-mounted microphone in Audacity (originally sampled at 44.1 kHz and then downsampled to 16kHz).The data in the form of sliced audio ﬁles for each item (approximately 1 s long padded withsilence) is fed to the model randomly in mini-batches of 64. The bare unpreﬁxed and preﬁxed formsare not paired in any way during training.

3. Results

One advantage of the GAN architecture is that the Generator network outputs innovative datathat are linguistically interpretable (Beguˇs, 2020c). Innovative outputs are often sporadic and donot allow for a full quantitative analysis, which nonetheless does not make them less informative.It is important to describe innovative outputs and how they can inform us about the learning ofspeech data in deep convolutional networks. In Sections 3.1 and 3.2 we present results from anexploratory study of the network’s innovative outputs based on an acoustic analysis of spectra. InSections 3.3, 3.4, and 3.5 we present a quantitative analysis of the generated outputs.

The total unique data points (audio recordings of the words with the structure described inSection 2.2) that the network is trained on is 270. Despite the small amount of training data, the8odel generates outputs that closely resemble human speech, are interpretable, analyzable, andhighly informative. This stands in contrast to some recent studies of neural network models onthe syntactic level that require very large training datasets and do not improve substantially withmore data (van Schijndel et al., 2019). As is argued below, the GANs do not overﬁt, but produceinnovative data that are linguistically interpretable despite the small training data set. This ﬁndingshould open up numerous possibilities for further exploration of learning representations in deepconvolutional networks: it is generally assumed that GANs and deep convolutional networks requirelarge amounts of data, which could be prohibitive for research questions that require smaller trainingdatasets.We analyze outputs of the Generator network at four training steps: after 7453 ( ∼ ∼ ∼ ∼ "dinO ], yet the training data lacks this sequence altogether. The clos-est neighbor to the innovative [ "dinO ] in the training data is [ "dEnO ] (see Figure A.11). There arenumerous other such generated outputs that violate the training data, but are linguistically validand interpretable. For example, 23.2% of outputs violate the training data with respect to vowelharmony (see Section 3.4). Innovative outputs that violate training data distributions in linguis-tically interpretable ways constitute strong evidence against overﬁtting in the GAN architecture:even with very small datasets and a relatively high number of epochs, the Generator does notoverﬁt. This is in line with previous evidence that GANs generally do not overﬁt (Adlam et al.,2019; Donahue et al., 2019), but here we additionally argue that GANs don’t overﬁt even withsmall training datasets (N = 270). One advantage of the exploratory study of GANs outputs is that we can follow how dependenciesin speech are learned by the network at diﬀerent training steps. We propose that the progressionof learning can be observed by keeping the latent space constant and generating data at diﬀerenttraining stages of the Generator network. This provides crucial information on how the numberof training steps inﬂuences the Generator’s outputs and learning representations — an area thatis relatively understudied. Testing the eﬀect of training steps on learning representations usingspeech data should reveal further insights into neural network interpretability, as is argued below.We propose that by analyzing generated outputs at diﬀerent training steps with latent spacekept constant, we can actively follow how the network corrects the outputs that violate distributionsin the data. For example, at 7453 steps, the network generates an innovative output that violatesthe training data: [ "bEnO ]. At 9740 training steps, the network outputs [ "bEmO ] for the same latentspace variables. This output still violates the data: none of the words in the training data was ofthe exact shape [ "bEmO ]. At 14900 steps, the network outputs [ "bEôO ] (for the same latent space),which corresponds to [ "bEôO ] in the training data (Figure 2).In a related example, the proposed method allows us to follow how the network searches throughthe space of possible segment combinations using linguistically valid strategies. Figure 2 shows anoutput [ "zilO ] for which there is no direct equivalent in the training data. The spectrogram shows aclear voicing bar and frication noise in the high frequencies, characteristic of a [z]. At 9740 steps, thenetwork devoices the initial consonant C , but keeps its frication noise (and also changes the high9 ɛ n ɔ b ɛ m ɔ b ɛɹɔ Time (s)0 2.24904000 F r equen cy ( H z ) zil ɔ sul ɔ ful ɔ tul ɔ Time (s)0 2.91408000 F r equen cy ( H z ) Figure 2: (left)

Three generated samples with the same values of latent space variables. (right)

Four generatedsamples with the same values of latent space variables at four training steps showing devoicing change of place ofarticulation, and occlusion. front vowel [i] to a back vowel [u] for an output [ "sulO ]. This output is likewise not attested in thetraining data. Finally, at 14900 steps, the network transforms the frication noise from a higher tolower kurtosis that corresponds to a labial fricative [f] in the training data ([ "fulO ]). At 20990 steps,it appears as if the network is introducing a period of aspiration noise and turning the fricative intoa stop with the same following sequence [ "t h ulO ]. None of these outputs are attested in the trainingdata, but the example illustrates that the Generator searches for segment combinations with validphonological processes in human language, such as devoicing or changing place of articulation .Using this technique, we can not only observe how the network repairs distributional violations,but also how it searches through the space of possible segment combinations to repair violations ofphonological rules in the data. Because the error rate of local phonological processes is relativelylow in the output data, (1.8% at 20990 steps), the study of how the network repairs outputs thatviolate phonological processes can only be exploratory at this point. An example that illustrateshow learning progress can be directly observed with this method is given in Figure 3. At 7453training steps, the Generator outputs [ E"zAôO ] which violates both the local process of devoicingafter a preﬁx and the non-local vowel harmony process. At 9740 steps, the second formant of thepreﬁx vowel ([ E ]) substantially weakens and the formant structure of a back [ O ] emerges, which meansthe network repairs the harmony violation. At 14900 steps, voicing in the fricative ceases from theoutput, which means the output now conforms to the devoicing rule in the training data. In otherwords, [z], which violates the phonological rule of devoicing after a preﬁx, devoices to [s], whichconforms to the training data. At 14900 steps, the output thus fully conforms to the distributionsin the training data: harmony and devoicing: [ O"sOlO ] (Figure 3). The output, while conformingto the rules of training data, is still innovative and none of the training inputs contains exactlythis sequence. Spectrograms in Figure 3 illustrate how the network applies learning representationsin its continuous outputs at diﬀerent training steps that correspond to phonological processes innatural language: devoicing and vowel-lowering . To test how the network encodes preﬁxation in its latent space, we used a technique described inBeguˇs (2020c) and Section 4 to identify dependencies between the latent space and generated data.500 outputs of the Generator network trained after 20990 steps were transcribed and annotatedfor presence of the preﬁx V- and VN-. The number of steps for this analysis was chosen basedon the analysis of progression of learning in Section 3.2: it appears that a number of disharmonic All acoustic analyses are performed in Praat (Boersma and Weenink, 2015). ˈ za ɹɔ ɔˈ za ɹɔ ɔˈ s ɔ l ɔ Time (s)0 2.25508000 F r equen cy ( H z ) Figure 3: Changes in outputs with the same latent space values across training steps. outputs is repaired at 20990 steps and further training with more steps ceases to repair disharmonicoutputs. That the network is successful in outputting data that approximates human speech in thetraining data is suggested by the fact that the author was unable to reliably transcribe the outputin only 25 out of 500 outputs (5%). The data were ﬁt to a Lasso logistic regression model withthe presence of the preﬁx as the dependent variable and the 100 latent variables of the Generatornetwork as predictors (with the glmnet package in Simon et al. 2011). Alpha values were estimatedwith 10-fold cross-validation. Estimates in Figure 4 suggest that the network uses a single latentvariable to encode the presence of the preﬁx in the output: there is a clear and substantial dropin regression estimates between z and the rest of the latent space (other 99 z -variables). Such asubstantial drop in regression estimates suggests that the network discretizes representation of thepreﬁx into a single latent variable.To test the eﬀect of z on generated data, we generate 100 outputs with the value of z setat − . z is set to its opposite value (4.5), only 1 out of 100generated samples (1%) contains a preﬁx. This generative test suggests that the network encodespresence of the preﬁx in the output as a single variable in its latent space. By manipulating thisfeature, we can actively control the presence of the preﬁx in the output. For a generative test showing that regression estimates indeed identify variables that correspond to a givenphonetic/phonological representation, see Beguˇs (2020c). ll ll l ll l ll lll l llll lll ll l llll ll ll lll l lll l ll lll l ll l ll llllll ll l ll ll ll ll ll l lll ll ll l lll l ll l ll llll ll ll l ll Latent variables (z) La ss o r eg r e ss i on e s t i m a t e s Figure 4: Absolute Lasso logistic regression estimates of a model with presence of the preﬁx as the dependentvariable and values of 100 z -variables as independent predictors). The estimates are sorted in reversed order. VN- V- Totalfront back front backHarmonious

53 31 47 31 162

Non-harmonious

21 6 15 6 48 % Harmonious

Table 2: Raw counts of harmonious and disharmonious outputs of the Generator network across the two preﬁxes andvowel quality levels (front vs. back).

The training data contains evidence for local and non-local phenomena. Devoicing and occlusionafter the preﬁxes V- and VN- are local; vowel harmony is non-local, as one or two segments intervenebetween the target and the corresponding vowel.To test error rates of the output data, 500 outputs from the Generator networks trained after20990 steps were analyzed. 211 outputs (42.2%) were analyzed as involving a preﬁx VN- or V-. Ofthe 211 preﬁxed outputs, 162 (or 76.8%) were analyzed as harmonious. Harmonious responses areconsistently more frequent than non-harmonious both for front and back V as well as across thetwo preﬁxes, V- and VN-. The distribution of the harmonious and disharmonious outputs acrossfront and back triggering vowels and across the two preﬁxes are given in Table 2.To test whether the Generator’s higher rates of harmonious responses are signiﬁcantly abovechance, we ﬁt the data to a linear logistic regression model with harmonious responses as a depen-dent variable (coded as successes) and vowel frontnesss (with two sum-coded levels, front andback) and prefix identity (with two sum-coded levels, V- and VN-) as the independent variableswith their interaction. Harmonious responses are signiﬁcantly more frequent than disharmoniousresponses at means of all predictors: β = 1 . , z = 7 . , p < . In one output excluded from the analysis, the preﬁx vowel is analyzed as [ A ]. ˈ z ɛ n ə ɛˈ b ɑ j ə ɔˈ v ɑ luTime (s)0 2.20208000 F r equen cy ( H z ) Figure 5: Waveforms and spectrograms (0–8000 Hz) of three outputs of the Generator network trained after 20990steps, [ E" z En@ ], [ E" b Aj@ ], and [ O" v Alu ], that violate the training data distributions with respect to local processes offricative and stop devoicing.

The violations are linguistically interpretable: the preﬁx vowel in the non-harmonious condition isnot of random formant structure, but consists of formants characteristic of [ O ].Local processes are substantially less frequent and easier to learn than non-local processes innatural languages. To test whether such distribution also emerges in deep convolutional networks,we can compare the error rate in the non-local process and the error rate in the local processes of thegenerated outputs. Out of 168 preﬁxed outputs containing a stop or a fricative, only three (1.8%)violate the devoicing rule in the training data by which stops and fricatives are always voicelessin preﬁxed forms, e.g. [ E" z En@ ], [ E" b Aj@ ], and [ O" v Alu ] (spectrograms in Figure 5). This error rateis signiﬁcantly lower compared to the error rate of the non-local process (OR =16.5 [5.2, 84.6], p < . In the framework of symbolic representations, vowel harmony can be derived with an algebraicrule (as in 1). The harmony of the preﬁx vowel ([ E ]/[ O ]) is triggered by the following vowel V :a rule that sets the feature [ ± front] in the vowel of the preﬁx according to the value of the samefeature in the following vowel (see formalism in 1). Alternatively, the grammar can also operate ona morphophonological level: a preﬁx as a morphological unit can be chosen based on the value ofthe following vowel.We propose here that using the technique in Beguˇs (2020c), we can elicit such rule-like behaviorin deep convolutional neural networks. The analysis in Section 3.3 suggests that the Generator13 l lll ll llll ll ll l lll ll ll l l llllll l ll lll lllll lll llll ll lll ll llll llll lll lll ll lll l ll ll ll lll llll lll l llllll l Latent variables (z) La ss o r eg r e ss i on e s t i m a t e s V2 is a front vowel a lllll ll llll ll ll l lll ll lll l llllll l ll lll lllll lll llll lllll ll l lll ll ll lll lll ll lll lll ll ll lll lll llll l lll lll l Latent variables (z) La ss o r eg r e ss i on e s t i m a t e s V2 is a back vowel b Figure 6: (a)

Absolute Lasso logistic regression estimates of a model with presence of front triggering vowels V asthe dependent variable and values of 100 z -variables as independent predictors). The estimates are sorted in reversedorder. (b) Absolute Lasso logistic regression estimates of a model with presence of back triggering vowels V as thedependent variable and values of 100 z -variables as independent predictors). The estimates are sorted in reversedorder. learns to associate z with presence of a preﬁx. There is a substantial drop in regression estimatesafter the estimates for z , which suggests that the network discretizes the continuous phoneticinput and uses a single variable to encode presence of some phonetic/phonological material whichcorresponds to a morphological unit: a preﬁx. To elicit rule-like behavior, we can identify anothervariable in the latent space — the variable that corresponds to the frontness/backness of vowel V .To identify such a variable, we generated 500 outputs are annotated the outputs for vowel (V )frontness. We ﬁt the data to two linear logistic regression models: one in which outputs with thefront vowel (V ) [ E, i ] are coded as success and another in which [

A, O, u ] are coded as success. Theindependent variables are values of the 100 latent variables z randomly sampled for each of the 500annotated generated outputs. The model is ﬁt using the glmnet package (Simon et al., 2011) in R(R Core Team, 2018). Lambda values are estimated with 10-fold cross-validation. Estimates of thetwo models are given in Figure 6.Both models uniformly suggest that z is the latent variable most strongly associated withdetermining vowel frontness of the triggering vowel V . Regression estimates again suggest thatthe Generator network learns to encode vowel frontness with a single latent variable: there is asubstantial drop of estimates after the single latent variable z . Negative values of z correspondto presence of front [ E, i ] in V , while positive values correspond to presence of back [ A, O, u ](estimates in Figure 6 are in absolute values).To elicit rule-like behavior, we force the preﬁx in the input and simultaneously force vowel V to turn from a front vowel [ E, i ] into a back vowel [

A, O, u ]. To achieve this aﬀect, we simultaneouslymanipulate z (presence of preﬁx) and z (frontness of vowel). If the Generator network learnedvowel harmony, then the vowel of the preﬁx should change together with the forced change of vowelquality. Such a behavior would parallel rule-based computation: setting a single variable to a valuethat forces preﬁxation in the output and manipulating the variable that changes the conditioningenvironment(V ) results in a process that changes the target vowel according to the condition — That the two variables are consecutive is coincidence. z to − . z interpolated from values − z from − z indeed causes the preﬁx in the output is suggested by the countof preﬁxed forms in the output: 634 out of 780 generated samples (or 81.3%) were analyzed asfeaturing a preﬁx (for an independent test of the eﬀect of z on presence of preﬁx, see Section 3.3).That z indeed changes the triggering vowel V from a front [ E, i ] to a back [

A, O, u ] is stronglysuggested by the generated outputs. We annotate the 634 preﬁxed forms from the 60 sets ofgenerated interpolated outputs for frontness and backness of the triggering vowel V . We ﬁt theannotated data to a generalized additive mixed logistic regression model (GAMMs; Wood 2011)with an intercept and thin-plate smooths that estimate how the presence of a front or back vowelin the output changes with interpolated values. A random smooth for each trajectory (each of the60 generated sets) is added to the model. Figure 7 suggests that the presence of z causes thetriggering vowel from a front one at values in the negative range to a back one at positive values.The relationship appears to be linear even when the model does not have an assumption of linearity(GAMM). If we reﬁt the data to a linear logistic mixed eﬀect regression (with a random intercept fortrajectory and by-trajectory random slopes), we get a signiﬁcant negative correlation between valuesof z (from − β = − . , z = − . , p < . in the output change from almost 100% at one endof spectrum to 0% (or 100% of back vowel) in the other end of spectrum.To test whether the preﬁx vowel is harmonious even when the variable changing the triggeringvowel is interpolated, we annotate the 634 preﬁxed forms from the 60 sets for frontness of thetriggering vowel V and for vowel harmony. Data is annotated for harmony (successes vs. failures)and ﬁt to a generalized additive mixed eﬀects logistic regression model. The independent variablesare frontness of the vowel (treatment-coded with back as reference) and a thin plate smoothsfor values of z as well as by-trajectory random smooths. The estimates of the parametric termsuggest that the preﬁx vowel is harmonious both for front and back triggering vowels V . Har-monious outputs with a back triggering vowel V ([ A, O, u ]) are signiﬁcantly more frequent thatnon-harmonious outputs: β = 1 . , z = 4 . , p < . V are not diﬀerent from estimates for back vowel. This isconﬁrmed if we reﬁt the model with sum-coded frontness factor ( β = 1 . , z = 6 . , p < . z and a slightpositive trend for harmony in the back vowel conditions, although estimates for smooths are notsigniﬁcant. This likely results from the trend that we observe in the data: as we force the triggeringvowel away from front (by setting z to − z , we have a higher proportion of disharmonious outputs, becauseapparently the underlying value of the triggering vowel is not “strongly” front or back. As thevalue of z increases towards 6 and the back vowel is forced more strongly in the output, we geta higher proportion of harmonious outputs again (of course with a back vowel harmony). Figure8 illustrates the gradual change of the forced preﬁx from a front (containing an [ E ]) to back (con-taining an [ O ]) when z changes the vowel V from a front to a back vowel. In other words, as weforce a change of the triggering vowel quality from front to back with a single latent variable, the While the estimates of the eﬀects are signiﬁcant, the trends are not categorical. Occasionally, the vowel does notchange from front to back and more rarely, trends are reversed. z17 F r on t ne ss Frontness of V2 a z17 F r on t ne ss Frontness of V2 − linear b −202 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 z17 H a r m on i ou s Harmonious responses c Frontness front back

Figure 7: (a)

Fitted values and 95% CIs of a generalized additive mixed eﬀects logistic regression model with thefront vs. back triggering vowel (V ) value as the dependent variable and thin-plate smooths for values of z as theindependent variable (with random smooths for each of the 60 generated sets). The estimates show that z causesa change from a front to a back vowel as its values are interpolated from − z and frontness/backness of the vowel are linear. The regression estimates are in Appendix Table A.12. (b) Fitted values and 95% CIs of a linear mixed eﬀects logistic regression model with the front vs. back triggeringvowel (V ) value and random intercept for each of the 60 trajectories. The plot illustrates how the percent of frontvowels decreases as the value of z increases (and vice-versa for back vowels). (c) Fitted values and 95% CIs of ageneralized additive mixed eﬀects logistic regression model with harmonious (success) and disharmonious (failure)outcome as the dependent variable, vowel frontness as a parametric predictor, and thin-plate smooths for the twolevels of frontness (front vs. back) across the values of z and random smooths for each of the 60 set of generatedoutputs. Estimates of the model are given in Appendix Table A.13. preﬁx (also forced with a single variable) automatically changes in order to remain harmonious.The deep convolutional network thus appears to represent what would approximate a rule-likecomputation in phonology: as we force the preﬁx in the output and change the quality of thetriggering vowel from front to back by manipulating only two latent variables, vowel harmonyemerges automatically. While the appearance of rule-based computation is not categorical —but as is always the case in connectionism, probabilistic — and other features can change alongthe observed changes. This is to the author’s knowledge the closest approximation of rule-basedphenomena, especially considering that the models contain no language-speciﬁc mechanism and aretrained in an unsupervised manner from raw acoustic data.

4. Paralleling neural networks and artiﬁcial grammar learning experiments

To parallel the performance of the computational experiment with results from a behavioralexperiment, we combine novel data presented here for the ﬁrst time with results of an experimentin Beguˇs (2020b). The subjects were trained on the same data as used in the computationalexperiment, but divided into two separate experiments: one in which subjects were trained ondata with the VN- preﬁx and another one on data with the V- preﬁx. Subjects were recruited16 ɛ n ˈ tin ə ɔ n ˈ tilu ɔ n ˈ t ɔ l ɔ Time (s)0 2.74405000 F r equen cy ( H z ) -4 -2 4 ɛ n ˈ tin ə ɔ n ˈ tilu ɔ n ˈ t ɔ l ɔ Time (s)0 2.74405000 F r equen cy ( H z ) ɛ m ˈ pil ɔ ɛ m ˈ pil ɔ ɔ m ˈ p ɔ l ɔ ɔ m ˈ p ɔ l ɔ Time (s)0 5.10405000 F r equen cy ( H z ) Figure 8: Outputs with interpolated values of z that change the triggering vowel V from a front [ E, i ] to a back[

A, O, u ] and z set at − .

5, which forces the preﬁx in the output. (left)

Three outputs with z set at − , − , and2. The spectrogram shows how the formant structure of a front [ E ] in the preﬁx changes to the formant structure ofa back [ O ] as the triggering vowel changes from a front [i] to a back O . (right) Five outputs with z set at 0 , , , , and 4. The spectrogram again shows an automatic change of the preﬁx vowel consistent with the vowel harmony inthe training data. Areas in squares indicate formant structures of interest. via Amazon MTurk , completed informed consent before participating, and were presented withexperimental stimuli in Experigen (Becker and Levine, 2013). In the behavioral experiment, theunpreﬁxed-preﬁxed forms are presented to subjects in pairs, where the preﬁxed form carries thefunction of plural. Subjects were presented with a picture of a Martian creature. A single creatureis associated with the unpreﬁxed form; four creatures are associated with the preﬁxed form. Theexperimental interface is illustrated in Figure 9.Subjects whose ﬁrst language was not English or who had self-reported linguistic educationwere removed from the analysis. Altogether 333 subjects that provided 1987 responses on thevowel harmony test are analyzed The training phase in the VN- experiment consisted of 58 pairs of bare and preﬁxed forms.All examples were harmonious and some included evidence for the local processes of post-nasaldevoicing and post-nasal devoicing and occlusion (as described in detail in the Section 2.2 on dataused in the computational experiment). In the V- experiment, the training phase consisted of 60pairs of bare and preﬁxed forms, all of which contained evidence for harmony and some of whichcontain evidence for local processes of devoicing and devoicing and fricativization (see Section 2.2).All items used in the behavioral experiment are listed in Appendix Tables A.3, A.4, A.5, A.6, A.7,A.8, and A.9.After the training phase, the subjects were tested on six bare forms with C either a [r] or[l] (three with a front V and three with back) and had to choose between harmonious and non-harmonious responses in a forced choice task (see Test – Local in Figure 9), as well as betweenvarious local processes. For example, subjects were presented with a stimulus [ "lirO ], presentedauditorily and orthographically, and had to choose between the plural form eliro (harmonious) and That the results of the experiment are not heavily inﬂuenced by the fact that the participants in the behavioralexperiments are recruited via Amazon MTurk is suggested by the fact that vowel harmony outcomes are very similarto a related experiment with very similar training data that wasperformed in-person with the supervision of a researchassistant in which subjects were recruited from the general public Beguˇs (2020b). For detailed exclusion criteria in the VN-condition, see Beguˇs (2020b). In the V- condition, we excluded partici-pants with non-unique Amazon MTurk IDs as well as with those IDs who had already taken the VN-experiment. rainingTrainingTest – non-localTest – local Figure 9: Experimenal design (from Beguˇs 2020b) in the Experigen interface (Becker and Levine, 2013). The orderof the training and the test phases are randomized, but the training precedes the test block . For the exact procedureof the experiment, see Beguˇs 2020b oliro , presented only orthographically. While the behavioral experiments do not directly test whether non-local processes are morediﬃcult to learn than local processes (this has already been conﬁrmed experimentally in severalstudies; see Finley 2011, 2012; McMullin and Hansson 2019; White et al. 2018), the local processis made more diﬃcult to learn in the experiment: subjects were explicitly instructed to learn the(non-local) distribution of preﬁxes (vowel harmony), but never about learning the local processes.Moreover, the learning of local processes is tested exclusively with auditory stimuli.To test the learning of the non-local process in the behavioral experiment, the responses were ﬁtto a linear mixed eﬀects logistic regression model ( lme4 package by Bates et al. 2015). First, we ﬁtthe full model with harmonic vs. non-harmonic responses (successes vs. failures) as the dependentvariable and frontness (front vs. back, sum-coded) of the vowel and the shape of the prefix (VN- vs. V-, sum-coded) as the independent variable (with interaction) and random intercepts for subject and item with by-subject and by-item random slope for harmony . The ﬁnal modelwas chosen based on Akaike Information Criterion (AIC) by removing random slopes ﬁrst and theninteractions. The ﬁnal model includes the frontness × prefix interaction and random interceptsfor subject and item .The results show that subject learn the vowel harmony pattern from the training data ( β =0 . , z = 5 . , p < . In the test phase on local processes involving the preﬁx VN-, the subjects were presented with a plural formexclusively auditorily and had to choose between two possible singular forms: one consistent with devoicing andanother consistent with devoicing and occlusion. In the V- condition, the subjects similarly chose between singularforms consistent with intervocalic devoicing or intervocalic devoicing with fricativization. l ll Prefix % ha r m on i ou s Human subjects a ll ll Prefix % ha r m on i ou s The Generator network b Frontness l l ba fr

Figure 10: (a)

Estimates and 95% CIs of the linear logistic regression model with harmonious responses of theGenerator network as successes and vowel frontness and preﬁx identity as the independent variables with theirinteraction. (b)

Linear mixed eﬀects logistic regression estimates with harmonious responses of human subjects inthe behavioral experiment as successes and vowel frontnesss and preﬁx identity as the independent variables withtheir interaction. level, which suggests subjects do learn the harmonious pattern. However, the error rate is quitehigh. The 95% proﬁle CIs for the preference for harmonious response are quite low: [57.6%, 69.2%],especially given that 234/270 items are bare-preﬁxed pairs each of which contains evidence for vowelharmony. All regression estimates are in Table A.11.We can directly compare subject’s responses in the behavioral experiments with outputs of thecomputational experiment. The Generator network violates local distributions in the data in onlythree out of 168 generated outputs with a preﬁx and a stop or a fricative (1.8%). On the non-localtask, however, the Generator’s error rate is substantially higher and similar to the error rate inthe artiﬁcial grammar learning experiment conducted on human subjects. Figure 10 illustrates thesimilarity.To be sure, there are substantial diﬀerences between the computational and behavioral exper-iment. First, the comparison is necessarily superﬁcial, because this paper does not claim thathumans learn phonological patterns in the same way as deep convolutional networks; however, thisdoes not preclude us from comparing their performance. The number of epochs in the computa-tional experiment is ∼ . Discussion This paper tests learning of local and non-local processes in human speech with deep convolu-tional networks in the GAN architecture. More speciﬁcally, we test the learning of non-local vowelharmony and local devoicing processes in a setting that approximates morphological and phono-logical processes in language: the model is trained on data with bare and preﬁxed forms in randomorder.First, we argue that deep convolutional GANs output highly informative data despite beingtrained on extremely small datasets (N = 270) with a high number of epochs. The outputs areacoustically analyzable and linguistically interpretable. As has been shown before (Beguˇs, 2020c,a),the Generator outputs innovative data that violate training data. These violations, however, arenot random, but are linguistically interpretable. 23.2% of outputs are disharmonious, which meansthey violate training data distributions. But these violations are not random; the network outputseither a front [ E ] or a back vowel [ O ] in the preﬁx, consistent with the shape of the preﬁx inthe training data. In only 5% of annotated outputs is the data not linguistically interpretable.Innovative outputs also suggest that the Generator does not overﬁt despite the high number ofepochs, in line with previous work on overﬁtting in GANs. The ﬁnding that GANs can be trainedon very small data sets should open up several new possibilities for research on deep convolutionalnetworks, speech, and internal representations in deep convolutional networks.An exploratory study of innovative outputs suggests that, in order to repair its data violations,the network uses strategies that approximate processes in human phonology: devoicing, occlusion,and distribution of frication noise. We propose that these repairs can be directly followed withprogression of learning by keeping the random variables constant while generating data from thenetwork trained at diﬀerent training steps. Acoustic analysis of outputs at diﬀerent training stepsin Sections 3.1 and 3.3 identiﬁes strategies that the network uses to repair violations in datadistributions.One of the objectives of this paper is to explore how deep convolutional networks trainedin the GAN framework on raw speech discretize linguistically meaningful representations in thelatent space, especially with respect to non-local morphophonological processes. The raw acousticdata hearing human infants are faced with is continuous. Phonological computation discretizesthe continuous space into discrete representations and manipulates these representations, whichresults in phonological processes such as vowel harmony. Using the technique in Beguˇs (2020c),we identify variables in the Generator’s latent space that correspond to linguistically meaningfulunits, such as presence of a preﬁx or frontness of a vowel. Lasso regression estimates suggest thatthe network uses a minimal number of variables to represent presence of a preﬁx in the output. Inother words, the steep drop in the regression estimates after the variable with highest estimatessuggest that the network discretizes some continuous phonetic content in its internal space. Thesame is true for a phonetic feature such as frontness of the ﬁrst vowel in bare forms V . Thenetwork appears to primarily use a single variable to encode this phonetic property of outputs.An independent generative test suggest that manipulating this one variable on a linear scale welloutside the training range (from − z at − .

5) and force the change of the triggering vowel from frontto back with a linear interpolation of a single variable ( z ). The statistical tests in Section 3.5suggest that the generated output remain harmonious in the majority of cases despite the changeof the triggering vowel. In other words, the rule-like vowel harmony emerges automatically in a20eep convolutional network from an interaction of the variable that forces some morphophonologicalentity in the output (the preﬁx) and the variable that changes the triggering vowel. While harmonicoutputs are signiﬁcantly more frequent than non-harmonic outputs, the distribution is probabilisticrather than categorical. Another trend emerges from the statistical tests: the outputs are morelikely to be non-harmonic in the transition period when the triggering vowel changes from front toback. It is likely the case that the relative strength of frontness and backness aﬀects the rates ofharmonic vs. non-harmonic outcomes. In other words, it appears that the preﬁx harmony is nottriggered until the frontness/backness feature of the triggering vowel is strong enough, i.e. has ahigh enough latent variable value. That phonological features bear inherit weights (that can beconceptualized as strength or latent feature values in our model) has been argued before in theOptimality Theoretic framework (Smolensky and Goldrick, 2016; Smolensky et al., 2019).Phonological computation has been shown to favor local processes over non-local processes.Many studies show experimentally that the learning of non-local processes is more diﬃcult (Finley,2011, 2012; McMullin and Hansson, 2019; White et al., 2018). This learning bias is also reﬂected intypology: the majority of phonological processes are local in the world’s languages (Finley, 2011). Aclear preference for locality emerges in our computational experiment as well: despite substantiallymore evidence for the non-local process in the training data, the error rate is signiﬁcantly higherin the non-local condition in the Generator’s network. Whether the prevalence of some patternsin human speech results from articulatory factors (e.g. the articulation of sounds is most stronglyaﬀected by the immediately preceding or following sounds) or from learnability (e.g. the learning ofnon-local processes is more diﬃcult) has been a focal topic of discussion in phonology, linguistics,and cognitive science in general. While this result does not oﬀer an answer as to whether thepreference for non-locality in typology results from learning or a language’s cultural transmissionBeguˇs (2020b), it does provide evidence that non-locality preferences can be explained with domain-general cognitive mechanisms.Because GANs trained on small datasets produce informative results, we can use the samestimuli for training deep convolutional networks and artiﬁcial grammar learning experiments onhuman subjects. We compare data from a behavioral experiment that tested the learning of vowelharmony. Results show a similar degree or error rate across the computational and artiﬁcial gram-mar learning experiments. It is true that the Generator network does not output vowel harmonycategorically (as opposed to local processes, which are near categorical), but neither do the humansubjects tested in a behavioral experiment perform at the categorical level. The training data inthe two behavioral experiments contain evidence for vowel harmony almost in every training datapoint. Harmony is categorical: no training data points violate training data, yet the error rate inhuman subjects is even higher than in the computational experiments. This suggests that non-local processes are, from a learnability viewpoint, similarly costly both for the deep convolutionalnetwork and for human subjects.Training deep convolutional networks on well-understood dependencies in speech also oﬀers in-sights into how the network searches through the space of possible segment combinations as thetraining progresses. We propose that by keeping the latent space constant while generating datafrom the Generator trained after diﬀerent number of steps reveals that the networks use linguis-tically interpretable strategies for repairing the outputs that violate training data. Phonologicalprocesses such as devoicing, occlusion, and distribution of frication noise are observed, amongothers.The results of the present experiment provide new information on internal representations indeep convolutional networks trained on raw speech, and bear evidence for the long-standing dis-cussion on symbolism vs. connectionism in cognitive science. The networks not only representmorphophonological units with discretized representations (resembling the morphological level),21ut also learn to encode morphophonological processes (resembling rule-like computation). Anapproximation of rule-like non-local generalizations in the data emerges from training a deep con-volutional GAN. We provide evidence arguing that human behavioral data superﬁcially matchesthe outcomes of the computational model. Applying such an experiment to further data shouldyield a clearer picture on how rule-like generalizations emerge as interactions between variables indeep convolutional neural networks trained on raw speech data, and how performance and biasesof deep neural networks corresponds to human performance in behavioral experiments. Acknowledgements

This research was funded by a grant to new faculty at the University of Washington and UCBerkeley as well as by Harvard Mind Brain Behavior and Department of Linguistics.

References

Adlam, B., Weill, C., Kapoor, A., 2019. Investigating under and overﬁtting in wasserstein generativeadversarial networks. arXiv:1910.14137 .Alishahi, A., Barking, M., Chrupa(cid:32)la, G., 2017. Encoding of phonology in a recurrent neural modelof grounded speech, in: Proceedings of the 21st Conference on Computational Natural LanguageLearning (CoNLL 2017), Association for Computational Linguistics, Vancouver, Canada. pp.368–378. URL: , doi: .Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial networks, in:Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th International Conference on MachineLearning, PMLR, International Convention Centre, Sydney, Australia. pp. 214–223. URL: http://proceedings.mlr.press/v70/arjovsky17a.html .Bates, D., M¨achler, M., Bolker, B., Walker, S., 2015. Fitting linear mixed-eﬀects models usinglme4. Journal of Statistical Software 67, 1–48. doi: .Becker, M., Levine, J., 2013. Experigen – an online experiment platform. URL: http://becker.phonologist.org/experigen .Beguˇs, G., 2020a. Ciwgan and ﬁwgan: Encoding information in acoustic data to model lexi-cal learning with generative adversarial networks. URL: https://arxiv.org/abs/2006.02951 , arXiv:2006.02951 . arXiv 2006.02951.Beguˇs, G., 2020b. Distinguishing cognitive from historical inﬂuences in phonology. Submitted ms.,UC Berkeley.Beguˇs, G., 2020c. Generative adversarial phonology: Modeling unsupervised phonetic and phono-logical learning with neural networks. Frontiers in Artiﬁcial Intelligence 3, 44. URL: , doi: .Berent, I., 2013. The phonological mind. Trends in Cognitive Sciences 17, 319 – 327. URL: , doi: https://doi.org/10.1016/j.tics.2013.05.004 , doi: .Chomsky, N., Halle, M., 1968. The Sound Pattern of English. Harper & Row, New York.Donahue, C., McAuley, J.J., Puckette, M.S., 2019. Adversarial audio synthesis, in: 7th InternationalConference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019,OpenReview.net. URL: https://openreview.net/forum?id=ByMVTsR5KQ .Eloﬀ, R., Nortje, A., van Niekerk, B., Govender, A., Nortje, L., Pretorius, A., Biljon, E., van derWesthuizen, E., Staden, L., Kamper, H., 2019. Unsupervised acoustic unit discovery for speechsynthesis using discrete latent-variable neural networks, in: Proc. Interspeech 2019, pp. 1103–1107. doi: .Finley, S., 2011. The privileged status of locality in consonant harmony. Journal of Memoryand Language 65, 74 – 83. URL: , doi: https://doi.org/10.1016/j.jml.2011.02.006 .Finley, S., 2012. Testing the limits of long-distance learning: Learning beyond a three-segment window. Cognitive Science 36, 740–756. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1551-6709.2011.01227.x , doi: , arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1551-6709.2011.01227.x .Garofolo, J.S., Lamel, L., M Fisher, W., Fiscus, J., S. Pallett, D., L. Dahlgren, N., Zue, V., 1993.Timit acoustic-phonetic continuous speech corpus. Linguistic Data Consortium .Gaskell, M., Hare, M., Marslen-Wilson, W.D., 1995. A connectionist model of phonologi-cal representation in speech perception. Cognitive Science 19, 407 – 439. URL: , doi: https://doi.org/10.1016/0364-0213(95)90007-1 .Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,A., Bengio, Y., 2014. Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes,C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances in Neural Information ProcessingSystems 27. Curran Associates, Inc., pp. 2672–2680. URL: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf .Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., 2017. Improved train-ing of wasserstein gans, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus,R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Sys-tems 30. Curran Associates, Inc., pp. 5767–5777. URL: http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf .Hansson, G. ´O., 2010. Consonant harmony: Long-distance interactions in phonology. University ofCalifornia Press.Heinz, J., 2010. Learning long-distance phonotactics. Linguistic Inquiry 41, 623–661. URL: https://doi.org/10.1162/LING_a_00015 , doi: .van der Hulst, H., 2013. Discoverers of the phoneme, in: Allan, K. (Ed.), The Oxford Handbookof the History of Linguistics. Oxford University Press, pp. 167–191. doi: . 23abak, B., 2011. Turkish vowel harmony, in: van Oostendorp, M., Ewen, C.J.,Hume, E., Rice, K. (Eds.), The Blackwell Companion to Phonology. Wiley Black-well. chapter 118, pp. 1–24. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/9781444335262.wbctp0118 , doi: , arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781444335262.wbctp0118 .Legendre, G., Miyata, Y., Smolensky, P., 1990. Harmonic grammar: A formal multi-level con-nectionist theory of linguistic well-formedness: Theoretical foundations. University of Colorado,Boulder. ICS Technical Report , doi: .Marcus, G.F., 2001. The algebraic mind: Integrating connectionism and cognitive science. MITpress.Marcus, G.F., Vijayan, S., Bandi Rao, S., Vishton, P.M., 1999. Rule learn-ing by seven-month-old infants. Science 283, 77–80. URL: https://science.sciencemag.org/content/283/5398/77 , doi: , arXiv:https://science.sciencemag.org/content/283/5398/77.full.pdf .McClelland, J.L., Elman, J.L., 1986. The trace model of speech perception. Cognitive Psychology18, 1 – 86. URL: ,doi: https://doi.org/10.1016/0010-0285(86)90015-0 .McClelland, J.L., Rumelhart, D.E., , Group, P.R., 1986. Parallel distributed processing: Explo-rations in the microstructure of cognition. volume 2. MIT Press, Cambridge, MA.McMullin, K., Hansson, G., 2019. Inductive learning of locality relations in segmental phonology.Laboratory Phonology: Journal of the Association for Laboratory Phonology 10, 14. doi: .Plaut, D.C., Kello, C.T., 1999. The emergence of phonology from the interplay of speech compre-hension and production: A distributed connectionist approach.. Lawrence Erlbaum AssociatesPublishers, Mahwah, NJ, US. pp. 381–415.Prince, A., Smolensky, P., 1993/2004. Optimality Theory: Constraint Interaction in GenerativeGrammar. Blackwell, Malden, MA. First published in 1993, Tech. Rep. 2, Rutgers UniversityCenter for Cognitive Science.R Core Team, 2018. R: A Language and Environment for Statistical Computing. R Foundationfor Statistical Computing. Vienna, Austria. URL: .Radford, A., Metz, L., Chintala, S., 2015. Unsupervised representation learning with deep convo-lutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .R¨as¨anen, O., Nagamine, T., Mesgarani, N., 2016. Analyzing distributional learning of phonemiccategories in unsupervised deep neural networks. CogSci ... Annual Conference of the CognitiveScience Society. Cognitive Science Society (U.S.). Conference 2016, 1757–1762. URL: https://pubmed.ncbi.nlm.nih.gov/29359204 . 24ose, S., Walker, R., 2004. A typology of consonant agreement as correspondence. Language 80,475–531. URL: .Rumelhart, D.E., McClelland, J.L., Group, P.R., 1986. Parallel distributed processing: Explo-rations in the microstructure of cognition. volume 1. MIT Press, Cambridge, MA.van Schijndel, M., Mueller, A., Linzen, T., 2019. Quantity doesn’t buy quality syntax with neurallanguage models, in: Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China. pp. 5831–5837. URL: , doi: .Shain, C., Elsner, M., 2019. Measuring the perceptual availability of phonological featuresduring language acquisition using unsupervised binary stochastic autoencoders, in: Proceed-ings of the 2019 Conference of the North American Chapter of the Association for Compu-tational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), As-sociation for Computational Linguistics, Minneapolis, Minnesota. pp. 69–85. URL: .Simon, N., Friedman, J., Hastie, T., Tibshirani, R., 2011. Regularization paths for cox’s propor-tional hazards model via coordinate descent. Journal of Statistical Software 39, 1–13. URL: .Smolensky, P., Goldrick, M., 2016. Gradient symbolic representations in grammar: The case offrench liaison, in: Rutgers Optimality Archive 1552, Rutgers University.Smolensky, P., Rosen, E., Goldrick, M., 2019. Learning a gradient grammar of French liaison, in:Proceedings of the 2019 Annual Meeting on Phonology.White, J., Nevins, A., Polg´ardi, K., Martin, A., Kager, R., Linzen, T., Peperkamp, S., Topintzi, I.,Markopoulos, G., van de Vijver, R., 2018. Preference for locality is aﬀected by the preﬁx/suﬃxasymmetry, in: Hucklebridge, S., Nelson, M. (Eds.), NELS 48: Proceedings of the Forty-EighthAnnual Meeting of the North East Linguistic Society, GLSA. pp. 207–220.Wood, S.N., 2011. Fast stable restricted maximum likelihood and marginal likelihood estimation ofsemiparametric generalized linear models. Journal of the Royal Statistical Society (B) 73, 3–36. Appendix A. Appendix

Appendix A.1. Training data

The recordings or training data were made in a sound-attenuated booth at the Department ofLinguistics at Harvard University using a USBPre 2 (Sound Devices) pre-amp and Shure 53 Betaomnidirectional condenser head-mounted microphone in Audacity (originally sampled at 44.1 kHzand then downsampled to 16 kHz). 25 ime (s)36.84 37.7104000 F r equen cy ( H z ) d ɛ n ɔ din ɔ Time (s)0 1.50104000 F r equen cy ( H z ) Figure A.11: Waveforms and spectrograms (0–4,000 Hz) of (top) input sample (left) and generated sample (right)of [

O"p h ORO ]; and (bottom) input sample [ "dEnO ] (left) and generated sample (right) [ "dinO ].Table A.3: IPA transcriptions of training data without consonantal changes; C is a sonorant. [ "luôu ] and [ On"luôu ]are missing from the computational experiment.

Fillers [l] [+fr] "lEn En"lEn len enlen "linO En"linO lino enlino[ − fr] "lOô On"lOô lor onlor "luôu On"luôu luru onluru[r] [+fr] "ôEl En"ôEl rel enrel "ôinu En"ôinu rinu enrinu[ − fr] "ôAs On"ôAs ras onras "ôOlO On"ôOlO rolo onrolo[j] [+fr] "jim En"jim yim enyim "jeni En"jEni yeni enyeni[ − fr] "jAm On"jAm yam onyam "jAlu On"jAlu yalu onyalu26 able A.4: IPA transcriptions of training data without consonantal changes; C is a sonorant. Fillers [l] [+fr] "lEm E"lEm lem elem "linO E"linO lino elino[ − fr] "lOô O"lOô lor olor "luôu O"luôu luru oluru[r] [+fr] "ôEl E"ôEl rel erel "ôinu E"ôinu rinu erinu[ − fr] "ôAs O"ôAs ras oras "ôOlO O"ôOlO rolo orolo[j] [+fr] "jim E"jim yim eyim "jeni E"jEni yeni eyeni[ − fr] "jAm O"jAm yam oyam "jAlu O"jAlu yalu oyalu Table A.5: IPA transcriptions of training data without consonantal changes; C is a voiceless obstruent. VoicelessPlace

Labial [ − cont] [+fr] "p h in@ Em"p h in@ pina empina "p h imi Em"p h imi pimi empimi[ − fr] "p h OôO Om"p h OôO poro omporo[+cont] [+fr] "fini Em"fini ﬁni emﬁni "fim@ Em"fim@ ﬁma emﬁma[ − fr] "fuô@ Om"fuô@ fura omfura "fOlO Om"fOlO folo omfoloCoronal [ − cont] [+fr] "t h ElO En"t h ElO telo entelo "t h in@ En"t h in@ tina entina[ − fr] "t h Aôu On"t h Aôu taru ontaru[+cont] [+fr] "sEnO En"sEnO seno enseno "sil@ En"sil@ sila ensila[ − fr] "sOôO On"sOôO soro onsoro "sAnu On"sAnu sanu onsanu Table A.6: IPA transcriptions of training data without consonantal changes; C is a voiceless obstruent. VoicelessPlace

Labial [ − cont] [+fr] "p h in@ E"p h in@ pina epina "p h imi E"p h imi pimi epimi[ − fr] "p h OôO O"p h OôO poro oporo "p h O m O O"p h O m O pomo opomo[+cont] [+fr] "fini E"fini ﬁni eﬁni "fim@ E"fim@ ﬁma eﬁma[ − fr] "fuô@ O"fuô@ fura ofura "fOlO O"fOlO folo ofoloCoronal [ − cont] [+fr] "t h ElO E"t h ElO telo etelo "t h in@ E"t h in@ tina etina[-fr] "t h Aôu O"t h Aôu taru otaru "t h OmO O"t h OmO tomo otomo[+cont] [+fr] "sEnO E"sEnO seno eseno "sil@ E"sil@ sila esila[ − fr] "sOôO O"sOôO soro osoro "sAnu O"sAnu sanu osanu able A.7: IPA transcriptions of training data with consonantal changes. VoicedPlace

Labial [ − cont] [+fr] "bil@ Em"p h il@ bila empila "beô@ Em"p h eô@ bera empera "bilO Em"p h ilO bilo empilo "bEm@ Em"p h Em@ bema empema[ − fr] "bul@ Om"p h ul@ bula ompula "bAlu Om"p h Alu balu ompalu "bOô@ Om"pOô@ bora ompora "bunE Om"punE bune ompune[+cont] [+fr] "vil@ Em"p h il@ vila empila "vEmO Em"p h EmO vemo empemo "viô@ Em"p h iô@ vira empira "vEl@ Em"p h El@ vela empela[ − fr] "vulO Om"p h ulO vulo ompulo "vAôu Om"p h Aôu varu omparu "vOn@ Om"p h On@ vona ompona "vulE Om"p h ulE vule ompuleCoronal [ − cont] [+fr] "dilO En"t h ilO dilo entilo "diôi En"t h iôi diri entiri "dElO En"t h ElO delo entelo "dEm@ En"t h Em@ dema entema[ − fr] "dulE On"t h ulE dule ontule "dOôu On"t h Oôu doru ontoru "dAlE On"t h AlE dale ontale "dun@ On"t h un@ duna ontuna[+cont] [+fr] "zil@ En"t h il@ zila entila "ziô@ En"t h iô@ zira entira "zEmO En"t h EmO zemo entemo "zEni En"t h Eni zeni enteni[ − fr] "zulO On"t h ulO zulo ontulo "zAôu On"t h Aôu zaru ontaru "zOlE On"t h OlE zole ontole "zunE On"t h unE zune ontune able A.8: IPA transcriptions of training data with consonantal changes. VoicedPlace

Labial [ − cont] [+fr] "bElO E"p h ElO belo epelo "bel@ E"p h el@ bela epela "biô@ E"p h iô@ bira epira "bim@ E"p h im@ bima epima[ − fr] "bulE O"p h ulE bule opule "bAôu O"p h Aôu baru oparu "bulO O"pulO bulo opulo "bOn@ O"pOn@ bona opona[+cont] [+fr] "bilO E"filO bilo eﬁlo "bEm@ E"fEm@ bema efema "bil@ E"fil@ bila eﬁla "bEôO E"fEôO bero efero[ − fr] "bul@ O"ful@ bula ofula "bAlu O"fAlu balu ofalu "bOô@ O"fOô@ bora ofora "bunE O"funE bune ofuneCoronal [ − cont] [+fr] "dil@ E"t h il@ dila etila "diôu E"t h iôu diru etiru "dEni E"t h Eni deni eteni "dEm@ E"t h Em@ dema etema[ − fr] "dulO O"t h ulO dulo otulo "dAôu O"t h Aôu daru otaru "dOlE O"t h OlE dole otole "dunE O"t h unE dune otune[+cont] [+fr] "dilu E"silu dilu esilu "diôi E"siôi diri esiri "dEmE E"sEmE deme eseme "dEnO E"sEnO deno eseno[ − fr] "dulE O"sulE dule osule "dOôu O"sOôu doru osoru "dAl@ O"sAl@ dala osala "dun@ O"sun@ duna osuna Table A.9: IPA transcriptions of training data without the preﬁxed forms. "bAô@ bara "vAô@ vara "dAmi dami "zAmi zami "lEni (2 × ) leni "ôEm@ (2 × ) rema "bAj@ (2 × ) baja "vAj@ vaya "dAwE dawe "zAwO zawo "liôO (2 × ) liro "ôuôO (2 × ) ruro "bEnE bene "vEnE vene "dAwO dawo "zElE zele "lOna (2 × ) lona "bEjO (2 × ) beyo "vEjo vejo "dElE dele "ziwO ziwo "lOnu (2 × ) lonu "bijE biye "dEwE dewe "bujE buye "diwO (2 × ) diwo "dOw@ dowa Estimate Std. Error z value Pr( > | z | )(Intercept) = mean 1.34 0.19 7.20 0.0000mean vs. back 0.30 0.19 1.64 0.1016mean vs. V- 0.05 0.19 0.29 0.7710Frontness:Preﬁx -0.05 0.19 -0.29 0.7710 Table A.10: Linear logistic regression estimates with harmonious responses of the Generator network as successesand vowel frontnesss (with two sum-coded levels, front and back) and prefix identity as the independent variableswith their interaction. > | z | )(Intercept) 0.56 0.11 5.01 0.0000harmvow1 0.08 0.11 0.75 0.4549alt1 0.04 0.06 0.72 0.4738harmvow1:alt1 0.09 0.05 1.86 0.0623 Table A.11: Linear mixed eﬀects logistic regression estimates with harmonious responses of human subjects in thebehavioral experiment as successes and vowel frontnesss (with two sum-coded levels, front and back) and prefix identity as the independent variables with their interaction.

A. parametric coeﬃcients Estimate Std. Error t-value p-value(Intercept) 1.3535 0.2693 5.0257 < < < Table A.12: Estimates of a generalized additive mixed eﬀects logistic regression model with the front vs. backtriggering vowel (V ) value (front = success; back = failure) as the dependent variable and a thin-plate smooth forvalues of z as the independent variable (with random smooths for each of the 60 generated sets). A. parametric coeﬃcients Estimate Std. Error t-value p-value(Intercept) = back 1.3488 0.3277 4.1160 < < Table A.13: Estimates of a generalized additive mixed eﬀects logistic regression model with harmonious (success)and disharmonious (failure) outcome as the dependent variable, vowel frontness as a parametric predictor, andthin-plate smooths for the two levels of frontness (front vs. back, treatment-coded with back as the reference level)across the values of z , and random smooths for each of the 60 set of generated outputs., and random smooths for each of the 60 set of generated outputs.