[PDF] Compositionality Through Language Transmission, using Artificial Neural Networks

Abstract

We propose an architecture and process for using the Iterated Learning Model ("ILM") for artificial neural networks. We show that ILM does not lead to the same clear compositionality as observed using DCGs, but does lead to a modest improvement in compositionality, as measured by holdout accuracy and topologic similarity. We show that ILM can lead to an anti-correlation between holdout accuracy and topologic rho. We demonstrate that ILM can increase compositionality when using non-symbolic high-dimensional images as input.

Full PDF

““This item is a glaxefw, and this is a glaxuzb”: Compositionality ThroughLanguage Transmission, using Artiﬁcial Neural Networks

Hugh Perkins ([email protected])

ASAPP (https://asapp.com)1 World Trade Center, NY 10007 USA

Abstract

We propose an architecture and process for us-ing the Iterated Learning Model (”ILM”) forartiﬁcial neural networks. We show that ILMdoes not lead to the same clear composition-ality as observed using DCGs, but does leadto a modest improvement in compositionality,as measured by holdout accuracy and topo-logic similarity. We show that ILM can leadto an anti-correlation between holdout accu-racy and topologic rho. We demonstrate thatILM can increase compositionality when us-ing non-symbolic high-dimensional images asinput.

Human languages are compositional. For exam-ple, if we wish to communicate the idea of a ‘redbox’, we use one word to represent the color ‘red‘,and one to represent the shape ‘box‘. We can usethe same set of colors with other shapes, such as‘sphere‘. This contrasts with a non-compositionallanguage, where each combination of color andshape would have its own unique word, such as‘aefabg‘. That we use words at all is a characteris-tic of compositionality. We could alternatively usea unique sequence of letters or phonemes for eachpossible thought or utterance.Compositionality provides advantages overnon-compositional language. Compositional lan-guage allows us to generalize concepts such as col-ors across different situations and scenarios. How-ever, it is unclear what is the concrete mechanismthat led to human languages being compositional.In laboratory experiments using artiﬁcial neuralnetworks, languages emerging between multiplecommunicating agents show some small signs ofcompositionality, but do not show the clear com-positional behavior that human languages show.(Kottur et al., 2017) shows that agents do not learn compositionality unless they have to. In the con-text of referential games, (Lazaridou et al., 2018)showed that agent utterances had a topographicrho of 0.16-0.26, on a scale of 0 to 1, even whilstshowing a task accuracy of in excess of 98%.In this work, following the ideas of (Kirby,2001), we hypothesize that human languages arecompositional because compositional languagesare highly compressible, and can be transmittedacross generations most easily. We extend theideas of (Kirby, 2001) to artiﬁcial neural networks,and experiment with using non-symbolic inputs togenerate each utterance.We ﬁnd that transmitting languages across gen-erations using artiﬁcial neural networks does notlead to such clearly visible compositionality aswas apparent in (Kirby, 2001). However, we wereunable to prove a null hypothesis that ILM usingartiﬁcial neural networks does not increase com-positionality across generations. We ﬁnd that ob-jective measures of compositionality do increaseover several generations. We ﬁnd that the mea-sures of compositionality reach a relatively mod-est plateau after several generations.Our key contributions are:• propose an architecture for using ILM withartiﬁcial neural networks, including withnon-symbolic input• show that ILM with artiﬁcial neural networksdoes not lead to the same clear composition-ality as observed using DCGs• show that ILM does lead to a modest increasein compositionality for neural models• show that two measures of compositionality,i.e. holdout accuracy and topologic similar-ity, can correlate negatively, in the presenceof ILM a r X i v : . [ c s . C L ] J a n demonstrate an effect of ILM on composi-tionality for non-symbolic high-dimensionalinputs (Kirby, 2001) hypothesized that compositionalityin language emerges because languages need tobe easy to learn, in order to be transmitted be-tween generations. (Kirby, 2001) showed thatusing simulated teachers and students equippedwith a context-free grammar, the transmission of arandomly initialized language across generationscaused the emergence of an increasingly composi-tional grammar. (Kirby et al., 2008) showed evi-dence for the same process in humans, who wereeach tasked with transmitting a language to an-other participant, in a chain.(Kirby, 2001) termed this approach the ”Iter-ated Learning Method” (ILM). Learning proceedsin a sequence of generations. In each generation,a teacher agent transmits a language to a studentagent. The student agent then becomes the teacheragent for the next generation, and a new studentagent is created. A language G is deﬁned as amapping G : M (cid:55)→ U from a space of mean-ings M to a space of utterances U . G can berepresented as a set of pairs of meanings and ut-terances G = { ( m , u ) , ( m , u ) , . . . ( m n , u n ) } .Transmission from teacher to student is imperfect,in that only a subset, G train of the full languagespace G is presented to the student. Thus the stu-dent agent must generalize from the seen mean-ing/utterance pairs { ( m i , u i ) | m i ∈ M train ,t ⊂M} to unseen meanings, { m i | m i ∈ ( M \ M train, t ) } . We represent the mapping from mean-ing m i to utterance u i by the teacher as f T ( · ) .Similarly, we represent the student agent as f S ( · ) In ILM each generation proceeds as follows:• draw subset of meanings M train ,t from the fullset of meanings M • invention: use teacher agent to generate ut-terances U train ,t = { u i,t = f T ( m i ) | m i ∈ M train ,t } • incorporation: the student memorizes theteacher’s mapping from M train ,t to U train ,t • generalization: the student generalizes fromthe seen meaning/utterance pairs G train ,t todetermine utterances for the unseen meanings M train ,t S : ( a , b ) → abc S : ( x, y ) → A : y B : xA : b → ab B : a → c Figure 1: Two Example sets of DCG rules. Eachset will produce utterance ‘abc’ when presented withmeanings ( a , b ) . a a a a a b qda bguda lda kda ixcda b qr bgur lr kr ixcr b qa bgua la ka ixca b qu bguu lu ku ixcu b qp bgup lp kp ixcp Table 1: Example language generated by Kirby’s ILM.

In (Kirby, 2001), the agents are deterministicsets of DCG rules, e.g. see Figure 1. For each pairof meaning and utterance ( m i , u i ) ∈ G train ,t , if ( m i , u i ) is deﬁned by the existing grammar rules,then no learning takes place. Otherwise, a newgrammar rule is added, that maps from m i to u i . Then, in the generalization phase, rules aremerged, where possible, to form a smaller set ofrules, consistent with the set of meaning/utterancepairs seen during training, G train ,t . The general-ization phase uses a complex set of hand-craftedmerging rules.The initial language at generation t is ran-domly initialized, such that each u t,i is initializedwith a random sequence of letters. The meaningspace comprised two attributes, each having 5 or10 possible values, giving a total meaning spaceof = 25 or = 100 possible meanings.(Kirby, 2001) examined the compositionality ofthe language after each generation, by looking forcommon substrings in the utterances for each at-tribute. An example language is shown in Table1. In this language, there are two meaning at-tributes, a and b taking values { a , . . . , a } and { b , . . . , b } . For example, attribute a could becolor, and a could represent ‘red’; whilst b couldbe shape, and b could represent ‘square’. Thenthe word for ‘red square’, in the example languageshown, would be ‘qu’. We can see that in the ex-ample, the attribute a was associated with a preﬁxq’, whilst attribute b tended to be associated witha sufﬁx ‘u’. The example language thus showscompositionality.(Kirby et al., 2008) extended ILM to humans.They observed that ILM with humans could leadto degenerate grammars, where multiple meaningsmapped to identical utterances. However, theyshowed that pruning duplicate utterances from theresults of the generation phase, prior to presenta-tion to the student, was sufﬁcient to prevent theformation of such degenerate grammars. Figure 2: Naive ILM using Artiﬁcial Neural Networks

We seek to extend ILM to artiﬁcial neural net-works, for example using RNNs. Different fromthe DCG in (Kirby, 2001), artiﬁcial neural net-works generalize over their entire support, for eachtraining example. Learning is in general lossy andimperfect.In the case of using ANNs we need to ﬁrst con-sider how to represent a single ‘meaning’. Con-sidering the example language depicted in Table1 above, we can represent each attribute as a one-hot vector, and represent the set of two attributesas the concatenation of two one-hot vectors.More generally, we can represent a meaning asa single real-valued vector, m . In this work, wewill use ‘thought vector‘ and ‘meaning vector‘ assynonyms for ‘meaning‘, in the context of ANNs.We partition the meaning space M into M train and M holdout , such that M = M train ∪ M holdout .We will denote a subset of M train at generation t by M train ,t . A naive attempt to extend ILM to artiﬁcial neuralnetworks (ANNs) is to simply replace the DCG inILM with an RNN, see Figure 2.In practice we observed that using this for-mulation leads to a degenerate grammar, whereall meanings map to a single identical utterance.

Meaning space Nodups Uniq ρ acc H - 0.024 0.04 - 0.024 0.08 0 yes 0.039 yes Table 2: Results using naive ANN ILM architecture.‘Nodups’: remove duplicates; ρ : topographic similar-ity (see later); ‘Uniq’: uniqueness. Termination criteriafor teacher-student training is 98% accuracy. ANNs generalize naturally, but learning is lossyand imperfect. This contrasts with a DCG whichdoes not generalize. In the case of a DCG, gener-alization is implemented by applying certain hand-crafted rules. With careful crafting of the gen-eralization rules, the DCG will learn a trainingset perfectly, and degenerate grammars are rare.In the case of using an ANN, the lossy teacher-student training progressively smooths the out-puts. In the limit of training over multiple genera-tions, an ANN produces the same output, indepen-dent of the input: a degenerate grammar. The ﬁrsttwo rows of Table 2 show results for two mean-ing spaces: 2 attributes each with 33 possible val-ues (depicted as ), and 5 attributes each with10 possible values (depicted as ). The column‘uniq’ is a measure of the uniqueness of utterancesover the meaning space, where 0 means all utter-ances are identical, and 1 means all utterances aredistinct. We can see that the uniqueness values arenear zero for both meaning spaces.We tried the approach of (Kirby et al., 2008)of removing duplicate utterances prior to presenta-tion to the student. Results for ‘nodups’ are shownin the last two rows of Table 2. The uniquenessimproved slightly, but was still near zero. Thus theapproach of (Kirby et al., 2008) did not prevent theformation of a degenerate grammar, in our experi-ments, when using ANNs. To prevent the formation of degenerate grammars,we propose to enforce uniqueness of utterances bymapping the generated utterances back into mean-ing space, and using reconstruction loss on the re-constructed meanings.Using meaning space reconstruction loss re-quires a way to map from generated utterancesback to meaning space. One way to achieve thiscould be to back-propagate from a generated ut-terance back onto a randomly initialized mean- igure 3: Agent sender-receiver architecture ing vector. However, this requires multiple back-propagation iterations in general, and we foundthis approach to be slow. We choose to introducea second ANN, which will learn to map from dis-crete utterances back to meaning vectors. Our ar-chitecture is thus an auto-encoder. We call the de-coder the ‘sender’, which maps from a thoughtvector into discrete language. The encoder istermed the ‘receiver’. We equip each agent withboth a sender and a receiver network, Figure 3.

We will denote the teacher sender networkas f T, send ( · ) , the student receiver network as f S, recv ( · ) , and the student sender network as f S, send . The output of f · , send ( · ) will be non-normalized logits, representing a sequence of dis-tributions over discrete tokens. These logits canbe converted into discrete tokens by applying anargmax.For teacher-student training, we use the sendernetwork of the teacher to generate a set ofmeaning-utterance pairs, which represent a sub-set of the teacher’s language. We present this lan-guage to the student, and train both the sender andthe receiver network of the student, on this newlanguage.The ILM training procedures is depicted in Fig-ure 4. A single generation proceeds as follows.For each step t , we do:• meaning sampling we sample a subset ofmeanings M train ,t = { m t, . . . m t,N } ⊂ M train , where M train is a subset of the spaceof all meanings, i.e. M train = M \ M holdout • teacher generation : use the teacher sendernetwork to generate the set of utterances U t = { u t, , . . . , u t,N } .• student supervised training : train the stu-dent sender and receiver networks super-vised, using M train ,t and U t • student end-to-end training : train the stu-dent sender and receiver network end-to-end,as an auto-encoderFor the teacher generation, each utterance u t,n is generated as f T, send ( m t,n ) .For the student supervised training, we train thestudent receiver network f S, recv ( · ) to generate U t ,given M t , and we train the student sender net-work f S, send ( · ) to recover M t given U t . Super-vised training for each network terminates after N sup epochs, or once training accuracy reachesacc sup The student supervised training serves to trans-mit the language from the teacher to the student.The student end-to-end training enforces unique-ness of utterances, so that the language does notbecome degenerate.In the end-to-end step, we iterate over multiplebatches, where for each batch j we do:• sample a set of meanings M train ,t,j = { m t,j, . . . m t,j,N batch } ⊂ M train • train, using an end-to-end loss function L e2e as an auto-encoder, using meanings M train ,t,j as both the input and the target ground truth.End-to-end training is run for either N e e batches, or until end-to-end training accuracyreaches threshold acc e e In the general case, the meanings m can be pre-sented as raw non-symbolic stimuli x . Each rawstimulus x can be encoded by some network intoa thought-vector m . We denote such an encodingnetwork as a ‘perception’ network. As an exam-ple of a perception network, an image could beencoded using a convolutional neural network.This then presents a challenge when training areceiver network. One possible architecture wouldbe for the receiver network to generate the origi-nal input x . We choose instead to share the per-ception network between the sender and receivernetworks in each agent. During supervised train-ing of the sender, using the language generated bythe teacher, we train the perception and sender net-works jointly. To train the receiver network, wehold the perception network weights constant, andtrain the receiver network to predict the output ofthe perception network, given input utterance u and target stimulus x . See Figure 5. Note that by igure 4: Neural ILM Training ProcedureFigure 5: Generalized Neural ILM Supervised Training setting the perception network as the identity op-erator, we recover the earlier supervised trainingsteps.For end-to-end training, with non-symbolic in-put, we use a referential task, e.g. as describedin (Lazaridou et al., 2018). The sender network ispresented the output of the perception network, m ,and generates utterance u . The receiver networkchooses a target image from distractors whichmatches the image presented to the sender. Thetarget image that the receiver network perceivescould be the original stimulus presented to thesender, or it could be a stimulus which matchesthe original image in concept, but is not the samestimulus. For example, two images could containthe same shapes, having the same colors, but indifferent positions. Figure 6 depicts the architec-ture, with a single distractor. In practice, multipledistractors are typically used. When we train a sender and receiver network end-to-end, we can put a softmax on the output ofthe sender network f · , send , to produce a probabil- ity distribution over the vocabulary, for each to-ken. We can feed these probability distributionsdirectly into the receiver network f · , recv , and trainusing cross-entropy loss. We denote this scenario SOFTMAX .Alternatively, we can sample discrete tokensfrom categorical distributions parameterized bythe softmax output. We train the resulting end-to-end network using REINFORCE. We use a mov-ing average baseline, and entropy regularization.This scenario is denoted RL . We wish to use objective measures of composi-tionality. This is necessary because the compo-sitional signal is empirically relatively weak. Weassume access to the ground truth for the mean-ings, and use two approaches: topographic simi-larity, ρ , as deﬁned in (Brighton and Kirby, 2006)and (Lazaridou et al., 2018); and holdout accuracyacc H . ρ is the correlation between distance in mean-ing space, and distance in utterance space, takenacross multiple examples. For the distance metric, igure 6: End-to-end Referential Task for Non-symbolic Inputs, where x s is the input stimulus presented to thesender, x tgt is the target input simulus, and x distr1 is a distractor stimulus. we use the L distance, for both meanings and ut-terances. That is, in meaning space, the distancebetween ‘red square‘ and ‘yellow square‘ is 1; andthe distance between ‘red square’ and ‘yellow cir-cle’ is 2. In utterance space, the difference be-tween ‘glaxefw’ and ‘glaxuzg’ is 3. Consideredas an edit distance, we consider substitutions; butneither insertions nor deletions. For the correla-tion measure, we use the Spearman’s Rank Corre-lation.acc H shows the ability of the agents to general-ize to combinations of shapes and colors not seenin the training set. For example, the training setmight contain examples of ‘red square’, ‘yellowsquare’, and ‘yellow circle’, but not ‘red circle’. Ifthe utterances were perfectly compositional, bothas generated by the sender, and as interpreted bythe receiver, then we would expect performanceon ‘red circle’ to be similar to the performance on‘yellow circle’. The performance on the holdoutset, relative to the performance on the training set,can thus be interpreted as a measure of composi-tionality.Note that when there is just a single attribute, itis not possible to exclude any values from train-ing, otherwise the model would never have beenexposed to the value at all. Therefore acc H is onlya useful measure of compositionality when thereare at least 2 attributes.We observe that one key difference between ρ and acc H is that ρ depends only on the compo-sitional behavior of the sender, whereas acc h de-pends also on the compositional behavior of thereceiver. As noted in (Lowe et al., 2019), it is pos-sible for utterances generated by a sender to ex-hibit a particular behavior or characteristic withoutthe receiver making use of this behavior or charac-teristic. Work on emergent communications was revivedrecently for example by (Lazaridou et al., 2016)and (Foerster et al., 2016). CITE and CITEshowed emergent communicatoins in a 2d world.CITE Several works investigate the composition-ality of the emergent language, such as CITE,CITE, CITE. (Kottur et al., 2017) showed thatagents do not generate compositional languagesunless they have to. (Lazaridou et al., 2018)used a referential game with high-dimensionalnon-symbolic input, and showed the resulting lan-guages contained elements of compositionality,measured by topographic similarity. (Bouchacourtand Baroni, 2018) caution that agents may not becommunicating what we think they are communi-cating, by using randomized images, and by inves-tigating the effect of swapping the target image.(Andreas et al., 2017) proposed an approach tolearn to translate from an emergent language into anatural language. Obtaining compositional emer-gent language can be viewed as disentanglementof the agent communications. (Locatello et al.,2019) prove that unsupervised learning of disen-tangled representations is fundamentally impossi-ble without inductive biases both on the consid-ered learning approaches and the data sets.Kirby pioneered ILM in (Kirby, 2001), extend-ing it to humans in (Kirby et al., 2008). (Grif-ﬁths and Kalish, 2007) proved that for Bayesianagents, that the iterated learning method convergesto a distribution over languages that is determinedentirely by the prior, which is somewhat alignedwith the result in (Locatello et al., 2019) for disen-tangled representations. (Li and Bowling, 2019),(Cogswell et al., 2020), and (Ren et al., 2020) ex-tend ILM to artiﬁcial neural networks, using sym-bolic inputs. Symbolic input vectors are by na-ure themselves compositional, typically, the con-catenation of one-hot vectors of attribute values,or of per-attribute embeddings (e.g. (Kottur et al.,2017)). Thus, these works show that given compo-sitional input, agents can generate compositionaloutput. In our work, we extend ILM to high-dimensional, non-symbolic inputs. However, aconcurrent work (Dagan et al., 2020) also extendsILM to image inputs, and also takes an additionalstep in examining the effect of genetic evolutionof the network architecture, in addition to the cul-tural evolution of the language that we consider inour own work.(Andreas, 2019) provides a very general frame-work, TRE, for evaluating compositionality, alongwith a speciﬁc implementation that relates closelyto the language representations used in the currentwork. It uses a learned linear projection to rear-range tokens within each utterance; and a relax-ation to enable the use of gradient descent to learnthe projection. Due to time pressure, we did notuse TRE in our own work.Our work on neural ILM relates to distillation,(Ba and Caruana, 2014), (Hinton et al., 2015), inwhich a large teacher networks distills knowledgeinto a smaller student network. More recently,(Furlanello et al., 2018) showed that when the stu-dent network has identical size and architecture tothe teacher network, distillation can still give animprovement in validation accuracy on a visionand a language model. Our work relates also toself-training, (He et al., 2019), in which learningproceeds in iterations, similar to ILM generations.

We conduct experiments ﬁrst on a synthetic con-cept dataset, built to resemble that of (Kirby,2001).We experiment conceptually with meaningswith a attributes, where each attribute can take oneof k values. The set of all possible meanings M comprises k a unique meanings. We use the no-tation k a to describe such a meaning space. Wereserve a holdout set H of 128 meanings, whichwill not be presented during training. This leaves ( k a − meanings for training and validation.In addition, we remove from the training set anymeanings having 3 or more attributes in commonwith any meanings in the holdout set. We choose two meanings spaces: and . is constructed to be similar in nature to (Kirby,2001), whilst being large enough to train an RNNwithout immediately over-ﬁtting. With 33 possi-ble values per attribute, the number of possiblemeanings increases from = 100 to ≈ , . In addition to not over-ﬁtting, this allowsus to set aside a reasonable holdout set of 128 ex-amples. We experiment in addition with a mean-ing space of , which has a total of , pos-sible meanings. We hypothesized that the muchlarger number of meanings prevents the networkfrom simply memorizing each meaning, and thusforce the network to naturally adopt a more com-positional representation. The model architecture for the symbolic concepttask is that depicted in Figure 4.The sender model converts each meaning into amany-hot representation, of dimension k · a , thenprojects the many-hot representation into an em-bedding space. Table 3 shows the results for the symbolic concepttask. We can see that when using an RL link, ILMimproves the topographic similarity measure, forboth and meaning spaces. This is true forboth SOFTMAX and RL . Interestingly, in the meaning space, the increase in compositionalityas measured by ρ is associated with a decrease inacc H , for both SOFTMAX and RL . This could indi-cate potentially that ILM is inducing the sender togenerate more compositional output, but that thereceiver’s understanding of the utterance becomesless compositional, in this scenario. It is interest-ing that ρ and acc H can be inversely correlated, incertain scenarios. This aligns somewhat with theﬁndings in (Lowe et al., 2019).Interestingly, it is not clear that using a meaning space leads to more compositional utter-ances than the much smaller meaning space. In Experiment One, we conserved the type of stim-uli used in prior work on ILM, eg (Kirby, 2001),using highly structured input. In Experiment Two,we investigate the extent to which ILM shows abeneﬁt using unstructured high-dimensional input. L E2E Tgt ILM? acc H ρ SOFTMAX e=100k 0.97+/-0.02 0.23+/-0.01 SOFTMAX e=100k yes RL e=500k 0.39+/-0.01 0.18+/-0.01 RL e=500k yes SOFTMAX e=100k SOFTMAX e=100k yes 0.56+/-0.06 RL e=500k RL e=500k yes 0.449+/-0.004 Table 3: Results using auto-encoder architecture on synthetic concepts dataset. ”E2E Tgt”: termination criteria(”target”) for end-to-end training; ” ρ ”: topographic similarity. Where ILM is used, it is run for 5 generations. We used OpenGL to create scenes containing col-ored objects, of various shapes, in different posi-tions.In the previous task, using symbolic meanings,we required the listener to reconstruct the sym-bolic meaning. In the case of images, we use areferential task, as discussed in Section 3.4. Theadvantage of using a referential task is that wedo not require the agents to communicate the ex-act position and color of each object, just whichshapes and colors are present. If the agents agreeon an ordering over shapes, then the number ofattributes to be communicated is exactly equal tothe number of objects in the images. The positionsof the objects are randomized to noise the images.We also varied the colors of the ground plane overeach image.Example images are shown in Figure 7. Eachexample comprises 6 images: one sender image,the target receiver image, and 4 distractor images.Each object in a scene was a different shape, andwe varied the colors and the positions of each ob-ject. Each shape was unique within each image.Two images were considered to match if the setsof shapes were identical, and if the objects withthe same shapes were identically colored. The po-sitions of the objects were irrelevant for the pur-poses of judging if the images matched.We change only a single color in each distractor,so that we force the sender and receiver to com-municate all object colors, not just one or two. Wecreate three datasets, for sets of 1, 2 or 3 objectsrespectively. Each dataset comprises 4096 train-ing examples, and 512 holdout examples.In the case of two shapes and three shapes, wecreate the holdout set by setting aside combina-tions of shapes and colors which are never seenin the training set. That is, the color ‘red’ might have been seen for a cube, but not for a cylinder.In the case of just one shape, this would mean thatthe color had never been seen at all, so for a sin-gle shape, we relax this requirement, and just useunseen geometrical conﬁgurations in the holdoutset.The dataset is constructed using OpenGL andpython. The code will be made available at . The supervised learning of the student sender andreceiver from the teacher generated language is il-lustrated in Figure 5. The referential task architec-ture is depicted in Figure 6. Owing to time pres-sure, we experimented only with using RL . Wechose RL over SOFTMAX because we felt that RL is more representative of the discrete nature of nat-ural languages. Shapes ILM? Batches acc H Holdout ρ Table 4: Results for OpenGL datasets. ‘Shapes’ isnumber of shapes, ‘Gens’ is number of ILM genera-tions, and ‘Batches’ is total number of batches. ForILM, batches per generation is total batches divided bynumber of ILM generations. For ILM, three genera-tions are used.

Table 4 shows the results using the OpenGLdatasets. We can see that when training using the https://github.com/asappresearch/neural-ilm igure 7: Example referential task images, one example per row. The sender image and the correct receiver imageare the ﬁrst two images in each row.Figure 8: Examples of individual runs up to 10 generations. ‘1 ilm’, ‘2 ilm’, and ‘3 ilm’ denote ILM over theone, two and three shape datasets respectively. ‘e2e acc’ denotes end to end training accuracy, ‘e2e holdout acc’denotes end to end accuracy on the holdout set (acc H ), and ‘e2e rho’ denotes the topologic similarity of thegenerated utterances ( ρ ). RL scenario, ILM shows an improvement acrossboth and meaning spaces.The increase in topographic similarity is asso-ciated with an improvement in holdout accuracy,across all scenarios, similar to the symbolicconcepts scenario.Figure 8 shows examples of individual runs.The plots within each row are for the same dataset,i.e. one shape, two shapes, or three shapes. Theﬁrst column shows the end to end accuracy, thesecond column shows holdout accuracy, acc H , andthe third column shows topologic similarity ρ . We note ﬁrstly that the variance across runs is high,which makes evaluating trends challenging. Re-sults in the table above were reported using ﬁveruns per scenario, and pre-selecting which runs touse prior to running them.We can see that end to end training accuracy isgood for the one and two shapes scenario, but thatthe model struggles to achieve high training accu-racy in the more challenging three shapes dataset.The holdout accuracy similarly falls dramatically,relative to the training accuracy, as the number ofshapes in the dataset increases. Our original hy-othesis was that the more challenging dataset, i.e.three shapes, would be harder to memorize, andwould thus lead to better compositionality. Thatthe holdout accuracy actually gets worse, com-pared to the training accuracy, with more shapeswas surprising to us.Similarly, the topological similarity actually be-comes worse as we add more shapes to the dataset.This seems unlikely to be simply because the re-ceiver struggles to learn anything at all, since theend to end training accuracy stays relatively highacross all three datasets. We note that the ILM ef-fect is only apparent over the ﬁrst few generations,reaching a plateau after around 2-3 generations. In this paper, we proposed an architecture to usethe iterated learning method (”ILM”) for neu-ral networks, including for non-symbolic high-dimensional input. We showed that using ILMwith neural networks does not lead to the sameclear compositionality as observed for DCGs.However, we showed that ILM does lead to amodest increase in compositionality, as measuredby both holdout accuracy and topologic similar-ity. We showed that holdout accuracy and topo-logic rho can be anti-correlated with each other,in the presence of ILM. Thus caution might beconsidered when using only a single one of thesemeasures. We showed that ILM leads to an in-crease in compositionality for non-symbolic high-dimensional input images.

Acknowledgements

Thank you to Angeliki Lazaridou for many inter-esting discussions and ideas that I’ve tried to usein this paper.

References

Jacob Andreas. 2019. Measuring compositional-ity in representation learning. arXiv preprintarXiv:1902.07181 .Jacob Andreas, Anca Dragan, and Dan Klein.2017. Translating neuralese. arXiv preprintarXiv:1704.06960 .Jimmy Ba and Rich Caruana. 2014. Do deep netsreally need to be deep? In Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q.Weinberger, editors,

Advances in Neural Informa-tion Processing Systems 27 , Curran Associates, Inc., pages 2654–2662. http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep.pdf.Diane Bouchacourt and Marco Baroni. 2018. Howagents see things: On visual representations inan emergent language game. arXiv preprintarXiv:1808.10696 .Henry Brighton and Simon Kirby. 2006. Understand-ing linguistic evolution by visualizing the emergenceof topographic mappings.

Artiﬁcial life arXiv preprint arXiv:2001.03361 .Jakob Foerster, Ioannis Alexandros Assael, Nandode Freitas, and Shimon Whiteson. 2016. Learning tocommunicate with deep multi-agent reinforcementlearning. In D. Lee, M. Sugiyama, U. Luxburg,I. Guyon, and R. Garnett, editors,

Advances in Neu-ral Information Processing Systems . Curran Asso-ciates, Inc., volume 29, pages 2137–2145.Tommaso Furlanello, Zachary C Lipton, MichaelTschannen, Laurent Itti, and Anima Anandkumar.2018. Born again neural networks. arXiv preprintarXiv:1805.04770 .Thomas L Grifﬁths and Michael L Kalish. 2007. Lan-guage evolution by iterated learning with bayesianagents.

Cognitive science arXivpreprint arXiv:1503.02531 .Simon Kirby. 2001. Spontaneous evolution of linguis-tic structure-an iterated learning model of the emer-gence of regularity and irregularity.

IEEE Transac-tions on Evolutionary Computation

Proceedings ofthe National Academy of Sciences arXivpreprint arXiv:1706.08502 .Angeliki Lazaridou, Karl Moritz Hermann, KarlTuyls, and Stephen Clark. 2018. Emergence oflinguistic communication from referential gameswith symbolic and pixel input. arXiv preprintarXiv:1804.03984 .ngeliki Lazaridou, Alexander Peysakhovich, andMarco Baroni. 2016. Multi-agent cooperation andthe emergence of (natural) language. arXiv preprintarXiv:1612.07182 .Fushan Li and Michael Bowling. 2019. Ease-of-teaching and language structure from emergent com-munication. In

Advances in Neural InformationProcessing Systems . pages 15851–15861.Francesco Locatello, Stefan Bauer, Mario Lucic, Gun-nar Raetsch, Sylvain Gelly, Bernhard Sch¨olkopf,and Olivier Bachem. 2019. Challenging commonassumptions in the unsupervised learning of disen-tangled representations. In international conferenceon machine learning . PMLR, pages 4114–4124.Ryan Lowe, Jakob Foerster, Y-Lan Boureau, JoellePineau, and Yann Dauphin. 2019. On the pitfalls ofmeasuring emergent communication. arXiv preprintarXiv:1903.05168 .Yi Ren, Shangmin Guo, Matthieu Labeau, Shay B Co-hen, and Simon Kirby. 2020. Compositional lan-guages emerge in a neural iterated learning model. arXiv preprint arXiv:2002.01365 . ppendix: hyper-parameters For all experiments, results and error bars are re-ported using ﬁve runs per scenario. We pre-selectwhich runs to use for reporting before runningthem.

For experiment 1, we use a batch-size of 100, em-bedding size of 50. RNNs are chosen to be GRUs.We query the teacher for utterances for 40% of thetraining meaning space each generation. We usean utterance length of 6, and a vocabulary size of4.6.2 Experiment 2