Anti-efficient encoding in emergent communication
Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, Marco Baroni
AAnti-efficient encoding in emergent communication
Rahma Chaabouni , Eugene Kharitonov , Emmanuel Dupoux and Marco Baroni Facebook AI Research Cognitive Machine Learning (ENS - EHESS - PSL Research University - CNRS - INRIA) ICREA {rchaabouni,kharitonov,dpx,mbaroni}@fb.com
Abstract
Despite renewed interest in emergent language simulations with neural networks,little is known about the basic properties of the induced code, and how theycompare to human language. One fundamental characteristic of the latter, knownas Zipf’s Law of Abbreviation (ZLA), is that more frequent words are efficientlyassociated to shorter strings. We study whether the same pattern emerges when twoneural networks, a “speaker” and a “listener”, are trained to play a signaling game.Surprisingly, we find that networks develop an anti-efficient encoding scheme,in which the most frequent inputs are associated to the longest messages, andmessages in general are skewed towards the maximum length threshold. This anti-efficient code appears easier to discriminate for the listener, and, unlike in humancommunication, the speaker does not impose a contrasting least-effort pressuretowards brevity. Indeed, when the cost function includes a penalty for longermessages, the resulting message distribution starts respecting ZLA. Our analysisstresses the importance of studying the basic features of emergent communicationin a highly controlled setup, to ensure the latter will not depart too far from humanlanguage. Moreover, we present a concrete illustration of how different functionalpressures can lead to successful communication codes that lack basic properties ofhuman language, thus highlighting the role such pressures play in the latter.
There is renewed interest in simulating language emergence among neural networks that interactto solve a task, motivated by the desire to develop automated agents that can communicate withhumans [e.g., Havrylov and Titov, 2017, Lazaridou et al., 2017, 2018, Lee et al., 2018]. As partof this trend, several recent studies analyze the properties of the emergent codes [e.g., Kottur et al.,2017, Bouchacourt and Baroni, 2018, Evtimova et al., 2018, Lowe et al., 2019, Graesser et al., 2019].However, these analyses generally consider relatively complex setups, when very basic characteristicsof the emergent codes have yet to be understood. We focus here on one such characteristic, namelythe length distribution of the messages that two neural networks playing a simple signaling gamecome to associate to their inputs, in function of input frequency.In his pioneering studies of lexical statistics, George Kingsley Zipf noticed a robust trend in humanlanguage that came to be known as Zipf’s Law of Abbreviation (ZLA): There is an inverse (non-linear) correlation between word frequency and length [Zipf, 1949, Teahan et al., 2000, Sigurd et al.,2004, Strauss et al., 2007]. Assuming that shorter words are easier to produce, this is an efficientencoding strategy, particularly effective given Zipf’s other important discovery that word distributionsare highly skewed, following a power-law distribution. Indeed, in this way language approachesan optimal code in information-theoretic terms [Cover and Thomas, 2006]. Zipf, and many afterhim, have thus used ZLA as evidence that language is shaped by functional pressures toward effort a r X i v : . [ c s . C L ] O c t inimization [e.g., Piantadosi et al., 2011, Mahowald et al., 2018, Gibson et al., 2019]. However,others [e.g., Mandelbrot, 1954, Miller et al., 1957, Ferrer i Cancho and del Prado Martín, 2011, delPrado Martín, 2013] noted that some random-typing distributions also respect ZLA, casting doubtson functional explanations of the observed pattern.We study a Speaker network that gets one out of K distinct one-hot vectors as input, randomlydrawn from a power-law distribution (so that frequencies are extremely skewed, like in naturallanguage). Speaker transmits a variable-length message to a Listener network. Listener outputsa one-hot vector, and the networks are rewarded if the latter is identical to the input. There is nodirect supervision on the message, so that the networks are free to create their own “language”.The networks develop a successful communication system that does not exhibit ZLA, and is indeed anti-efficient , in the sense that all messages are long, and the most frequent inputs are associated tothe longest messages. Interestingly, a similar effect is observed in artificial human communicationexperiments, in conditions in which longer messages do not demand extra effort to speakers, so thatthey are preferred as they ease the listener discrimination task [Kanwal et al., 2017]. Our Speakernetwork, unlike humans, has no physiological pressure towards brevity [Chaabouni et al., 2019],and our Listener network displays an a priori preference for longer messages. Indeed, when wepenalize Speaker for producing longer strings, the emergent code starts obeying ZLA. We examinethe implications of our findings in the Discussion.
We designed a variant of the Lewis signaling game [Lewis, 1969] in which the input distributionfollows a power-law distribution. We think of these inputs as a vocabulary of distinct abstract wordtypes , to which the agents will assign specific word forms while learning to play the game. We leaveit to further research to explore setups in which word type and form distributions co-evolve [Ferrer iCancho and Díaz-Guilera, 2007]. Importantly, our basic inefficient encoding result also holds whenthe inputs are uniformly distributed (Appendix A.1.5). Formally, the game proceeds as follows:1. The Speaker network receives one of K distinct one-hot vectors as input i . Inputs are notdrawn uniformly, but, like in natural language, from a power-law distribution. That is, the r th most frequent input i r has probability r × (cid:80) k =1 1 k to be sampled, with r ∈ (cid:74) , ..., (cid:75) .Consequently, the probability of sampling the st input is . while the probability ofsampling the th one is times lower.2. Speaker chooses a sequence of symbols from its alphabet A = { s , s ..., s a − , eos } of size | A | = a to construct a message m , terminated as soon as Speaker produces the ‘end-of-sequence’ token eos . If Speaker has not yet emitted eos at max_len − , it is stopped and eos is appended at the end of its message (so that all messages are suffixed with eos and nomessage is longer than max_len ).3. The Listener network consumes m and outputs ˆ i .4. The agents are successful if i = ˆ i , that is, Listener reconstructed Speaker’s input.The game is implemented using the EGG toolkit [Kharitonov et al., 2019], and the code can be foundat https://github.com/facebookresearch/EGG/tree/master/egg/zoo/channel . As standard in current emergent-language simulations [e.g., Lazaridou et al., 2018], both agentsare implemented as single-layer LSTMs [Hochreiter and Schmidhuber, 1997]. Speaker’s input is a K -dimensional one-hot vector i , and the output is a sequence of symbols, defining message m . Thissequence is generated as follows. A linear layer maps the input vector into the initial hidden stateof Speaker’s LSTM cell. Next, a special start-of-sequence symbol is fed to the cell. At each step ofthe sequence, the output layer defines a Categorical distribution over the alphabet. At training time,we sample from this distribution. During evaluation, we select the symbol greedily. Each selectedsymbol is fed back to the LSTM cell. The dimensionalities of the hidden state vectors are part ofthe hyper-parameters we explore (Appendix A.1.1). Finally, we initialize the weight matrices of our2gents with a uniform distribution with support in [ − √ input_size , √ input_size ], where input_size is the dimensionality of the matrix input (Pytorch default initialization).Listener consumes the entire message m , including eos . After eos is received, Listener’s hiddenstate is passed through a fully-connected layer with softmax activation, determining a Categoricaldistribution over K indices. This distribution is used to calculate the cross-entropy loss w.r.t. theground-truth input, i .The joint Speaker-Listener architecture can be seen as a discrete auto-encoder [Liou et al., 2014]. The architecture is not directly differentiable, as messages are discrete-valued. In language emergence,two approaches are dominantly used: Gumbel-Softmax relaxation [Maddison et al., 2016, Jang et al.,2016] and REINFORCE [Williams, 1992]. We also experimented with the approach of Schulman et al.[2015], combining REINFORCE and stochastic backpropagation to estimate gradients. Preliminaryexperiments showed that the latter algorithm (to be reviewed next) results in the fastest and moststable convergence, and we used it in all the following experiments. However, the main results wereport were also observed with the other algorithms, when successful.We denote by θ s and θ l the Speaker and Listener parameters, respectively. L is the cross-entropy loss,that takes the ground-truth one-hot vector i and Listener’s output L ( m ) distribution as inputs. Wewant to minimize the expectation of the cross-entropy loss E L ( i, L ( m )) , where the expectation iscalculated w.r.t. the joint distribution of inputs and message sequences. The gradient of the followingsurrogate function is an unbiased estimate of the gradient ∇ θ s ∪ θ l E L ( i, L ( m )) : E [ L ( i, L ( m ; θ l )) + ( {L ( i, L ( m ; θ l ) } − b ) log P s ( m | θ s )] (1)where {·} is the stop-gradient operation, P s ( m | θ s ) is the probability of producing the sequence m when Speaker is parameterized with vector θ s , and b is a running-mean baseline used to reduce theestimate variance without introducing a bias. To encourage exploration, we also apply an entropyregularization term [Williams and Peng, 1991] on the output distribution of the speaker agent.Effectively, under Eq. 1, the gradient of the loss w.r.t. the Listener parameters is found via conventionalbackpropagation (the first term in Eq. 1), while Speaker’s gradient is found with a REINFORCE-likeprocedure (the second term). Once the gradient estimate is obtained, we feed it into the Adam [Kingmaand Ba, 2014] optimizer. We explore different learning rate and entropy regularization coefficientvalues (Appendix A.1.1).We train agents for episodes, each consisting of mini-batches, in turn including inputssampled from the power-law distribution with replacement. After training, we present to the systemeach input once, to compute accuracy by giving equal weight to all inputs, independently of amountof training exposure. As ZLA is typically only informally defined, we introduce 3 reference distributions that displayefficient encoding and arguably respect ZLA.
Based on standard coding theory [Cover and Thomas, 2006], we design an optimal code (OC)guaranteeing the shortest average message length given a certain alphabet size and the constraint thatall messages must end with eos . The shortest messages are deterministically associated to the mostfrequent inputs, leaving longer ones for less frequent ones. The length of the message associated toan input is determined as follows. Let A = { s , s ...s a − , eos } be the alphabet of size a and i r bethe r th input when ranked by frequency. Then i r is mapped to a message of length l i r = min { n : n (cid:88) k =1 ( a − k − ≥ r } (2)3or instance, if a = 3 , then there is only one message of length (associated to the most frequentreferent), of length , of length etc. Section 2 of Ferrer i Cancho et al. [2013] presents a proofof how this encoding is the maximally efficient one.
Natural languages respect ZLA without being as efficient as OC. It has been observed that
Monkeytyping (MT) processes, whereby a monkey hits random typewriter keys including a space character,produce word length distributions remarkably similar to those attested in natural languages [Simon,1955, Miller et al., 1957]. We thus adapt a MT process to our setup, as a less strict benchmark fornetwork efficiency. We first sample an input without replacement according to the power-law distribution, then generatethe message to be associated with it. We repeat the process until all inputs are assigned a uniquemessage. The message is constructed by letting a monkey hit the a keys of a typewriter uniformlyat random ( p = 1 /a ), subject to these constraints: (i) The message ends when the monkey hits eos .(ii) A message cannot be longer than a specified length max_len . If the monkey has not yet emitted eos at max_len − , it is stopped and eos is appended at the end of the message. (iii) If a generatedmessage is identical to one already used, it is rejected and another is generated.For a given length l , there are only ( a − l − different messages. Moreover, for a random generatorwith the max_len constraint, the probability of generating a message of length l is: P l = p × (1 − p ) l − , if l < max_len and P max_len = (1 − p ) max_len − (3)From these calculations, we derive two qualitative observations about MT. First, as we fix max_len and increase a (decrease p = 1 /a ), more generated messages will reach max_len . Second, when a is small and max_len is large (as in early MT studies where max_len was infinite), a ZLA-likedistribution emerges, due to the finite number of different messages of length l . Indeed, for any l lessthan max_len , P l strictly decreases as l grows. Then, for given inputs, the monkey is likely to startby generating messages of the most probable length (that is, ). As we exhaust all unique messagesof this length, the process starts generating messages of the next probable length (i.e., ) and so on.Figure A1 in Appendix A.1.2 confirms experimentally that our MT distribution respects ZLA for a ≤ and various max_len . We finally consider word length distributions in natural language corpora. We used pre-compiledEnglish, Arabic, Russian and Spanish frequency lists from http://corpus.leeds.ac.uk/serge/ ,extracted from corpora of internet text containing between M (Russian) and M words (Arabic).For direct comparability with input set cardinality in our simulations, we only looked at the distributionof the top most frequent words, after merging lower- and upper-cased forms, and removingwords containing non-alphabetical characters. The resulting word frequency distributions obeyedpower laws with exponents between − . and − . (we used − to generate our inputs). Alphabetsizes are as follows: (English), (Spanish), (Russian), (Arabic). These are larger thannormative sizes, as unfiltered Internet text will occasionally include foreign characters (e.g., accentedletters in English text). Contrary to previous reference distributions, we cannot control max_len andalphabet size. We hence compare human and network distributions only in the adequate settings. Inthe main text, we present results for the languages with the smallest (English) and largest (Arabic)alphabets. The distributions of the other languages are comparable, and presented in Appendix A.1.3. We experiment with alphabet sizes a ∈ [3 , , , , . We chose mainly small alphabet sizes tominimize a potential bias in favor of long messages: For high a , randomly generating long messagesbecomes more likely, as the probability of outputting eos at random becomes lower. At the other There is always only one message of length 1 (that is, eos ), irrespective of alphabet size. No actual monkey was harmed in the definition of the process. a = 1000 , where the Speaker could in principle successfully communicateusing at most -symbol messages (as Speaker needs to produce eos ). Finally, a = 40 was chosen tobe close to the alphabet size of the natural languages we study (mean alphabet size: . ).After fixing a , we choose max_len so that agents have enough capacity to describe the whole inputspace ( | I | = 1000 ). For a given a and max_len , Speaker cannot encode more inputs than the messagespace size M max_len a = (cid:80) max_len j =1 ( a − j − . We experiment with max_len ∈ [2 , , , . Wecouldn’t use higher values because of memory limitations. Furthermore, we studied the effect of D = M max_len a | I | . While making sure that this ratio is at least , we experiment with low values, whereSpeaker would have to use nearly the whole message space to successfully denote all inputs. We alsoconsidered settings with significantly larger D , where constructing K distinct messages might be aneasier task.We train models for each ( max_len , a ) setting and agent hyperparameter choice (4 seeds per choice).We consider runs successful if, after training, they achieve an accuracy above on the full inputset (i.e., less than miss-classified inputs). As predicted, the higher D is, the more accurate theagents become. Indeed, agents need much larger D than strictly necessary in order to converge. Weselect for further analysis only those ( max_len , a ) choices that resulted in more than successfulruns (mean number of successful runs across the reported configurations is out of ). Moreover,we focus here on configurations with max_len = 30 , as the most comparable to natural language. We present results for all selected configurations (confirming the same trends) in Appendix A.1.4.Figure 1 shows message length distribution (averaged across all successful runs) in function ofinput frequency rank, compared to our reference distributions. The MT results are averaged across different runs. We show the Arabic and English distributions in the plot containing the mostcomparable simulation settings (30 , .Across configurations, we observe that Speaker messages greatly depart from ZLA. There is a cleargeneral preference for longer messages, that is strongest for the most frequent inputs , where Speakeroutputs messages of length max_len . That is, in the emergent encoding, more frequent words arelonger, making the system obey a sort of “ anti -ZLA” (see Appendix A.1.6 for confirmation that thisanti-efficient pattern is statistically significant). Consequently, the emergent language distributionsare well above all reference distributions, except for MT with a = 1000 , where the large alphabet sizeleads to uniformly long words, for reasons discussed in Section 2.4.2. Finally, the lack of efficiency inemergent language encodings is also observed when inputs are uniformly distributed (see AppendixA.1.5).Although some animal signing systems disobey ZLA, due to specific environmental constraints [e.g.,Heesen et al., 2019], a large survey of human and animal communication did not find any case ofsignificantly anti- efficient systems [Ferrer i Cancho et al., 2013], making our finding particularlyintriguing. inputs sorted by frequency m e ss a g e s l e n g t h (a) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (b) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (c) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (d) max_len = , a = emergent messages monkey typing optimal coding English Arabic Figure 1: Mean message length across successful runs as a function of input frequency rank, withreference distributions. For readability, we smooth natural language distributions by reporting thesliding average of consecutive lengths. Natural languages have no rigid upper bound on length, and 30 is the highest max_len we were able to trainmodels for. Qualitative inspection of the respective corpora suggest that 30 is anyway a reasonable “soft” upperbound on word length in the languages we studied (longer strings are mostly typographic detritus). .2 Causes of anti-efficient encoding We explore the roots of anti-efficiency by looking at the behavior of untrained Speakers and Listeners.Earlier work conjectured that ZLA emerges from the competing pressures to communicate in aperceptually distinct and articulatorily efficient manner [Zipf, 1949, Kanwal et al., 2017]. For ournetworks, there is a clear pressure from Listener in favour of ease of message discriminability , butSpeaker has no obvious reason to save on “articulatory” effort. We thus predict that the observedpattern is driven by a Listener-side bias.
For each i drawn from the power-law distribution without replacement, we get a message m from distinct untrained Speakers ( speakers for each hidden size in [100 , , ). We experimentwith different association processes. In the first, we associate the first generated m to i , irrespectiveof whether it was already associated to another input. In the second, we keep generating a m for i until we get a message that was not already associated to a distinct input. The second version is closerto the MT process (see Section 2.4.2). Moreover, message uniqueness is a reasonable constraint,since, in order to succeed, Speakers need first of all to keep messages denoting different inputs apart.Figure 2 shows that untrained Speakers have no prior toward outputting long sequences of symbols.Precisely, from Figure 2 we see that the untrained Speakers’ average message length coincides withthe one produced by the random process defined in Eq. 3 where p = a . In other words, untrainedSpeakers are equivalent to a random generator with uniform probability over symbols. Consequently,when imposing message uniqueness, non-trained Speakers become identical to MT. Hence, Speakersfaced with the task of producing distinct messages for the inputs, if vocabulary size is not too large,would naturally produce a ZLA-obeying distribution, that is radically altered in joint Speaker-Listenertraining. inputs sorted by frequency m e ss a g e s l e n g t h (a) max_len = 30 , a = 3 inputs sorted by frequency m e ss a g e s l e n g t h (b) max_len = 30 , a = 5 inputs sorted by frequency m e ss a g e s l e n g t h (c) max_len = 30 , a = 40 untrained Speaker with uniqueness constraint untrained Speaker monkey typing Figure 2: Average length of messages by input frequency rank for untrained Speakers, compared toMT. See Appendix A.1.7 for more settings.
Having shown that untrained Speakers do not favor long messages, we ask next if the emergentanti-efficient language is easier to discriminate by untrained Listeners than other encodings. Tothis end, we compute the average pairwise L2 distance of the hidden representations produced byuntrained Listeners in response to messages associated to all inputs. Messages that are further apartin the representational space of the untrained Listener should be easier to discriminate. Thus, ifSpeaker associates such messages to the inputs, it will be easier for Listener to distinguish them. Note that we did not use the uniqueness-of-messages constraint to define P l . We verified that indeed untrained Speakers have uniform probability over the different symbols. Results are similar if looking at the softmax layer instead. distinct untrained Listeners with 100-dimensional hidden size. We test different encodings: (1) emergent messages (produced by trained Speakers) (2) MT messages( runs) (3) OC messages and (4) human languages. Note that MT is equivalent to untrainedSpeaker, as their messages share the same length and alphabet distribution (see Section 3.2.1). Westudy Listeners’ biases with max_len = 30 while varying a as messages are more distinct fromreference distributions in that case (see Figure A3 in Appendix A.1.4). Results are reported in Figure3. Representations produced in response to the emergent messages have the highest average distance.MT only approximates the emergent language for a = 1000 , where, as seen in Figure 1 above, MT isanti-efficient. The trained Speaker messages are hence a priori easier for non-trained Listeners. Thelength of these messages could thus be explained by an intrinsic Listener’s bias, as conjectured above.Also, interestingly, natural languages are not easy to process by Listeners. This suggests that theemergence of “natural” languages in LSTM agents is unlikely, without imposing ad-hoc pressures. alphabet size (log scale) L d i s t a n c e emergent messagesmonkey typingoptimal codingEnglishArabic Figure 3: Average pairwise distance between messages’ representation in Listener’s hidden space,across all considered non-trained Listeners. Vertical lines mark standard deviations across Listeners.
We next impose an artificial pressure on Speaker to produce short messages, to counterbalanceListener’s preference for longer ones. Specifically, we add a regularizer disfavoring longer messagesto the original loss: L (cid:48) ( i, L ( m ) , m ) = L ( i, L ( m )) + α × | m | (4)where L ( i, L ( m )) is the cross-entropy loss used before, | . | denotes length, and α is a hyperparameter.The non-differentiable term α × | m | is handled seamlessly as it only depends on Speaker’s parameters θ s (which specify the distribution of the messages m ), and the gradient of the loss w.r.t. θ s isestimated via a REINFORCE-like term (Eq. 1). Figure 4 shows emergent message length distributionunder this objective, comparing it to other reference distributions in the most human-language-likesetting: ( max_len = , a = ). The same pattern is observed elsewhere (see Appendix A.1.8, thatalso evaluates the impact of the α hyperparameter). The emergent messages clearly follow ZLA.Speaker now assigns messages of ascending length to the most frequent inputs. For the remainingones, it chooses messages with relatively similar, but notably shorter, lengths (always much shorterthan MT messages). Still, the encoding is not as efficient as the one observed in natural language(and OC). Also, when adding length regularization, we noted a slower convergence, with a smallernumber of successful runs, that further diminishes when α increases. We conclude with a high-level look at what the long emergent messages are made of. Specifically,we inspect symbol unigram and bigram frequency distributions in the messages produced by trainedSender in response to the K inputs (the eos symbol is excluded from counts). For direct compa-rability with natural language, we report results in the ( max_len = , a = ) setting, but the patternsare general. We observe in Figure 5(a) that, even if at initialization Speaker starts with a uniformdistribution over its alphabet (not shown here), by end of training it has converged to a very skewedone. Natural languages follow a similar trend, but their distributions are not nearly as skewed (see We fix this value because, unlike for Speaker, it has considerable impact on performance, with beingthe preferred setting.
200 400 600 800 1000 inputs sorted by frequency m e ss a g e s l e n g t h emergent messagesemergent messages / length pressure monkey typingoptimal coding EnglishArabic Figure 4: Mean length of messages across successful runs as a function of input frequency rank for max_len = 30 , a = 40 , α = 0 . . Natural language distributions are smoothed as in Fig. 1.Figure 8(a) in Appendix A.2.1 for entropy analysis). We then investigate message structure bylooking at symbol bigram distribution. To this end, we build randomly generated control codes ,constrained to have the same mean length and unigram symbol distribution as the emergent code.Intriguingly, we observe in Figure 5(b) a significantly more skewed emergent bigram distribution,compared to the controls. This suggests that, despite the lack of phonetic pressures, Speaker isrespecting “phonotactic” constraints that are even sharper than those reflected in the natural languagebigram distributions (see Figure 8(b) in Appendix A.2.1 for entropy analysis). In other words, theemergent messages are clearly not built out of random unigram combinations. Looking at the patternmore closely, we find the skewed bigram distribution to be due to a strong tendency to repeat thesame character over and over, well beyond what is expected given the unigram symbol skew (seetypical message examples in Appendix A.2). More quantitatively, across all runs with max_len = ,if we denote the most probable symbols with s , ..., s , then we observe P ( s r , s r ) > P ( s r ) with r ∈ (cid:74) , .., (cid:75) , in more than . runs. We leave a better understanding of the causes andimplications of these distributions to future work. top 30 Unigrams f r e q u e n c y i n % Emergent messagesEnglishArabic (a) Symbol unigram distributions top 50 Bigrams f r e q u e n c y i n % Emergent messagesControl messagesEnglishArabic (b) Symbol bigram distributions
Figure 5: Distribution of top symbol unigrams and bigrams (ordered by frequency) in differentcodes. Emergent and control messages are averaged across successful runs and different simulationsrespectively in the ( max_len = , a = ) setting. We found that two neural networks faced with a simple communication task, in which they have tolearn to generate messages to refer to a set of distinct inputs that are sampled according to a power-lawdistribution, produce an anti-efficient code where more frequent inputs are significantly associated tolonger messages, and all messages are close to the allowed maximum length threshold. The results are8table across network and task hyperparameters (although we leave it to further work to replicate thefinding with different network architectures, such as transformers or CNNs). Follow-up experimentssuggest that the emergent pattern stems from an a priori preference of the listener network for longer,more discriminable messages, which is not counterbalanced by a need to minimize articulatory efforton the side of the speaker. Indeed, when an artificial penalty against longer messages is imposed onthe latter, we see a ZLA distribution emerging in the networks’ communication code.From the point of view of AI, our results stress the importance of controlled analyses of languageemergence. Specifically, if we want to develop artificial agents that naturally communicate withhumans, we want to ensure that we are aware of, and counteract, their unnatural biases, such as the onewe uncovered here in favor of anti-efficient encoding. We presented a proof-of-concept example ofhow to get rid of this specific bias by directly penalizing long messages in the cost function, but futurework should look into less ad hoc ways to condition the networks’ language. Getting the encodingright seems particularly important, as efficient encoding has been observed to interact in subtle wayswith other important properties of human language, such as regularity and compositionality [Kirby,2001]. We also emphasize the importance of using power-law input distributions when studyinglanguage emergence, as the latter are a universal property of human language [Zipf, 1949, Baayen,2001] largely ignored in previous simulations, that assume uniform input distributions.ZLA is observed in all studied human languages. As mentioned above, some animal communicationsystems violate it [Heesen et al., 2019], but such systems are 1) limited in their expressivity; and2) do not display a significantly anti- efficient pattern. We complemented this earlier comparativeresearch with an investigation of emergent language among artificial agents that need to signal alarge number of different inputs. We found that the agents develop a successful communicationsystem that does not exhibit ZLA, and is actually significantly anti-efficient. We connected this to anasymmetry in speaker vs. listener biases. This in turn suggests that ZLA in communication in generaldoes not emerge from trivial statistical properties, but from a delicate balance of speaker and listenerpressures. Future work should investigate emergent distributions in a wider range of artificial agentsand environments, trying to understand which factors are determining them.
We would like to thank Fermín Moscoso del Prado Martín, Ramon Ferrer i Cancho, Serge Sharoff, theaudience at REPL4NLP 2019 and the anonymous reviewers for helpful comments and suggestions.
References
Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning tocommunicate with sequences of symbols. In
Proceedings of NIPS , pages 2149–2159, Long Beach,CA, 2017.Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and theemergence of (natural) language. In
Proceedings of ICLR Conference Track , Toulon, France, 2017.Published online: https://openreview.net/group?id=ICLR.cc/2017/conference .Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguisticcommunication from referential games with symbolic and pixel input. In
Proceedings of ICLRConference Track , Vancouver, Canada, 2018. Published online: https://openreview.net/group?id=ICLR.cc/2018/Conference .Jason Lee, Kyunghyun Cho, Jason Weston, and Douwe Kiela. Emergent translation in multi-agentcommunication. In
Proceedings of ICLR Conference Track , Vancouver, Canada, 2018. Publishedonline: https://openreview.net/group?id=ICLR.cc/2018/Conference .Satwik Kottur, José Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge ‘naturally’in multi-agent dialog. In
Proceedings of EMNLP , pages 2962–2967, Copenhagen, Denmark, 2017.Diane Bouchacourt and Marco Baroni. How agents see things: On visual representations in anemergent language game. In
Proceedings of EMNLP , pages 981–985, Brussels, Belgium, 2018.9atrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. Emergent communicationin a multi-modal, multi-step referential game. In
Proceedings of ICLR Conference Track , Vancou-ver, Canada, 2018. Published online: https://openreview.net/group?id=ICLR.cc/2018/Conference .Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. On the pitfallsof measuring emergent communication. In
Proceedings of AAMAS , pages 693–701, Montreal,Canada, 2019.Laura Graesser, Kyunghyun Cho, and Douwe Kiela. Emergent linguistic phenomena in multi-agentcommunication games. https://arxiv.org/abs/1901.08706 , 2019.George Zipf.
Human Behavior and the Principle of Least Effort . Addison-Wesley, Boston, MA,1949.William J Teahan, Yingying Wen, Rodger McNab, and Ian H Witten. A compression-based algorithmfor chinese word segmentation.
Computational Linguistics , 26(3):375–393, 2000.Bengt Sigurd, Mats Eeg-Olofsson, and Joost Van Weijer. Word length, sentence length and frequency–zipf revisited.
Studia Linguistica , 58(1):37–52, 2004.Udo Strauss, Peter Grzybek, and Gabriel Altmann. Word length and word frequency. In PeterGrzybek, editor,
Contributions to the Science of Text and Language , pages 277–294. Springer,Dordrecht, the Netherlands, 2007.Thomas Cover and Joy Thomas.
Elements of Information Theory, 2nd ed.
Wiley, Hoboken, NJ, 2006.Steven T Piantadosi, Harry Tily, and Edward Gibson. Word lengths are optimized for efficientcommunication.
Proceedings of the National Academy of Sciences , 108(9):3526–3529, 2011.Kyle Mahowald, Isabelle Dautriche, Edward Gibson, and Steven Piantadosi. Word forms arestructured for efficient use.
Cognitive Science , 42:3116–3134, 2018.Edward Gibson, Richard Futrell Steven Piantadosi, Isabelle Dautriche, Kyle Mahowald, Leon Bergen,and Roger Levy. How efficiency shapes human language.
Trends in Cognitive Science , 2019. Inpress.Benoit Mandelbrot. Simpie games of strategy occurring in communication through natural languages.
Transactions of the IRE Professional Group on Information Theory , 3(3):124–137, 1954.George A Miller, E Newman, and E Friedman. Some effects of intermittent silence.
AmericanJournal of Psychology , 70(2):311–314, 1957.Ramon Ferrer i Cancho and Fermín Moscoso del Prado Martín. Information content versus wordlength in random typing.
Journal of Statistical Mechanics: Theory and Experiment , 2011(12):L12002, 2011.Fermín Moscoso del Prado Martín. The missing baselines in arguments for the optimal efficiency oflanguages. In
Proceedings of the Annual Meeting of the Cognitive Science Society , volume 35,2013.Jasmeen Kanwal, Kenny Smith, Jennifer Culbertson, and Simon Kirby. Zipf’s law of abbrevia-tion and the principle of least effort: Language users optimise a miniature lexicon for efficientcommunication.
Cognition , 165:45–52, 2017.Rahma Chaabouni, Eugene Kharitonov, Alessandro Lazaric, Emmanuel Dupoux, and Marco Baroni.Word-order biases in deep-agent emergent communication. In
Proceedings of ACL , pages 5166–5175, Florence, Italy, 2019.David Lewis. Convention: A philosophical study, 1969.Ramon Ferrer i Cancho and Albert Díaz-Guilera. The global minima of the communicative energy ofnatural communication systems.
Journal of Statistical Mechanics: Theory and Experiment , 2007(06):P06009, 2007. 10ugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni. Egg: a toolkit forresearch on emergence of language in games. arXiv preprint arXiv:1907.00852 , 2019.Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural Computation , 9(8):1735–1780, 1997.Cheng-Yuan Liou, Wei-Chen Cheng, Jiun-Wei Liou, and Daw-Ran Liou. Autoencoder for words.
Neurocomputing , 139:84–96, 2014.Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuousrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712 , 2016.Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. arXivpreprint arXiv:1611.01144 , 2016.Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning.
Machine learning , 8(3-4):229–256, 1992.John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation usingstochastic computation graphs. In
Advances in Neural Information Processing Systems , pages3528–3536, 2015.Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learningalgorithms.
Connection Science , 3(3):241–268, 1991.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Ramon Ferrer i Cancho, Antoni Hernández-Fernández, David Lusseau, Govindasamy Agoramoorthy,Minna J Hsu, and Stuart Semple. Compression as a universal principle of animal behavior.
Cognitive Science , 37(8):1565–1578, 2013.Herbert A Simon. On a class of skew distribution functions.
Biometrika , 42(3/4):425–440, 1955.Raphaela Heesen, Catherine Hobaiter, Ramon Ferrer-i Cancho, and Stuart Semple. Linguistic laws inchimpanzee gestural communication.
Proceedings of the Royal Society B , 286(1896):20182900,2019.Simon Kirby. Spontaneous evolution of linguistic structure-an iterated learning model of the emer-gence of regularity and irregularity.
IEEE Transactions on Evolutionary Computation , 5(2):102–110, 2001.Harald Baayen.
Word Frequency Distributions . Kluwer, Dordrecht, The Netherlands, 2001.
A.1 Supplementary
A.1.1 Hyperparameters
Both speaker and listener agents are single-layer LSTMs [Hochreiter and Schmidhuber, 1997].We experiment with the combinations (Speaker’s hidden size, Listener’s hidden size) in [(100 , , (250 , , (250 , , (500 , . We only experiment with combinations whereSpeaker’s hidden-size is bigger or equal to Listener’s, because of the asymmetry in their tasks.Indeed, as discussed in Section 3.1 of the main paper, the Speaker’s search space M max_len a isgenerally larger than the one of the Listener R .We use the Adam optimizer, with learning rate . . We apply entropy regularization to Speaker’soptimization. The values of the regularization’s coefficient are chosen in [1 , . , . We run thesimulation with each hyperparameter setting times with different random seeds. A.1.2 Monkey typing
We adapt the Monkey typing (MT) process by adding the max_len constraint. This makes it aZLA-like distribution only when vocabulary size a is small. Figure A1 illustrates this behavior. Wesee that the higher a is, the further the MT distribution departs from a ZLA pattern.11
200 400 600 800 1000 inputs sorted by frequency m e ss a g e s l e n g t h (a) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (b) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (c) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (d) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (e) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (f) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (g) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (h) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (i) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (j) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (k) max_len = , a = Figure A1: Monkey typing encoding: Mean message length across 50 simulations as a function ofinput frequency rank.
A.1.3 Natural language distributions
We report in Figure A2 word length distributions for all the natural languages we considered, andcompare them with (1) optimal encoding (OC) and (2) emergent language in the most comparablesimulation setting: ( max_len = 30 , a = 40) . Despite their different alphabet sizes, natural languagespattern similarly: They follow ZLA, and approximate OC. inputs sorted by frequency m e ss a g e l e n g t h emergent messagesSpanishEnglishRussianArabicoptimal coding Figure A2: Word length in natural languages in function of word frequency rank, compared to averageemergent code and OC in the ( max_len = 30 , a = 40) setting. For readability, we smooth naturallanguage distributions by reporting the sliding average of consecutive lengths.12 .1.4 Anti-efficient emergent language Figure A3 shows message length distribution (averaged across all successful runs) in function ofinput frequency rank, and compares it with some reference distributions. The results are in line withour finding in Section 3.1 of the main paper. inputs sorted by frequency m e ss a g e s l e n g t h (a) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (b) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (c) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (d) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (e) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (f) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (g) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (h) max_len = , a = inputs sorted by frequency m e ss a g e s l e n g t h (i) max_len = , a = emergent messages monkey typing optimal coding English Arabic Figure A3: Mean message length across successful runs as a function of input frequency rank, withreference distributions. Natural language distributions are smoothed as in Fig. A2.
A.1.5 Emergent language with uniform input distribution
Agents’ messages are very long also when the input distribution is uniform, see Figure A4. Theiraverage length is significantly larger than MT messages with uniform inputs (t-test, p < − ). A.1.6 Randomization test
In the main paper, we observe a tendency for Speaker to use longer messages for frequent inputs,making its code obey a sort of “ anti -ZLA”. In this section, we provide quantitative support forthis observation. We run the randomization test of Ferrer i Cancho et al. [2013]. We note E =
200 400 600 800 1000 inputs m e ss a g e s l e n g t h (a) a = inputs m e ss a g e s l e n g t h (b) a = inputs m e ss a g e s l e n g t h (c) a = inputs m e ss a g e s l e n g t h (d) a = Emergent messages Monkey typing Optimal coding
Figure A4: Mean message length per input across successful runs for max_len=30 and different a .Inputs are uniformly distributed.Table A.1: Results of the randomization test for different codes when max_len = 30 and withdifferent alphabet sizes a . Left/right p-values significant at α = 0 . suffixed by asterisk. See Table1 of Ferrer i Cancho et al. [2013] for more codes to be compared with our results. Setting Code E Left p-Value Right p-Value a = 5 OC . < − * > − − MT . < − ∗ > − − Emergent . > − − < − ∗ a = 10 OC . < − ∗ > − − MT .
27 0 . * . Emergent . > − − < − * a = 40 OC . < − * > − − MT .
30 0 .
814 0 . Emergent . > − − < − *Regularized ( α = . ) . < − * > − − English . < − * > − − Arabic . < − * > − − a = 1000 OC .
86 0 . * . MT .
67 0 .
750 0 . Emergent .
98 0 .
072 0 . (cid:80) i =1 p i × l i the mean length of messages, where p i is the probability of the type i and l i is thelength of the corresponding message. A language that respects ZLA is characterized by a small E (optimal coding, OC, is associated with min ( E ) ). Under H , the mean length of the encodingcoincides with the mean length of a random permutation of messages across types. To be comparablewith Ferrer i Cancho et al. [2013], we use the same number of permutations ( = 10 ). Also, we adopttheir definition of “left p-value” and “right p-value”. If left p-value ≤ . , the studied encodingis significantly small (characterized by significantly smaller E than random permutations), if rightp-value ≤ . , it is significantly large , corresponding to our notion of anti-efficiency.We observe in Table A.1 that H is only rejected for MT with a ≥ , which, as we mentioned in themain paper, approaches a random length distribution for those cases, and for emergent messages with a = 1000 . OC, natural languages, and emergent language with Speaker-length regularization are,in all the considered settings, significantly more efficient than chance. Importantly, the Emergentlanguage results confirm LSTMs’ natural preference for long messages ( E approaching max_len )and significant anti-efficiency for a ≤ (right p-value ≈ ). When a = 1000 , there is no frequencyrank/length relation and all lengths ≈ max_len . A.1.7 Speaker initial length distribution
Figure A5 plots message length in function of input frequency rank for several settings. In particular,we report all settings ( max_len , a ) that succeeded when training the Speaker-Listener system. Here,14owever, no training is performed, so that we can observe Speaker’s initial biases. The results are inline with our finding in Section 3.2.1 of the main paper. inputs sorted by frequency m e ss a g e s l e n g t h (a) (6 , inputs sorted by frequency m e ss a g e s l e n g t h (b) (6 , inputs sorted by frequency m e ss a g e s l e n g t h (c) (6 , inputs sorted by frequency m e ss a g e s l e n g t h (d) (6 , inputs sorted by frequency m e ss a g e s l e n g t h (e) (11 , inputs sorted by frequency m e ss a g e s l e n g t h (f) (11 , inputs sorted by frequency m e ss a g e s l e n g t h (g) (11 , inputs sorted by frequency m e ss a g e s l e n g t h (h) (11 , inputs sorted by frequency m e ss a g e s l e n g t h (i) (11 , inputs sorted by frequency m e ss a g e s l e n g t h (j) (30 , inputs sorted by frequency m e ss a g e s l e n g t h (k) (30 , inputs sorted by frequency m e ss a g e s l e n g t h (l) (30 , inputs sorted by frequency m e ss a g e s l e n g t h (m) (30 , inputs sorted by frequency m e ss a g e s l e n g t h (n) (30 , untrained Speaker with uniqueness constraint untrained Speaker monkey typing Figure A5: Average length of messages in function of input frequency rank for untrained Speakers,compared to MT. In each figure we report the results in a specific setting ( max_len , a ) . A.1.8 The effect of length regularization
We look here at the effect of the regularization coefficient α on the nature of the emergent encoding.To this end, we consider the setting that is least efficient when no optimization is applied: ( max_len =30 , a = 1000) . The same pattern is also observed with different choices of max_len and a . FigureA6 shows, for α = 1 , that emergent messages approximate optimal coding . For even larger values,we were not able to successfully train the system to communicate. This is in line with Zipf’s view of competing pressures for accurate communication vs. efficiency. The emergent messages follow ZLAonly when both pressures are at work. If the efficiency pressure is not present, agents come up with acommunicatively effective but non-efficient encoding, as shown in Section A.1.4 and Section 3.1 ofthe main paper. However, if the efficiency pressure is too high, agents cannot converge on a protocolthat is successful from the point of view of communication. A.2 Repetition in emergent messages
We report in listings 1, 2, 3 and 4 examples of emergent messages in different settings. We noticethat the agents extensively use repetition, even when a (vocabulary size) is large. This repetition, thatresults in the very skewed bigram distributions presented in Section 3.3 of the main paper, increaseswith higher max_len , as shown in figure A7. Moreover, from figure A7, we see that, unlike inemergent codes, this sort of repetition does not appear in natural language.Listing 1: Emergent messages for the most frequent inputs ( max_len : and a : ). m1 : 1 8 , 5 , 3 6 , 3 6 , 5 , 5 , 1 0 , 5 , 3 2 , 8 , e o sm2 : 1 , 3 6 , 2 , 3 6 , 1 0 , 1 3 , 9 , 2 9 , 3 3 , e o s
200 400 600 800 1000 inputs sorted by frequency m e ss a g e s l e n g t h (a) α = . inputs sorted by frequency m e ss a g e s l e n g t h (b) α = . inputs sorted by frequency m e ss a g e s l e n g t h (c) α = emergent messages emergent messages with regularization monkey typing optimal coding Figure A6: Length of messages as a function of input frequency for max_len = 30 and a = 1000 ,when varying α in the length regularization case. m3 : 2 9 , 1 , 8 , 1 , 3 9 , 3 9 , 9 , 1 5 , 1 0 , 1 9 , e o sm4 : 2 9 , 1 , 3 6 , 3 6 , 3 6 , 3 6 , 5 , 8 , 1 3 , 9 , e o s Listing 2: Emergent messages for the most frequent inputs ( max_len : and a : ). m1 : 4 3 1 , 4 3 1 , 3 0 5 , 3 0 5 , 7 0 , 7 0 , 3 3 1 , 3 9 1 , 1 3 4 , 5 8 1 , e o sm2 : 8 6 7 , 2 8 8 , 4 6 6 , 4 6 6 , 4 6 6 , 7 3 7 , 1 1 3 , 7 7 , 6 1 5 , 6 1 5 , e o sm3 : 2 8 8 , 4 6 6 , 4 6 6 , 4 6 6 , 4 1 8 , 1 4 4 , 1 1 3 , 6 1 5 , 6 3 8 , 6 1 5 , e o sm4 : 4 , 4 , 1 5 2 , 1 5 2 , 1 5 2 , 4 6 8 , 6 4 2 , 6 1 5 , 4 2 2 , 1 3 4 , e o s Listing 3: Emergent messages for the most frequent inputs ( max_len : and a : ). m1 : 3 , 4 , 4 , 4 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 3 , 4 , 3 , 4 , e o sm2 : 3 , 1 , 3 , 3 , 1 , 1 , 1 , 1 , 1 , 1 , 4 , 4 , 4 , 4 , 4 , 2 , 4 , 2 , 4 , 2 , 4 , 2 , 4 , 2 , 4 , 2 , 4 , 3 , 2 , e o sm3 : 1 , 4 , 4 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 4 , 4 , 4 , 4 , 4 , 2 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 2 , 4 , 3 , 1 , e o sm4 : 1 , 4 , 4 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 2 , 4 , 2 , 2 , 4 , 1 , 4 , e o s Listing 4: Emergent messages for the most frequent inputs ( max_len : and a : ). m1: 11 ,11 ,12 ,24 ,8 ,8 ,12 ,24 ,12 ,12 ,12 ,12 ,12 ,12 ,36 ,24 ,24 ,35 ,35 ,35 ,36 ,36 ,20 ,15 ,36 ,19 ,11 ,31 ,13 , eosm2: 13 ,31 ,31 ,24 ,8 ,8 ,8 ,8 ,8 ,8 ,8 ,8 ,8 ,19 ,24 ,3 ,3 ,36 ,36 ,19 ,29 ,15 ,31 ,30 ,31 ,15 ,19 ,11 ,13 , eosm3: 39 ,8 ,12 ,8 ,8 ,8 ,8 ,25 ,25 ,25 ,25 ,25 ,25 ,25 ,36 ,24 ,12 ,12 ,35 ,35 ,35 ,18 ,18 ,11 ,3 ,7 ,11 ,7 ,11 , eosm4: 14 ,31 ,8 ,8 ,8 ,8 ,8 ,8 ,24 ,25 ,25 ,25 ,36 ,36 ,36 ,36 ,36 ,36 ,36 ,36 ,36 ,36 ,3 ,2 ,35 ,30 ,31 ,21 ,29 , eos m e ss a g e l e n g t h emergent messages original message lengthmessage length w/o repetition Figure A7: Mean message length (weighted by input probability, and averaged across successfulruns) for various max_len and fixed a = 40 , before and after removing all repetitions. A repetitionhere refers to a sequence of 2 or more consecutive identical symbols. Emergent messages are indexedby their max_len , and we add the same statistics in two human languages for comparison.16 .2.1 Entropy of symbol distributions in different codes We report the entropy of symbol unigram and bigram distributions for different codes in figures 8(a)and 8(b), respectively. We observe that, in both cases, the emergent code symbol distribution is moreskewed than in any considered reference code.
Uniform MT Emergent English Arabic
Codes E n t r o p y o f U n i g r a m s ' d i s t r i b u t i o n (a) Entropy of unigram distributions Uniform MT Control Emergent English Arabic
Codes E n t r o p y o f B i g r a m s ' d i s t r i b u t i o n (b) Entropy of bigram distributions Figure A8: Entropy of symbol unigram and bigram distributions for different codes (in naturallog). The higher the entropy, the more uniform the corresponding distribution is. The entropy ofthe uniform code is computed by assuming a uniform distribution over symbols (unigram) and sequences of symbols (bigram). MT and control messages (see Section 3.3 of main text) areaveraged across different simulations in the ( max_len = , a =40