CCreolizing the Web
Abhinav Tamaskar
Courant Institute of Mathematical Sciences, NYU, [email protected]
Roy Rinberg
Courant Institute of Mathematical Sciences, NYU, [email protected]
Sunandan Chakraborty
Indiana University, NYU, [email protected]
Bud Mishra
Courant Institute of Mathematical Sciences, NYU, [email protected]
The evolution of language has been a hotly debated subject with contradicting hypotheses and unreli-able claims. Drawing from signalling games, dynamic population mechanics, machine learning and algebraictopology, we present a method for detecting evolutionary patterns in a sociological model of language evo-lution. We develop a minimalistic model that provides a rigorous base for any generalized evolutionarymodel for language based on communication between individuals. We also discuss theoretical guarantees ofthis model, ranging from stability of language representations to fast convergence of language by temporalcommunication and language drift in an interactive setting. Further we present empirical results and theirinterpretations on a real world dataset from Reddit to identify communities and echo chambers for opinions,thus placing obstructions to reliable communication among communities.
Key words : Language Evolution, Topological Data Analysis, Population Dynamics, Signalling Game,Machine Learning
Introduction
The mystery of language evolution and its (co-)evolution with learning continues to arouseintense debates. There are only a handful of conceptual frameworks for human languagesthat have found common acceptance: (i) Human language is a biological artefact, as op-posed to a cultural artifact Lenneberg (1967). (ii) Human language builds on a hierarchicalstructure, whose depth is not upper-boundedChristiansen and Chater (1999). (iii) Humanlanguage acquisition occurs over a surprisingly short period aided primarily by positiveexamples Briscoe (2002)Li (1996). However, there are many other corollaries that seem tohave found neither acceptance in theory nor utilization in tool-boxes that aim to automatenatural language processing.There are other similar questions in the biology of evolution: e.g., codon evolution andevolution of intercellular signaling, which are important in the emergence of cellulariza- a r X i v : . [ c s . C L ] F e b uthor: Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) tion and multi-cellular organisms, respectively Sharp and Matassi (1994). The theoreticalframework for them can be built on information-asymmetric games and their conventionalNash equilibria, and can be tested experimentally in artificial cells with unnatural bases(and the resulting codons), and in modified cells with chimeric receptors, for instance.There are few natural experiments that shed light on these processes, e.g., mitochondriaand tumor cells, and they have also played an important role in our understanding ofevolution of these systemsAcevedo et al. (2014)Korolev et al. (2014).These systems, like human language, can also be thought of encoding some form of inter-agent coordination (not necessarily faithfully)Traulsen et al. (2009). They also share fewother traits: e.g., (i) Universality, (ii) Stability and (iii) Near Optimality (with respect tosuitably selected utility); we will call them USNO-theories. A rigorous theory for humanlanguages may seek to build on similar traits: (i) A universal grammar (with some flexibilityfor parametrization)Cook and Newson (2014), (ii) Stability (with faithful acquisition usingmeager amount of positive stimuli)Eisner and McQueen (2006)Nichols (1992) and (iii)Near Optimality (as a solution to minimal design specifications)Escudero (2005)Case andMoelius (2008). However, hypotheses related to physiology of a language organ or thegenetics of linguistic phenotypes are not readily testable experimentally as human languageis unique to humans thus imposing stringent ethical barriers against their experimentalmanipulation. Some analysis of bird-songs have been useful, but not very conclusive (forobvious reasons). In silico models that work reasonably in the context of machine learningand artificial intelligence have focused on large text corpora and semi-supervised learning(with massive number of counter-examples) that do not capture the human context andremain orthogonal to the biology of languagesCollobert and Weston (2008).Interesting natural experiments that are thought to have lent support to USNO-theoriesare in the creolization process, where a group of individuals from Old World are assem-bled with no common human language to use for coordination, but who give rise to asecond generation of New World speech community that invent a human language (Cre-ole) with a new parametrization of the universal grammar, but also enjoying the stabilityand near-optimality that is common to already-existing human languagesSiegel (1997).However, while Creole languages can be studied, their evolution remains poorly under-stood as there exists no data recording their historical dynamicsHymes (1971). As Crick’sFrozen Accident hypothesis and the Cambrian explosion have been used to explain codon uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) Hillary Clinton (born October 26, 1947) is an American politician and diplomat who served as the First Lady of the United States from 1993 to 2001 and as the Democratic Party's nominee for President of the United States in the 2016 election.Donald Trump (born June 14, 1946) is the 45th and current President of the United States. Before entering politics, he was a businessman and television personality.
George Walker Bush (born July 6, 1946) is an American politician who served as the 43rd President of the United States from 2001 to 2009. He had previously served as the 46th Governor of Texas from 1995 to 2000.
Barack Hussein Obama II born August 4, 1961) is an American politician who served as the 44th Presidentof the United Statesfrom January 20, 2009, to January 20, 2017.
Language Evolution Timeline
Initial representation
Intuit
So on
June 2015
July 2015 Aug 2015
Similarities over time
Figure 1 A linguistic object intuit consists of an image, a hashtag and a short description of the intuit. Alanguage starts with a small number of such intuits in a core germinal population and accrues additionalusers who add additional intuits, and use them to communicate. External to this dynamics, we canexamine the time-stamped representations of the vocabulary of intuits and observe the evolution of therepresentation through time. This analysis will help us in understanding the social vs. inherent evolutionof the representations and of language, based on the changes of the similarities of the intuits over time.Analyzing the data on a per user basis will give us hitherto unknown knowledge of the socio-culturaleffects of interactions and community effects on dialects of the language. But, motivated to study thelanguage as a whole, as opposed to just a pair of intuits and their similarities, we were led to novelmathematics to analyze topological differences in representations over time. evolution or multi-cellularity, there has been human language evolution’s Pop hypothesisthat suggests creolization would happen suddenly and freeze quickly, not thawing everagainKoonin and Novozhilov (2009). The alternative experimentally-supported hypothesissuggesting emergence of a human language as a stable separating Nash equilibria of aninformation asymmetric game would be more explanatory and hence appealingHuang et al.(2001)Chomsky (1972).Motivated thus, we have proposed using crowd-sourcing to create a super sized speech-community with a massively scalable socio-technological version of creolization. The ele-ments of these systems would be intuits (with more details in later sections), and even- uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) tually a grammar that linearizes (or even planarizes) Intuits in a stable manner. We callthis idea “Creolization of the Web” and here, we study various algorithmic issues relatedto machine learning, natural language processing and evolutionary processes to study thefeasibility of such a creolization experiment(s). In particular, we focus on (i) definitionof Intuits, the building blocks of the creolization combining images, hashtags, and shorttweet-like (140 characters) description, (ii) their dynamic geometric representation and (iii)evolution of the representation via a Bayesian echo chamber. We illustrate the process withReddit data involving political subreddits to identify evolutionary patterns that emerge ina dynamic population interaction model (fig. 1, Guo et al. (2015)).We realize that the ultimate system that combines elements of wiki, Twitter, emoticons,and Facebook could provide enormous utility in web-search, social networking, and sharedeconomy, possibly displacing English as the defacto intermediate language of the web.Creation of a suitable infrastructure for Intuits remains a secondary but critical goal.
Problem Description
We aim to build a database of a pictorial language called, intuits , which will help inthe process of learning language evolution. The building blocks of this language, calledan intuit , is a token for any word in the vocabulary, where the token contains richerinformation than just the word, by storing (1) a title (a hashtag unique identifier, (2) abrief (140 character) description of the title and (3) an image of the title. The presence ofthis database to track the change of meanings of the intuits over time will give importantinsights to the theory of language evolution.In this paper we give a baseline minimal model, based on the
Bayesian Echo Cham-ber
Guo et al. (2015), which is applicable to any evolutionary method and also has theflexibility to be individualized to any language using concrete grammars and objectivesemantics specific to that language. To experimentally verify the plausibility of such amodel, we analyze real world data from Reddit , which is an online community of users– sufficiently active and engaged to model communication interactions in a population.Reddit is structured as a collection of “subreddits”, which are communities dedicated to aparticular topic, such as gaming , sports , technology , etc. Each user of Reddit is gener-ally subscribed to a few of the subreddits, focusing on the content that the user generallybrowses and is exposed to. The Reddit community has been frequently divided on many uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) The Bayesian Echo Chamber
Internal Representations (word2vec)
Randomized
Conversations
Population graph
Sequential Updates
Figure 2 The Bayesian Echo Chamber. In the model of the Bayesian Echo Chamber, the population is representedas a graph of individuals, called (language) learners, where the edges denote interactions (conversations)between learners. Each learner has their own internal representation of the language, which they usein their conversations with their neighbors. The conversations happen based on a particular topic. Andthe words in the conversation are chosen based on the similarities of the words with the topic in theinternal representation of the learner. topics, most recent of which has been on the political spectrum. This discordance providesa very rich environment to measure the effects of social stratification of language due todissenting views between communities. We started with a synthetic model of intuits for alarge population interacting panmictically (or as determined by an expander populationgraph) as it provides a baseline for an idealized theoretical model and a null model forhypotheses testing.The change in language is measured using computational tools (originally developedfor Natural Language Processing, NLP), specifically word2vec , to get a feature rich, highdimensional embedding of the elements of a language associated with individual speak-ers. These embeddings can be thought of as the representation of the language for theindividual and the difference in the representations gives us a measure of the dissimilaritybetween the interpretations of the language in the population. Each representation beinga corpus of high dimensional points (“point clouds”), there is no standard notion of adistance between two such comparable representations. We propose to apply a topologicalmetric using persistent homology
Edelsbrunner and Harer (2008)Carlsson (2009), which is uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!)
Figure 3 Simulation of a population with various parameters of connectivity and size. The simulations show afast convergence in the representation of individuals and a small drift over time after convergence.These results agree with the accepted theories of language evolution which predict fast stabilizationand small drifts in language representations Greenberg (1960)Fischer (1958). an emerging field of computational mathematics, quantifying a sense of difference betweentwo representations. The advantages of using the topological metric is the rich informationcontent, which provides insights into the local features of a space as well as measuring theglobal differences between two representationsArora et al. (2006)Singh et al. (2007).
Results
Confirmation of Echo Chambers in Reddit
The existence of echo chambers in any society can be manifested in many forms, such asthe presence of dialects across the physical distribution of a population or the prevalenceof accepted norms and ideologies in a community. The frequent divide in the politicalspectrum within a population, popularly described as the “left and right-wing extremisms”,is an interesting part of language that can be harnessed to understand political ideologiesin subreddits. uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) Bottleneck Distance
Persistence Diagram H Homology Barcodes
Figure 4 Topological distance of embeddings. To define the distance between representations we start with atopological metric, as it gives information about the representation as well as the differences betweentwo representations. We examine features in the word2vec embeddings of the vocabulary of a learnerand calculate the distances based on the geometric embeddings. Two words will be close to each otherin the word2vec space iff they are semantically nearly synonymous in the vocabulary of the learner.Now we can calculate the persistent homologies of the embedding and obtain the persistent diagramsof the space. This computation gives us the bottleneck distance between the two diagrams, and equipsus with a sense of how dis-similar two embeddings are.
To examine this hypothesis explaining a spectrum in the communities, we proceeded toanalyze the three most popular political subreddits which are widely believed to cater todifferent groups, namely /r/politics , /r/worldnews , /r/The Donald . /r/politics is thesubreddit focused on US politics; the user base of /r/politics has been thought to be largelyliberal. /r/worldnews focuses more on international news and has frequent discussions oninternational relations between countries. /r/The Donald is another US politics focusedgroup, which was founded in June 2015, and has a more republican user base.We collected the top fifty most frequent and popular users from each subreddit, to infera model of the user base of the subreddit. We took the Reddit data for each user over aperiod of two years from June 2015 to November 2017. Using this as a data corpus forthe word2vec model we created word embeddings for each user to get a point cloud of the uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!)
Figure 5 t-SNE of users in different subreddits. The embeddings of each user were generated using the state ofthe art word2vec models and their Reddit data from the June 2015 till November 2017; using which, wecalculated the distances between each pair of users using the persistent homology metric. To visualizeand quantify the clusters formed using this metric, we performed t-SNE in 2-D plane, as t-SNE giveshigher probabilities to cluster pairs which have small distance while not clustering larger distance pairs.The resulting clusters show a stark similarity between the users of /r/politics and /r/worldnews , whilethose of /r/The Donald are clustered separately. This behavior is mimicked in all dimensions, showingthat there is little communication happening between these communities. vocabulary of the user. Persistent homology was then used to calculate the barcodes of the word2vec embeddings of each user. Based on the barcodes of each user, the bottleneckdistance metric provided a similarity score to every pair of users, which was used by t-SNEto get a low-dimensional clustering embedding of the population fig. 5. The advantageof the t-SNE clustering is the ability to find highly probable clusters (i.e., with a largelikelihood), while low probability clusters are ignored.Based on the t-SNE clusterings, we see a stark similarity between the users of /r/politicsand /r/worldnews . This structure not only supports the hypothesis postulating existenceof largely liberal user bases in the two subreddits, but also gives a clear method to findecho chambers across the whole Reddit community. The users of /r/The Donald are shown uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) to be hugely dissimilar to those of /r/politics and /r/worldnews as the political ideologiesof republicans have many contrasting accepted notions than those of democrats.The idea for using these embeddings and the topological similarity can also be extendedto any other spatial model, such as the embeddings computed by GloVE , fasttext , sense ,etc. Sense embeddings have the additional characteristic of being able to identify polysemy.Thus Topological Data Analysis (TDA) can take advantage of this feature to characterizemeasures of polysemy between different languages. Nonetheless, one needs to be careful,when considering the potential effects of prevalent topics in the subreddits and to ensurethat secondary structures do not dominate the embedding criterion. This goal can beensured by restricting the topic base to a particular subset so that the vocabulary of thetopics remains largely consistent through the subreddits.
Comparison of subreddits gives details of divergence over time
One of the main reasons for performing temporal analysis of language in Reddit is to be ableto identify the effects of communications (or lack thereof) between the population on thelanguage of each community. To analyze this effect, we took the most popular topics fromeach month, from June 2015 till November 2017, in each subreddit and made an incremental word2vec model. This incremental model presented to us a highly dynamic picture of eachsubreddit through time, which we used as an input to the persistent homology toolbox torigorously quantify the changing similarities over time fig. 6.We observe that there is a consistent increase in the relative pair-wise distances of thesubreddits. This dispersion corresponds to the formation of communities and how thenascent communities differ in interpreting semantic nature and sentiments of words inthe subreddits. The increase in the bottleneck distances can be seen as one effect of thewidening division in the population based on political creeds and affiliations.
Non-isotropy of language embeddings
Language isotropy has been thought of as a reason for the robustness of the word2vec models and any embedding tool in general. Isotropy in a geometric sense is the measureof uniformity of the word embeddings across the inherent embedding space. The core ideathat is assumed to support the word embeddings (and approaches based on them) is asfollows: All natural languages must be able to describe all concepts in the language modelusing minimal combinations of words. This property is facilitated as the words becomeuniformly distributed across the spaceArora et al. (2016). uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!)
Figure 6 Bottleneck distance between subreddits. We collect the most popular posts from every month in eachsubreddit to build a temporal model for language representation. Using the bottleneck distance ofpersistence diagrams we can calculate the distance between the language representation over timeand see the effects of the community structure. The consistent increase in the distance between therepresentations confirms the hypothesis of echo chambers in subreddits, leading to a divergence inrepresentations and topic focus between the subreddits.
Persistent homology offers an easy way to measure the isotropy of any word embeddingmodel by looking at the point cloud of the embeddings. The presence of holes in theembedding space can be thought of as parts of the space which are poorly described usingthe current geometry and for which news words should either be introduced or words canbe remapped to new meanings, reminiscent of Moran processes in evolution and linguisticsTiefelsdorf (2006).We took the subreddit data from each of the three political subreddits and calculated theembeddings of the word corpus to get a representation of the words at the end of 2017. Weobserved the presence of multiple large homology groups suggesting inconsistencies withthe hypothesis of isotropy of word2vec embeddings. Our observation, albeit in a limitedcontext provokes additional analysis of word2vec models and their effectiveness. Anotherpotential investigation is the location of the homologies and identifying the regions of space uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) contributing to the homologies. This strategy may lead to a tool for analyzing a text corpusand identifying topics which can be misrepresented. Such a tool can point to potentialpitfalls of the embeddings and also new approaches to avoid them. Using user data to find similarities of subreddits
One of the reasons for conducting the experiment on a per user basis is to be able toidentify the communities from population data and minimal structural information. Thisnew individualized data prompted us to re-perform the previous analysis of subredditdistance based on only the user data. We took the word corpus for each user and made anincremental word2vec model to get temporal embeddings of the each user from June 2015till November 2017. Using these embeddings, we calculated the average distance betweeneach pairs of users in the subreddits to observe changes in the language representations.The average user distance between the subreddits remained largely unchanged through-out the time period of analysis, painting a different picture than the more robust analysisfrom the overall subreddit data. This discrepancy prompts a more detailed analysis ofusing personalized data to gather succinct information to compare communities. This ap-proach also faces a problem in identifying communities based on individualized data, whereno proper means of learning the underlying population graph exists. In a setting whereconversations take place with multiple users, the problem of inferring the communicationhypergraph is a harder problem Kim et al. (2017).
Intra-subreddit language drift using users
To observe the drift in language over time we examine the distance between the repre-sentations of each user over time (fig. 6). The user data has many limitations, namely,initialization process is slow; vocabulary remains limited; length of conversations is typ-ically short; and most importantly, the best existing data corpus is inadequately small.Due to these limitations, any kind of user based analysis of subreddits has proven difficult.We notice a small pattern of increasing distance, reminiscent of the subreddit distancemetric. But the fluctuations in first two homologies show the effect of lack of data on thebottleneck distance.One way of getting around this limitation is to have robust user data to construct goodindividual representations of the language. The design of intuits is such that the crowd-sourced natural experiments can yield better individual representations, each of which can uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!)
Figure 7 Intra-subreddit distance. The individualized embeddings inside each subreddit can help us understandthe convergence of language over time and the stability of the language after convergence. The currentuser distances remain stable over a period of two years suggesting a stable distribution of languagerepresentations, where the divergence observed a priori is an effect of the drift in language due toshifts in the topic focus over time. be tracked over time to get drift of the language and observe the community effects onthe representation. Collecting more focused data, such as the ones to be gathered by the intuit project, will help reveal much more about various linguistic hypotheses – rangingfrom origins of the language to its universality and stability.
Discussion
We conclude that design and launch of intuit ’s large-scale crowd-sourced creolizationexperiment constitutes a feasible project – proviso , serious attention is given to lan-guage’s convergence properties (and subsequent stability). Our computational simulationof Bayesian Echo Chamber and the mathematical analysis of convergence to equilibriawithin it appear promising for the following reasons: (i) by providing the right tools to acrowd-sourced wiki-like public effort, it seems conceivable to creolize a natural languagemore suitable for the world-wide web and (ii) furthermore, by not ignoring the effects ofnaturally occurring population (graph) structures (e.g., reddit), it seems possible to avoid uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) certain natural limitations, usually exhibited as disparate Echo Chambers, coexisting, butin fundamental disagreement with one another. Thus there must be significant efforts tobridge the differences between the idealized theoretical model and extant empirical models,which may be achieved by simply prompting conversations among key individuals, whocould facilitate rapid mixing in the population graph. Theories of random graphs, expandergraphs and algebraic analysis of graphs provide powerful mathematical tools to achievethese goals algorithmically.We hypothesize further that a properly designed intuit experiment will parametrize theuniversal grammar (assuming and validating its existence) common to natural languages;it will quickly converge to a highly stable Nash equilibrium; and it will optimize certaininformation-theoretic utility functions for the utterer-hearer pairs. These hypotheses are,separately and together, refutable. The data collected from this natural experiment willshed important light on the biological mechanisms responsible for the emergence of humanlanguages, while spurring the emergence of a new wave of language creation.The experiment also raises additional questions: How will the intuit language relate to the ongoing research in Artificial Intelligence?
Currently there is much interest in using deep learning for natural language processing,especially for language translation, text-tagging, captioning images, etc. – all relying onsome form of word2vec embeddings based on large corpora from multiple languages. Thereis a lack of a proper theory in deep learning explaining its spectacular successes andintriguing failures (e.g., adversarial perturbations) that this version of AI (sub-symbolic,black-boxes) exhibits. Our work on the signalling-game-theoretic models, as initiated here,could be useful in injecting robustness to the future AI research. A particularly colorfulexample of a confusing experiment in AI involves Microsoft’s Tay, which was effortlesslyhijacked by a millenials’ echo chamber.
How will the intuit language relate to the current thinking in Mathematical Data Science?
We have shown here that topological analysis of point-cloud-data provides a powerful toolthat could be widely applicable. Some applied works on evolutionary studies in virologyand oncology have been influential, but wider applications remain unexplored, especiallyin the context of the evolution of languages, social norms, social contracts, social insti-tutions, etc., – all topics of immense importance as intelligence/information technologieshave begun to disrupt long-standing, hitherto stable institutions in unpredictable manners. uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!)
Creolization’s deeper relations to topological data analysis (TDA), Manifold Learning,Information Geometry, Game Theory etc. are thus important topics of future research.
How will the intuit language relate to the current thinking in Biology?
Our experimentsanticipate support for the usefulness of distributional methods of representing semantics ina language. Our approach is supported by the analysis by Arora et al.Arora et al. (2016),who were able to identify a semantically-relevant low-dimensional shared representationof fMRI responses. Their experiments and analysis were conducted in an unsupervisedfashion and involved views of multiple subjects watching the same natural movie stimulus.These studies point to some fundamental questions about the biology of languages and howit evolved in a relatively short period. Our analysis using intuits – with its multimodalemoji like structures – is hoped to raise more challenges and resolve ancient mysteries.Last but not least, how will the intuit language relate to the current thinking in Linguis-tics?
Noam Chomsky and his followers have played a dominant role in shaping the currenttheories of language, but in isolation from other evolutionary researchers and their theories,such as cellularization (codons), endosymbiosis, multi-cellularity, speciation,etc. However,human spoken language is hypothesized to be a biological artefact (postulating a yet-to-beidentified language organ; related to the so-called I-language; and supporting distribu-tional semantics), but leads to theories that are unexperimentable (“not-even-wrong”).The existence of WWW and crowd-sourcing drastically changes the situation by enablingscalable and experimental inventions of new artificial natural languages using large numberof communicating human learners.However, our biggest challenges will remain in the engineering of the intuit
LinguisticSystem, focusing on how the data should be collected and how it should be analyzed.We can use existing efforts developed in cloud computing (e.g., BigTable, BigQuery, etc.),enabling construction of such a system with relatively small man-power. But given thatinternet is already affecting how younger generations communicate (with hashtags, emojis,acronyms, etc.), the window of opportunity for the natural experiments based on intuit may be closing soon, particularly as the field gets crowded by powerful monolithic corpo-rations, namely, the so-called unicorns e.g. Twitter (tweets), Facebook (identity systems)and Google (Language Translations). uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) Table 1 Translation between persistent homology and linguistic terminology
Persistent homology Linguistic interpretationFiltration value › Clustering of words up to similarity of › Persistence diagrams Representation of difference in similarities of aword to its semantic neighbors -dimensional betti numbers at distance › Number of word clusters based on semanticdis-similarity up to › Generators of -dimensional persistence diagram Representatives of word clustersBirth-death timings of -dimensional generators Hierarchical clustering of words -dimensional persistence diagram Small cycles help in detection of polysemous wordsDistance between persistence diagrams Measure of difference of two word corpus, in termsof semantic meanings associated to wordsHigher dimensional persistence diagrams Non-isotropy of word embeddingsGenerators of higher dimensional Bounding regions of space with no embeddingspersistence diagrams contradicting isotropy Methodology
Here, we show the guarantees of the interactive model for language evolution. We applythe model to real world data and show how the properties of the model give us insights intothe data using persistent homology. For more details, see the supplementary materials.
Modeling Change in Language
Many different representation scheme can be used to model a language Mnih and Hinton(2007)Collobert and Weston (2008)White (2003). We represent vocabulary of a languagein a high-dimensional space, with the properties that (a) Similar words/intuits (based ontheir contextual usage or annotations) must be placed in nearby region in this space; (b)Any change due to evolution of language can be captured through the relative movementsof the words in this space.
Bayesian Echo Chamber
This is a new
Bayesian generative model for social interaction data, for uncoveringinfluence-relations among individuals from their time-stamped conversation dataGuo et al.(2015). The forcing function of a social process is based on the mutual influence amongthe participants, which must be inferred from the flow of the interaction Reali and Grif-fiths (2009). Evolution in a language can be modeled to explain temporal stability of theword meanings, despite their occasional misuses – but also, sporadic changes in meaningsthat do occur within subcommunites (Echo Chambers) and propagate furtherPelikan et al.(2000). They may be assumed to be accompanied by concomitant changes in usage andgrammatical structure to connect these words.Social influence forces these changes and the uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) flow of changes originates from the “most influencing” participant to the “weaker partici-pant”. We use Bayesian Echo Chamber to understand how external influences, mixing ofdifferent social and linguistic cultures in a speech community initiate language evolutionand how it finally converges to a stabilized form, where after stability it is less susceptibleto external stimulants.
Persistent Homology
We base our analysis on the field of algebraic topology, which can capture the global dif-ferences between two high dimensional embeddings as well as give local information abouta representation to get insights into distribution of the words (or intuits) in the space. Wemeasure “features” of the space that remain invariant under continuous deformations, suchas stretching, bending, rotating but not tearing or gluing parts of the space. These featurescorrespond to the “holes” in the space, that range across all dimensions of the space andcan capture higher order structures – more so, than simple combinations of elementarystructures, which are commonly used in machine learning.We create a continuous topological space from a corpus of points, by putting balls of size › around each point. The union of these balls gives the ˇCech complex, which has the exacttopology of the underlying space but is hard to compute. We instead focus on the Vietoris-Rips(VR) complex, which is a smaller simplicial complex but can be shown to be a goodapproximation of the ˇCech complexCarlsson (2009)Edelsbrunner (2014). The VR complexof size › for a corpus C , V R › ( C ) , is constructed by joining two points if they have a distancesmaller than › , which corresponds to the threshold of word similarities. By building the V R › ( C ) complex at different scales of › , we can see when the holes are generated and whenthey are filled up, called the birth and death “times” of the holes (homologies). These areencoded in the barcode representation fig. 4 and also as points on the plane, known as apersistence diagram.The presence of holes across different dimensions gives us an estimate of how differenttwo spaces are. Topologically, we can think of this as the obstructions from changing onespace into another using continuous methods. Persistent homology strengthens these no-tions and enables us to capture more information about these holes, such as the sizes andthe boundary words of the holes. The longer a barcode is, the higher the probability thatthe underlying feature is a characteristic of the manifold and not a by-product of noise inthe dataEdelsbrunner et al. (2000)Guskov and Wood (2001). We use a topological metric uthor: Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!) called the bottleneck distance between persistence diagrams, which is a special case of theWasserstein metric on this space. The bottleneck distance can be thought of as a corre-spondence between the homologies to minimize the disparity between spaces. Hence a largebottleneck distance is a proof of dissimilarity between the underlying spaces. Moreover,the generators of the corresponding homology groups helps identify the local structuresgenerating the deformationsChazal et al. (2015). Overall, we connect the ideas from alge-braic topology, language evolution and computational linguistics, and provide a map todecipher notions interchangeably between these fields in table 1. Acknowledgments
We would like to thank Sylvain Cappel, Misha Gromov of Courant Institute of Mathematical Sciences, RaulRabadan of Columbia University, Rohit Parikh of CUNY, Larry Rudolph of TwoSigma for their keen insightand enlightening discussions in the makings of this project. We would also like to thank Halley Young ofCMU who kickstarted our project in its infancy and get better understandings of the datasets available.B.M. was supported by an Army Research Office Grant
References
Acevedo A, Brodsky L, Andino R (2014) Mutational and fitness landscapes of an rna virus revealed throughpopulation sequencing.
Nature
Transactions of the Association for Computational Linguistics
Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm ,41–50 (Society for Industrial and Applied Mathematics).Briscoe T (2002)
Linguistic evolution through language acquisition (Cambridge University Press).Carlsson G (2009) Topology and data.
Bulletin of the American Mathematical Society
International Conference on Algorithmic LearningTheory , 419–433 (Springer).Chazal F, Glisse M, Labru`ere C, Michel B (2015) Convergence rates for persistence diagram estimation intopological data analysis.
The Journal of Machine Learning Research
Harvard Educational Review
Cognitive Science uthor:
Article Short Title Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!)
Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networkswith multitask learning.
Proceedings of the 25th international conference on Machine learning , 160–167(ACM).Cook V, Newson M (2014)
Chomsky’s universal grammar (John Wiley & Sons).Edelsbrunner H (2014)
A short course in computational geometry and topology . Number Mathematical meth-ods (Springer).Edelsbrunner H, Harer J (2008) Persistent homology-a survey.
Contemporary mathematics
Foundationsof Computer Science, 2000. Proceedings. 41st Annual Symposium on , 454–463 (IEEE).Eisner F, McQueen JM (2006) Perceptual learning in speech: Stability over time.
The Journal of the Acous-tical Society of America
Linguistic perception and second language acquisition: Explaining the attainment of op-timal phonological categorization (Netherlands Graduate School of Linguistics).Fischer JL (1958) Social influences on the choice of a linguistic variant.
Word
Internationaljournal of American linguistics
Artificial Intelligence and Statistics , 315–323.Guskov I, Wood ZJ (2001) Topological noise removal.
Spoken language processing: A guide to theory, algorithm,and system development , volume 1 (Prentice hall PTR Upper Saddle River).Hymes DH (1971)
Pidginization and creolization of languages (CUP Archive).Kim C, Bandeira AS, Goemans MX (2017) Community detection in hypergraphs, spiked tensor models,and sum-of-squares.
Sampling Theory and Applications (SampTA), 2017 International Conference on ,124–128 (IEEE).Koonin EV, Novozhilov AS (2009) Origin and evolution of the genetic code: the universal enigma.
IUBMBlife
Nature Reviews Cancer
Hospital Practice £ : pb $ : ; Language in Society uthor:
Article Short Title
Article submitted to
Interfaces ; manuscript no. (Please, provide the manuscript number!)
Proceedings of the24th international conference on Machine learning , 641–648 (ACM).Nichols J (1992)
Linguistic diversity in space and time (University of Chicago Press).Pelikan M, Goldberg DE, Cant´u-Paz E (2000) Bayesian optimization algorithm, population sizing, and timeto convergence.
Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation ,275–282 (Morgan Kaufmann Publishers Inc.).Reali F, Griffiths TL (2009) Words as alleles: connecting language evolution with bayesian learners to modelsof genetic drift.
Proceedings of the Royal Society of London B: Biological Sciences rspb20091513.Sharp PM, Matassi G (1994) Codon usage and genome evolution.
Current opinion in genetics & development
The structure and status of pidgins andcreoles
SPBG , 91–100.Tiefelsdorf M (2006)
Modelling spatial processes: the identification and analysis of spatial relationships inregression residuals by means of Moran’s I , volume 87 (Springer).Traulsen A, Hauert C, De Silva H, Nowak MA, Sigmund K (2009) Exploration dynamics in evolutionarygames.
Proceedings of the National Academy of Sciences