[PDF] Coloring the Black Box: What Synesthesia Tells Us about Character Embeddings

Abstract

In contrast to their word- or sentence-level counterparts, character embeddings are still poorly understood. We aim at closing this gap with an in-depth study of English character embeddings. For this, we use resources from research on grapheme-color synesthesia -- a neuropsychological phenomenon where letters are associated with colors, which give us insight into which characters are similar for synesthetes and how characters are organized in color space. Comparing 10 different character embeddings, we ask: How similar are character embeddings to a synesthete's perception of characters? And how similar are character embeddings extracted from different models? We find that LSTMs agree with humans more than transformers. Comparing across tasks, grapheme-to-phoneme conversion results in the most human-like character embeddings. Finally, ELMo embeddings differ from both humans and other models.

Full PDF

CColoring the Black Box:What Synesthesia Tells Us about Character Embeddings

Katharina Kann ∗ University of Colorado Boulder [email protected]

Mauro M. Monsalve-Mercado ∗ Columbia University [email protected]

Abstract

In contrast to their word- or sentence-levelcounterparts, character embeddings are stillpoorly understood. We aim at closing thisgap with an in-depth study of English char-acter embeddings. For this, we use resourcesfrom research on grapheme–color synesthesia– a neuropsychological phenomenon where let-ters are associated with colors –, which giveus insight into which characters are similarfor synesthetes and how characters are orga-nized in color space. Comparing 10 differ-ent character embeddings, we ask: How sim-ilar are character embeddings to a synesthete’sperception of characters? And how similarare character embeddings extracted from dif-ferent models? We ﬁnd that LSTMs agreewith humans more than transformers. Compar-ing across tasks, grapheme-to-phoneme con-version results in the most human-like charac-ter embeddings. Finally, ELMo embeddingsdiffer from both humans and other models.

Neural network models have become crucial toolsin natural language processing (NLP) and deﬁnethe state of the art on a large variety of tasks (Wanget al., 2018). However, they are difﬁcult to un-derstand and are often considered ”black boxes”. This can make their use difﬁcult to defend in manysettings, for instance in a legal context, and consti-tutes a barrier for model improvement. Therefore,a lot of research has been dedicated to investigat-ing the information encoded by neural networks.Especially word embeddings, contextualized wordrepresentations, and language representation mod-els like BERT (Devlin et al., 2019) have been ex-haustively studied (Rogers et al., 2020). ∗ Equal contribution. This fact has led to the establishment of a workshop withthe same name: https://blackboxnlp.github.io

A B C D E F ...

Figure 1: Characters as they might be seen by agrapheme–color synesthete. The colors in this exam-ple are randomly chosen.

Character embeddings are used for a large setof tasks, either as a supplement to word-level in-put, e.g., for part-of-speech tagging by Plank et al.(2016), or on their own, e.g., for character-levelsequence-to-sequence (seq2seq) tasks by Kann andSch¨utze (2016). Despite this, they have not yetbeen explicitly analysed. One reason for this mightbe that identifying relevant properties to study ismore challenging than for their word-level coun-terparts. However, we argue that, in order to trulyshine light into black-box NLP models, it is nec-essary to understand each and every component ofthem.In this paper, we perform a detailed study ofEnglish character embeddings. Our ﬁrst contribu-tion is a character similarity task (§3) in analogyto the word-based version:

Do embeddings agreewith humans on which pairs of characters are simi-lar?

While e.g., cat and tiger are generally consid-ered more similar than cat and chair , this is not triv-ial for characters. For our annotations, we exploit aphenomenon called synesthesia . People with synes-thesia, synesthetes , share perceptual experiencesbetween two or more senses (§2). For grapheme–color synesthetes, each letter is associated with acolor, which is consistent over a person’s life time(Eagleman et al., 2007), cf. Figure 1. Using adataset of letters from the English alphabet and theassociated colors from 4269 synesthetes (Witthoftet al., 2015), we compute differences in color space(§3.3) as a proxy for character similarity.As our second contribution , we propose a set a r X i v : . [ c s . C L ] J a n f methods to characterize the structure of char-acter embedding matrices (§4). The methods wepropose are (i) a clustering analysis, (ii) computingthe clustering coefﬁcient, (iii) measuring between-ness centrality, and (iv) computing cut-distances.Our ﬁnal contribution is a detailed characterembedding analysis (§6). We explore 6 types ofembeddings, obtained from 4 different architec-tures trained on 3 different tasks, as well as 4 em-bedding matrices from pretrained ELMo models(Peters et al., 2018). Comparing across modelstrained on the same task, character embeddingsfrom LSTMs, which, similar to humans, processinput sequentially, correlate more with human sim-ilarity scores than embeddings from transform-ers. Comparing across tasks, character embeddingsfrom language modeling show a surprisingly lowcorrelation with human judgments. In contrast, thecorrelation is highest for grapheme-to-phonemeconversion (G2P). This is in line with ﬁndingsthat colors perceived by synesthetes are inﬂuencedby the sound of each letter (Asano and Yokosawa,2013). Synesthesia is the perceptual phenomenon thattwo or more sensory or cognitive pathways areco-activated in the brain by stimulating only oneof them. For example, for a person with chromes-thesia, a common form of synesthesia, hearing aparticular sound evokes the visual perception ofcolors (Cytowic, 2018; Simner, 2012).By far the most common form of synesthesia isgrapheme-color synesthesia, for which subjects re-port the perception of colors when seeing numeralsor letters (cf. Figure 1). The neural basis behind thephenomenon is still a much debated topic, but evi-dence suggests it might be the result of increasedeffective connectivity between the brain areas in-volved (Ramachandran and Hubbard, 2001a). Forexample, visual area V4a in the ventral pathway,often associated with contextual processing andperception of color, has shown to be part of a com-plex in the fusiform gyrus exhibiting higher corticalthickness, volume, and surface area in synesthetes(Hubbard et al., 2005). Belonging to this com-plex and adjacent to the color processing regionis the area dedicated to the visual recognition andprocessing of graphemes, suggesting that a promi-nence of this type of synesthesia is due to higherregion interconnectivity being more probable. Additional evidence supports the idea that synes-thetic associations often involve the extraction ofmeaning from a stimulus, suggesting that abstractconstructions play a major role in the associationsformed (Mroczko-Wasowicz and Nikolic, 2014).For cases where language is involved, semanticsmay in part underlie color representations of wordsand graphemes in V4.The study of synesthesia presents a unique op-portunity to understand the neural basis of cogni-tive models of language (Ramachandran and Hub-bard, 2001b; Simner, 2007). Here, grapheme–colorsynesthesia serves as a window to look at how thebrain represents individual characters.

Our ﬁrst contribution is a character-level analogueof the word similarity task, an intrinsic evaluationmethod for word embeddings. It consists of judg-ing the similarity of pairs of words, e.g., of cat and tiger . To obtain a gold standard for this task, humanannotators assign similarity scores to a list of wordpairs. This is not always trivial: are cat and bird more or less similar than cat and ﬁsh ? However,people tend to agree on general tendencies.Word embeddings are evaluated on similaritydatasets as follows: The similarity – usually cosinesimilarity – of all pairs of words is computed. Theagreement between models and human annotationsis then quantiﬁed as the correlation between the twovectors of scores. Word embeddings with a higherperformance on word similarity tasks are expectedto perform better on downstream tasks, since theyencode valuable information about words and therelationships between them.

In order to design a character similarity task, werequire a gold standard. However, we are not likelyto get a meaningful answer when asking people if B and J are more similar than C and Q .We solve this problem by looking at how differ-ent characters are represented by grapheme–colorsynesthetes in color space. This tells us how sim-ilar characters are perceived to be without havingto ask explicitly. We leverage a dataset collectedby Witthoft et al. (2015), which consists of letter-to-color mappings collected from 4269 synesthetesand compute pair-wise perceptually uniform dis-tances between characters. Analogously to theord-level version of the task, we then take asour gold standard the average over all annotators.For embeddings, we compute cosine similaritiesbetween character vectors.We differ from the word similarity task in animportant detail: we do not evaluate the quality ofembeddings. Instead, we aim to understand char-acter embeddings by assessing how similar to thehuman perception of characters they are. In par-ticular, we do not necessarily expect embeddingswhich score higher on our task to perform better ondownstream applications. To motivate the distance metric we use, we ﬁrstsummarize human perception of color. The vi-sual process of color perception starts in the retina,where three types of light-sensitive receptors aretuned to respond broadly around three distinct fre-quencies in the visible spectrum. However, colorperception is not simply reduced to combinationsof these physiological responses. The human braincan perceive the same physical light frequency asa different color depending on a plethora of con-textual markers, mostly due to further processingupstream of the retina into high-level visual corti-cal areas, where associations are formed betweenenvironmental cues and the object being visualized.Important contextual properties of color affect-ing perception, often discussed in color theory asthe Munsell (or HSL) color system, are its hue,saturation, and lightness. Combinations of theseproperties are interpreted in a highly non-linear manner by cortical areas tasked with color percep-tion. Not surprisingly, simple metrics for colorcomparison, such as the Euclidean distance in RGBspace, perform poorly in situations involving acolor discrimination task. Constructing perceptu-ally uniform metrics that deal with these perceivednon-linearities is an active ﬁeld of research. Onestandard metric for perceptual color comparisonis the CIEDE2000 color difference (Sharma et al.,2005), which includes several correction factors fora modiﬁed Euclidean metric on HSL space. Weemploy CIEDE2000 in our analysis.

CIEDE2000 allows us to obtain pair-wise distancesbetween all letters in the alphabet by computingcolor differences for the associated colors as per-ceived by synesthetes. The character similarity task then compares vectors of pair-wise distances.However, we can gain additional insight from thedistance matrices , which represent fully connectednetworks whose nodes are the letters in the alpha-bet and whose edges are weighted by the pair-wisedifferences. Thus, we further propose four well-established methods from network theory (New-man, 2018) for the analysis of character embed-dings and human difference matrices.

Clustering analysis.

To characterize the globalstructure of the network, we ﬁrst propose Ward’svariance minimization algorithm to identify clus-ters, i.e., groups of letters that are similar to eachother, but far away from other letters. Ward’s algo-rithm is part of a family of hierarchical clusteringalgorithms whose objective function aims at mini-mizing the variance within clusters (Ward, 1963).Starting from a forest of clusters (initially singlenodes), the algorithm evaluates the Ward distance d between a new cluster u made up of clusters s and t , and a third cluster v not used yet, as d ( u, v ) = √ a + b − c, (1)with a = | v | + | s | T d ( v, s ) , (2) b = | v | + | t | T d ( v, t ) , (3) c = | v | T d ( s, t ) , (4)where T := | s | + | t | + | v | , and | · | is the sizeof the cluster. If u is a good cluster, then s and t are removed from the forest and the algorithmcontinues until only one cluster is left. Finally,the number of clusters , their sizes , and the Warddistances between clusters characterize the globalstructure of the network.

Clustering coefﬁcient.

The clustering coefﬁ-cient provides a way to measure the degree towhich nodes in a network cluster together. Fora binary network, the local version represents thefraction of the number of pairs of neighbors of anode that are connected, over the total number ofpairs of neighbors of said node, and it measuresthe inﬂuence of a node on its immediate neighbors.Several generalizations for weighted networks havebeen proposed (Saram¨aki et al., 2007). Here, weuse the average of the weights for neighbors in thesubgraph of a node uc u = 1 deg ( u )( deg ( u ) − (cid:88) vw ( ˆ w uv ˆ w uw ˆ w vw ) / , here ˆ w i denotes weight w i normalized by themaximum weight in the network, and deg is thedegree of the node (the sum of its edges’ weights).The average over all nodes is used as a proxy forthe overall level of clustering within the network. Betweenness centrality.

Different concepts ofcentrality attempt to capture the relative importanceof particular nodes in the network. One such con-cept, betweenness , measures the extent to whicha node lies on the shortest path between pairs ofnodes (Brandes, 2008). In a sense, it generalizesthe clustering coefﬁcient from a measure of thelocal inﬂuence of a node to immediate neighborsto the whole network. In particular, it accounts fornodes that connect two different clusters while notbeing a part of either. Betweenness centrality iscomputed as the fraction of all-pairs shortest pathsthat pass through a particular node uB ( u ) = (cid:88) s,t n ( s, t | u ) n ( s, t ) , where the sum is over all nodes s and t in thenetwork, and n counts the number of shortest pathsbetween two nodes, optionally taking into accountif it passes through u . Cut-distance.

The fourth and last approach wepropose to characterize our matrices of characterdistances is to employ a matrix norm called cut-norm. It is widely used in graph and networktheory and has been shown to capture global fea-tures such as clustering and sparseness (Friezeand Kannan, 1999). The cut-norm of a matrix A = ( a ij ) i ∈ M,j ∈ N is deﬁned as || A || c := max  (cid:12)(cid:12)(cid:12)(cid:80) i ∈ I (cid:80) j ∈ J a ij (cid:12)(cid:12)(cid:12) | I || J | : I ⊂ MJ ⊂ N  , i.e., the maximum over all possibles sub-matrixarrangements is taken as the norm. In practice, wecompute it using an efﬁcient implementation thatrelies on Grothendieck’s inequality for an approxi-mation (Alon and Naor, 2004; Wen et al., 2013). Inaddition, the norm naturally gives rise to a distancemetric d c ( A, B ) := || A − B || c that allows us tocompare pairs of distance matrices directly. We now describe the tasks and model architectureswe employ to train different character embeddings. https://pypi.org/project/cutnorm/ The task of language model-ing consists of computing a probability distributionover all elements in a predeﬁned vocabulary, givena sequence of past elements. Language models caneither be used to assign a probability to an inputsequence or to generate text by sampling from theprobability distributions. We train language modelson the character level, i.e., the vocabulary consistsof the English alphabet.All our language models are trained on wikitext-103. We use the provided training, development,and test splits. The training set consists of roughly1 million tokens.

Morphological inﬂection.

In languages thatexhibit rich inﬂectional morphology, words inﬂect:grammatical information like number, case, andtense are incorporated into the word itself. Forinstance, wrote is the inﬂected form of the Englishlemma write , expressing past tense.The task of morphological inﬂection consistsof mapping a lemma to an inﬂected form whichis deﬁned by a set of morphological tags. Mor-phological inﬂection is typically being cast as acharacter-level seq2seq task, where the charactersof the lemma together with the morphological tagsare the input, and the characters of the inﬂectionare the output (Kann and Sch¨utze, 2016):

PST w a l k → w a l k e d We train our inﬂection models on the , En-glish training examples provided by Cotterell et al.(2017) and use the corresponding development andtest sets with , examples each. Grapheme-to-phoneme conversion.

Givena word’s spelling, G2P consists of generating an(IPA-like) representation of its pronunciation: p r e t t i e r → P R IH T IY ER

It has been shown that similar-sounding letters tendto be associated with similar synesthetic colors(Asano and Yokosawa, 2013). Hence, we assumethat the embedding space induced by this task couldbe similar to human perception of characters.We train all G2P models on examples extractedfrom the CMU Pronouncing Dictionary. Our train-ing, development, and test sets consist of 114,399,5447, and 12,855 examples, respectively. https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip We use the splits provided at https://github. .2 Architectures

To isolate the effects of task and model architecture,we train different architectures for each task. Alltest set performances are shown in Table 1. Wetrain 50 models with different random seeds forseq2seq tasks, and 10 instances for language mod-els. For our analysis, we look at the input embed-dings and average pair-wise distances over modelsfor each group. All models have been trained onan NVidia Tesla K80 GPU.

LSTM seq2seq architecture.

Our ﬁrst archi-tecture is a seq2seq model similar to that by Bah-danau et al. (2015). It consists of a bi-directionallong short-term memory (LSTM; Hochreiter andSchmidhuber, 1997) encoder and an LSTM de-coder, which are connected via an attention mecha-nism. We apply it on the character level.We train this architecture on morphological in-ﬂection (

Inﬂ

LSTM ) and G2P (

G2P

LSTM ), using thefairseq sequence modeling toolkit for our imple-mentation. All embeddings and hidden states are100-dimensional, and both encoder and decoderhave 1 hidden layer. For training, we use an Adamoptimizer (Kingma and Ba, 2014) with an initiallearning rate of 0.001, dropout with a coefﬁcient of . , and a batch size of 20. To account for differenttraining set sizes, we train our model for G2P for15 and our model for inﬂection for 100 epochs. Transformer seq2seq architecture.

We fur-ther experiment with a transformer seq2seq archi-tecture (Vaswani et al., 2017). Similar to the LSTMseq2seq model, this architecture consists of an en-coder and a decoder which are connected by an at-tention mechanism. However, the encoder and thedecoder consist of combinations of feed-forwardand attention layers instead of LSTMs.We apply this architecture to morphological in-ﬂection (

Inﬂ T ) and G2P ( G2P T ), and implementthe models using the fairseq toolkit. All embed-dings have 256 dimensions, and hidden layers are1024-dimensional. Both encoder and decoder have4 layers, and use 4 attention heads. We employan Adam optimizer with an initial learning rate of0.001 for training, together with dropout with acoefﬁcient of . , and a batch size of 400. We trainour models for G2P for 30 epochs and our modelsfor morphological inﬂection for 100 epochs. LSTM language model architecture. We com/microsoft/CNTK/tree/master/Examples/SequenceToSequence/CMUDict/Data . https://github.com/pytorch/fairseq Inﬂ

LSTM

Inﬂ T G2P

LSTM

G2P T LM LSTM LM T Table 1:

Top : accuracy for inﬂection and G2P andcharacter-level perplexity for language modeling.

Bot-tom : number of model instances. Results are averagedover all runs; models are described in the text. also experiment with an LSTM language model( LM LSTM ). This architecture consists of a unidi-rectional LSTM, and it receives the last generatedcharacter as input at each time step.Our implementation is based on the ofﬁcial py-torch LSTM language model example. We use thedefault hyperparameters except for the following:our embeddings and LSTM hidden states are 512-dimensional, we use 2 hidden layers, and we trainfor 2 epochs with a batch size of 64.

Transformer language model architecture.

Our last architecture is a transformer languagemodel ( LM T ). Like the LSTM language model,it receives previously generated characters as inputand computes a probability distribution over thecharacter vocabulary.Again, we use the fairseq toolkit for our imple-mentation, and employ the default hyperparame-ters for the transformer language model. We usean Adam optimizer with an initial learning rate of0.0005 for training, and dropout with a coefﬁcientof 0.1. This model is trained for 3 epochs. We further analyze the character embeddings ofELMo models (Peters et al., 2018). ELMo mod-els are pretrained networks, aimed at producingcontextualized word embeddings for use in down-stream NLP tasks. The model architecture consistsof a convolutional layer over character embeddings,whose output is then fed into a 2-layer bidirectionalLSTM. ELMo models are trained with a bidirec-tional language modeling objective.We experiment with 4 English models whichare available online: small ( ELMo s ), medium( ELMo m ), original ( ELMo o ) and original-5.5B( ELMo l ). Those models differ in the number oftheir parameters and the amount of text they havebeen trained on: ELMo s and ELMo m have 13.6million and 28 million parameters, respectively. https://github.com/pytorch/examples/tree/master/word_language_model https://allennlp.org/elmo igure 2: Pearson correlation between the vector of av-eraged human character differences and distance vec-tors according to character embeddings. Both ELMo o and ELMo l have 93.6 million param-eters. All models except for ELMo l have beentrained on the 1 Billion Word Benchmark. ELMo l has been trained on a combination of Wikipediaand news crawl data, which together result in adataset of 5.5 billion tokens. The Pearson correlation of all models with humanjudgements as well as with each other is shown inFigure 2. G2P

LSTM shows the highest correlationwith 0.30, while LM T is not correlated at all, andELMo m obtains the strongest negative correlationwith − . .More generally, we see that most models arecorrelated with each other (between . and . ),with the exception of the ELMos: the predictionsof ELMo l are the only ones which have a Pearsoncorrelation ≥ . with those of some other models.Comparing to human scores, we ﬁnd the fol-lowing patterns: embeddings from seq2seq tasksshow a higher correlation than embeddings fromlanguage models. Even ELMo models, which aretrained on large amounts of text, obtain a maxi-mum correlation of . . Thus, we conclude thatlanguage modeling does not result in embeddingswhich perform well on our character similarity task.Embeddings for G2P correlate more strongly withhuman character perception than embeddings frominﬂection, when comparing identical architectures. Figure 3: Cut-distance between the average human dis-tance matrix and all character embeddings.

This is noteworthy, since colors perceived by synes-thetes are supposed to be inﬂuenced by the sound ofeach letter (Asano and Yokosawa, 2013), similar tothe embeddings for G2P. Finally, comparing acrossarchitectures, we see that LSTM models correlatestronger with human judgments than transformermodels, which is in line with the common under-standing that recurrent neural networks might bebetter models of human cognition.

First, we look at the globalstructure of all character embeddings (cf. Figure5). All but the ELMo models exhibit a marked sep-aration between a tight cluster of vowels (

AEIOU )or extended vowels (+ Y ), which are highly similar,and the rest of the alphabet. In contrast, this distinc-tion is not found for humans, neither for individualsnor the average distance matrix. Despite this, thehuman average does present a clear global struc-ture (cf. appendix for details). One clear clusteris BDGJKMNPQRVWXZ , with the particularly closepairs MN and XZ , perhaps due to the letters’ shape,sound, or proximity in the alphabet. This cluster isfar away from the letters in AES , which themselvesdo not form a cluster. Another cluster, on averagefar form the ﬁrst, is formed by the letters

CILOUY .Apart from the clear separation between vowelsand consonants, character embeddings exhibit richadditional structure (cf. Figure 5). For Inﬂ

LSTM ,the cluster

BCFHJMPQW contains the tighter sub-cluster

BJQW that is similar to the letters

GKVX .In contrast, Inﬂ T has the two small clusters HJKQ and

LNR far from each other, and a less clear-cut igure 4: Comparison measures for models of charac-ter embedding.

Top : Betweenness centrality.

Bottom :Cut-distances and cluster coefﬁcients for the averagehuman distance matrix and all character embeddings. structure among the rest of the consonants.Embeddings from G2P

LSTM exhibit a structureclearly connected to the G2P task. The distancenetwork has many small clusters of similar sound-ing letters, e.g.,

UW,IJ,JY,GJ,CKQ,SXZ , and

BFPV . G2P

LSTM ’s counterpart, G2P T , producessimilar groupings, but displays a more homoge-neous structure overall.LM LSTM has the tight cluster

KSXYZ within thelarger

BCDFGKMPSTVWXYZ cluster, and weakerconnections between other letters. LM T has aclearer cluster structure, showing the strongest sep-aration between vowels and consonants, especiallyfrom the non-cluster JKQXZ , which is also well-separated from other consonants. ELMo modelsare outliers in that they do not present a clear globalstructure: only loose clusters can be identiﬁed.

Cluster coefﬁcient.

Looking at the cluster co-efﬁcients (cf. Figure 4, last row) we also see a dif-ference between ELMo embeddings and the othermodels: all models except for the ELMos have clus-ter coefﬁcients between 0.72 and 0.88 and, thus,close to the human average of 0.81. In contrast,the cluster coefﬁcients for ELMo embeddings arebetween . for ELMo s and . for ELMo l . Betweenness centrality.

The local values ofbetweenness centrality (cf. Figure 4) show therich structure of the similarity matrices for mostembeddings. For all but the ELMo models, themajority of letters have either extremely high orlow levels of betweenness. In particular vowels tend to occupy prominent places in the networkstructure. Humans, in contrast, are more similar toELMo embeddings.

Cut-distance.

Looking at cut-distances (cf. Fig-ure 4), we ﬁnd that the structure of ELMo em-beddings is signiﬁcantly more similar to a randommatrix than that of the other embeddings. The cut-distances (cf. Figure 3) between humans and em-beddings largely agree with the conclusions fromSection 6.1 – G2P

LSTM and the ELMo models arerespectively the most similar and dissimilar –, eventhough correlation for node-to-node similaritiesdoes not necessarily imply a similar global struc-ture.

Neural network analysis.

A lot of ink has beenspilled on what neural network models learn andhow. For instance, Zhang and Bowman (2018) in-vestigated different pretraining objectives on theirability to induce syntactic and part-of-speech infor-mation. Pruksachatkun et al. (2020) studied modelperformance on probing tasks to investigate whatmodels learn from intermediate-task training. Be-linkov et al. (2017) explored what neural machinetranslation models learn about morphology.Other work created test sets to evaluate speciﬁclinguistic model abilities. Linzen et al. (2016) madea dataset to investigate the ability of neural net-works to detect mismatches in subject–verb agree-ment in the presence of distractor nouns. Warstadtet al. (2019) created a benchmark called BLiMPto assess the ability of language models to handlespeciﬁc syntactic phenomena in English. Muelleret al. (2020) introduced a similar suite of test setsin English, French, German, Hebrew and Russian,also focusing on syntactic phenomena. Similarly,Xiang et al. (2021) presented CLiMP, a benchmarkfor the evaluation of Chinese language models.Besides that, attention mechanisms (Bahdanauet al., 2015) in neural models have been commonsubjects of investigation. Jain and Wallace (2019)claimed that ”attention is not explanation”, to belater on challenged by Wiegreffe and Pinter (2019),who argued that ”attention is not not explanation”.However, the relationship between inputs, attentionweights, and outputs is still poorly understood.Furthermore, our work is related to research onwhich information is learned and how informa-tion is encoded by so-called language represen-tation models, e.g., BERT (Devlin et al., 2019) or

LMo ELMo SmallELMo MediumELMo LargeG2P LSTM G2P TransformerInﬂection LSTM Inﬂection TransformerLM TransformerLM LSTM

Figure 5: Distance matrices and corresponding dendrograms reveal the cluster structure of character embeddings.Darker colors depict small distances (high similarity) between pairs. Dendrograms summarize the cluster structure,with the height of horizontal lines depicting the Ward distance between the corresponding clusters being joined.

RoBERTa (Liu et al., 2019). Similar to attentionin other models, attention in BERT has been in-vestigated exhaustively. Clark et al. (2019) foundthat it captures substantial syntactic information,and Vig (2019) built a visualization tool for theattention mechanism. Hewitt and Manning (2019)evaluated whether syntax trees could be recoveredfrom BERT or ELMo’s word representation space.An overview of over 40 different studies of BERTcan be found in Rogers et al. (2020).

Embedding analysis.

The research which isclosest to our work investigates which informationis captured by different types of embeddings, oftenby training a classiﬁer to predict certain featuresof interest. For instance, Kann et al. (2019) useda classiﬁer-based approach to examine whetherword and sentence embeddings encode informa-tion about the frame-selectional properties of verbs.Ettinger et al. (2016) investigated the grammaticalinformation contained in sentence embeddings re-garding multiple linguistic phenomena. Qian et al.(2016) mapped a dense embedding to a sparse lin-guistic property space to explore the contained in-formation. Bjerva and Augenstein (2018) studiedlanguage embeddings. Different word similarity datasets have beenused for word embedding evaluation, for in-stance RG-65 (Rubenstein and Goodenough, 1965),WordSim-353 (Finkelstein et al., 2002), or SimLex-999 (Hill et al., 2015).In contrast to the work in this paragraph, whichwas concerned with word or sentence embeddings,we aim at understanding character embeddings.

In this paper, we performed an in-depth analysisof character embeddings extracted from variouscharacter-level models for NLP. We leveraged re-sources from research on grapheme–color synes-thesia – a neuropsychological phenomenon whereletters are associated with colors –, to constructa dataset for a character similarity task. We fur-ther performed an analysis of networks represent-ing characters as nodes and similarities as edgeweights to understand how characters are organizedby human synesthetes in comparison to characterembeddings. Analysing 10 different character em-beddings, we found that LSTMs agreed with hu-mans more than transformer models. Comparingdifferent tasks, G2P resulted in embeddings moreimilar to human character representions than in-ﬂection and, by a wide margin, language modeling.ELMo embeddings differed from humans and othermodels in that they exhibited no clear structure.

Acknowledgments

We would like to thank Manuel Mager and theanonymous reviewers for their feedback on thiswork. Mauro M. Monsalve-Mercado acknowl-edges ﬁnancial support from the Center for Theoret-ical Neuroscience at Columbia University throughthe NSF NeuroNex Award DBI-1707398 and TheGatsby Charitable Foundation.

References

Noga Alon and Assaf Naor. 2004. Approximating thecut-norm via grothendieck’s inequality. In

Proceed-ings of the Thirty-Sixth Annual ACM Symposium onTheory of Computing , STOC ’04, page 72–80, NewYork, NY, USA. Association for Computing Machin-ery.Michiko Asano and Kazuhiko Yokosawa. 2013.Grapheme learning and grapheme-color synesthe-sia: toward a comprehensive model of grapheme-color association.

Frontiers in human neuroscience ,7:757.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

ICLR .Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-san Sajjad, and James Glass. 2017. What do neu-ral machine translation models learn about morphol-ogy? In

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 861–872, Vancouver,Canada. Association for Computational Linguistics.Johannes Bjerva and Isabelle Augenstein. 2018. Fromphonology to syntax: Unsupervised linguistic typol-ogy at different levels with language embeddings. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) , pages 907–916, NewOrleans, Louisiana. Association for ComputationalLinguistics.Ulrik Brandes. 2008. On variants of shortest-path be-tweenness centrality and their generic computation.

Social Networks , 30(2):136 – 145.Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D Manning. 2019. What does BERTlook at? An analysis of BERT’s attention. arXivpreprint arXiv:1906.04341 . Ryan Cotterell, Christo Kirov, John Sylak-Glassman,G´eraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra K¨ubler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017.CoNLL-SIGMORPHON 2017 shared task: Univer-sal morphological reinﬂection in 52 languages. In

Proceedings of the CoNLL SIGMORPHON 2017Shared Task: Universal Morphological Reinﬂection ,pages 1–30, Vancouver. Association for Computa-tional Linguistics.R.E. Cytowic. 2018.

Synesthesia . MIT Press EssentialKnowledge series. MIT Press.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.David M Eagleman, Arielle D Kagan, Stephanie S Nel-son, Deepak Sagaram, and Anand K Sarma. 2007. Astandardized test battery for the study of synesthesia.

Journal of neuroscience methods , 159(1):139–145.Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of compositionby means of simple classiﬁcation tasks. In

Proceed-ings of the 1st Workshop on Evaluating Vector-SpaceRepresentations for NLP , pages 134–139, Berlin,Germany. Association for Computational Linguis-tics.Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2002. Placing search in context:The concept revisited.

ACM Trans. Inf. Syst. ,20(1):116–131.Alan Frieze and Ravindran Kannan. 1999. Quick ap-proximation to matrices and applications.

Combina-torica , 19:175–220.John Hewitt and Christopher D. Manning. 2019. Astructural probe for ﬁnding syntax in word repre-sentations. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Felix Hill, Roi Reichart, and Anna Korhonen. 2015.SimLex-999: Evaluating semantic models with (gen-uine) similarity estimation.

Computational Linguis-tics , 41(4):665–695.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.dward M. Hubbard, A. Cyrus Arman, Vilaya-nur S. Ramachandran, and Geoffrey M. Boynton.2005. Individual differences among grapheme-colorsynesthetes: Brain-behavior correlations.

Neuron ,45(6):975 – 985.Sarthak Jain and Byron C. Wallace. 2019. Attention isnot Explanation. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 3543–3556, Minneapolis, Minnesota.Association for Computational Linguistics.Katharina Kann and Hinrich Sch¨utze. 2016. Single-model encoder-decoder with explicit morphologicalrepresentation for reinﬂection. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , pages555–560, Berlin, Germany. Association for Compu-tational Linguistics.Katharina Kann, Alex Warstadt, Adina Williams, andSamuel R. Bowman. 2019. Verb argument structurealternations in word and sentence embeddings. In

Proceedings of the Society for Computation in Lin-guistics (SCiL) 2019 , pages 287–297.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies.

Transactions of theAssociation for Computational Linguistics , 4:521–535.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Aleksandra Mroczko-Wasowicz and Danko Nikolic.2014. Semantic mechanisms may be responsible fordeveloping synesthesia.

Frontiers in Human Neuro-science , 8:509.Aaron Mueller, Garrett Nicolai, Panayiota Petrou-Zeniou, Natalia Talmina, and Tal Linzen. 2020.Cross-linguistic syntactic evaluation of word predic-tion models. arXiv preprint arXiv:2005.00187 .M. Newman. 2018.

Networks . OUP Oxford.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics. Barbara Plank, Anders Søgaard, and Yoav Goldberg.2016. Multilingual part-of-speech tagging with bidi-rectional long short-term memory models and auxil-iary loss. In

Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 412–418, Berlin,Germany. Association for Computational Linguis-tics.Yada Pruksachatkun, Jason Phang, Haokun Liu,Phu Mon Htut, Xiaoyi Zhang, Richard YuanzhePang, Clara Vania, Katharina Kann, and Samuel RBowman. 2020. Intermediate-task transfer learn-ing with pretrained models for natural language un-derstanding: Wen and why does it work? arXivpreprint arXiv:2005.00628 .Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016.Investigating language universal and speciﬁc prop-erties in word embeddings. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages1478–1488, Berlin, Germany. Association for Com-putational Linguistics.V. S. Ramachandran and E. M. Hubbard. 2001a.Psychophysical investigations into the neural ba-sis of synaesthesia.

Proceedings of the RoyalSociety of London. Series B: Biological Sciences ,268(1470):979–983.V. S. Ramachandran and E. M. Hubbard. 2001b.Synaesthesia–a window into perception, thoughtand language.

Journal of Consciousness Studies ,8(12):3–34.Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in bertology: What weknow about how BERT works. arXiv preprintarXiv:2002.12327 .Herbert Rubenstein and John B. Goodenough. 1965.Contextual correlates of synonymy.

Commun. ACM ,8(10):627–633.Jari Saram¨aki, Mikko Kivel¨a, Jukka-Pekka Onnela,Kimmo Kaski, and J´anos Kert´esz. 2007. Generaliza-tions of the clustering coefﬁcient to weighted com-plex networks.

Phys. Rev. E , 75:027105.Gaurav Sharma, Wencheng Wu, and Edul N. Dalal.2005. The ciede2000 color-difference formula: Im-plementation notes, supplementary test data, andmathematical observations.

Color Research & Ap-plication , 30(1):21–30.Julia Simner. 2007. Beyond perception: Synaesthesiaas a psycholinguistic phenomenon.

Trends in Cogni-tive Sciences , 11(1):23–29.Julia Simner. 2012. Deﬁning synaesthesia.

Britishjournal of psychology , 103(1):1–15.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

NeurIPS .esse Vig. 2019. Visualizing attention in transformer-based language representation models. arXivpreprint arXiv:1904.02679 .Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355, Brussels, Belgium.Association for Computational Linguistics.Joe Ward. 1963. Hierarchical grouping to optimize anobjective function.

Journal of the American Statisti-cal Association , 58(301):236–244.Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo-hananey, Wei Peng, Sheng-Fu Wang, and Samuel RBowman. 2019. Blimp: A benchmark of lin-guistic minimal pairs for english. arXiv preprintarXiv:1912.00582 .Miaomiao Wen, Zeyu Zheng, Hyeju Jang, Guang Xi-ang, and Carolyn Penstein Ros´e. 2013. Extract-ing events with informal temporal references in per-sonal histories in online communities. In

Proceed-ings of the 51st Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 836–842, Soﬁa, Bulgaria. Associationfor Computational Linguistics.Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not explanation. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 11–20, Hong Kong, China. Associ-ation for Computational Linguistics.Nathan Witthoft, Jonathan Winawer, and David M Ea-gleman. 2015. Prevalence of learned grapheme-color pairings in a large online sample of synesthetes.

PLoS One , 10(3).Beilei Xiang, Changbing Yang, Yu Li, Alex Warstadt,and Katharina Kann. 2021. CLiMP: A benchmarkfor Chinese language model evaluation. In

Proceed-ings of the 16th Conference of the European Chapterof the Association for Computational Linguistics .Kelly Zhang and Samuel Bowman. 2018. Languagemodeling teaches you more than translation does:Lessons learned through auxiliary syntactic taskanalysis. In

Proceedings of the 2018 EMNLP Work-shop BlackboxNLP: Analyzing and Interpreting Neu-ral Networks for NLP , pages 359–361, Brussels, Bel-gium. Association for Computational Linguistics. ppendix A: Statistics

X–Z, M–N, R–X, K–X, H–N, M–R, O–U, M–W, V–Z, O–QB–Y, B–C, C–R, C–M, M–Y, R–Y, C–X, I–R, P–Y, C–Z

Table 2: The most similar character pairs in descending order (top) and the most dissimilar character pairs inascending order (bottom) for human synesthetes.

Appendix B: Analysis of Color Differences of Characters as Perceived by Synesthetes

In additional to the most important ﬁndings mentioned in the main part of this paper, we further presentan in-depth study of the character similarities according to human synesthetes in this section.

Clustering Analysis.

For each individual, we compute a character difference matrix using CIEDI2000(normalized to values between 0 and 1). We then use Ward’s hierarchical clustering algorithm to explorehidden structural features. Several examples suggest that individual synesthetes tend to represent certaingroups of letters with closely matching colors, since their perceptual color differences tend to form tightclusters (Figure 6A), as opposed to the node-to-node average over the entire population of synesthetes(Figure 6B). To make this ﬁnding concrete, we compute several measures aimed to address the degree ofclustering of each network.First, we compute the distances between identiﬁed clusters in the distance matrix for each individual(Figure 6C). Pooled over the whole population, the distribution of cluster distances reveal an over-representation of small distances when compared to shufﬂed data. Over-represented small clustereddistances and large jumps to big distances implies the existence of just a handful of tight clusters withsmall within-cluster distances and larger inter-cluster distances, as suggested by the dendrogram examplesin Figure 6A.Imposing a cut-off cluster distance for the dendrograms, we effectively select the cluster structure of thedistance matrix. Although several methods for selecting the cut-off have been studied, they often stronglydepend on the nature of the dataset in question. Using three largely different cut-offs we ﬁnd robustly thateach distance matrix encodes only a handful of tight clusters (cf. Figures 6D, E), typically around 3 to 5clusters with an average size of 6 letters per cluster.

Clustering coefﬁcient.

Next, we compute the local clustering coefﬁcient for each node in eachdistance matrix and observe no strong differences among nodes belonging to the same matrix. For eachmatrix, we average the cluster coefﬁcient over all nodes and look at their distribution by pooling overall individuals (cf. Figure 6F). This reveals a narrow distribution of low clustering coefﬁcients peakedaround . , implying a small average distance of any node to its neighbours. For comparison, we computethe clustering coefﬁcient for a random uniform distance matrix (symmetric with zeroes in its diagonal)whose coefﬁcient places at . (averaged over 100 iterations), and also for a homogeneous distance matrix(all entries are equal with a zero diagonal) with a coefﬁcient of . Moreover, we repeat the analysis forthe node-to-node average distance matrix and show a relatively higher coefﬁcient ( . ), implying higherdistances on average for any node with respect to its neighbors. Betweenness centrality.

Next, we examine the betweenness centrality of nodes (cf. Figure 6H). Wecompute this measure on the auxiliary similarity matrix . − d , where d is a distance matrix, in orderto interpret high betweenness values as the most important nodes. The distribution over all individualsreveals the top-ranking nodes as CEILOSY . These are the nodes found to be in most shortest pathsbetween pair of nodes, and possibly lie at the intersection between otherwise separate clusters.

Cut-distance.

Finally, cut-distances between the individual distance matrices and a reference matrixoffer an additional characterization of their global structure. We compute cut-distances with respect toa zero matrix (the cut-norm), a random and an homogeneous matrix (deﬁned as in the last paragraph),and with respect to shufﬂed versions of themselves, averaged over 100 iterations (cf. Figure 6G). We ﬁndnarrow distributions suggesting all individuals share a similar global structure. B OriginalShufﬂedCumulativeCluster distances0 4 D i s t r i b u t i o n s Human averageIndividual examples C D Individuals 10864 1Number of clusters E threshold:2 32 1502 Distance2036Cluster sizes 203 6 9 15 C o u n t C o u n t N=4269 700320871117 F Cut-ShufﬂeCut-Random Cut-NormCut-Homo. 0.80.60.40.20 G Cut distance H B e t w ee nn e ss Figure 6: Network analysis of human synesthetes. (A)

Five randomly chosen examples illustrate the typicaldistance matrices for individual synesthetes (N=4269). The dendrogram summarizing the clustering structure ispresented on top of its corresponding distance matrix for one example (left). The dendrogram is a tree-levelrepresentation of the identiﬁed clusters, where the height of each horizontal line represents the Ward distancebetween the selected pair of clusters. (B)

The clustered distance matrix and corresponding dendrogram of thenode-to-node average for all human synesthetes. (C)

Distribution of all the distances between clusters for eachdistance matrix pooled over all individuals. Cluster distances correspond to the heights of all horizontal linesin each dendrogram. For comparison, the distribution of cluster distances corresponding to all shufﬂed distancematrices is presented, together with their cumulative distributions. (D)

Histograms of the number of clusters ineach distance matrix pooled over all individuals for three different cutoff cluster distances. (E)

Histograms ofthe size of the clusters found for the procedure in (D). (F)

Distribution of the average clustering coefﬁcient foreach distance matrix pooled over all individuals. For comparison the average clustering coefﬁcient for a randomdistance matrix, for the human average of (B), and for an homogeneous distance matrix are also marked by dashedlines. (G)

Distribution of cut-distances between all the human distance matrices and the zero matrix (cut-norm),a random distance matrix (average over 100 iterations), a homogeneous distance matrix, and shufﬂe versions ofthemselves (average over 100 iterations). (H)(H)