[PDF] Learning Character-level Compositionality with Visual Features

Abstract

Previous work has modeled the compositionality of words by creating character-level models of meaning, reducing problems of sparsity for rare words. However, in many writing systems compositionality has an effect even on the character-level: the meaning of a character is derived by the sum of its parts. In this paper, we model this effect by creating embeddings for characters based on their visual characteristics, creating an image for the character and running it through a convolutional neural network to produce a visual character embedding. Experiments on a text classification task demonstrate that such model allows for better processing of instances with rare characters in languages such as Chinese, Japanese, and Korean. Additionally, qualitative analyses demonstrate that our proposed model learns to focus on the parts of characters that carry semantic content, resulting in embeddings that are coherent in visual space.

Full PDF

LLearning Character-level Compositionality with Visual Features

Frederick Liu , Han Lu , Chieh Lo , Graham Neubig Language Technology Institute Electrical and Computer EngineeringCarnegie Mellon University, Pittsburgh, PA 15213 { fliu1,hlu2,gneubig } @[email protected] Abstract

Previous work has modeled the composi-tionality of words by creating character-level models of meaning, reducing prob-lems of sparsity for rare words. However,in many writing systems compositionalityhas an effect even on the character-level:the meaning of a character is derived bythe sum of its parts. In this paper, wemodel this effect by creating embeddingsfor characters based on their visual charac-teristics, creating an image for the charac-ter and running it through a convolutionalneural network to produce a visual char-acter embedding. Experiments on a textclassiﬁcation task demonstrate that suchmodel allows for better processing of in-stances with rare characters in languagessuch as Chinese, Japanese, and Korean.Additionally, qualitative analyses demon-strate that our proposed model learns tofocus on the parts of characters that carrysemantic content, resulting in embeddingsthat are coherent in visual space.

Compositionality—the fact that the meaning of acomplex expression is determined by its structureand the meanings of its constituents—is a hall-mark of every natural language (Frege and Austin,1980; Szab´o, 2010). Recently, neural models haveprovided a powerful tool for learning how to com-pose words together into a meaning representationof whole sentences for many downstream tasks.This is done using models of various levels ofsophistication, from simpler bag-of-words (Iyyeret al., 2015) and linear recurrent neural network(RNN) models (Sutskever et al., 2014; Kiros et al.,2015), to more sophisticated models using tree- (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1)

Kalb (cid:1)

Kälber (cid:1) a (cid:1) Do (cid:1) Do'(polite) (cid:1)

Calf (cid:1)

Calves (cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1)

Laurel (cid:1)

Whale (cid:1)

Salmon (cid:1)

Salmon (cid:1) gui (cid:1) jing (cid:1) gui (cid:1) gui (cid:1) (a) (cid:1) (b) (cid:1) (c) (cid:1) (d) (cid:1) han'da (cid:1) ham '' ni'' ' da (cid:1) Figure 1: Examples of character-level composi-tionality in (a, b) Chinese, (c) Korean, and (d) Ger-man. The red part of the characters are shared, andaffects the pronunciation (top) or meaning (bot-tom).structured (Socher et al., 2013) or convolutionalnetworks (Kalchbrenner et al., 2014).In fact, a growing body of evidence shows that itis essential to look below the word-level and con-sider compositionality within words themselves.For example, several works have proposed mod-els that represent words by composing togetherthe characters into a representation of the word it-self (Ling et al., 2015; Zhang et al., 2015; Dhingraet al., 2016). Additionally, for languages with pro-ductive word formation (such as agglutination andcompounding), models calculating morphology-sensitive word representations have been found ef-fective (Luong et al., 2013; Botha and Blunsom,2014). These models help to learn more robustrepresentations for rare words by exploiting mor-phological patterns, as opposed to models that op-erate purely on the lexical level as the atomic units.For many languages, compositionality stops atthe character-level: characters are atomic units ofmeaning or pronunciation in the language, and nofurther decomposition can be done. However, forother languages, character-level compositionality,where a character’s meaning or pronunciation can In English, for example, this is largely the case. a r X i v : . [ c s . C L ] M a y ang Geography Sports Arts Military Economics Transportation Chinese 32.4k 49.8k 50.4k 3.6k 82.5k 40.4kJapanese 18.6k 82.7k 84.1k 81.6k 80.9k 91.8kKorean 6k 580 5.74k 840 5.78k 1.68k

Lang Medical Education Food Religion Agriculture Electronics

Chinese 30.3k 66.2k 554 66.9k 89.5k 80.5kJapanese 66.5k 86.7k 20.2k 98.1k 97.4k 1.08kKorean 16.1k 4.71k 33 2.60k 1.51k 1.03k

Table 1: By-category statistics for the Wikipedia dataset. Note that Food is the abbreviation for “Foodand Culture” and Religion is the abbreviation for “Religion and Belief”.be derived from the sum of its parts, is very mucha reality. Perhaps the most compelling exampleof compositionality of sub-character units can befound in logographic writing systems such as theHan and Kanji characters used in Chinese andJapanese, respectively. As shown on the left sideof Fig. 1, each part of a Chinese character (calleda “radical”) potentially contributes to the meaning(i.e., Fig. 1(a)) or pronunciation (i.e., Fig. 1(b))of the overall character. This is similar to howEnglish characters combine into the meaning orpronunciation of an English word. Even in lan-guages with phonemic orthographies, where eachcharacter corresponds to a pronunciation insteadof a meaning, there are cases where compositionoccurs. Fig. 1(c) and (d) show the examples of Ko-rean and German, respectively, where morpholog-ical inﬂection can cause single characters to makechanges where some but not all of the componentparts are shared.In this paper, we investigate the feasibility ofmodeling the compositionality of characters in away similar to how humans do: by visually ob-serving the character and using the features of itsshape to learn a representation encoding its mean-ing. Our method is relatively simple, and gener-alizable to a wide variety of languages: we ﬁrsttransform each character from its Unicode repre-sentation to a rendering of its shape as an image,then calculate a representation of the image us-ing Convolutional Neural Networks (CNNs) (Cunet al., 1990). These features then serve as inputsto a down-stream processing task and trained inan end-to-end manner, which ﬁrst calculates a lossfunction, then back-propagates the loss back to theCNN. Other prominent examples are largely for extinct lan-guages: Egyptian hieroglyphics, Mayan glyphs, and Sume-rian cuneiform scripts (Daniels and Bright, 1996).

As demonstrated by our motivating examplesin Fig. 1, in logographic languages character-levelsemantic or phonetic similarity is often indicatedby visual cues; we conjecture that CNNs canappropriately model these visual patterns. Con-sequently, characters with similar visual appear-ances will be biased to have similar embeddings,allowing our model to handle rare characters ef-fectively, just as character-level models have beeneffective for rare words.To evaluate our model’s ability to learn repre-sentations, particularly for rare characters, we per-form experiments on a downstream task of classi-fying Wikipedia titles for three Asian languages:Chinese, Japanese, and Korean. We show thatour proposed framework outperforms a baselinemodel that uses standard character embeddings forinstances containing rare characters. A qualita-tive analysis of the characteristics of the learnedembeddings of our model demonstrates that visu-ally similar characters share similar embeddings.We also show that the learned representations areparticularly effective under low-resource scenar-ios and complementary with standard characterembeddings; combining the two representationsthrough three different fusion methods (Snoeket al., 2005; Karpathy et al., 2014) leads to con-sistent improvements over the strongest baselinewithout visual features.

Before delving into the details of our model, weﬁrst describe a dataset we constructed to exam-ine the ability of our model to capture the com-positional characteristics of characters. Speciﬁ-cally, the dataset must satisfy two desiderata: (1)it must be necessary to fully utilize each charac-ter in the input in order to achieve high accuracy,and (2) there must be enough regularity and com- (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) Rank F r e qu e n cy ⎯⎯ Chinese ⎯⎯"

Japanese ⎯⎯ Korean

Rank < 20%Freq. > 80%

Figure 2: The character rank-frequency distribu-tion of the corpora we considered in this paper. Allthree languages have a long-tail distribution.positionality in the characters of the language. Tosatisfy these desiderata, we create a text classiﬁ-cation dataset where the input is a Wikipedia ar-ticle title in Chinese, Japanese, or Korean, andthe output is the category to which the article be-longs. This satisﬁes (1), because Wikipedia titlesare short and thus each character in the title willbe important to our decision about its category. Italso satisﬁes (2), because Chinese, Japanese, andKorean have writing systems with large numbersof characters that decompose regularly as shownin Fig. 1. While this task in itself is novel, it issimilar to previous work in named entity type in-ference using Wikipedia (Toral and Munoz, 2006;Kazama and Torisawa, 2007; Ratinov and Roth,2009), which has proven useful for downstreamnamed entity recognition systems.

As the labels we would like to predict, we use12 different main categories from the Wikipediaweb page: Geography, Sports, Arts, Military, Eco-nomics, Transportation, Health Science, Educa-tion, Food Culture, Religion and Belief, Agricul-ture and Electronics. Wikipedia has a hierarchicalstructure, where each of these main categories hasa number of subcategories, and each subcategoryhas its own subcategories, etc. We traverse thishierarchical structure, adding each main categorytag to all of its descendants in this subcategory treestructure. In the case that a particular article is thedescendant of multiple main categories, we favorthe main category that minimizes the depth of the The link to the dataset and the crawling scripts– https://github.com/frederick0329/Wikipedia_title_dataset

GeographySportsArtsMilitaryEconomicsTransportationHealth ScienceEducationFood CultureReligion and BeliefAgricultureElectronicsVisual model(Image as input)Lookup model(Symbol as input) (cid:1)

CNN CNN CNN (cid:1) (cid:2) (cid:1)

SoftmaxGRU (cid:1) (cid:2)

Figure 3: An illustration of two models, our pro-posed V

ISUAL model at the top and the base-line L

OOKUP model at the bottom using the sameRNN architecture. A string of characters (e.g. “ 温病学 ”), each converted into a 36x36 image, servesas input of our V ISUAL model. d c is the dimen-sion of the character embedding for the L OOKUP model.article in the tree (e.g., if an article is two stepsaway from Sports and three steps away from Arts,it will receive the “Sports” label). We also per-form some rudimentary ﬁltering, removing pagesthat match the regular expression “.*:.*”, whichcatches special pages such as “title:agriculture”.

For Chinese, Japanese, and Korean, respectively,the number of articles is 593k/810k/46.6k, and theaverage length and standard deviation of the ti-tle is 6.25 ± ± ± Our overall model for the classiﬁcation task fol-lows the encoder model by Sutskever et al. (2014).ayer →

322 ReLu3 MaxPool (2, 2)4 Spatial Convolution (3, 3) →

325 ReLu6 MaxPool (2, 2)7 Spatial Convolution (3, 3) →

328 ReLu9 Linear (800, 128)10 ReLu11 Linear (128, 128)12 ReLuTable 2: Architecture of the CNN used in the ex-periments. All the convolutional layers have 323 × OOKUP model, calculatesthe representation for each character by looking itup in a character embedding matrix. Our proposedmodel, the V

ISUAL model instead learns the rep-resentation of each character from its visual ap-pearance via CNN. L OOKUP model

Given a character vocabulary C , for the L OOKUP model as in the bottom part ofFig. 2.1, the input to the network is a stream ofcharacters c , c , ...c N , where c n ∈ C . Each char-acter is represented by a 1-of- | C | (one-hot) en-coding. This one-hot vector is then multiplied bythe lookup matrix T C ∈ R | C |× d c , where d c is thedimension of the character embedding. The ran-domly initialized character embeddings were opti-mized with classiﬁcation loss. V ISUAL model

The proposed method aims tolearn a representation that includes image in-formation, allowing for better parameter sharingamong characters, particularly characters that areless common. Different from the L

OOKUP model,each character is ﬁrst transformed into a 36-by-36image based on its Unicode encoding as shown inthe upper part of Fig 2.1. We then pass the im-age through a CNN to get the embedding for theimage. The parameters for the CNN are learnedthrough backpropagation from the classiﬁcation loss. Because we are training embeddings basedon this classiﬁcation loss, we expect that the CNNwill focus on parts of the image that contain se-mantic information useful for category classiﬁca-tion, a hypothesis that we examine in the experi-ments (see Section 5.5).In more detail, the speciﬁc structure of the CNNthat we utilize consists of three convolution layerswhere each convolution layer is followed by themax pooling and ReLU nonlinear activation lay-ers. The conﬁgurations of each layer are listed inTab. 2. The output vector for the image embed-dings also has size d c which is the same as theL OOKUP model.

Encoder and Classiﬁer

For both theL

OOKUP and the V

ISUAL models, we adoptan RNN encoder using Gated Recurrent Units(GRUs) (Chung et al., 2014). Each of the GRUunits processes the character embeddings sequen-tially. At the end of the sequence, the incrementalGRU computation results in a hidden state e embedding the sentence. The encoded sentenceembedding is passed through a linear layer whoseoutput is the same size as the number of classes.We use a softmax layer to compute the posteriorclass probabilities: P ( y = j | e ) = exp( w Tj e + b j ) (cid:80) Li =1 exp( w Ti e + b i ) (1)To train the model, we use cross-entropy lossbetween predicted and true targets: J = 1 B B (cid:88) i =1 L (cid:88) j =1 − t i,j log( p i,j ) (2)where t i,j ∈ { , } represents the ground truth la-bel of the j -th class in the i -th Wikipedia page ti-tle. B is the batch size and L is the number ofcategories. One thing to note is that the L

OOKUP and theV

ISUAL models have their own advantages. TheL

OOKUP model learns embedding that capturesthe semantics of each character symbol withoutsharing information with each other. In con-trast, the proposed V

ISUAL model directly learnsembedding from visual information, which natu-rally shares information between visually similarcharacters. This characteristic gives the V

ISUAL ookup/Visual 100% 50% 12.5%zh trad 0.55/0.54 0.53/0.50 0.48/0.47zh simp 0.55/0.54 0.53/0.52 0.48/0.46ja 0.42/0.39 0.47/0.45 0.44/0.41ko 0.47/0.42 0.44/0.39 0.37/0.36

Table 3: The classiﬁcation results of the L

OOKUP / V

ISUAL models for different percentages of fulltraining size.model the ability to generalize better to rare char-acters, but also has the potential disadvantage ofintroducing noise for characters with similar ap-pearances but different meanings.With the complementary nature of these twomodels in mind, we further combine the two em-beddings to achieve better performances. Weadopt three fusion schemes, early fusion, late fu-sion (described by Snoek et al. (2005) and Karpa-thy et al. (2014)), and fallback fusion, a methodspeciﬁc to this paper.

Early Fusion

Early fusion works by concatenat-ing the two varieties of embeddings before feedingthem into the RNN. In order to ensure that the di-mensions of the RNN are the same after concate-nation, the concatenated vector is fed through ahidden layer to reduce the size from × d c to d c .The whole model is then ﬁne-tuned with trainingdata. Late Fusion

Instead of learning a joint represen-tation like early fusion, late fusion averages themodel predictions. Speciﬁcally, it takes the outputof the softmax layers from both models and aver-ages the probabilities to create a ﬁnal distributionused to make the prediction.

Fallback Fusion

Our ﬁnal fallback fusionmethod hypothesizes that our V

ISUAL model doesbetter with instances which contain more rarecharacters. First, in order to quantify the over-all rareness of an instance consisting of multiplecharacters, we calculate the average training setfrequency of the characters therein. The fallbackfusion method uses the V

ISUAL model to predicttesting instances with average character frequencybelow or equal to a threshold (here we use 0.0 fre-quency as cutoff, which means all characters in theinstance do not appear in the training set), and usesthe L

OOKUP model to predict the rest of the in-stances.

In this section, we compare our proposed V

ISUAL model with the baseline L

OOKUP model throughthree different sets of experiments. First, we ex-amine whether our model is capable of classify-ing text and achieving similar performance as thebaseline model. Next, we examine the hypothesisthat our model will outperform the baseline modelwhen dealing with low frequency characters. Fi-nally, we examine the fusion methods described inSection 4.

The dimension of the embeddings and batch sizefor both models are set to d c = 128 and B =400 , respectively. We build our proposed modelusing Torch (Collobert et al., 2002), and use Adam(Kingma and Ba, 2014) with a learning rate η =0 . for stochastic optimization. The length ofeach instance is cut off or padded to 10 charactersfor batch training. In this experiment, we examine whether our V I - SUAL model achieves similar performance withthe baseline L

OOKUP model in classiﬁcation ac-curacy.The results in Tab. 3 show that the baselinemodel performs 1-2% better across four datasets;this is due to the fact that the L

OOKUP model candirectly learn character embeddings that capturethe semantics of each character symbol for fre-quent characters. In contrast, the V

ISUAL modellearns embeddings from visual information, whichconstraints characters that has similar appearanceto have similar embeddings. This is an advantagefor rare characters, but a disadvantage for high fre-quency characters because being similar in appear-ance does not always lead to similar semantics.To demonstrate that this is in fact the case, be-sides looking at the overall classiﬁcation accuracy,we also examine the performance on classifyinglow frequency instances which are sorted accord-ing to the average training set frequency of thecharacters therein. Tab. 4 and Fig. 4 both show thatour model performs better in the 100 lowest fre-quency instances (the intersection point of the twomodels). More speciﬁcally, take Fig. 4(a)’ as ex-ample, the solid (proposed) line is higher than thedashed (baseline) line up to , indicating that theproposed model outperforms the baseline for the A cc u m u l a t e d N u m b e r o f C o rr ec t l y P r e d i c t e d I n s t a n ces Rank(a) (b)(c) (d) ⎯⎯ ! Visual,(TP(=(100% ⎯⎯ ! Visual,(TP(=(50% ⎯⎯ ! Visual,(TP(=(12.5% ⎯ ! ⎯ ! Lookup,(TP(=(100% ⎯ ! ⎯ ! Lookup,(TP(=(50% ⎯ ! ⎯ ! Lookup,(TP(=(12.5% ⎯⎯ ! Visual,(TP(=(100% ⎯⎯ ! Visual,(TP(=(50% ⎯⎯ ! Visual,(TP(=(12.5% ⎯ ! ⎯ ! Lookup,(TP(=(100% ⎯ ! ⎯ ! Lookup,(TP(=(50% ⎯ ! ⎯ ! Lookup,(TP(=(12.5% ⎯⎯ ! Visual,(TP(=(100% ⎯⎯ ! Visual,(TP(=(50% ⎯⎯ ! Visual,(TP(=(12.5% ⎯ ! ⎯ ! Lookup,(TP(=(100% ⎯ ! ⎯ ! Lookup,(TP(=(50% ⎯ ! ⎯ ! Lookup,(TP(=(12.5% ⎯⎯ ! Visual,(TP(=(100% ⎯⎯ ! Visual,(TP(=(50% ⎯⎯ ! Visual,(TP(=(12.5% ⎯ ! ⎯ ! Lookup,(TP(=(100% ⎯ ! ⎯ ! Lookup,(TP(=(50% ⎯ ! ⎯ ! Lookup,(TP(=(12.5%

Figure 4: Experiments on different training sizes for four different datasets. More speciﬁcally, we con-sider three different training data size percentages (TPs) (100%, 50%, and 12.5%) and four datasets: (a)traditional Chinese, (b) simpliﬁed Chinese, (c) Japanese, and (d) Korean. We calculate the accumulatednumber of correctly predicted instances for the V

ISUAL model (solid lines) and the L

OOKUP model(dashed lines). This ﬁgure is a log-log plot, where x-axis shows rarity (rarest to the left), y-axis showscumulative correctly classiﬁed instances up to this rank; a perfect classiﬁer will result in a diagonal line.ﬁrst 100 instances. Lines depart the x-axis whenthe model classiﬁes its ﬁrst instance correctly, andthe L

OOKUP model did not correctly classify anyof the ﬁrst 80 rarest instances, resulting in it cross-ing later than the proposed model. This conﬁrmsthat the V

ISUAL model can share visual informa-tion among characters and help to classify low fre-quency instances.For training time, visual features take signiﬁ-cantly more time, as expected. VISUAL is 30xslower than LOOKUP, although they are equiv-alent at test time. For space, images of Chinesecharacters took 36MB to store for 8985 characters.

In our second experiment, we consider two smallertraining sizes (i.e., 50% and 12.5% of the fulltraining size) indicated by green and red lines inFig. 4. We performed this experiment under thehypothesis that because the proposed method wasmore robust to infrequent characters, the proposedmodel may perform better in low-resourced sce-narios. If this is the case, the intersection point ofthe two models will shift right because of the in-crease of the number of instances with low averagecharacter frequency.

Lookup/Visual 100 1000 10000zh trad 0.22/ /0.39zh simp 0.25/ /0.37 /0.40ja 0.30/ /0.41 /0.41ko /0.33 /0.33 /0.42

Table 4: Classiﬁcation results for the L

OOKUP / V

ISUAL of the k lowest frequency instancesacross four datasets. The 100 lowest frequency in-stances for traditional and simpliﬁed Chinese andKorean were both signiﬁcant (p-value < h trad zh simp ja koLookup 0.5503 0.5543 0.4914 0.4765Visual 0.5434 0.5403 0.4775 0.4207early 0.5520 0.5546 0.4896 0.4796late fall 0.5507 0.5547 0.4914 0.4766 Table 5: Experiment results for three different fu-sion methods across 4 datasets. The late fusionmodel was better (p-value < Results of different fusion methods can be foundin Tab. 5. The results show that late fusiongives the best performance among all the fu-sion schemes combining the L

OOKUP modeland the proposed V

ISUAL model. Early fusionachieves small improvements for all languages ex-cept Japanese, where it displays a slight drop.Unsurprisingly, fallback fusion performs betterthan the L

OOKUP model and the V

ISUAL modelalone, since it directly targets the weakness of theL

OOKUP model (e.g., rare characters) and replacesthe results with the V

ISUAL model. These re-sults show that simple integration, no matter whichschemes we use, is beneﬁcial, demonstrating thatboth methods are capturing complementary infor-mation.

Finally, we qualitatively examine what is learnedby our proposed model in two ways. First, wevisualize which parts of the image are most im-portant to the V

ISUAL model’s embedding calcu-lation. Second, we show the 6-nearest neighborresults for characters using both the L

OOKUP andthe V

ISUAL embeddings.

Iron Bronze Salmon SerranidaeSilk Coil Rhyme PleasedWave Put on Cypress PillarCuckoo Eagle Mosquito Ant

Figure 5: Examples of how much each part of thecharacter contributes to its embedding (the darkerthe more). Two characters are shown per radical toemphasize that characters with same radical havesimilar patterns.

Emphasis of the V

ISUAL

Model

In order todelve deeper into what the V

ISUAL model haslearned, we measure a modiﬁed version of the oc-clusion sensitivity proposed by Zeiler and Fergus(2014) by masking the original character image infour ways, and examine the importance of eachpart of the character to the model’s calculated rep-resentations. Speciﬁcally, we leave only the up-per half, bottom half, left half, or right half of theimage, and mask the remainder with white pix-els since Chinese characters are usually formedby combining two radicals vertically or horizon-tally. We run these four images forward throughthe CNN part of the model and calculate the L distance between the masked image embeddingswith the full image embedding. The larger the dis-tance, the more the masked part of the charactercontributes to the original embedding. The contri-bution of each part (e.g. the L distance) is repre-sented as a heat map, and then it is normalized toadjust the opacity of the character strokes for bet-ter visualization. The value of each corner of theheatmap is calculated by adding the two L dis-tances that contribute to this corner.The visualization is shown in Fig. 5. The mean-ing of each Chinese character in English is shownbelow the Chinese character. The opacity of thecharacter strokes represent how much the corre-sponding parts contribute to the original embed-ding (the darker the more). In general, the darkerpart of the character is related to its semantics. Forexample, “ 金 ” means gold in Chinese, which is (cid:1) (cid:1) (cid:1) (cid:4)(cid:1)(cid:6)(cid:1) (cid:1) (cid:1)(cid:3)(cid:1)(cid:2)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:4)(cid:1)(cid:6)(cid:1) (cid:1) (cid:1)(cid:3)(cid:1)(cid:2)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:4)(cid:1)(cid:6)(cid:1) (cid:1) (cid:1)(cid:3)(cid:1)(cid:2)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:4)(cid:1)(cid:6)(cid:1) (cid:1) (cid:1)(cid:3)(cid:1)(cid:2)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:4)(cid:1)(cid:6)(cid:1) (cid:1) (cid:1)(cid:3)(cid:1)(cid:2)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:2)(cid:1)(cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1)(cid:5)(cid:1)(cid:6)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:2)(cid:1)(cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1)(cid:5)(cid:1)(cid:6)(cid:1)(cid:7)(cid:1) (cid:2)(cid:1)(cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1)(cid:5)(cid:1)(cid:6)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:2)(cid:1)(cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1)(cid:5)(cid:1)(cid:6)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:2)(cid:1)(cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1)(cid:5)(cid:1)(cid:6)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:2)(cid:1)(cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1)(cid:5)(cid:1)(cid:6)(cid:1)(cid:7)(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:6)(cid:1)(cid:8)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1)(cid:2)(cid:1)(cid:4)(cid:1)(cid:3) ! (cid:1) (cid:6)(cid:1)(cid:8)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1)(cid:2)(cid:1)(cid:4)(cid:1)(cid:3) ! (cid:1) (cid:6)(cid:1)(cid:8)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1)(cid:2)(cid:1)(cid:4)(cid:1)(cid:3) ! (cid:1) (cid:6)(cid:1)(cid:8)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1)(cid:2)(cid:1)(cid:4)(cid:1)(cid:3) ! (cid:1) (cid:6)(cid:1)(cid:8)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1)(cid:2)(cid:1)(cid:4)(cid:1)(cid:3) ! (cid:1) (cid:6)(cid:1)(cid:8)(cid:1)(cid:5)(cid:1)(cid:7)(cid:1)(cid:2)(cid:1)(cid:4)(cid:1)(cid:3) ! (cid:1) (cid:1) (cid:1) (cid:6)(cid:1)(cid:3)(cid:1)(cid:4)(cid:1)(cid:7)(cid:1)(cid:5)(cid:1)(cid:8)(cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:3)(cid:1)(cid:4)(cid:1)(cid:7)(cid:1)(cid:5)(cid:1)(cid:8)(cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:3)(cid:1)(cid:4)(cid:1)(cid:7)(cid:1)(cid:5)(cid:1)(cid:8)(cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:3)(cid:1)(cid:4)(cid:1)(cid:7)(cid:1)(cid:5)(cid:1)(cid:8)(cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:3)(cid:1)(cid:4)(cid:1)(cid:7)(cid:1)(cid:5)(cid:1)(cid:8)(cid:1)(cid:2)(cid:1) (cid:6)(cid:1)(cid:3)(cid:1)(cid:4)(cid:1)(cid:7)(cid:1)(cid:5)(cid:1)(cid:8)(cid:1)(cid:2)(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1) (cid:2) (cid:1)(cid:5)(cid:1)(cid:2)(cid:1) (cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1) (cid:2) (cid:1)(cid:5)(cid:1)(cid:2)(cid:1) (cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1) (cid:2) (cid:1)(cid:5)(cid:1)(cid:2)(cid:1) (cid:1) (cid:1) (cid:3)(cid:1) (cid:1) (cid:1)(cid:4)(cid:1) (cid:2) (cid:1)(cid:5)(cid:1)(cid:2)(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:2)(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1)(cid:5)(cid:1)(cid:4)(cid:1)(cid:2)(cid:1)(cid:3)(cid:1)(cid:6)(cid:1)(cid:1) (cid:1) (cid:1)(cid:5)(cid:1)(cid:4)(cid:1)(cid:2)(cid:1)(cid:3)(cid:1)(cid:6)(cid:1)(cid:1) (cid:1) (cid:1)(cid:5)(cid:1)(cid:4)(cid:1)(cid:2)(cid:1)(cid:3)(cid:1)(cid:6)(cid:1)(cid:1) (cid:1) (cid:1)(cid:5)(cid:1)(cid:4)(cid:1)(cid:2)(cid:1)(cid:3)(cid:1)(cid:6)(cid:1)(cid:1) (cid:1) (cid:1)(cid:5)(cid:1)(cid:4)(cid:1)(cid:2)(cid:1)(cid:3)(cid:1)(cid:6)(cid:1)(cid:1) (cid:1) (cid:1)(cid:5)(cid:1)(cid:4)(cid:1)(cid:2)(cid:1)(cid:3)(cid:1)(cid:6)(cid:1)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:2)(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:5)(cid:1) (cid:1) (cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:4)(cid:1)(cid:3)(cid:1) (cid:5)(cid:1) (cid:1) (cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:4)(cid:1)(cid:3)(cid:1) (cid:5)(cid:1) (cid:1) (cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:4)(cid:1)(cid:3)(cid:1) (cid:5)(cid:1) (cid:1) (cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:4)(cid:1)(cid:3)(cid:1) (cid:5)(cid:1) (cid:1) (cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:4)(cid:1)(cid:3)(cid:1) (cid:5)(cid:1) (cid:1) (cid:1)(cid:2)(cid:1)(cid:6)(cid:1)(cid:4)(cid:1)(cid:3)(cid:1) Visual'model (cid:1)

Lookup'model (cid:1)

Visual'model (cid:1)

Lookup'model (cid:1)

Figure 6: Visualization of the Chinese traditionalcharacters by ﬁnding the 6-nearest neighbors ofthe query (i.e., center) characters. The highlightedred indicates the radical along with the meaning ofthe characters.highlighted in both “ 鐵 ” (Iron) and “ 銅 ” (Bronze).We can also ﬁnd similar results for other exam-ples shown in Fig. 5. Fig. 5 also demonstratedthat our model captures the compositionality ofChinese characters, both meaning of sub-characterunits and their structure (e.g. the semantic contenttends to be structurally localized on one side of aChinese character). K-nearest neighbors

Finally, to illustrate thedifference of the learned embeddings between thetwo models, we display 6-nearest neighbors ( L distance) for selected characters in Fig. 6. As canbe seen, the V ISUAL embedding for characterswith similar appearances are close to each other.In addition, similarity in the radical part indicatessemantic similarity between the characters. Forexample, the characters with radical “ 鳥 ” all referto different type of birds.The L OOKUP embedding do not show such fea-ture, as it learns the embedding individually foreach symbol and relies heavily on the training setand the task. In fact, the characters shown in Fig. 6for the L

OOKUP model do not exhibit semanticsimilarity either. There are two potential expla-nations for this: First, the category classiﬁcationtask that we utilized do not rely heavily on the ﬁne-grained semantics of each character, and thus theL

OOKUP model was able to perform well withoutexactly capturing the semantics of each characterprecisely. Second, the Wikipedia dataset containsa large number of names and location and the char-acters therein might not have the same semanticmeaning used in daily vocabulary.

Methods that utilize neural networks to learndistributed representations of words or charac-ters have been widely developed. However,word2vec (Mikolov et al., 2013), for example, re-quires storing an extremely large table of vectorsfor all word types. For example, due to the sizeof word types in twitter tweets, work has beendone to generate vector representations of tweetsat character-level (Dhingra et al., 2016).There is also work done in understanding math-ematical expressions with a convolutional net-work for text and layout recognition by usingan attention-based neural machine translation sys-tem (Deng et al., 2016). They tested on real-world rendered mathematical expressions pairedwith LaTeX markup and show the system is ef-fective at generating accurate markup. Other thanthat, there are several works that combine visualinformation with text in improving machine trans-lation (Sutskever et al., 2014), visual question an-swering, caption generation (Xu et al., 2015), etc.These works extract image representations from apre-trained CNN (Zhu et al., 2016; Wang et al.,2016).Unrelated to images, CNNs have also been usedfor text classiﬁcation (Kim, 2014; Zhang et al.,2015). These models look at the sequential depen-dencies at the word or character-level and achievethe state-of-the-art results. These works inspireus to use CNN to extract features from image andserve as the input to the RNN. Our model is ableto directly back-propagate the gradient all the waythrough the CNN, which generates visual embed-dings, in a way such that the embedding can con-tain both semantic and visual information.Several techniques for reducing the rare wordseffects have been introduced in the literature, in-cluding spelling expansion (Habash, 2008), dictio-nary term expansion (Habash, 2008), proper nametransliteration (Daum´e and Jagarlamudi, 2011),treating words as a sequence of characters (Lu-ong and Manning, 2016), subword units (Sennrichet al., 2015), and reading text as bytes (Gillicket al., 2015). However, most of these techniquesstill have no mechanism for handling low fre-quency characters, which are the target of thiswork.Finally, there are works on improving embed-dings with radicals, which explicitly splits Chi-nese characters into radicals based on a dictionaryf what radicals are included in which characters(Li et al., 2015; Shi et al., 2015; Yin et al., 2016).The motivation of this method is similar to ours,but is only applicable to Chinese, in contrast tothe method in this paper, which works on any lan-guage for which we can render text.

In this paper, we proposed a new frameworkthat utilizes appearance of characters, convolu-tional neural networks, recurrent neural networksto learn embeddings that are compositional in thecomponent parts of the characters. More specif-ically, we collected a Wikipedia dataset, whichconsists of short titles of three different languagesand satisﬁes the compositionality in the charactersof the language. Next, we proposed an end-to-endmodel that learns visual embeddings for charactersusing CNN and showed that the features extractedfrom the CNN include both visual and semanticinformation. Furthermore, we showed that ourV

ISUAL model outperforms the L

OOKUP baselinemodel in low frequency instances. Additionally,by examining the character embeddings visually,we found that our V

ISUAL model is able to learnvisually related embeddings.In summary, we tackled the problem of rarecharacters by using embeddings learned from im-ages. In the future, we hope to further general-ize this method to other tasks such as pronuncia-tion estimation, which can take advantage of thefact that pronunciation information is encoded inparts of the characters as demonstrated in Fig. 1,or machine translation, which could beneﬁt froma wholistic view that considers both semantics andpronunciation. We also hope to apply the model toother languages with complicated compositionalwriting systems, potentially including historicaltexts such as hieroglyphics or cuneiform.

Acknowledgments

We thank Taylor Berg-Kirkpatrick, AdhigunaKuncoro, Chen-Hsuan Lin, Wei-Cheng Chang,Wei-Ning Hsu and the anonymous reviewers fortheir enlightening comments and feedbacks.

References

Jan A Botha and Phil Blunsom. 2014. Compositionalmorphology for word representations and languagemodelling. In

ICML . pages 1899–1907. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555 .Ronan Collobert, Samy Bengio, and Johnny Marithoz.2002. Torch: A modular machine learning softwarelibrary.Y. Le Cun, B. Boser, J. S. Denker, R. E. Howard,W. Habbard, L. D. Jackel, and D. Henderson. 1990.Advances in neural information processing systems2. pages 396–404.Peter T Daniels and William Bright. 1996.

The world’swriting systems . Oxford University Press.Hal Daum´e and Jagadeesh Jagarlamudi. 2011. Domainadaptation for machine translation by mining unseenwords. In

ACL-HLT . pages 407–412.Yuntian Deng, Anssi Kanervisto, and Alexander M.Rush. 2016. What you get is what you see:A visual markup decompiler. arXiv preprintarXiv:1609.04938 .Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick,Michael Muehl, and William W Cohen. 2016.Tweet2vec: Character-based distributed representa-tions for social media.

ACL .Gottlob Frege and John Langshaw Austin. 1980.

Thefoundations of arithmetic: A logico-mathematicalenquiry into the concept of number . NorthwesternUniversity Press.Dan Gillick, Cliff Brunk, Oriol Vinyals, and AmarnagSubramanya. 2015. Multilingual language process-ing from bytes. arXiv preprint arXiv:1512.00103 .Nizar Habash. 2008. Four techniques for online han-dling of out-of-vocabulary words in Arabic-Englishstatistical machine translation. In

HLT-Short . pages57–60.Mohit Iyyer, Varun Manjunatha, and Jordan L Boyd-Graber. 2015. Deep unordered composition rivalssyntactic methods for text classiﬁcation. In

ACL .Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences.

ACL pages 655–665.Andrej Karpathy, George Toderici, Sanketh Shetty,Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.2014. Large-scale video classiﬁcation with convolu-tional neural networks. In

CVPR . pages 1725–1732.Jun’ichi Kazama and Kentaro Torisawa. 2007. Ex-ploiting Wikipedia as external knowledge for namedentity recognition. In

EMNLP-CoNLL . pages 698–707.Yoon Kim. 2014. Convolutional neural networks forsentence classiﬁcation. In

EMNLP . pages 1746–1751.iederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. In

NIPS . pages 3294–3302.Yanran Li, Wenjie Li, Fei Sun, and Sujian Li.2015. Component-enhanced chinese character em-beddings.

EMNLP pages 829–834.Wang Ling, Chris Dyer, Alan W Black, Isabel Tran-coso, Ramon Fermandez, Silvio Amir, Luis Marujo,and Tiago Luis. 2015. Finding function in form:Compositional character models for open vocabu-lary word representation. In

EMNLP . pages 1520–1530.Minh-Thang Luong and Christopher D Manning. 2016.Achieving open vocabulary neural machine transla-tion with hybrid word-character models.

ACL pages1054–1063.Thang Luong, Richard Socher, and Christopher Man-ning. 2013. Better word representations with recur-sive neural networks for morphology. In

CoNLL .pages 104–113.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

NIPS . pages 3111–3119.Mej Newman. 2005. Power laws, Pareto distributionsand Zipf’s law.

CONTEMP PHYS pages 323–351.Lev Ratinov and Dan Roth. 2009. Design challengesand misconceptions in named entity recognition. In

CoNLL . pages 147–155.Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units.

ACL pages 1715–1725.Xinlei Shi, Junjie Zhai, Xudong Yang, Zehua Xie,and Chao Liu. 2015. Radical embedding: Delvingdeeper to chinese radicals. In

ACL . pages 594–598.Cees GM Snoek, Marcel Worring, and Arnold WMSmeulders. 2005. Early versus late fusion in seman-tic video analysis. In

ACM MM . pages 399–402.Richard Socher, Alex Perelygin, Jean Y Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In

EMNLP . pages 1631–1642.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

NIPS . pages 3104–3112.Zolt´an Gendler Szab´o. 2010. Compositionality.

Stan-ford encyclopedia of philosophy . Antonio Toral and Rafael Munoz. 2006. A proposalto automatically build and maintain gazetteers fornamed entity recognition by using wikipedia. In

EACL . pages 56–61.Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang,Chang Huang, and Wei Xu. 2016. Cnn-rnn: A uni-ﬁed framework for multi-label image classiﬁcation.In

CVPR . pages 2285–2294.Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron C Courville, Ruslan Salakhutdinov, Richard SZemel, and Yoshua Bengio. 2015. Show, attend andtell: Neural image caption generation with visual at-tention. In

ICML .Rongchao Yin, Quan Wang, Rui Li, Peng Li, and BinWang. 2016. Multi-granularity chinese word em-bedding.

EMNLP pages 981–986.Matthew D Zeiler and Rob Fergus. 2014. Visualiz-ing and understanding convolutional networks. In

ECCV . Springer, pages 818–833.Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-siﬁcation. In

NIPS . pages 649–657.Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answeringin images. In