[PDF] RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER

Abstract

Recently multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. However, most of the multimodal methods use attention mechanisms to extract visual clues regardless of whether the text and image are relevant. Practically, the irrelevant text-image pairs account for a large proportion in tweets. The visual clues that are unrelated to the texts will exert uncertain or even negative effects on multimodal model learning. In this paper, we introduce a method of text-image relation propagation into the multimodal BERT model. We integrate soft or hard gates to select visual clues and propose a multitask algorithm to train on the MNER datasets. In the experiments, we deeply analyze the changes in visual attention before and after the use of text-image relation propagation. Our model achieves state-of-the-art performance on the MNER datasets.

Full PDF

RRpBERT: A Text-image Relation Propagation-based BERT Model for MultimodalNER

Lin Sun , Jiquan Wang * , Kai Zhang , Yindu Su , Fangsheng Weng Department of Computer Science, Zhejiang University City College, Hangzhou, China College of Computer Science and Technology, Zhejiang University, Hangzhou, China Department of Computer Science and Technology, Tsinghua University, Beijing, [email protected], [email protected], [email protected], [email protected]

Abstract

Recently multimodal named entity recognition (MNER) hasutilized images to improve the accuracy of NER in tweets.However, most of the multimodal methods use attentionmechanisms to extract visual clues regardless of whether thetext and image are relevant. Practically, the irrelevant text-image pairs account for a large proportion in tweets. The vi-sual clues that are unrelated to the texts will exert uncertainor even negative effects on multimodal model learning. In thispaper, we introduce a method of text-image relation propaga-tion into the multimodal BERT model. We integrate soft orhard gates to select visual clues and propose a multitask al-gorithm to train on the MNER datasets. In the experiments,we deeply analyze the changes in visual attention before andafter the use of text-image relation propagation. Our modelachieves state-of-the-art performance on the MNER datasets.The source code is available online . Introduction

Social media platforms such as Twitter have become partof the everyday lives of many people. They are importantsources for various information extraction applications suchas open event extraction (Wang, Deyu, and He 2019) andsocial knowledge graph construction (Hosseini 2019). As akey component of these applications, named entity recogni-tion (NER) aims to detect named entities (NEs) and classifythem into predeﬁned types, such as person (PER), location(LOC) and organization (ORG). Recent works on tweetsbased on multimodal learning have been increasing (Moon,Neves, and Carvalho 2018; Lu et al. 2018; Zhang et al. 2018;Arshad et al. 2019; Yu et al. 2020). These researchers inves-tigated to enhance linguistic representations with the aid ofvisual clues in tweets. Most of the MNER methods used at-tention weights to extract visual clues related to the NEs (Luet al. 2018; Zhang et al. 2018; Arshad et al. 2019). For ex-ample, Fig. 1(a) shows a successful visual attention examplein (Lu et al. 2018). In fact, texts and images in tweets couldalso be irrelevant. Vempala and Preot¸iuc-Pietro (2019) cate-gorized text-image relations according to whether the “Im-age adds to the tweet meaning”. The “Image does not add * https://github.com/Multimodal-NER/RpBERT (a) [PER Radiohead] offers old and new at ﬁrst concert infour years.(b) Nice image of [PER Kevin Love] and [PER Kyle Korver] during 1st half [LOC Cleveland] . Figure 1: Visual attention examples of MNER from (Lu et al.2018). The left column is a tweet’s image and the right col-umn is its corresponding attention visualization. (a) Success-ful case, (b) failure case.to the tweet meaning” type accounts for approximately 56%of instances in Vempala’s text-image relation classiﬁcation(TRC) dataset. In addition, we trained a classiﬁer of whetherthe “Image adds to the tweet meaning” on a large randomlycollected corpus, Twitter100k (Hu et al. 2017), and the pro-portion of classiﬁed negatives was approximately 60%. Theattention-based models would also produce visual attentionalthough the text and image are irrelevant, and such visualattention might exert negative effects on the text inference.Fig. 1(b) shows a failure visual attention example. The vi-sual attention focuses on the wall and ground, resulting intagging “[ORG] Cleveland” with the wrong label “LOC”.In this paper, we consider inferring the text-image rela-tion to address the problem of inappropriate visual attentionclues in multimodal models. The contributions of this papercan be summarized as follows:• We propose a novel text-image relation propagation-based multimodal BERT model. We investigate thesoft and hard ways of propagating text-image relationsthrough the model by training. A training procedure for a r X i v : . [ c s . C L ] F e b he multiple tasks of text-image relation classiﬁcation anddownstream NER is also presented.• We provide insights into the visual attention by numericaldistributions and heat maps. Text-image relation propaga-tion can not only reduce the interference from irrelevantimages but also leverage more visual information for rel-evant text-image pairs.• The experimental results show that the failure cases in therelated works are correctly recognized by our model, andthe state-of-the-art performance is achieved in this paper. Related Work

Multimodal NER

Moon et al. (2018) proposed a modality-attention module at the input of an NER network. The mod-ule computed a weighted modal combination of word em-beddings, character embeddings, and visual features. Lu etal. (2018) presented a visual attention model to ﬁnd the im-age regions related to the content of the text. The atten-tion weights of the image regions were computed by a lin-ear projection of the sum of the text query vector and re-gional visual representations. The extracted visual contextfeatures were incorporated into the word-level outputs of thebiLSTM model. Zhang et al. (2018) designed an adaptiveco-attention network (ACN) layer, which was between theLSTM and CRF layers. The ACN contained a gated multi-modal fusion module to learn a fusion vector of the visualand linguistic features. The author designed a ﬁltration gateto determine whether the fusion feature was helpful in im-proving the tagging accuracy of each token. The output scoreof the ﬁltration gate was computed by a sigmoid activationfunction. Arshad et al. (2019) also presented a gated multi-modal fusion representation for each token. The gated fusionwas a weighted sum of the visual attention feature and tokenalignment feature. The visual attention feature was calcu-lated by the weighted sum of VGG-19 (Simonyan and Zis-serman 2014) visual features and the weights were the addi-tive attention scores between a word query and image fea-tures. Overall, the problem of the attention-guided modelsis that the extracted visual contextual cues do not match thetext for irrelevant text-image pairs. The authors of (Lu et al.2018; Arshad et al. 2019) showed failed examples in whichunrelated images provided misleading visual attention andyielded prediction errors.

Pretrained multimodal BERT

The pretrained modelBERT has achieved great success in natural language pro-cessing (NLP). The latest presented visual-linguistic mod-els based on the BERT architecture include VL-BERT (Suet al. 2019), ViLBERT (Lu et al. 2019), VisualBERT (Liet al. 2019), UNITER (Chen et al. 2020), LXMERT (Tanand Bansal 2019), and Unicoder-VL (Li et al. 2020). Wesummarize and compare the existing visual-linguistic BERTmodels in three aspects as follows: 1)

Architecture . Thestructures of Unicoder-VL, VisualBERT, VL-BERT, andUNITER were the same as that of vanilla BERT. The im-age and text tokens were combined into a sequence andfed into BERT to learn contextual embeddings. LXMERTand ViLBERT separated visual and language processing into two streams that interacted through cross-modality or co-attentional transformer layers respectively. 2)

Visual rep-resentations . The image features could be represented asregion-of-interest (RoI) or block regions. All the above pre-trained models used Fast R-CNN (Girshick 2015) to detectobjects and pool RoI features. The purpose of RoI detectionis to reduce the complexity of visual information and per-form the task of masked region classiﬁcation with linguisticclues (Su et al. 2019; Li et al. 2020). However, for the ir-relevant text-image pairs, the non-useful and salient visualfeatures could increase the interference with the linguisticfeatures. Moreover, object recognition categories are lim-ited and many NEs have no corresponding object class, suchas company trademark and scenic location. 3)

Pretrainingtasks . The models were trained on image caption datasetssuch as the COCO caption dataset (Chen et al. 2015) orConceptual Captions (Sharma et al. 2018). The pretrainingtasks mainly include masked language modeling (MLM),masked region classiﬁcation (MRC) (Chen et al. 2020; Tanand Bansal 2019; Li et al. 2020; Su et al. 2019), and image-text matching (ITM) (Chen et al. 2020; Li et al. 2020; Luet al. 2019). The ITM task is a binary classiﬁcation, whichdeﬁnes the pairs in the caption dataset as positives and thepairs generated by replacing the image or text in a pairedexample with other randomly selected samples as negatives.It assumed that the text-image pairs in the caption datasetswere highly related; however, this assumption could not beestablished in the text-image pairs of tweets.Visual features are always directly concatenated with lin-guistic features (Yu and Jiang 2019) or extracted by atten-tion weights in the latest multimodal models, regardless ofwhether the images contribute to the semantics of the texts,resulting in failed MNER examples shown in Table 7. There-fore, in this work, we explore a multimodal variant of BERTto perform MNER for tweets with different text-image rela-tions.

The Proposed Approach

In this section, we introduce a text-image Relationpropagation-based BERT model (RpBERT) for multimodalNER, which is shown in Fig. 2. We illustrate the RpBERTarchitecture and then describe its training procedure in de-tail.

Model Design

Our RpBERT extends vanilla BERT to a multitask frame-work of text-image relation classiﬁcation and visual-linguistic learning for MNER. First, similar to most visual-linguistic BERTs, we adapt vanilla BERT to multimodal in-puts. The input sequence of RpBERT is designed as follows: [CLS] w . . . w n (cid:124) (cid:123)(cid:122) (cid:125) T [SEP] v . . . v m (cid:124) (cid:123)(cid:122) (cid:125) V , (1)where [CLS] stands for text-image relation classiﬁcation, [SEP] stands for the separation between text and imagefeatures, T = { w , . . . , w n } denotes a sequence of linguis-tic features, and V = { v , . . . , v m } denotes a sequence of vi-sual features. The word token sequence is generated by the pBERT T V Relation propagation R G [CLS] TV V = Image block embedding + Segment embedding + Position embedding

RpBERT

ResNet-152

This kitty's face is cuter and more symmetrical than mine will ever be

T = Word token embedding + Segment embedding + Position embedding × 7 × 2 Task relation classification

Task

Multimodal NER [SEP] [SEP][CLS] 𝑒 𝑘𝑅𝑝𝐵𝐸𝑅𝑇 FC r FC Figure 2: The RpBERT architecture overview. Two RpBERTs share the same structure and parameters.BERT tokenizer, which breaks an unknown word into mul-tiple word-piece tokens. Unlike the latest visual-linguisticBERT models (Su et al. 2019; Lu et al. 2019; Li et al.2020), we represent visual features as block regions insteadof RoIs. The visual features are extracted from the imageby ResNet (He et al. 2016). The output size of the last con-volutional layer in ResNet is × × d v , where × de-notes 49 block regions in an image. The extracted features ofblock regions { f i,j } i,j =1 are arranged into an image blockembedding sequence { b = f , W v , . . . , b = f , W v } ,where f i,j ∈ R × d v and W v ∈ R d v × d BERT to match theembedding size of BERT, and d v = 2048 when workingwith ResNet-152. Following the practice in BERT, the in-put embeddings of tokens are the sum of word token em-beddings (or image block embeddings), segment embed-dings, and position embeddings. The segment embeddingsare learned from two types, where A denotes text tokens andB denotes image blocks. The position embeddings of wordtokens are learned from the word order in the sentence, butall positions are the same for visual tokens.The output of the token [CLS] is fed to a fully connected(FC) layer as a binary classiﬁer for Task G shown in Fig. 2 to yield probabilities [ π , π ] . Thetext-image relevant score r is deﬁned as the probability ofbeing positive, r = π . (2)We use the relevant score r to construct a visual mask matrix R in Fig. 2, R = (cid:16) x i,j = r (cid:17) × d BERT . (3)The text-image relation is propagated to RpBERT via R (cid:12) V ,where (cid:12) is the element-wise multiplication. For example,if π = 0 , then all visual features are discarded. Finally, e RpBERTk , the outputs of the tokens T with visual clues, arefed to the NER model for Task Relation Propagation

We investigate two kinds of relation propagation, soft andhard, by different probability gates G :• Soft relation propagation : In soft relation propagation,the output of G can be viewed as a continuous distri-bution. The visual features are ﬁltered according to thestrength of the text-image relation. The gate G is deﬁnedas a softmax function: G s = sof tmax ( x ) . (4)• Hard relation propagation : In hard relation propagation,the output of G can be viewed as a categorical distribu-tion. The visual features are either selected or discardedbased on 0 or 1. The gate G is deﬁned as follows: G h = [ sof tmax ( x ) > . , (5)where [ · ] is the Iverson bracket indicator function, whichtakes a value of 1 when its argument is true and 0 oth-erwise. As G h is not differentiable, an empirical wayis to use a straight-through estimator (Bengio, L´eonard,and Courville 2013) for propagating gradients backthrough the network. Besides, Jang et al. (2017) proposedGumbel-Softmax to create a continuous approximation tothe categorical distribution. Inspired by this, we deﬁne thegate G as Gumbel-Softmax in Eq. (6) for hard relationpropagation. G h = sof tmax (( x + g ) //τ ) , (6)where g is a noise sampled from Gumbel distributionand τ is a temperature parameter. As the temperature ap-proaches 0, samples from the Gumbel-Softmax distribu-tion become one-hot and the Gumbel-Softmax distribu-tion becomes identical to the categorical distribution. Inthe training stage, the temperature τ is annealed using theschedule of 1 to 0.1.In the experimental results, we compare the performances of G s , G h , and G h in Table 4. ultitask Training for MNER In this section, we present how to train RpBERT for MNER.The training procedure involves multitask leaning of text-image relation classiﬁcation and MNER, represented bysolid red arrows in Fig. 2. The two tasks are described indetail as follows:

Task : We em-ploy the “Image Task” splits of the TRC dataset (Vempalaand Preot¸iuc-Pietro 2019) for text-image relation classiﬁca-tion. This classiﬁcation attempts to identify whether the im-age’s content contributes additional information beyond thetext. The types of text-image relations and statistics of theTRC dataset are shown in Table 1.Let D = { a ( i ) } Ni =1 = { < text ( i ) , image ( i ) > } Ni =1 bea set of text-image pairs for TRC training. The loss L ofbinary relation classiﬁcation is calculated by cross entropy: L = − N (cid:88) i =1 log ( p ( a ( i ) )) , (7)where p ( x ) is the probability for correct classiﬁcation and iscalculated by softmax. Task : In this stage, weuse the mask matrix R to control the additive visual clues.The input sequence of RpBERT is [CLS] T [SEP] R (cid:12) V .We denote the output of T as e RpBERTk . To perform NER,we use biLSTM-CRF (Lample et al. 2016) as a baselineNER model. The biLSTM-CRF model consists of a bidirec-tional LSTM and conditional random ﬁelds (CRF) (Lafferty,McCallum, and Pereira 2001). The input e k of biLSTM-CRF is a concatenation of word and character embed-dings (Lample et al. 2016). CRF uses the biLSTM hiddenvectors of each token to tag the sequence with entity labels.To evaluate the RpBERT model, we concatenate e RpBERTk as the input of biLSTM, i.e., [ e k ; e RpBERTk ] . For out-of-vocabulary (OOV) words, we average the outputs of BERT-tokenized subwords not only to generate an approximatevector but also to align the broken words with the input em-beddings of biLSTM-CRF.In biLSTM-CRF, named entity tagging is trained on astandard CRF model. We feed the hidden vectors H = { h t = [ −→ h LST Mt ; ←− h LST Mt ] } nt =1 of biLSTM to the CRFmodel. For a sequence of tags y = { y , . . . , y n } , the proba-bility of the label sequence y is deﬁned as follows (Lampleet al. 2016): p ( y | x ) = e s ( x,y ) (cid:80) y (cid:48)∈ Y e s ( x,y (cid:48) ) , (8)where Y is all possible tag sequences for the sentence x and s ( x, y ) are feature functions modeling transitions andemissions. Details can be referred in (Lample et al. 2016).The objective of Task D = { ( x ( i ) , y ( i ) ) } Mi =1 : L = − M (cid:88) i =1 log ( p ( y ( i ) | x ( i ) )) . (9)Combining Task Algorithm 1

Multitask training procedure of RpBERT forMNER.

Input:

The TRC dataset and MNER dataset.

Output: θ RpBERT , θ ResNet , θ F Cs , θ biLST M , and θ CRF . for all epochs do for all batches in the TRC dataset do Forward text-image pairs through RpBERT; Compute loss L by Eq. (7); Update θ F Cs and ﬁnetune θ RpBERT and θ ResNet using ∇L ; end for for all batches in the MNER dataset do Forward text-image pairs through RpBERT; Compute the visual mask matrix R ; Forward text-image pairs with relation propaga-tion through RpBERT and biLSTM-CRF;

Compute loss L by Eq. (9); Update θ biLST M and θ CRF and ﬁnetune θ RpBERT and θ ResNet using ∇L ; end for end for θ RpBERT , θ ResNet , θ F Cs , θ biLST M , and θ CRF representthe parameters of RpBERT, ResNet, FCs, biLSTM, andCRF, respectively. In each epoch, the procedure ﬁrst per-forms Task

Experiments

Datasets

In the experiments, we use three datasets to evaluate the per-formance. One is the TRC dataset, and the other two areMNER datasets of Fudan University and Snap Research.The detailed descriptions are as follows:•

TRC dataset of Bloomberg LP (Vempala and Preot¸iuc-Pietro 2019)

In this dataset, the authors annotated tweets into fourtypes of text-image relation, as shown in Table 1. “Im-age adds to the tweet meaning” is centered on the roleof the image to the semantics of the tweet while “Text ispresented in image” focuses on the text’s role. In the Rp-BERT model, we treat the text-image relation for the im-age’s role as binary classiﬁcation task between R ∪ R and R ∪ R . We follow the same split of 8:2 for train/testsets as in (Vempala and Preot¸iuc-Pietro 2019). We use thisdataset to perform learning Task R R R R Image adds to the tweet meaning √ √ × ×

Text is presented in image √ × √ ×

Percentage (%) 18.5 25.6 21.9 33.8

Table 1: Four relation types in the TRC dataset.

MNER dataset of Fudan University (Zhang et al.2018)

The authors sampled the tweets with images collectedthrough Twitter’s API. In this dataset, the NE types arePerson, Location, Organization, and Misc. The authors la-beled 8,257 tweet texts using the BIO2 tagging schemeand used a 4,000/1,000/3,257 train/dev/test split.•

MNER dataset of Snap Research (Lu et al. 2018)

The authors collected the data from Twitter and Snapchat,but Snapchat data are not available for public use. The NEtypes are Person, Location, Organization, and Misc. Eachdata instance contains one sentence and one image. Theauthors labeled 6,882 tweet texts using the BIO taggingscheme and used a 4,817/1,032/1,033 train/dev/test split.

Settings

We use the 300-dimensional fastText Crawl (Mikolov et al.2018) word vectors in biLSTM-CRF. All images are re-shaped to a size of × to match the input size ofResNet. We use ResNet-152 to extract visual features andﬁnetune it with a learning rate of 1e-6. The FC layers in ourmodel are a linear neural network followed by ReLU ac-tivation. The architecture of RpBERT is the same as that ofBERT-Base, and we load the pretrained weights from BERT-base-uncased model to initialize our RpBERT model. Wetrain the model using Adam (Kingma and Ba 2014) opti-mizer with default settings. Table 2 shows the hyperparam-eter values in the RpBERT and biLSTM-CRF models. Weuse F1 score as evaluation metric for TRC and MNER. Hyperparameter ValueLSTM hidden state size 256+RpBERT 1024LSTM layer 2mini-batch size 8char embedding dimension 25optimizer Adamlearning rate 1e-4learning rate for ﬁnetuning RpBERT and ResNet 1e-6dropout rate 0.5

Table 2: Hyperparameters of the RpBERT and biLSTM-CRF models.

Result of TRC

Table 3 shows the performance of RpBERT in text-image re-lation classiﬁcation on the test set of the TRC data. In termsof the network structure, Lu et al. (2018) represented themultimodal feature as a concatenation of linguistic featuresfrom LSTM and visual features from InceptionNet (Szegedyet al. 2015). The result shows that the BERT-based visual-linguistic model signiﬁcantly outperforms that of Lu etal. (2018), and F1 score of RpBERT on the test set of theTRC data increases by 7.1% compared to Lu et al. (2018).

Results of MNER

Table 4 illustrates the improved performance by visual clues,such as biLSTM-CRF vs. biLSTM-CRF with image and

Lu et al. (2018) RpBERTF1 score 81.0 (+7.1) 88.1

Table 3: Results of the text-image relation classiﬁcation inF1 score (%).

Fudan Univ. Snap Res.biLSTM-CRF (+0.0) (+0.0) t = 0 (+0.1) (+0.5) (+0.8) (+0.6) (+0.0) (+0.0) G h (+2.2) +1.3) G h (+2.6) +1.5) G s (+2.8) +2.3) Table 4: Comparison of the improved performance by visualclues in F1 score (%).BERT vs. RpBERT. The inputs of “biLSTM-CRF” and“biLSTM-CRF + BERT” are text only, while those of othermodels are text-image pairs. “biLSTM-CRF w/ image at t = 0 ” means that the image feature is placed at the be-ginning of LSTM before the word sequence, similar to themodel in (Vinyals et al. 2015). “biLSTM-CRF + RpBERT”means that the contextual embeddings e RpBERTk with vi-sual clues are concatenated as the input of biLSTM-CRF, asclariﬁed in the section of “Multitask Training for MNER”.The results show that the best “+ RpBERT G s ” achievesincreases of 4.5% and 7.3% compared to “biLSTM-CRF”on the Fudan Univ. and Snap Res. datasets, respectively.In terms of the role of visual features, the increase of “+RpBERT G s ” achieves approximately 2.5% compared to “+BERT”, which is larger than those of the biLSTM-CRFbased multimodal models such as Zhang et al. (2018) andLu et al. (2018) compared to biLSTM-CRF. This indicatesthat the RpBERT model can better leverage visual featuresto enhance the context of tweets.In Table 5, we compare performance with the state-of-the-art method (Yu et al. 2020) and visual-linguistic pretrainedmodels which codes are available, such as VL-BERT (Suet al. 2019), ViLBERT (Lu et al. 2019), and UNITER (Chenet al. 2020). Similar to e RpBERTk in RpBERT, we takeout the contextual embeddings of word sequence in visual-linguistic models and concatenate them with the token em-beddings e k as the input embedding of biLSTM-CRF. Forexample, “biLSTM-CRF + VL-BERT” means that the out-put of word sequence in VL-BERT is concatenated asthe input of biLSTM-CRF, i.e., (cid:2) e k ; e V L - BERTk (cid:3) . The re-sults show that RpBERT G s outperforms all pretrained mod-els. Additionally, we test RpBERT using the structure ofBERT-Large, which has 24 layers and 16 attention heads.“biLSTM-CRF + RpBERT-Large G s ” achieves state-of-the-art results on the MNER datasets and outperforms the cur-rent best results (Yu et al. 2020) by 1.5% on the Fudan Univ.dataset and 2.5% on the Snap Res. dataset. udan Univ. Snap Res.Image adds Image doesn’t add Overall Image adds Image doesn’t add OverallbiLSTM-CRF + RpBERT G s (-0.5) (-3.1) (-1.8) (-0.7) (-2.3) (-1.2) Table 6: Performance comparison in F1 score (%) when the relation propagation (Rp) is ablated.

Fudan Univ. Snap Res.Arshad et al. (2019) 72.9 -Yu et al. (2020) 73.4 85.3biLSTM-CRF + VL-BERT 72.4 86.0biLSTM-CRF + ViLBERT 72.0 85.7biLSTM-CRF + UNITER 72.7 86.1biLSTM-CRF + RpBERT G s G s Table 5: Performance comparison with other models in F1score (%).

Ablation Study

In this section, we report the results when ablating the re-lation propagation in RpBERT, or equivalently performingonly Task r and S T V , where S T V is the aver-age sum of visual attentions and is deﬁned as follows: S T V = 1 LH L (cid:88) l =1 H (cid:88) h =1 n (cid:88) i =1 m (cid:88) j =1 Att ( l,h ) ( w i , v j ) , (10)where Att ( l,h ) ( w i , v j ) is the attention between the i th wordand j th image block on the h th head and l th layer in Rp-BERT. The samples are from the test set of the Snap Res.dataset. In Fig. 3(a), we ﬁnd that the distribution of S T V ofRpBERT w/o Rp is close to a horizontal line and is unrelatedto the relevant score r . In Fig. 3(b), most S T V values of Rp-BERT decrease on irrelevant text-image pairs ( r < . ) andincrease on relevant text-image pairs ( r > . ) compared tothose of RpBERT w/o Rp. Quantitatively, the mean of S T V decreases by 20% from 0.041 to 0.034 on irrelevant text-image pairs while it increases from 0.042 to 0.102 on rele-vant text-image pairs. In general, after using relation prop-agation, the trend is towards leveraging more visual cues instronger text-image relations. (a) RpBERT w/o Rp (b) RpBERT G s Figure 3: The numerical distribution between r and S T V . Case Study via Attention Visualization

We illustrate ﬁve failure examples mentioned in (Lu et al.2018; Arshad et al. 2019; Yu et al. 2020) in Table 7. Thecommon reason for these failed examples is inappropriatevisual attention features. The table shows the relevant score r and overall image attentions of RpBERT and RpBERTw/o Rp. The visual attention of an image block j across allwords, heads and layers is deﬁned as follows: a vj = 1 LH L (cid:88) l =1 H (cid:88) h =1 n (cid:88) i =1 Att ( l,h ) ( w i , v j ) . (11)We visualize the overall image attentions { a vj } j =1 by heatmaps. The NER results of “+ RpBERT w/o Rp”, “+RpBERT G s ”, and the previous works are also presented forcomparison.Examples 1 and 2 are from the Snap Res. dataset, and Ex-amples 3, 4, and 5 are from the Fudan Univ. dataset. TheNER results of all examples obtained by RpBERT are cor-rect. In Example 1, RpBERT performs correct and the vi-sual attentions have no negative effects on the NER results.In Example 2, the visual attentions focus on the ground andresult in tagging “Cleveland” with the wrong label “LOC”.In Example 3, “Reddit” is misidentiﬁed as “ORG” by thevisual attentions. In Example 5, “Siri” is wrongly identi-ﬁed as “PER” because of the visual attentions to the humanface. In Examples 2, 3, and 5, the text-image pairs are rec-ognized as irrelevant since the values of r are small. Withtext-image relation propagation, much less visual featuresare weighted to the linguistic features in RpBERT and theNER results are correct. In Example 4, the text and imageare related, i.e., r = 0 . . The persons are signiﬁcantly con-cerned in (Arshad et al. 2019), resulting in the wrong la-bel “PER” for “Mount Sherman”. RpBERT w/o Rp extendssome visual attention to the mountain scene, while RpBERTincreases much more visual attention to the scenery, such assky and mountain, and thus strengthens the understanding { a vj } j =1 ofRpBERT w/o Rp { a vj } j =1 ofRpBERT G s r [ORG SBU] base-ball shots fromSaturday. Nice image of [PER Kevin Love] and [PER KyleKorver] during 1 sthalf [LOCCleveland] [ORG Reddit] needs to stop pre-tending racism isvaluable debate. [MISC PSD]Lesher teacherstake school spirit totop of 14ner [LOCMount Sherman] . Ask [PER Siri] what 0 divided by0 is and watch herput you in yourplace.+ RpBERT G s Looking forwardto editing some [ORG SBU] base-ball shots fromSaturday. Nice image of [PER Kevin Love] and [PER KyleKorver] during 1 sthalf [ORGCleveland] [MISC Reddit] needs to stop pre-tending racism isvaluable debate. [ORG PSDLesher] teacherstake school spirit totop of 14ner [LOCMount Sherman] . Ask [MISC Siri] what 0 divided by 0is and watch her putyou in your place.Previous work Looking forward toediting some

SBU baseball shots fromSaturday. (Lu et al.2018)

Nice image of [PER Kevin Love] and [PER KyleKorver] during 1 sthalf [LOCCleveland] . (Luet al. 2018) [ORG Reddit] needs to stop pre-tending racism isvaluable debate. (Arshad et al.2019) [ORG PSDLesher] teacherstake school spirit totop of 14ner [PERMount Sherman] . (Arshad et al.2019) Ask [PER Siri] what 0 divided by0 is and watch herput you in yourplace. (Yu et al.2020)

Low High

Table 7: Five failed examples in the previous works tested by “+ RpBERT G s ” and “+ RpBERT w/o Rp”. Blue and black labelsare correct and red ones are wrong.of the whole picture and yields the correct labels of “PSDLesher” and “Mount Sherman”. Conclusion

This paper concerns the visual attention problem raised bythe text-unrelated images in tweets for multimodal learning.We propose a relation propagation-based multimodal modelbased on text-image relation inference. The model is trainedby the multiple tasks of text-image relation classiﬁcation anddownstream NER. In experiments, the ablation study quan-titatively evaluates the role of text-image relation propaga- tion. The heat map visualization and numerical distributionregarding the visual attention justify that RpBERT can betterleverage visual information adaptively according to the rela-tion between text and image. The failed cases mentioned inother papers are effectively resolved by the RpBERT model.Our model achieves the best F1 scores in both TRC andMNER, i.e., 88.1% on the TRC dataset, 74.9% on the Fu-dan Univ. dataset, and 87.8% on the Snap Res. dataset. cknowledgements

This work was supported by the National Innovation andEntrepreneurship Training Program for College Studentsunder Grant 202013021005 and in part by the NationalNatural Science Foundation of China (NSFC) under Grant62072402.

References

Arshad, O.; Gallo, I.; Nawaz, S.; and Calefati, A. 2019.Aiding Intra-Text Representations with Visual Context forMultimodal Named Entity Recognition. In , 337–342.Bengio, Y.; L´eonard, N.; and Courville, A. C. 2013.Estimating or Propagating Gradients Through StochasticNeurons for Conditional Computation. arXiv preprintarXiv:1308.3432 .Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.;Doll´ar, P.; and Zitnick, C. L. 2015. Microsoft coco cap-tions: Data collection and evaluation server. arXiv preprintarXiv:1504.00325 .Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan,Z.; Cheng, Y.; and Liu, J. 2020. UNITER: UNiversal Image-TExt Representation Learning. In

Computer Vision – ECCV2020 , 104–120.Girshick, R. 2015. Fast r-cnn. In

Proceedings of the IEEEinternational conference on computer vision , 1440–1448.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.Hosseini, H. 2019. Implicit entity recognition, classiﬁcationand linking in tweets. In

Proceedings of the 42nd Interna-tional ACM SIGIR Conference on Research and Develop-ment in Information Retrieval , 1448–1448.Hu, Y.; Zheng, L.; Yang, Y.; and Huang, Y. 2017. Twit-ter100k: A real-world dataset for weakly supervised cross-media retrieval.

IEEE Transactions on Multimedia

International confer-ence on learning representations .Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 .Lafferty, J. D.; McCallum, A.; and Pereira, F. C. 2001. Con-ditional Random Fields: Probabilistic Models for Segment-ing and Labeling Sequence Data. In

Proceedings of theEighteenth International Conference on Machine Learning ,282–289.Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami,K.; and Dyer, C. 2016. Neural Architectures for Named En-tity Recognition. In

Proceedings of the 2016 Conference ofNAACL-HLT , 260–270. Association for Computational Lin-guistics. Li, G.; Duan, N.; Fang, Y.; Gong, M.; Jiang, D.; and Zhou,M. 2020. Unicoder-VL: A Universal Encoder for Vision andLanguage by Cross-Modal Pre-Training. In

AAAI , 11336–11344.Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; and Chang, K.-W. 2019. Visualbert: A simple and performant baseline forvision and language. arXiv preprint arXiv:1908.03557 .Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; and Ji, H. 2018.Visual attention model for name tagging in multimodal so-cial media. In

Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers) , 1990–1999.Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert:Pretraining task-agnostic visiolinguistic representations forvision-and-language tasks. In

Advances in Neural Informa-tion Processing Systems , 13–23.Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; andJoulin, A. 2018. Advances in Pre-Training DistributedWord Representations. In

Proceedings of the InternationalConference on Language Resources and Evaluation (LREC2018) .Moon, S.; Neves, L.; and Carvalho, V. 2018. MultimodalNamed Entity Recognition for Short Social Media Posts. In

Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers) ,852–860.Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018.Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In

Proceed-ings of the 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) , 2556–2565.Melbourne, Australia: Association for Computational Lin-guistics.Simonyan, K.; and Zisserman, A. 2014. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 .Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; and Dai, J.2019. VL-BERT: Pre-training of Generic Visual-LinguisticRepresentations. In

International Conference on LearningRepresentations .Szegedy, C.; Wei Liu; Yangqing Jia; Sermanet, P.; Reed, S.;Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich,A. 2015. Going deeper with convolutions. In , 1–9.Tan, H.; and Bansal, M. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In

Proceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Processing(EMNLP-IJCNLP) , 5103–5114.Vempala, A.; and Preot¸iuc-Pietro, D. 2019. Categorizing andInferring the Relationship between the Text and Image ofTwitter Posts. In

Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics , 2830–2840.inyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015.Show and tell: A neural image caption generator. In

Pro-ceedings of the IEEE conference on computer vision andpattern recognition , 3156–3164.Wang, R.; Deyu, Z.; and He, Y. 2019. Open Event Extrac-tion from Online Text using a Generative Adversarial Net-work. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Processing(EMNLP-IJCNLP) , 282–291.Yu, J.; and Jiang, J. 2019. Adapting BERT for target-oriented multimodal sentiment classiﬁcation. In

Proceed-ings of the 28th International Joint Conference on ArtiﬁcialIntelligence , 5408–5414. AAAI Press.Yu, J.; Jiang, J.; Yang, L.; and Xia, R. 2020. Improving Mul-timodal Named Entity Recognition via Entity Span Detec-tion with Uniﬁed Multimodal Transformer. In

Proceedingsof the 58th Annual Meeting of the Association for Computa-tional Linguistics , 3342–3352.Zhang, Q.; Fu, J.; Liu, X.; and Huang, X. 2018. Adaptive co-attention network for named entity recognition in tweets. In