Cross-media Structured Common Space for Multimedia Event Extraction
Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, Shih-Fu Chang
CCross-media Structured Common Space for Multimedia Event Extraction
Manling Li ∗ , Alireza Zareian ∗ , Qi Zeng , Spencer Whitehead , Di Lu ,Heng Ji , Shih-Fu Chang University of Illinois at Urbana-Champaign, Columbia University Dataminr { manling2,hengji } @illinois.edu , { az2407,sc250 } @columbia.edu http://blender.cs.illinois.edu/software/m2e2 Abstract
We introduce a new task, M ulti M edia E vent E xtraction (M E ), which aims to extractevents and their arguments from multime-dia documents. We develop the first bench-mark and collect a dataset of 245 multi-media news articles with extensively anno-tated events and arguments. We proposea novel method, W eakly A ligned S tructured E mbedding ( WASE ), that encodes structuredrepresentations of semantic information fromtextual and visual data into a common em-bedding space. The structures are alignedacross modalities by employing a weakly su-pervised training strategy, which enables ex-ploiting available resources without explicitcross-media annotation. Compared to uni-modal state-of-the-art methods, our approachachieves 4.0% and 9.8% absolute F-scoregains on text event argument role labeling andvisual event extraction. Compared to state-of-the-art multimedia unstructured representa-tions, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction andargument role labeling, respectively. By utiliz-ing images, we extract 21.4% more event men-tions than traditional text-only methods.
Traditional event extraction methods target a sin-gle modality, such as text (Wadden et al., 2019),images (Yatskar et al., 2016) or videos (Ye et al.,2015; Caba Heilbron et al., 2015; Soomro et al.,2012). However, the practice of contemporaryjournalism (Stephens, 1998) distributes news viamultimedia. By randomly sampling 100 multi-media news articles from the Voice of America(VOA), we find that 33% of images in the arti-cles contain visual objects that serve as event ar-guments and are not mentioned in the text. Take ∗ These authors contributed equally to this work. Our data and code are available at http://blender.cs.illinois.edu/software/m2e2
Figure 1: An example of Multimedia Event Extraction.An event mention and some event arguments (
Agent and
Person ) are extracted from text, while the vehiclearguments can only be extracted from the image.
Figure 1 as an example, we can extract the
Agent and
Person arguments of the
Movement.Transport event from text, but can extract the
Vehicle argu-ment only from the image. Nevertheless, eventextraction is independently studied in ComputerVision (CV) and Natural Language Processing(NLP), with major differences in task definition,data domain, methodology, and terminology. Mo-tivated by the complementary and holistic na-ture of multimedia data, we propose M ulti M edia E vent E xtraction ( M E ), a new task that aims tojointly extract events and arguments from multiplemodalities. We construct the first benchmark andevaluation dataset for this task, which consists of245 fully annotated news articles.We propose the first method, W eakly A ligned S tructured E mbedding ( WASE ), for extractingevents and arguments from multiple modalities.Complex event structures have not been cov-ered by existing multimedia representation meth-ods (Wu et al., 2019b; Faghri et al., 2018; Karpa-thy and Fei-Fei, 2015), so we propose to learn a structured multimedia embedding space. Morespecifically, given a multimedia document, werepresent each image or sentence as a graph, whereeach node represents an event or entity and each a r X i v : . [ c s . MM ] M a y dge represents an argument role. The node andedge embeddings are represented in a multimediacommon semantic space, as they are trained to re-solve event co-reference across modalities and tomatch images with relevant sentences. This en-ables us to jointly classify events and argumentroles from both modalities. A major challengeis the lack of multimedia event argument annota-tions, which are costly to obtain due to the annota-tion complexity. Therefore, we propose a weaklysupervised framework, which takes advantage ofannotated uni-modal corpora to separately learnvisual and textual event extraction, and uses animage-caption dataset to align the modalities.We evaluate WASE on the new task of M E .Compared to the state-of-the-art uni-modal meth-ods and multimedia flat representations, ourmethod significantly outperforms on both eventextraction and argument role labeling tasks in allsettings. Moreover, it extracts 21.4% more eventmentions than text-only baselines. The trainingand evaluation are done on heterogeneous data setsfrom multiple sources, domains and data modali-ties, demonstrating the scalability and transferabil-ity of the proposed model. In summary, this papermakes the following contributions: • We propose a new task, MultiMedia EventExtraction, and construct the first annotatednews dataset as a benchmark to support deepanalysis of cross-media events. • We develop a weakly supervised trainingframework, which utilizes existing single-modal annotated corpora, and enables jointinference without cross-modal annotation. • Our proposed method, WASE, is the firstto leverage structured representations andgraph-based neural networks for multimediacommon space embedding.
Each input document consists of a set of im-ages M = { m , m , . . . } and a set of sentences S = { s , s , . . . } . Each sentence s can be repre-sented as a sequence of tokens s = ( w , w , . . . ) ,where w i is a token from the document vocabu-lary W . The input also includes a set of entities T = { t , t , . . . } extracted from the documenttext. An entity is an individually unique object in the real world, such as a person, an organization, afacility, a location, a geopolitical entity, a weapon,or a vehicle. The objective of M E is twofold: Event Extraction : Given a multimedia docu-ment, extract a set of event mentions, where eachevent mention e has a type y e and is grounded ona text trigger word w or an image m or both, i.e., e = ( y e , { w, m } ) . Note that for an event, w and m can both exist,which means the visual event mention and the tex-tual event mention refer to the same event. Forexample in Figure 1, deploy indicates the same Movement.Transport event as the image. We con-sider the event e as text-only event if it only hastextual mention w , and as image-only event if itonly contains visual mention m , and as multime-dia event if both w and m exist. Argument Extraction : The second task is toextract a set of arguments of event mention e . Eachargument a has an argument role type y a , and isgrounded on a text entity t or an image object o (represented as a bounding box), or both, a = ( y a , { t, o } ) . The arguments of visual and textual event men-tions are merged if they refer to the same real-world event, as shown in Figure 1. E Dataset
We define multimedia newsworthy event types byexhaustively mapping between the event ontologyin NLP community for the news domain (ACE )and the event ontology in CV community for gen-eral domain (imSitu (Yatskar et al., 2016)). Theycover the largest event training resources in eachcommunity. Table 1 shows the selected completeintersection, which contains 8 ACE types (i.e.,24% of all ACE types), mapped to 98 imSitu types(i.e., 20% of all imSitu types). We expand theACE event role set by adding visual argumentsfrom imSitu, such as instrument , bolded in Ta-ble 1. This set encompasses 52% ACE events ina news corpus, which indicates that the selectedeight types are salient in the news domain. Wereuse these existing ontologies because they en-able us to train event and argument classifiers forboth modalities without requiring joint multime-dia event annotation as training data. https://catalog.ldc.upenn.edu/ldc2006T06 vent Type Argument Role Movement.Transport(223 |
53) Agent (46 | | | | | |
27) Attacker (192 | | | | |
69) Entity (102 | Police (3 | In-strument (0 | | |
56) Agent (64 | | Instrument (0 | | |
37) Entity (33 | Instrument (0 | | |
79) Participant (119 | | |
64) Agent (39 | | | | |
6) Giver (19 | | | Table 1: Event types and argument roles in M E , withexpanded ones in bold. Numbers in parentheses repre-sent the counts of textual and visual events/arguments. We collect 108,693 multimedia news articlesfrom the Voice of America (VOA) website > ); (3) Diversity: articles thatbalance the event type distribution regardless oftrue frequency. The data statistics are shown inTable 2. Among all of these events, 192 textualevent mentions and 203 visual event mentions canbe aligned as 309 cross-media event mention pairs.The dataset can be divided into 1,105 text-onlyevent mentions, 188 image-only event mentions,and 395 multimedia event mentions. Source Event Mention Argument Role sentence image textual visual textual visual6,167 1,014 1,297 391 1,965 1,429
Table 2: M E data statistics. We follow the ACE event annotation guide-lines (Walker et al., 2006) for textual event andargument annotation, and design an annotationguideline for multimedia events annotation.One unique challenge in multimedia event an-notation is to localize visual arguments in complexscenarios, where images include a crowd of peo-ple or a group of object. It is hard to delineate http://blender.cs.illinois.edu/software/m2e2/ACL2020_M2E2_annotation.pdf Figure 2: Example of bounding boxes. each of them using a bounding box. To solve thisproblem, we define two types of bounding boxes:(1) union bounding box : for each role, we anno-tate the smallest bounding box covering all con-stituents; and (2) instance bounding box : for eachrole, we annotate a set of bounding boxes, whereeach box is the smallest region that covers an indi-vidual participant (e.g., one person in the crowd),following the VOC2011 Annotation Guidelines .Figure 2 shows an example. Eight NLP and CV re-searchers complete the annotation work with twoindependent passes and reach an Inter-AnnotatorAgreement (IAA) of 81.2%. Two expert annota-tors perform adjudication. As shown in Figure 3, the training phase containsthree tasks: text event extraction (Section 3.2), vi-sual situation recognition (Section 3.3), and cross-media alignment (Section 3.4). We learn a cross-media shared encoder, a shared event classifier,and a shared argument classifier. In the testingphase (Section 3.5), given a multimedia news arti-cle, we encode the sentences and images into thestructured common space, and jointly extract tex-tual and visual events and arguments, followed bycross-modal coreference resolution.
As shown inFigure 4, we choose Abstract Meaning Represen-tation (AMR) (Banarescu et al., 2013) to repre-sent text because it includes a rich set of 150fine-grained semantic roles. To encode eachtext sentence, we run the CAMR parser (Wanget al., 2015b,a, 2016) to generate an AMR graph,based on the named entity recognition and part-of-speech (POS) tagging results from StanfordCoreNLP (Manning et al., 2014). To representeach word w in a sentence s , we concatenate its http://host.robots.ox.ac.uk/pascal/VOC/voc2011/guidelines.html or the rebels, bravado goes hand-in-hand with the desperate resistance theinsurgents have mounted..... trigger imageentity region attend VOAImage-Caption PairsLiana Owen [Participant] drove from Pennsylvania to attend [Contact.Meet] therally in Manhattan with herparents [Participant] . ... ... destroying [Conflict.Attack] Item [Target] : ship
Tool [Instrument] : bomb
Liana Owen trigger image entity region ... ... insurgents imSitu Image Event Multimedia News resistance
Contact.Meet Conflict.Attack
Contact.Meet
Participant
Conflict.Attack
Instrument
Conflict.Attack
Attacker
Conflict.Attack
Instrument
Training Phase Testing Phase
Cross-media Structured Common Representation EncoderCross-media Shared Argument Classifier
Conflict.Attack
Alignment
Cross-media Shared Event Classifier
ACE Text Event
Figure 3: Approach overview. During training (left), we jointly train three tasks to establish a cross-media struc-tured embedding space. During test (right), we jointly extract events and arguments from multimedia articles. pre-trained GloVe word embedding (Penningtonet al., 2014), POS embedding, entity type embed-ding and position embedding. We then input theword sequence to a bi-directional long short termmemory (Bi-LSTM) (Graves et al., 2013) networkto encode the word order and get the represen-tation of each word w . Given the AMR graph,we apply a Graph Convolutional Network (GCN)(Kipf and Welling, 2016) to encode the graph con-textual information following (Liu et al., 2018a): w ( k +1) i = f ( (cid:88) j ∈N ( i ) g ( k ) ij ( W E ( i,j ) w ( k ) j + b ( k ) E ( i,j ) )) , (1)where N ( i ) is the neighbour nodes of w i in theAMR graph, E ( i, j ) is the edge type between w i and w j , g ij is the gate following (Liu et al.,2018a), k represents GCN layer number, and f isthe Sigmoid function. W and b denote param-eters of neural layers in this paper. We take thehidden states of the last GCN layer for each wordas the common-space representation w C , where C stands for the common (multimedia) embeddingspace. For each entity t , we obtain its representa-tion t C by averaging the embeddings of its tokens. Event and Argument Classifier:
We classifyeach word w into event types y e and classify each We use BIO tag schema to decide trigger word boundary,i.e., adding prefix B- to the type label to mark the beginningof a trigger, I- for inside, and O for none. entity t into argument role y a : P ( y e | w ) = exp (cid:0) W e w C + b e (cid:1)(cid:80) e (cid:48) exp ( W e (cid:48) w C + b e (cid:48) ) ,P ( y a | t ) = exp( W a [ t C ; w C ] + b a ) (cid:80) a (cid:48) exp( W a (cid:48) [ t C ; w C ] + b a (cid:48) ) . (2)We take ground truth text entity mentions as inputfollowing (Ji and Grishman, 2008) during training,and obtain testing entity mentions using a namedentity extractor (Lin et al., 2019). To obtainimage structures similar to AMR graphs, and in-spired by situation recognition (Yatskar et al.,2016), we represent each image with a situationgraph , that is a star-shaped graph as shown in Fig-ure 4, where the central node is labeled as a verb v (e.g., destroying ), and the neighbor nodes are ar-guments labeled as { ( n, r ) } , where n is a noun(e.g., ship ) derived from WordNet synsets (Miller,1995) to indicate the entity type, and r indicatesthe role (e.g., item ) played by the entity in theevent, based on FrameNet (Fillmore et al., 2003).We develop two methods to construct situationgraphs from images and train them using the im-Situ dataset (Yatskar et al., 2016) as follows. (1) Object-based Graph: Similar to extractingentities to get candidate arguments, we employ the aption AMR Graph
Attention-based GraphImage Structured Multimedia Common Space ... ... :agent :destination :item :item attack-01protest-01 bus :ARG0 :ARG1 B i - L S T M Context
Thailand :name rally-01 :mod oppose-01 :ARG0-of person Bangkok :location support-01 pro-government Red Shirt :ARG0:ARG0-of :mod :ARG1 attack-01 ... protest-01 busrally-01 Bangkok :agent :destination ...
Role-drivenAttention G CN ... ... man car stone :ARG0 :ARG1:location:mod throwing Thai oppositionprotesters [Attacker] attack [Conflict.Attack] abus [Target] carrying pro-government Red Shirtsupporters on their way toa rally at a stadium inBangkok [Place] . A M R P a r s e r S i t ua t i on G r aph E n c ode r G CN or Object-based Graph Figure 4: Multimedia structured common space construction. Red pixels stands for attention heatmap. most similar task in CV, object detection, and ob-tain the object bounding boxes detected by a FasterR-CNN (Ren et al., 2015) model trained on OpenImages (Kuznetsova et al., 2018) with 600 objecttypes (classes). We employ a VGG-16 CNN (Si-monyan and Zisserman, 2014) to extract visualfeatures of an image m and and another VGG-16to encode the bounding boxes { o i } . Then we ap-ply a Multi-Layer Perceptron (MLP) to predict averb embedding from m and another MLP to pre-dict a noun embedding for each o i . ˆ m = MLP m ( m ) , ˆ o i = MLP o ( o i ) . We compare the predicted verb embedding to allverbs v in the imSitu taxonomy in order to classifythe verb, and similarly compare each predictednoun embedding to all imSitu nouns n which re-sults in probability distributions: P ( v | m ) = exp ( ˆ mv ) (cid:80) v (cid:48) exp ( ˆ mv (cid:48) ) ,P ( n | o i ) = exp( ˆ o i n ) (cid:80) n (cid:48) exp( ˆ o i n (cid:48) ) , where v and n are word embeddings initializedwith GloVE (Pennington et al., 2014). We useanother MLP with one hidden layer followed bySoftmax ( σ ) to classify role r i for each object o i : P ( r i | o i ) = σ (cid:0) MLP r ( ˆ o i ) (cid:1) . Given verb v ∗ and role-noun ( r ∗ i , n ∗ i ) annotationsfor an image (from the imSitu corpus), we define the situation loss functions: L v = − log P ( v ∗ | m ) , L r = − log( P ( r ∗ i | o i ) + P ( n ∗ i | o i )) . (2) Attention-based Graph: State-of-the-art ob-ject detection methods only cover a limited set ofobject types, such as 600 types defined in OpenImages. Many salient objects such as bomb , stone and stretcher are not covered in these ontologies.Hence, we propose an open-vocabulary alterna-tive to the object-based graph construction model.To this end, we construct a role-driven attentiongraph, where each argument node is derived bya spatially distributed attention (heatmap) condi-tioned on a role r . More specifically, we use aVGG-16 CNN to extract a × convolutional fea-ture map for each image m , which can be regardedas attention keys k i for × local regions. Next,for each role r defined in the situation recognitionontology (e.g., agent ), we build an attention query vector q r by concatenating role embedding r withthe image feature m as context and apply a fullyconnected layer: q r = W q [ r ; m ] + b q . Then, we compute the dot product of each querywith all keys, followed by Softmax, which formsa heatmap h on the image, i.e., h i = exp( q r k i ) (cid:80) j ∈ × exp( q r k j ) . e use the heatmap to obtain a weighted averageof the feature map to represent the argument o r ofeach role r in the visual space: o r = (cid:88) i h i m i . Similar to the object-based model, we embed o r to ˆ o r , compare it to the imSitu noun embeddingsto define a distribution, and define a classificationloss function. The verb embedding ˆ m and the verbprediction probability P ( v | m ) and loss are definedin the same way as in the object-based method. Event and Argument Classifier:
We use ei-ther the object-based or attention-based formula-tion and pre-train it on the imSitu dataset (Yatskaret al., 2016). Then we apply a GCN to obtain thestructured embedding of each node in the com-mon space, similar to Equation 1. This yields m C and o C i . We use the same classifiers as defined inEquation 2 to classify each visual event and argu-ment using the common space embedding: P ( y e | m ) = exp( W e m C + b e ) (cid:80) e (cid:48) exp( W e (cid:48) m C + b e (cid:48) ) ,P ( y a | o ) = exp( W a [ o C ; m C ] + b a ) (cid:80) a (cid:48) exp( W a (cid:48) [ o C ; m C ] + b a (cid:48) ) . (3) In order to make the event and argument classi-fier shared across modalities, the image and textgraph should be encoded to the same space. How-ever, it is extremely costly to obtain the paralleltext and image event annotation. Hence, we useevent and argument annotations in separate modal-ities (i.e., ACE and imSitu datasets) to train clas-sifiers, and simultaneously use VOA news imageand caption pairs to align the two modalities. Tothis end, we learn to embed the nodes of each im-age graph close to the nodes of the correspondingcaption graph, and far from those in irrelevant cap-tion graphs. Since there is no ground truth align-ment between the image nodes and caption nodes,we use image and caption pairs for weakly super-vised training, to learn a soft alignment from eachwords to image objects and vice versa. α ij = exp ( w C i o C j ) (cid:80) j (cid:48) exp ( w C i o C j (cid:48) ) , β ji = exp ( w C i o C j ) (cid:80) i (cid:48) exp ( w C i (cid:48) o C j ) , where w i indicates the i th word in caption sen-tence s and o j represents the j th object of image m . Then, we compute a weighted average of softlyaligned nodes for each node in other modality, i.e., w (cid:48) i = (cid:88) j α ij o C j , o (cid:48) j = (cid:88) i β ji w C i . (4)We define the alignment cost of the image-captionpair as the Euclidean distance between each nodeto its aligned representation, (cid:104) s, m (cid:105) = (cid:88) i || w i − w (cid:48) i || + (cid:88) j || o j − o (cid:48) j || We use a triplet loss to pull relevant image-captionpairs close while pushing irrelevant ones apart: L c = max(0 , (cid:104) s, m (cid:105) − (cid:104) s, m − (cid:105) ) , where m − is a randomly sampled negative imagethat does not match s . Note that in order to learnthe alignment between the image and the triggerword, we treat the image as a special object whenlearning cross-media alignment.The common space enables the event and argu-ment classifiers to share weights across modali-ties, and be trained jointly on the ACE and im-Situ datasets, by minimizing the following objec-tive functions: L e = − (cid:88) w log P ( y e | w ) − (cid:88) m log P ( y e | m ) , L a = − (cid:88) t log P ( y a | t ) − (cid:88) o log P ( y a | o ) , All tasks are jointly optimized: L = L v + L r + L e + L a + L c In the test phase, our method takes a multime-dia document with sentences S = { s , s , . . . } and images M = { m , m , . . . , } as input. Wefirst generate the structured common embeddingfor each sentence and each image, and then com-pute pairwise similarities (cid:104) s, m (cid:105) . We pair eachsentence s with the closest image m , and aggre-gate the features of each word of s with the alignedrepresentation from m by weighted averaging: w (cid:48)(cid:48) i = (1 − γ ) w i + γ w (cid:48) i , (5)where γ = exp( −(cid:104) s, m (cid:105) ) and w (cid:48) i is derived from m using Equation 4. We use w (cid:48)(cid:48) i to classify each r a i n i n g Model Text-Only Evaluation Image-Only Evaluation Multimedia EvaluationEvent Mention Argument Role Event Mention Argument Role Event Mention Argument Role
P R F P R F P R F P R F P R F P R F T e x t JMEE 42.5 58.2 48.7 22.9 28.3 25.3 - - - - - - 42.1 34.6 38.1 21.1 12.6 15.8GAIL 43.4 53.5 47.9 23.6 29.2 26.1 - - - - - - 44.0 32.4 37.3 22.7 12.8 16.4WASE T I m ag e WASE I att - - - - - - 29.7 61.9 40.1 9.1 10.2 9.6 28.3 23.0 25.4 2.9 6.1 3.8WASE I obj - - - - - - 28.6 59.2 38.7 13.3 9.8 11.2 26.1 22.4 24.1 4.7 5.0 4.9 M u l t i m e d i a VSE-C 33.5 47.8 39.4 16.6 24.7 19.8 30.3 48.9 26.4 5.6 6.1 5.7 33.3 48.2 39.3 11.1 14.9 12.8Flat att obj att
WASE obj
Table 3: Event and argument extraction results (%). We compare three categories of baselines in three evaluationsettings. The main contribution of the paper is joint training and joint inference on multimedia data (bottom right). word into an event type and to classify each en-tity into a role with multimedia classifiers in Equa-tion 2. To this end, we define t (cid:48)(cid:48) i similar to w (cid:48)(cid:48) i but using t i and t (cid:48) i . Similarly, for each image m we find the closest sentence s , compute the aggre-gated multimedia features m (cid:48)(cid:48) and o (cid:48)(cid:48) i , and feedinto the shared classifiers (Equation 3) to predictvisual event and argument roles. Finally, we core-fer the cross-media events of the same event typeif the similarity (cid:104) s, m (cid:105) is higher than a threshold. We conduct evaluation ontext-only, image-only, and multimedia event men-tions in M E dataset in Section 2.2. We adoptthe traditional event extraction measures, i.e., Pre-cision , Recall and F . For text-only event men-tions, we follow (Ji and Grishman, 2008; Li et al.,2013): a textual event mention is correct if itsevent type and trigger offsets match a referencetrigger; and a textual event argument is correct ifits event type, offsets, and role label match a ref-erence argument. We make a similar definition forimage-only event mentions: a visual event men-tion is correct if its event type and image match areference visual event mention; and a visual eventargument is correct if its event type, localization,and role label match a reference argument. A vi-sual argument is correctly localized if the Inter-section over Union (IoU) of the predicted bound-ing box with the ground truth bounding box is over0.5. Finally, we define a multimedia event mentionto be correct if its event type and trigger offsets(or the image) match the reference trigger (or thereference image). The arguments of multimedia events are either textual or visual arguments, andare evaluated accordingly. To generate boundingboxes for the attention-based model, we thresholdthe heatmap using the adaptive value of . ∗ p ,where p is the peak value of the heatmap. Then wecompute the tightest bounding box that enclosesall of the thresholded region. Examples are shownin Figure 7 and Figure 8. Baselines
The baselines include: (1)
Text-only models: We use the state-of-the-art modelJMEE (Liu et al., 2018a) and GAIL (Zhang et al.,2019) for comparison. We also evaluate the ef-fectiveness of cross media joint training by in-cluding a version of our model trained only onACE, denoted as WASE T . (2) Image-only mod-els: Since we are the first to extract newswor-thy events, and the most similar work situationrecognition can not localize arguments in images,we use our model trained only on image corpusas baselines. Our visual branch has two ver-sions, object-based and attention-based, denotedas WASE I obj and WASE I att . (3) Multimedia mod-els: To show the effectiveness of structured em-bedding, we include a baseline by removing thetext and image GCNs from our model, which isdenoted as Flat. The Flat baseline ignores edgesand treats images and sentences as sets of vec-tors. We also compare to the state-of-the-art cross-media common representation model, ContrastiveVisual Semantic Embedding VSE-C (Shi et al.,2018), by training it the same way as WASE.
Parameter Settings
The common space dimen-sion is . The dimension is for image posi-tion embedding and feature map, and for wordposition embedding, entity type embedding, andPOS tag embedding. The layer of GCN is . .2 Quantitative Performance As shown in Table 3, our complete methods(WASE att and WASE obj ) outperform all baselinesin the three evaluation settings in terms of F . Thecomparison with other multimedia models demon-strates the effectiveness of our model architectureand training strategy. The advantage of structuredembedding is shown by the better performanceover the flat baseline. Our model outperformsits text-only and image-only variants on multi-media events, showing the inadequacy of single-modal information for complex news understand-ing. Furthermore, our model achieves better per-formance on text-only and image-only events,which demonstrates the effectiveness of multime-dia training framework in knowledge transfer be-tween modalities.WASE obj and WASE att , are both superior to thestate of the art and each has its own advantages.WASE obj predicts more accurate bounding boxessince it is based on a Faster R-CNN pretrained onbounding box annotations, resulting in a higherargument precision. While WASE att achieves ahigher argument recall as it is not limited by thepredefined object classes of the Faster R-CNN. Model P (%) R (%) F (%)rule based 10.1 100 18.2VSE 31.2 74.5 44.0Flat att obj att obj Table 4: Cross-media event coreference performance.
Furthermore, to evaluate the cross-media eventcoreference performance, we pair textual and vi-sual event mentions in the same document, andcalculate
Precision , Recall and F to compare withground truth event mention pairs . As shown inTable 4, WASE obj outperforms all multimedia em-bedding models, as well as the rule-based baselineusing event type matching. This demonstrates theeffectiveness of our cross-media soft alignment. Our cross-media joint training approach success-fully boosts both event extraction and argumentrole labeling performance. For example, in Fig-ure 5 (a), the text-only model can not extract
Jus- We do not use coreference clustering metrics because weonly focus on mention-level cross-media event coreferenceinstead of the full coreference in all documents. tice.Arrest event, but the joint model can use theimage as background to detect the event type. InFigure 5 (b), the image-only model detects the im-age as
Conflict.Demonstration , but the sentencesin the same document help our model not to la-bel it as
Conflict.Demonstration . Compared withmultimedia flat embedding in Figure 6, WASE canlearn structures such as
Artifact is on top of
Vehi-cle , and the person in the middle of
Justice.Arrest is Entity instead of
Agent . Iraqi security forces search [Justice.Arrest] a civilian in thecity of Mosul. People celebrate Supreme Courtruling on Same Sex Marriage in frontof the Supreme Court in Washington.
Figure 5: Image helps textual event extraction, and sur-rounding sentence helps visual event extraction.
Compare to State-of-the-art Cross-media Flat Representation u Baseline: u Event: Justice:Arrest-Jail u Roles: u None u Our Approach: u Event: Conflict.Attack u Baseline: u Event: Justice:Arrest-Jail u Roles: u Agent = man u Baseline: u Event: Justice:Arrest-Jail u Roles: u Entity = man u Our Approach:
Flat Event Justice:ArrestJailRole NoneOurs Event Conflict.AttackRole Instrument = gun Flat Event
Movement.Transport
Role Artifact = noneOurs Event
Movement.Transport
Role Artifact = man Flat Event Justice:ArrestJailRole Agent = manOurs Event Conflict.AttackRole Entity = man
Figure 6: Comparison with multimedia flat embedding.
One of the biggest challenges in M E is localiz-ing arguments in images. Object-based modelssuffer from the limited object types. Attention-based method is not able to precisely localize theobjects for each argument, since there is no super-vision on attention extraction during training. Forexample, in Figure 7, the Entity argument in the
Conflict.Demonstrate event is correctly predictedas troops , but its localization is incorrect because
Place argument share similar attention. When oneargument targets at too many instances, attentionheatmaps tend to lose focus and cover the wholeimage, as shown in Figure 8.
Text Event Extraction
Text event extraction hasbeen extensively studied for general news do- ntity: people Entity: troopsPlace: street
Figure 7: Argument labeling error examples: correctentity name but wrong localization.
Entity: people Place: street Entity: dissent
Figure 8: Attention heatmaps lose focus due to largeinstance candidate number. main (Ji and Grishman, 2008; Liao and Grishman,2011; Huang and Riloff, 2012; Li et al., 2013;Chen et al., 2015; Nguyen et al., 2016; Hong et al.,2018; Liu et al., 2018b; Chen et al., 2018; Zhanget al., 2019; Liu et al., 2018a; Wang et al., 2019;Yang et al., 2019; Wadden et al., 2019). Multime-dia features has been proven to effectively improvetext event extraction (Zhang et al., 2017).
Visual Event Extraction “Events” in NLP usu-ally refer to complex events that involve multipleentities in a large span of time (e.g. protest), whilein CV (Chang et al., 2016; Zhang et al., 2007;Ma et al., 2017) events are less complex single-entity activities (e.g. washing dishes) or actions(e.g. jumping). Visual event ontologies focus ondaily life domains, such as dogshow and weddingceremony (Perera et al., 2012). Moreover, mostefforts ignore the structure of events including ar-guments. There are a few methods that aim to lo-calize the agent (Gu et al., 2018; Li et al., 2018;Duarte et al., 2018), or classify the recipient (Sig-urdsson et al., 2016; Kato et al., 2018; Wu et al.,2019a) of events, but neither detects the completeset of arguments for an event. The most similar toour work is Situation Recognition (SR) (Yatskaret al., 2016; Mallya and Lazebnik, 2017) whichpredicts an event and multiple arguments from aninput image, but does not localize the arguments.We use SR as an auxiliary task for training our vi-sual branch, but exploit object detection and atten-tion to enable localization of arguments. Silbererand Pinkal redefine the problem of visual argu-ment role labeling with event types and boundingboxes as input. Different from their work, we ex-tend the problem scope to including event identifi-cation and coreference, and further advance argu- ment localization by proposing an attention frame-work which does not require bounding boxes fortraining nor testing.
Multimedia Representation
Multimedia com-mon representation has attracted much attentionrecently (Toselli et al., 2007; Weegar et al., 2015;Hewitt et al., 2018; Chen et al., 2019; Liu et al.,2019; Su et al., 2019a; Sarafianos et al., 2019;Sun et al., 2019b; Tan and Bansal, 2019; Li et al.,2019a,b; Lu et al., 2019; Sun et al., 2019a; Rah-man et al., 2019; Su et al., 2019b). However, pre-vious methods focus on aligning images with theircaptions, or regions with words and entities, butignore structure and semantic roles. UniVSE (Wuet al., 2019b) incorporates entity attributes and re-lations into cross-media alignment, but does notcapture graph-level structures of images or text.
In this paper we propose a new task of multimediaevent extraction and setup a new benchmark. Wealso develop a novel multimedia structured com-mon space construction method to take advantageof the existing image-caption pairs and single-modal annotated data for weakly supervised train-ing. Experiments demonstrate its effectivenessas a new step towards semantic understanding ofevents in multimedia data. In the future, we aimto extend our framework to extract events fromvideos, and make it scalable to new event types.We plan to expand our annotations by includingevent types from other text event ontologies, aswell as new event types not in existing text on-tologies. We will also apply our extraction resultsto downstream applications including cross-mediaevent inference, timeline generation, etc.
Acknowledgement
This research is based upon work supported in partby U.S. DARPA AIDA Program No. FA8750-18-2-0014 and U.S. DARPA KAIROS ProgramNo. FA8750-19-2-1004. The views and conclu-sions contained herein are those of the authorsand should not be interpreted as necessarily rep-resenting the official policies, either expressedor implied, of DARPA, or the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for governmental purposesnotwithstanding any copyright annotation therein. eferences
Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In
Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse , pages 178–186.Fabian Caba Heilbron, Victor Escorcia, BernardGhanem, and Juan Carlos Niebles. 2015. Activ-itynet: A large-scale video benchmark for humanactivity understanding. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 961–970.Xiaojun Chang, Zhigang Ma, Yi Yang, ZhiqiangZeng, and Alexander G Hauptmann. 2016. Bi-level semantic representation analysis for multime-dia event detection.
IEEE transactions on cybernet-ics , 47(5):1180–1197.Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2019. Uniter: Learning univer-sal image-text representations. arXiv preprintarXiv:1909.11740 .Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng,and Jun Zhao. 2015. Event extraction via dynamicmulti-pooling convolutional neural networks. In
Proc. ACL-IJCNLP2015 .Yubo Chen, Hang Yang, Kang Liu, Jun Zhao, and Yan-tao Jia. 2018. Collective event detection via a hier-archical and bias tagging networks with gated multi-level attention mechanisms. In
Proc. EMNLP2018 .Kevin Duarte, Yogesh Rawat, and Mubarak Shah.2018. Videocapsulenet: A simplified network foraction detection. In
Advances in Neural InformationProcessing Systems , pages 7610–7619.Fartash Faghri, David J Fleet, Jamie Ryan Kiros, andSanja Fidler. 2018. Vse++: Improving visual-semantic embeddings with hard negatives.
Pro-ceedings of the British Machine Vision Conference(BMVC) .Charles J Fillmore, Christopher R Johnson, andMiriam RL Petruck. 2003. Background to framenet.
International journal of lexicography , 16(3):235–250.Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recur-rent neural networks. In , pages 6645–6649. IEEE.Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick,Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya-narasimhan, George Toderici, Susanna Ricco, RahulSukthankar, et al. 2018. Ava: A video dataset ofspatio-temporally localized atomic visual actions. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 6047–6056.John Hewitt, Daphne Ippolito, Brendan Callahan, RenoKriz, Derry Tanti Wijaya, and Chris Callison-Burch.2018. Learning translations via images with a mas-sively multilingual image dataset. In
Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 2566–2576.Yu Hong, Wenxuan Zhou, jingli zhang jingli, GuodongZhou, and Qiaoming Zhu. 2018. Self-regulation:Employing a generative adversarial network to im-prove event detection. In
Proc. ACL2018 .Ruihong Huang and Ellen Riloff. 2012. Bootstrappedtraining of event extraction classifiers. In
Proc.EACL2012 .Heng Ji and Ralph Grishman. 2008. Refining event ex-traction through cross-document inference. In
Pro-ceedings of ACL-08: HLT , pages 254–262.Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages3128–3137.Keizo Kato, Yin Li, and Abhinav Gupta. 2018. Com-positional learning for human object interaction. In
Proceedings of the European Conference on Com-puter Vision (ECCV) , pages 234–251.Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 .Alina Kuznetsova, Hassan Rom, Neil Alldrin, JasperUijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Ka-mali, Stefan Popov, Matteo Malloci, Tom Duerig,et al. 2018. The open images dataset v4: Uni-fied image classification, object detection, and vi-sual relationship detection at scale. arXiv preprintarXiv:1811.00982 .Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei.2018. Recurrent tubelet proposal and recognitionnetworks for action detection. In
Proceedings of theEuropean conference on computer vision (ECCV) ,pages 303–318.Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, andMing Zhou. 2019a. Unicoder-vl: A universal en-coder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 .Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019b. Visualbert: Asimple and performant baseline for vision and lan-guage. arXiv preprint arXiv:1908.03557 .Qi Li, Heng Ji, and Liang Huang. 2013. Joint eventextraction via structured prediction with global fea-tures. In
Proc. ACL2013 .hasha Liao and Ralph Grishman. 2011. Acquir-ing topic features to improve event extraction: inpre-selected and balanced collections. In
Proc.RANLP2011 .Ying Lin, Liyuan Liu, Heng Ji, Dong Yu, and JiaweiHan. 2019. Reliability-aware dynamic feature com-position for name tagging. In
Proc. The 57th An-nual Meeting of the Association for ComputationalLinguistics (ACL2019) .Chunxiao Liu, Zhendong Mao, An-An Liu, TianzhuZhang, Bin Wang, and Yongdong Zhang. 2019. Fo-cus your attention: A bidirectional focal attentionnetwork for image-text matching. In
Proceedings ofthe 27th ACM International Conference on Multime-dia , pages 3–11. ACM.Xiao Liu, Zhunchen Luo, and Heyan Huang. 2018a.Jointly multiple events extraction via attention-based graph information aggregation. In
Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 1247–1256.Xiao Liu, Zhunchen Luo, and Heyan Huang. 2018b.Jointly multiple events extraction via attention-based graph information aggregation. In
Proc.EMNLP2018 .Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. In
Advances in Neural Information Process-ing Systems , pages 13–23.Zhigang Ma, Xiaojun Chang, Zhongwen Xu, NicuSebe, and Alexander G Hauptmann. 2017. Joint at-tributes and event analysis for multimedia event de-tection.
IEEE transactions on neural networks andlearning systems , 29(7):2921–2930.Arun Mallya and Svetlana Lazebnik. 2017. Recurrentmodels for situation recognition. In
Proceedings ofthe IEEE International Conference on Computer Vi-sion , pages 455–463.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In
Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.George A Miller. 1995. Wordnet: a lexical database forenglish.
Communications of the ACM , 38(11):39–41.Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-ishman. 2016. Joint event extraction via recurrentneural networks. In
Proc. NAACL-HLT2016 .Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In
Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1532–1543. AG Amitha Perera, Sangmin Oh, P Megha, TianyangMa, Anthony Hoogs, Arash Vahdat, Kevin Cannons,Greg Mori, Scott Mccloskey, Ben Miller, et al. 2012.Trecvid 2012 genie: Multimedia event detection andrecounting. In
In TRECVID Workshop . Citeseer.Wasifur Rahman, Md Kamrul Hasan, Amir Zadeh,Louis-Philippe Morency, and Mohammed EhsanHoque. 2019. M-bert: Injecting multimodal in-formation in the bert structure. arXiv preprintarXiv:1908.05787 .Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. In
Advances in neural information processing systems ,pages 91–99.Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kaka-diaris. 2019. Adversarial representation learning fortext-to-image matching. In
The IEEE InternationalConference on Computer Vision (ICCV) .Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang,and Jian Sun. 2018. Learning visually-grounded se-mantics from contrastive adversarial samples. arXivpreprint arXiv:1806.10348 .Gunnar A Sigurdsson, G¨ul Varol, Xiaolong Wang, AliFarhadi, Ivan Laptev, and Abhinav Gupta. 2016.Hollywood in homes: Crowdsourcing data collec-tion for activity understanding. In
European Confer-ence on Computer Vision , pages 510–526. Springer.Carina Silberer and Manfred Pinkal. 2018. Ground-ing semantic roles in images. In
Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 2616–2626.Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 .Khurram Soomro, Amir Roshan Zamir, and MubarakShah. 2012. Ucf101: A dataset of 101 human ac-tions classes from videos in the wild. arXiv preprintarXiv:1212.0402 .Mitchell Stephens. 1998.
The Rise of the Image, TheFall of the Word . New York: Oxford UniversityPress.Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019a.Deep joint-semantics reconstructing hashing forlarge-scale unsupervised cross-modal retrieval. In
The IEEE International Conference on Computer Vi-sion (ICCV) .Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2019b. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 .Chen Sun, Fabien Baradel, Kevin Murphy, andCordelia Schmid. 2019a. Contrastive bidirectionaltransformer for temporal representation learning. arXiv preprint arXiv:1906.05743 .hen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. 2019b. Videobert: Ajoint model for video and language representationlearning. In
The IEEE International Conference onComputer Vision (ICCV) .Hao Tan and Mohit Bansal. 2019. Lxmert: Learningcross-modality encoder representations from trans-formers. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Process-ing .Alejandro H Toselli, Ver´onica Romero, and EnriqueVidal. 2007. Viterbi based alignment between textimages and their transcripts. In
Proceedings ofthe Workshop on Language Technology for CulturalHeritage Data (LaTeCH 2007). , pages 9–16.David Wadden, Ulme Wennberg, Yi Luan, and Han-naneh Hajishirzi. 2019. Entity, relation, and eventextraction with contextualized span representations. arXiv preprint arXiv:1909.03546 .Christopher Walker, Stephanie Strassel, Julie Medero,and Kazuaki Maeda. 2006. Ace 2005 multilin-gual training corpus.
Linguistic Data Consortium,Philadelphia , 57.Chuan Wang, Sameer Pradhan, Xiaoman Pan, HengJi, and Nianwen Xue. 2016. Camr at semeval-2016task 8: An extended transition-based amr parser. In
Proceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016) , pages 1173–1178, San Diego, California. Association for Com-putational Linguistics.Chuan Wang, Nianwen Xue, and Sameer Pradhan.2015a. Boosting transition-based amr parsing withrefined actions and auxiliary analyzers. In
Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 2: Short Papers) , pages 857–862,Beijing, China. Association for Computational Lin-guistics.Chuan Wang, Nianwen Xue, and Sameer Pradhan.2015b. A transition-based algorithm for amr pars-ing. In
Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies , pages 366–375, Denver, Colorado. Asso-ciation for Computational Linguistics.Rui Wang, Deyu Zhou, and Yulan He. 2019.Open event extraction from online text using agenerative adversarial network. arXiv preprintarXiv:1908.09246 .Rebecka Weegar, Kalle ˚Astr¨om, and Pierre Nugues.2015. Linking entities across images and text. In
Proceedings of the Nineteenth Conference on Com-putational Natural Language Learning , pages 185–193. Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan,Kaiming He, Philipp Krahenbuhl, and Ross Gir-shick. 2019a. Long-term feature banks for detailedvideo understanding. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 284–293.Hank Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang,Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019b. Uni-vse: Robust visual semantic embeddings via struc-tured semantic representations. In
Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition .Sen Yang, Dawei Feng, Linbo Qiao, Zhigang Kan, andDongsheng Li. 2019. Exploring pre-trained lan-guage models for event extraction and generation.In
Proceedings of the 57th Conference of the Asso-ciation for Computational Linguistics , pages 5284–5294.Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi.2016. Situation recognition: Visual semantic rolelabeling for image understanding. In
Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition , pages 5534–5542.Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu,and Shih-Fu Chang. 2015. Eventnet: A large scalestructured concept library for complex event detec-tion in video. In
Proceedings of the 23rd ACM inter-national conference on Multimedia , pages 471–480.ACM.Tongtao Zhang, Heng Ji, and Avirup Sil. 2019. Jointentity and event extraction with generative adversar-ial imitation learning.
Data Intelligence Vol 1 (2):99-120 .Tongtao Zhang, Spencer Whitehead, Hanwang Zhang,Hongzhi Li, Joseph Ellis, Lifu Huang, Wei Liu,Heng Ji, and Shih-Fu Chang. 2017. Improving eventextraction via multimodal integration. In
Proceed-ings of the 25th ACM international conference onMultimedia , pages 270–278. ACM.Yifan Zhang, Changsheng Xu, Yong Rui, JinqiaoWang, and Hanqing Lu. 2007. Semantic event ex-traction from basketball games using multi-modalanalysis. In2007 IEEE International Conference onMultimedia and Expo