[PDF] Cross-media Structured Common Space for Multimedia Event Extraction

Abstract

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.

Full PDF

CCross-media Structured Common Space for Multimedia Event Extraction

Manling Li ∗ , Alireza Zareian ∗ , Qi Zeng , Spencer Whitehead , Di Lu ,Heng Ji , Shih-Fu Chang University of Illinois at Urbana-Champaign, Columbia University Dataminr { manling2,hengji } @illinois.edu , { az2407,sc250 } @columbia.edu http://blender.cs.illinois.edu/software/m2e2 Abstract

We introduce a new task, M ulti M edia E vent E xtraction (M E ), which aims to extractevents and their arguments from multime-dia documents. We develop the ﬁrst bench-mark and collect a dataset of 245 multi-media news articles with extensively anno-tated events and arguments. We proposea novel method, W eakly A ligned S tructured E mbedding ( WASE ), that encodes structuredrepresentations of semantic information fromtextual and visual data into a common em-bedding space. The structures are alignedacross modalities by employing a weakly su-pervised training strategy, which enables ex-ploiting available resources without explicitcross-media annotation. Compared to uni-modal state-of-the-art methods, our approachachieves 4.0% and 9.8% absolute F-scoregains on text event argument role labeling andvisual event extraction. Compared to state-of-the-art multimedia unstructured representa-tions, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction andargument role labeling, respectively. By utiliz-ing images, we extract 21.4% more event men-tions than traditional text-only methods.

Traditional event extraction methods target a sin-gle modality, such as text (Wadden et al., 2019),images (Yatskar et al., 2016) or videos (Ye et al.,2015; Caba Heilbron et al., 2015; Soomro et al.,2012). However, the practice of contemporaryjournalism (Stephens, 1998) distributes news viamultimedia. By randomly sampling 100 multi-media news articles from the Voice of America(VOA), we ﬁnd that 33% of images in the arti-cles contain visual objects that serve as event ar-guments and are not mentioned in the text. Take ∗ These authors contributed equally to this work. Our data and code are available at http://blender.cs.illinois.edu/software/m2e2

Figure 1: An example of Multimedia Event Extraction.An event mention and some event arguments (

Agent and

Person ) are extracted from text, while the vehiclearguments can only be extracted from the image.

Figure 1 as an example, we can extract the

Agent and

Person arguments of the

Movement.Transport event from text, but can extract the

Vehicle argu-ment only from the image. Nevertheless, eventextraction is independently studied in ComputerVision (CV) and Natural Language Processing(NLP), with major differences in task deﬁnition,data domain, methodology, and terminology. Mo-tivated by the complementary and holistic na-ture of multimedia data, we propose M ulti M edia E vent E xtraction ( M E ), a new task that aims tojointly extract events and arguments from multiplemodalities. We construct the ﬁrst benchmark andevaluation dataset for this task, which consists of245 fully annotated news articles.We propose the ﬁrst method, W eakly A ligned S tructured E mbedding ( WASE ), for extractingevents and arguments from multiple modalities.Complex event structures have not been cov-ered by existing multimedia representation meth-ods (Wu et al., 2019b; Faghri et al., 2018; Karpa-thy and Fei-Fei, 2015), so we propose to learn a structured multimedia embedding space. Morespeciﬁcally, given a multimedia document, werepresent each image or sentence as a graph, whereeach node represents an event or entity and each a r X i v : . [ c s . MM ] M a y dge represents an argument role. The node andedge embeddings are represented in a multimediacommon semantic space, as they are trained to re-solve event co-reference across modalities and tomatch images with relevant sentences. This en-ables us to jointly classify events and argumentroles from both modalities. A major challengeis the lack of multimedia event argument annota-tions, which are costly to obtain due to the annota-tion complexity. Therefore, we propose a weaklysupervised framework, which takes advantage ofannotated uni-modal corpora to separately learnvisual and textual event extraction, and uses animage-caption dataset to align the modalities.We evaluate WASE on the new task of M E .Compared to the state-of-the-art uni-modal meth-ods and multimedia ﬂat representations, ourmethod signiﬁcantly outperforms on both eventextraction and argument role labeling tasks in allsettings. Moreover, it extracts 21.4% more eventmentions than text-only baselines. The trainingand evaluation are done on heterogeneous data setsfrom multiple sources, domains and data modali-ties, demonstrating the scalability and transferabil-ity of the proposed model. In summary, this papermakes the following contributions: • We propose a new task, MultiMedia EventExtraction, and construct the ﬁrst annotatednews dataset as a benchmark to support deepanalysis of cross-media events. • We develop a weakly supervised trainingframework, which utilizes existing single-modal annotated corpora, and enables jointinference without cross-modal annotation. • Our proposed method, WASE, is the ﬁrstto leverage structured representations andgraph-based neural networks for multimediacommon space embedding.

Each input document consists of a set of im-ages M = { m , m , . . . } and a set of sentences S = { s , s , . . . } . Each sentence s can be repre-sented as a sequence of tokens s = ( w , w , . . . ) ,where w i is a token from the document vocabu-lary W . The input also includes a set of entities T = { t , t , . . . } extracted from the documenttext. An entity is an individually unique object in the real world, such as a person, an organization, afacility, a location, a geopolitical entity, a weapon,or a vehicle. The objective of M E is twofold: Event Extraction : Given a multimedia docu-ment, extract a set of event mentions, where eachevent mention e has a type y e and is grounded ona text trigger word w or an image m or both, i.e., e = ( y e , { w, m } ) . Note that for an event, w and m can both exist,which means the visual event mention and the tex-tual event mention refer to the same event. Forexample in Figure 1, deploy indicates the same Movement.Transport event as the image. We con-sider the event e as text-only event if it only hastextual mention w , and as image-only event if itonly contains visual mention m , and as multime-dia event if both w and m exist. Argument Extraction : The second task is toextract a set of arguments of event mention e . Eachargument a has an argument role type y a , and isgrounded on a text entity t or an image object o (represented as a bounding box), or both, a = ( y a , { t, o } ) . The arguments of visual and textual event men-tions are merged if they refer to the same real-world event, as shown in Figure 1. E Dataset

We deﬁne multimedia newsworthy event types byexhaustively mapping between the event ontologyin NLP community for the news domain (ACE )and the event ontology in CV community for gen-eral domain (imSitu (Yatskar et al., 2016)). Theycover the largest event training resources in eachcommunity. Table 1 shows the selected completeintersection, which contains 8 ACE types (i.e.,24% of all ACE types), mapped to 98 imSitu types(i.e., 20% of all imSitu types). We expand theACE event role set by adding visual argumentsfrom imSitu, such as instrument , bolded in Ta-ble 1. This set encompasses 52% ACE events ina news corpus, which indicates that the selectedeight types are salient in the news domain. Wereuse these existing ontologies because they en-able us to train event and argument classiﬁers forboth modalities without requiring joint multime-dia event annotation as training data. https://catalog.ldc.upenn.edu/ldc2006T06 vent Type Argument Role Movement.Transport(223 |

37) Entity (33 | Instrument (0 | | |

79) Participant (119 | | |

6) Giver (19 | | | Table 1: Event types and argument roles in M E , withexpanded ones in bold. Numbers in parentheses repre-sent the counts of textual and visual events/arguments. We collect 108,693 multimedia news articlesfrom the Voice of America (VOA) website > ); (3) Diversity: articles thatbalance the event type distribution regardless oftrue frequency. The data statistics are shown inTable 2. Among all of these events, 192 textualevent mentions and 203 visual event mentions canbe aligned as 309 cross-media event mention pairs.The dataset can be divided into 1,105 text-onlyevent mentions, 188 image-only event mentions,and 395 multimedia event mentions. Source Event Mention Argument Role sentence image textual visual textual visual6,167 1,014 1,297 391 1,965 1,429

Table 2: M E data statistics. We follow the ACE event annotation guide-lines (Walker et al., 2006) for textual event andargument annotation, and design an annotationguideline for multimedia events annotation.One unique challenge in multimedia event an-notation is to localize visual arguments in complexscenarios, where images include a crowd of peo-ple or a group of object. It is hard to delineate http://blender.cs.illinois.edu/software/m2e2/ACL2020_M2E2_annotation.pdf Figure 2: Example of bounding boxes. each of them using a bounding box. To solve thisproblem, we deﬁne two types of bounding boxes:(1) union bounding box : for each role, we anno-tate the smallest bounding box covering all con-stituents; and (2) instance bounding box : for eachrole, we annotate a set of bounding boxes, whereeach box is the smallest region that covers an indi-vidual participant (e.g., one person in the crowd),following the VOC2011 Annotation Guidelines .Figure 2 shows an example. Eight NLP and CV re-searchers complete the annotation work with twoindependent passes and reach an Inter-AnnotatorAgreement (IAA) of 81.2%. Two expert annota-tors perform adjudication. As shown in Figure 3, the training phase containsthree tasks: text event extraction (Section 3.2), vi-sual situation recognition (Section 3.3), and cross-media alignment (Section 3.4). We learn a cross-media shared encoder, a shared event classiﬁer,and a shared argument classiﬁer. In the testingphase (Section 3.5), given a multimedia news arti-cle, we encode the sentences and images into thestructured common space, and jointly extract tex-tual and visual events and arguments, followed bycross-modal coreference resolution.

As shown inFigure 4, we choose Abstract Meaning Represen-tation (AMR) (Banarescu et al., 2013) to repre-sent text because it includes a rich set of 150ﬁne-grained semantic roles. To encode eachtext sentence, we run the CAMR parser (Wanget al., 2015b,a, 2016) to generate an AMR graph,based on the named entity recognition and part-of-speech (POS) tagging results from StanfordCoreNLP (Manning et al., 2014). To representeach word w in a sentence s , we concatenate its http://host.robots.ox.ac.uk/pascal/VOC/voc2011/guidelines.html or the rebels, bravado goes hand-in-hand with the desperate resistance theinsurgents have mounted..... trigger imageentity region attend VOAImage-Caption PairsLiana Owen [Participant] drove from Pennsylvania to attend [Contact.Meet] therally in Manhattan with herparents [Participant] . ... ... destroying [Conﬂict.Attack] Item [Target] : ship

Tool [Instrument] : bomb

Liana Owen trigger image entity region ... ... insurgents imSitu Image Event Multimedia News resistance

Contact.Meet Conﬂict.Attack

Contact.Meet

Participant

Conﬂict.Attack

Instrument

Conﬂict.Attack

Attacker

Conﬂict.Attack

Instrument

Training Phase Testing Phase

Cross-media Structured Common Representation EncoderCross-media Shared Argument Classiﬁer

Conﬂict.Attack

Alignment

Cross-media Shared Event Classiﬁer

ACE Text Event

Figure 3: Approach overview. During training (left), we jointly train three tasks to establish a cross-media struc-tured embedding space. During test (right), we jointly extract events and arguments from multimedia articles. pre-trained GloVe word embedding (Penningtonet al., 2014), POS embedding, entity type embed-ding and position embedding. We then input theword sequence to a bi-directional long short termmemory (Bi-LSTM) (Graves et al., 2013) networkto encode the word order and get the represen-tation of each word w . Given the AMR graph,we apply a Graph Convolutional Network (GCN)(Kipf and Welling, 2016) to encode the graph con-textual information following (Liu et al., 2018a): w ( k +1) i = f ( (cid:88) j ∈N ( i ) g ( k ) ij ( W E ( i,j ) w ( k ) j + b ( k ) E ( i,j ) )) , (1)where N ( i ) is the neighbour nodes of w i in theAMR graph, E ( i, j ) is the edge type between w i and w j , g ij is the gate following (Liu et al.,2018a), k represents GCN layer number, and f isthe Sigmoid function. W and b denote param-eters of neural layers in this paper. We take thehidden states of the last GCN layer for each wordas the common-space representation w C , where C stands for the common (multimedia) embeddingspace. For each entity t , we obtain its representa-tion t C by averaging the embeddings of its tokens. Event and Argument Classiﬁer:

We classifyeach word w into event types y e and classify each We use BIO tag schema to decide trigger word boundary,i.e., adding preﬁx B- to the type label to mark the beginningof a trigger, I- for inside, and O for none. entity t into argument role y a : P ( y e | w ) = exp (cid:0) W e w C + b e (cid:1)(cid:80) e (cid:48) exp ( W e (cid:48) w C + b e (cid:48) ) ,P ( y a | t ) = exp( W a [ t C ; w C ] + b a ) (cid:80) a (cid:48) exp( W a (cid:48) [ t C ; w C ] + b a (cid:48) ) . (2)We take ground truth text entity mentions as inputfollowing (Ji and Grishman, 2008) during training,and obtain testing entity mentions using a namedentity extractor (Lin et al., 2019). To obtainimage structures similar to AMR graphs, and in-spired by situation recognition (Yatskar et al.,2016), we represent each image with a situationgraph , that is a star-shaped graph as shown in Fig-ure 4, where the central node is labeled as a verb v (e.g., destroying ), and the neighbor nodes are ar-guments labeled as { ( n, r ) } , where n is a noun(e.g., ship ) derived from WordNet synsets (Miller,1995) to indicate the entity type, and r indicatesthe role (e.g., item ) played by the entity in theevent, based on FrameNet (Fillmore et al., 2003).We develop two methods to construct situationgraphs from images and train them using the im-Situ dataset (Yatskar et al., 2016) as follows. (1) Object-based Graph: Similar to extractingentities to get candidate arguments, we employ the aption AMR Graph

Attention-based GraphImage Structured Multimedia Common Space ... ... :agent :destination :item :item attack-01protest-01 bus :ARG0 :ARG1 B i - L S T M Context

Thailand :name rally-01 :mod oppose-01 :ARG0-of person Bangkok :location support-01 pro-government Red Shirt :ARG0:ARG0-of :mod :ARG1 attack-01 ... protest-01 busrally-01 Bangkok :agent :destination ...

Role-drivenAttention G CN ... ... man car stone :ARG0 :ARG1:location:mod throwing Thai oppositionprotesters [Attacker] attack [Conﬂict.Attack] abus [Target] carrying pro-government Red Shirtsupporters on their way toa rally at a stadium inBangkok [Place] . A M R P a r s e r S i t ua t i on G r aph E n c ode r G CN or Object-based Graph Figure 4: Multimedia structured common space construction. Red pixels stands for attention heatmap. most similar task in CV, object detection, and ob-tain the object bounding boxes detected by a FasterR-CNN (Ren et al., 2015) model trained on OpenImages (Kuznetsova et al., 2018) with 600 objecttypes (classes). We employ a VGG-16 CNN (Si-monyan and Zisserman, 2014) to extract visualfeatures of an image m and and another VGG-16to encode the bounding boxes { o i } . Then we ap-ply a Multi-Layer Perceptron (MLP) to predict averb embedding from m and another MLP to pre-dict a noun embedding for each o i . ˆ m = MLP m ( m ) , ˆ o i = MLP o ( o i ) . We compare the predicted verb embedding to allverbs v in the imSitu taxonomy in order to classifythe verb, and similarly compare each predictednoun embedding to all imSitu nouns n which re-sults in probability distributions: P ( v | m ) = exp ( ˆ mv ) (cid:80) v (cid:48) exp ( ˆ mv (cid:48) ) ,P ( n | o i ) = exp( ˆ o i n ) (cid:80) n (cid:48) exp( ˆ o i n (cid:48) ) , where v and n are word embeddings initializedwith GloVE (Pennington et al., 2014). We useanother MLP with one hidden layer followed bySoftmax ( σ ) to classify role r i for each object o i : P ( r i | o i ) = σ (cid:0) MLP r ( ˆ o i ) (cid:1) . Given verb v ∗ and role-noun ( r ∗ i , n ∗ i ) annotationsfor an image (from the imSitu corpus), we deﬁne the situation loss functions: L v = − log P ( v ∗ | m ) , L r = − log( P ( r ∗ i | o i ) + P ( n ∗ i | o i )) . (2) Attention-based Graph: State-of-the-art ob-ject detection methods only cover a limited set ofobject types, such as 600 types deﬁned in OpenImages. Many salient objects such as bomb , stone and stretcher are not covered in these ontologies.Hence, we propose an open-vocabulary alterna-tive to the object-based graph construction model.To this end, we construct a role-driven attentiongraph, where each argument node is derived bya spatially distributed attention (heatmap) condi-tioned on a role r . More speciﬁcally, we use aVGG-16 CNN to extract a × convolutional fea-ture map for each image m , which can be regardedas attention keys k i for × local regions. Next,for each role r deﬁned in the situation recognitionontology (e.g., agent ), we build an attention query vector q r by concatenating role embedding r withthe image feature m as context and apply a fullyconnected layer: q r = W q [ r ; m ] + b q . Then, we compute the dot product of each querywith all keys, followed by Softmax, which formsa heatmap h on the image, i.e., h i = exp( q r k i ) (cid:80) j ∈ × exp( q r k j ) . e use the heatmap to obtain a weighted averageof the feature map to represent the argument o r ofeach role r in the visual space: o r = (cid:88) i h i m i . Similar to the object-based model, we embed o r to ˆ o r , compare it to the imSitu noun embeddingsto deﬁne a distribution, and deﬁne a classiﬁcationloss function. The verb embedding ˆ m and the verbprediction probability P ( v | m ) and loss are deﬁnedin the same way as in the object-based method. Event and Argument Classiﬁer:

We use ei-ther the object-based or attention-based formula-tion and pre-train it on the imSitu dataset (Yatskaret al., 2016). Then we apply a GCN to obtain thestructured embedding of each node in the com-mon space, similar to Equation 1. This yields m C and o C i . We use the same classiﬁers as deﬁned inEquation 2 to classify each visual event and argu-ment using the common space embedding: P ( y e | m ) = exp( W e m C + b e ) (cid:80) e (cid:48) exp( W e (cid:48) m C + b e (cid:48) ) ,P ( y a | o ) = exp( W a [ o C ; m C ] + b a ) (cid:80) a (cid:48) exp( W a (cid:48) [ o C ; m C ] + b a (cid:48) ) . (3) In order to make the event and argument classi-ﬁer shared across modalities, the image and textgraph should be encoded to the same space. How-ever, it is extremely costly to obtain the paralleltext and image event annotation. Hence, we useevent and argument annotations in separate modal-ities (i.e., ACE and imSitu datasets) to train clas-siﬁers, and simultaneously use VOA news imageand caption pairs to align the two modalities. Tothis end, we learn to embed the nodes of each im-age graph close to the nodes of the correspondingcaption graph, and far from those in irrelevant cap-tion graphs. Since there is no ground truth align-ment between the image nodes and caption nodes,we use image and caption pairs for weakly super-vised training, to learn a soft alignment from eachwords to image objects and vice versa. α ij = exp ( w C i o C j ) (cid:80) j (cid:48) exp ( w C i o C j (cid:48) ) , β ji = exp ( w C i o C j ) (cid:80) i (cid:48) exp ( w C i (cid:48) o C j ) , where w i indicates the i th word in caption sen-tence s and o j represents the j th object of image m . Then, we compute a weighted average of softlyaligned nodes for each node in other modality, i.e., w (cid:48) i = (cid:88) j α ij o C j , o (cid:48) j = (cid:88) i β ji w C i . (4)We deﬁne the alignment cost of the image-captionpair as the Euclidean distance between each nodeto its aligned representation, (cid:104) s, m (cid:105) = (cid:88) i || w i − w (cid:48) i || + (cid:88) j || o j − o (cid:48) j || We use a triplet loss to pull relevant image-captionpairs close while pushing irrelevant ones apart: L c = max(0 , (cid:104) s, m (cid:105) − (cid:104) s, m − (cid:105) ) , where m − is a randomly sampled negative imagethat does not match s . Note that in order to learnthe alignment between the image and the triggerword, we treat the image as a special object whenlearning cross-media alignment.The common space enables the event and argu-ment classiﬁers to share weights across modali-ties, and be trained jointly on the ACE and im-Situ datasets, by minimizing the following objec-tive functions: L e = − (cid:88) w log P ( y e | w ) − (cid:88) m log P ( y e | m ) , L a = − (cid:88) t log P ( y a | t ) − (cid:88) o log P ( y a | o ) , All tasks are jointly optimized: L = L v + L r + L e + L a + L c In the test phase, our method takes a multime-dia document with sentences S = { s , s , . . . } and images M = { m , m , . . . , } as input. Weﬁrst generate the structured common embeddingfor each sentence and each image, and then com-pute pairwise similarities (cid:104) s, m (cid:105) . We pair eachsentence s with the closest image m , and aggre-gate the features of each word of s with the alignedrepresentation from m by weighted averaging: w (cid:48)(cid:48) i = (1 − γ ) w i + γ w (cid:48) i , (5)where γ = exp( −(cid:104) s, m (cid:105) ) and w (cid:48) i is derived from m using Equation 4. We use w (cid:48)(cid:48) i to classify each r a i n i n g Model Text-Only Evaluation Image-Only Evaluation Multimedia EvaluationEvent Mention Argument Role Event Mention Argument Role Event Mention Argument Role

P R F P R F P R F P R F P R F P R F T e x t JMEE 42.5 58.2 48.7 22.9 28.3 25.3 - - - - - - 42.1 34.6 38.1 21.1 12.6 15.8GAIL 43.4 53.5 47.9 23.6 29.2 26.1 - - - - - - 44.0 32.4 37.3 22.7 12.8 16.4WASE T I m ag e WASE I att - - - - - - 29.7 61.9 40.1 9.1 10.2 9.6 28.3 23.0 25.4 2.9 6.1 3.8WASE I obj - - - - - - 28.6 59.2 38.7 13.3 9.8 11.2 26.1 22.4 24.1 4.7 5.0 4.9 M u l t i m e d i a VSE-C 33.5 47.8 39.4 16.6 24.7 19.8 30.3 48.9 26.4 5.6 6.1 5.7 33.3 48.2 39.3 11.1 14.9 12.8Flat att obj att

WASE obj

Table 3: Event and argument extraction results (%). We compare three categories of baselines in three evaluationsettings. The main contribution of the paper is joint training and joint inference on multimedia data (bottom right). word into an event type and to classify each en-tity into a role with multimedia classiﬁers in Equa-tion 2. To this end, we deﬁne t (cid:48)(cid:48) i similar to w (cid:48)(cid:48) i but using t i and t (cid:48) i . Similarly, for each image m we ﬁnd the closest sentence s , compute the aggre-gated multimedia features m (cid:48)(cid:48) and o (cid:48)(cid:48) i , and feedinto the shared classiﬁers (Equation 3) to predictvisual event and argument roles. Finally, we core-fer the cross-media events of the same event typeif the similarity (cid:104) s, m (cid:105) is higher than a threshold. We conduct evaluation ontext-only, image-only, and multimedia event men-tions in M E dataset in Section 2.2. We adoptthe traditional event extraction measures, i.e., Pre-cision , Recall and F . For text-only event men-tions, we follow (Ji and Grishman, 2008; Li et al.,2013): a textual event mention is correct if itsevent type and trigger offsets match a referencetrigger; and a textual event argument is correct ifits event type, offsets, and role label match a ref-erence argument. We make a similar deﬁnition forimage-only event mentions: a visual event men-tion is correct if its event type and image match areference visual event mention; and a visual eventargument is correct if its event type, localization,and role label match a reference argument. A vi-sual argument is correctly localized if the Inter-section over Union (IoU) of the predicted bound-ing box with the ground truth bounding box is over0.5. Finally, we deﬁne a multimedia event mentionto be correct if its event type and trigger offsets(or the image) match the reference trigger (or thereference image). The arguments of multimedia events are either textual or visual arguments, andare evaluated accordingly. To generate boundingboxes for the attention-based model, we thresholdthe heatmap using the adaptive value of . ∗ p ,where p is the peak value of the heatmap. Then wecompute the tightest bounding box that enclosesall of the thresholded region. Examples are shownin Figure 7 and Figure 8. Baselines

The baselines include: (1)

Text-only models: We use the state-of-the-art modelJMEE (Liu et al., 2018a) and GAIL (Zhang et al.,2019) for comparison. We also evaluate the ef-fectiveness of cross media joint training by in-cluding a version of our model trained only onACE, denoted as WASE T . (2) Image-only mod-els: Since we are the ﬁrst to extract newswor-thy events, and the most similar work situationrecognition can not localize arguments in images,we use our model trained only on image corpusas baselines. Our visual branch has two ver-sions, object-based and attention-based, denotedas WASE I obj and WASE I att . (3) Multimedia mod-els: To show the effectiveness of structured em-bedding, we include a baseline by removing thetext and image GCNs from our model, which isdenoted as Flat. The Flat baseline ignores edgesand treats images and sentences as sets of vec-tors. We also compare to the state-of-the-art cross-media common representation model, ContrastiveVisual Semantic Embedding VSE-C (Shi et al.,2018), by training it the same way as WASE.

Parameter Settings

The common space dimen-sion is . The dimension is for image posi-tion embedding and feature map, and for wordposition embedding, entity type embedding, andPOS tag embedding. The layer of GCN is . .2 Quantitative Performance As shown in Table 3, our complete methods(WASE att and WASE obj ) outperform all baselinesin the three evaluation settings in terms of F . Thecomparison with other multimedia models demon-strates the effectiveness of our model architectureand training strategy. The advantage of structuredembedding is shown by the better performanceover the ﬂat baseline. Our model outperformsits text-only and image-only variants on multi-media events, showing the inadequacy of single-modal information for complex news understand-ing. Furthermore, our model achieves better per-formance on text-only and image-only events,which demonstrates the effectiveness of multime-dia training framework in knowledge transfer be-tween modalities.WASE obj and WASE att , are both superior to thestate of the art and each has its own advantages.WASE obj predicts more accurate bounding boxessince it is based on a Faster R-CNN pretrained onbounding box annotations, resulting in a higherargument precision. While WASE att achieves ahigher argument recall as it is not limited by thepredeﬁned object classes of the Faster R-CNN. Model P (%) R (%) F (%)rule based 10.1 100 18.2VSE 31.2 74.5 44.0Flat att obj att obj Table 4: Cross-media event coreference performance.

Furthermore, to evaluate the cross-media eventcoreference performance, we pair textual and vi-sual event mentions in the same document, andcalculate

Precision , Recall and F to compare withground truth event mention pairs . As shown inTable 4, WASE obj outperforms all multimedia em-bedding models, as well as the rule-based baselineusing event type matching. This demonstrates theeffectiveness of our cross-media soft alignment. Our cross-media joint training approach success-fully boosts both event extraction and argumentrole labeling performance. For example, in Fig-ure 5 (a), the text-only model can not extract

Jus- We do not use coreference clustering metrics because weonly focus on mention-level cross-media event coreferenceinstead of the full coreference in all documents. tice.Arrest event, but the joint model can use theimage as background to detect the event type. InFigure 5 (b), the image-only model detects the im-age as

Conﬂict.Demonstration , but the sentencesin the same document help our model not to la-bel it as

Conﬂict.Demonstration . Compared withmultimedia ﬂat embedding in Figure 6, WASE canlearn structures such as

Artifact is on top of

Vehi-cle , and the person in the middle of

Justice.Arrest is Entity instead of

Agent . Iraqi security forces search [Justice.Arrest] a civilian in thecity of Mosul. People celebrate Supreme Courtruling on Same Sex Marriage in frontof the Supreme Court in Washington.

Figure 5: Image helps textual event extraction, and sur-rounding sentence helps visual event extraction.

Compare to State-of-the-art Cross-media Flat Representation u Baseline: u Event: Justice:Arrest-Jail u Roles: u None u Our Approach: u Event: Conflict.Attack u Baseline: u Event: Justice:Arrest-Jail u Roles: u Agent = man u Baseline: u Event: Justice:Arrest-Jail u Roles: u Entity = man u Our Approach:

Flat Event Justice:ArrestJailRole NoneOurs Event Conflict.AttackRole Instrument = gun Flat Event

Movement.Transport

Role Artifact = noneOurs Event

Movement.Transport

Role Artifact = man Flat Event Justice:ArrestJailRole Agent = manOurs Event Conflict.AttackRole Entity = man

Figure 6: Comparison with multimedia ﬂat embedding.

One of the biggest challenges in M E is localiz-ing arguments in images. Object-based modelssuffer from the limited object types. Attention-based method is not able to precisely localize theobjects for each argument, since there is no super-vision on attention extraction during training. Forexample, in Figure 7, the Entity argument in the

Conﬂict.Demonstrate event is correctly predictedas troops , but its localization is incorrect because

Place argument share similar attention. When oneargument targets at too many instances, attentionheatmaps tend to lose focus and cover the wholeimage, as shown in Figure 8.

Text Event Extraction

Text event extraction hasbeen extensively studied for general news do- ntity: people Entity: troopsPlace: street

Figure 7: Argument labeling error examples: correctentity name but wrong localization.

Entity: people Place: street Entity: dissent

Figure 8: Attention heatmaps lose focus due to largeinstance candidate number. main (Ji and Grishman, 2008; Liao and Grishman,2011; Huang and Riloff, 2012; Li et al., 2013;Chen et al., 2015; Nguyen et al., 2016; Hong et al.,2018; Liu et al., 2018b; Chen et al., 2018; Zhanget al., 2019; Liu et al., 2018a; Wang et al., 2019;Yang et al., 2019; Wadden et al., 2019). Multime-dia features has been proven to effectively improvetext event extraction (Zhang et al., 2017).

Visual Event Extraction “Events” in NLP usu-ally refer to complex events that involve multipleentities in a large span of time (e.g. protest), whilein CV (Chang et al., 2016; Zhang et al., 2007;Ma et al., 2017) events are less complex single-entity activities (e.g. washing dishes) or actions(e.g. jumping). Visual event ontologies focus ondaily life domains, such as dogshow and weddingceremony (Perera et al., 2012). Moreover, mostefforts ignore the structure of events including ar-guments. There are a few methods that aim to lo-calize the agent (Gu et al., 2018; Li et al., 2018;Duarte et al., 2018), or classify the recipient (Sig-urdsson et al., 2016; Kato et al., 2018; Wu et al.,2019a) of events, but neither detects the completeset of arguments for an event. The most similar toour work is Situation Recognition (SR) (Yatskaret al., 2016; Mallya and Lazebnik, 2017) whichpredicts an event and multiple arguments from aninput image, but does not localize the arguments.We use SR as an auxiliary task for training our vi-sual branch, but exploit object detection and atten-tion to enable localization of arguments. Silbererand Pinkal redeﬁne the problem of visual argu-ment role labeling with event types and boundingboxes as input. Different from their work, we ex-tend the problem scope to including event identiﬁ-cation and coreference, and further advance argu- ment localization by proposing an attention frame-work which does not require bounding boxes fortraining nor testing.

Multimedia Representation

Multimedia com-mon representation has attracted much attentionrecently (Toselli et al., 2007; Weegar et al., 2015;Hewitt et al., 2018; Chen et al., 2019; Liu et al.,2019; Su et al., 2019a; Saraﬁanos et al., 2019;Sun et al., 2019b; Tan and Bansal, 2019; Li et al.,2019a,b; Lu et al., 2019; Sun et al., 2019a; Rah-man et al., 2019; Su et al., 2019b). However, pre-vious methods focus on aligning images with theircaptions, or regions with words and entities, butignore structure and semantic roles. UniVSE (Wuet al., 2019b) incorporates entity attributes and re-lations into cross-media alignment, but does notcapture graph-level structures of images or text.

In this paper we propose a new task of multimediaevent extraction and setup a new benchmark. Wealso develop a novel multimedia structured com-mon space construction method to take advantageof the existing image-caption pairs and single-modal annotated data for weakly supervised train-ing. Experiments demonstrate its effectivenessas a new step towards semantic understanding ofevents in multimedia data. In the future, we aimto extend our framework to extract events fromvideos, and make it scalable to new event types.We plan to expand our annotations by includingevent types from other text event ontologies, aswell as new event types not in existing text on-tologies. We will also apply our extraction resultsto downstream applications including cross-mediaevent inference, timeline generation, etc.

Acknowledgement

This research is based upon work supported in partby U.S. DARPA AIDA Program No. FA8750-18-2-0014 and U.S. DARPA KAIROS ProgramNo. FA8750-19-2-1004. The views and conclu-sions contained herein are those of the authorsand should not be interpreted as necessarily rep-resenting the ofﬁcial policies, either expressedor implied, of DARPA, or the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for governmental purposesnotwithstanding any copyright annotation therein. eferences

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Grifﬁtt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In

Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse , pages 178–186.Fabian Caba Heilbron, Victor Escorcia, BernardGhanem, and Juan Carlos Niebles. 2015. Activ-itynet: A large-scale video benchmark for humanactivity understanding. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 961–970.Xiaojun Chang, Zhigang Ma, Yi Yang, ZhiqiangZeng, and Alexander G Hauptmann. 2016. Bi-level semantic representation analysis for multime-dia event detection.

IEEE transactions on cybernet-ics , 47(5):1180–1197.Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2019. Uniter: Learning univer-sal image-text representations. arXiv preprintarXiv:1909.11740 .Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng,and Jun Zhao. 2015. Event extraction via dynamicmulti-pooling convolutional neural networks. In

Proc. ACL-IJCNLP2015 .Yubo Chen, Hang Yang, Kang Liu, Jun Zhao, and Yan-tao Jia. 2018. Collective event detection via a hier-archical and bias tagging networks with gated multi-level attention mechanisms. In

Proc. EMNLP2018 .Kevin Duarte, Yogesh Rawat, and Mubarak Shah.2018. Videocapsulenet: A simpliﬁed network foraction detection. In

Advances in Neural InformationProcessing Systems , pages 7610–7619.Fartash Faghri, David J Fleet, Jamie Ryan Kiros, andSanja Fidler. 2018. Vse++: Improving visual-semantic embeddings with hard negatives.

Pro-ceedings of the British Machine Vision Conference(BMVC) .Charles J Fillmore, Christopher R Johnson, andMiriam RL Petruck. 2003. Background to framenet.

International journal of lexicography , 16(3):235–250.Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recur-rent neural networks. In , pages 6645–6649. IEEE.Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick,Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya-narasimhan, George Toderici, Susanna Ricco, RahulSukthankar, et al. 2018. Ava: A video dataset ofspatio-temporally localized atomic visual actions. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 6047–6056.John Hewitt, Daphne Ippolito, Brendan Callahan, RenoKriz, Derry Tanti Wijaya, and Chris Callison-Burch.2018. Learning translations via images with a mas-sively multilingual image dataset. In

Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 2566–2576.Yu Hong, Wenxuan Zhou, jingli zhang jingli, GuodongZhou, and Qiaoming Zhu. 2018. Self-regulation:Employing a generative adversarial network to im-prove event detection. In

Proc. ACL2018 .Ruihong Huang and Ellen Riloff. 2012. Bootstrappedtraining of event extraction classiﬁers. In

Proc.EACL2012 .Heng Ji and Ralph Grishman. 2008. Reﬁning event ex-traction through cross-document inference. In

Pro-ceedings of ACL-08: HLT , pages 254–262.Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages3128–3137.Keizo Kato, Yin Li, and Abhinav Gupta. 2018. Com-positional learning for human object interaction. In

Proceedings of the European Conference on Com-puter Vision (ECCV) , pages 234–251.Thomas N Kipf and Max Welling. 2016. Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 .Alina Kuznetsova, Hassan Rom, Neil Alldrin, JasperUijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Ka-mali, Stefan Popov, Matteo Malloci, Tom Duerig,et al. 2018. The open images dataset v4: Uni-ﬁed image classiﬁcation, object detection, and vi-sual relationship detection at scale. arXiv preprintarXiv:1811.00982 .Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei.2018. Recurrent tubelet proposal and recognitionnetworks for action detection. In

Proceedings of theEuropean conference on computer vision (ECCV) ,pages 303–318.Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, andMing Zhou. 2019a. Unicoder-vl: A universal en-coder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 .Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019b. Visualbert: Asimple and performant baseline for vision and lan-guage. arXiv preprint arXiv:1908.03557 .Qi Li, Heng Ji, and Liang Huang. 2013. Joint eventextraction via structured prediction with global fea-tures. In

Proc. ACL2013 .hasha Liao and Ralph Grishman. 2011. Acquir-ing topic features to improve event extraction: inpre-selected and balanced collections. In

Proc.RANLP2011 .Ying Lin, Liyuan Liu, Heng Ji, Dong Yu, and JiaweiHan. 2019. Reliability-aware dynamic feature com-position for name tagging. In

Proc. The 57th An-nual Meeting of the Association for ComputationalLinguistics (ACL2019) .Chunxiao Liu, Zhendong Mao, An-An Liu, TianzhuZhang, Bin Wang, and Yongdong Zhang. 2019. Fo-cus your attention: A bidirectional focal attentionnetwork for image-text matching. In

Proceedings ofthe 27th ACM International Conference on Multime-dia , pages 3–11. ACM.Xiao Liu, Zhunchen Luo, and Heyan Huang. 2018a.Jointly multiple events extraction via attention-based graph information aggregation. In

Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 1247–1256.Xiao Liu, Zhunchen Luo, and Heyan Huang. 2018b.Jointly multiple events extraction via attention-based graph information aggregation. In

Proc.EMNLP2018 .Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. In

Advances in Neural Information Process-ing Systems , pages 13–23.Zhigang Ma, Xiaojun Chang, Zhongwen Xu, NicuSebe, and Alexander G Hauptmann. 2017. Joint at-tributes and event analysis for multimedia event de-tection.

IEEE transactions on neural networks andlearning systems , 29(7):2921–2930.Arun Mallya and Svetlana Lazebnik. 2017. Recurrentmodels for situation recognition. In

Proceedings ofthe IEEE International Conference on Computer Vi-sion , pages 455–463.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In

Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.George A Miller. 1995. Wordnet: a lexical database forenglish.

Communications of the ACM , 38(11):39–41.Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-ishman. 2016. Joint event extraction via recurrentneural networks. In

Proc. NAACL-HLT2016 .Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In

Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1532–1543. AG Amitha Perera, Sangmin Oh, P Megha, TianyangMa, Anthony Hoogs, Arash Vahdat, Kevin Cannons,Greg Mori, Scott Mccloskey, Ben Miller, et al. 2012.Trecvid 2012 genie: Multimedia event detection andrecounting. In

In TRECVID Workshop . Citeseer.Wasifur Rahman, Md Kamrul Hasan, Amir Zadeh,Louis-Philippe Morency, and Mohammed EhsanHoque. 2019. M-bert: Injecting multimodal in-formation in the bert structure. arXiv preprintarXiv:1908.05787 .Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. In

Advances in neural information processing systems ,pages 91–99.Nikolaos Saraﬁanos, Xiang Xu, and Ioannis A. Kaka-diaris. 2019. Adversarial representation learning fortext-to-image matching. In

The IEEE InternationalConference on Computer Vision (ICCV) .Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang,and Jian Sun. 2018. Learning visually-grounded se-mantics from contrastive adversarial samples. arXivpreprint arXiv:1806.10348 .Gunnar A Sigurdsson, G¨ul Varol, Xiaolong Wang, AliFarhadi, Ivan Laptev, and Abhinav Gupta. 2016.Hollywood in homes: Crowdsourcing data collec-tion for activity understanding. In

European Confer-ence on Computer Vision , pages 510–526. Springer.Carina Silberer and Manfred Pinkal. 2018. Ground-ing semantic roles in images. In

Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 2616–2626.Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 .Khurram Soomro, Amir Roshan Zamir, and MubarakShah. 2012. Ucf101: A dataset of 101 human ac-tions classes from videos in the wild. arXiv preprintarXiv:1212.0402 .Mitchell Stephens. 1998.

The Rise of the Image, TheFall of the Word . New York: Oxford UniversityPress.Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019a.Deep joint-semantics reconstructing hashing forlarge-scale unsupervised cross-modal retrieval. In

The IEEE International Conference on Computer Vi-sion (ICCV) .Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2019b. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 .Chen Sun, Fabien Baradel, Kevin Murphy, andCordelia Schmid. 2019a. Contrastive bidirectionaltransformer for temporal representation learning. arXiv preprint arXiv:1906.05743 .hen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. 2019b. Videobert: Ajoint model for video and language representationlearning. In

The IEEE International Conference onComputer Vision (ICCV) .Hao Tan and Mohit Bansal. 2019. Lxmert: Learningcross-modality encoder representations from trans-formers. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Process-ing .Alejandro H Toselli, Ver´onica Romero, and EnriqueVidal. 2007. Viterbi based alignment between textimages and their transcripts. In

Proceedings ofthe Workshop on Language Technology for CulturalHeritage Data (LaTeCH 2007). , pages 9–16.David Wadden, Ulme Wennberg, Yi Luan, and Han-naneh Hajishirzi. 2019. Entity, relation, and eventextraction with contextualized span representations. arXiv preprint arXiv:1909.03546 .Christopher Walker, Stephanie Strassel, Julie Medero,and Kazuaki Maeda. 2006. Ace 2005 multilin-gual training corpus.

Linguistic Data Consortium,Philadelphia , 57.Chuan Wang, Sameer Pradhan, Xiaoman Pan, HengJi, and Nianwen Xue. 2016. Camr at semeval-2016task 8: An extended transition-based amr parser. In

Proceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016) , pages 1173–1178, San Diego, California. Association for Com-putational Linguistics.Chuan Wang, Nianwen Xue, and Sameer Pradhan.2015a. Boosting transition-based amr parsing withreﬁned actions and auxiliary analyzers. In

Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 2: Short Papers) , pages 857–862,Beijing, China. Association for Computational Lin-guistics.Chuan Wang, Nianwen Xue, and Sameer Pradhan.2015b. A transition-based algorithm for amr pars-ing. In

Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies , pages 366–375, Denver, Colorado. Asso-ciation for Computational Linguistics.Rui Wang, Deyu Zhou, and Yulan He. 2019.Open event extraction from online text using agenerative adversarial network. arXiv preprintarXiv:1908.09246 .Rebecka Weegar, Kalle ˚Astr¨om, and Pierre Nugues.2015. Linking entities across images and text. In

Proceedings of the Nineteenth Conference on Com-putational Natural Language Learning , pages 185–193. Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan,Kaiming He, Philipp Krahenbuhl, and Ross Gir-shick. 2019a. Long-term feature banks for detailedvideo understanding. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 284–293.Hank Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang,Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019b. Uni-vse: Robust visual semantic embeddings via struc-tured semantic representations. In

Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition .Sen Yang, Dawei Feng, Linbo Qiao, Zhigang Kan, andDongsheng Li. 2019. Exploring pre-trained lan-guage models for event extraction and generation.In

Proceedings of the 57th Conference of the Asso-ciation for Computational Linguistics , pages 5284–5294.Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi.2016. Situation recognition: Visual semantic rolelabeling for image understanding. In

Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition , pages 5534–5542.Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu,and Shih-Fu Chang. 2015. Eventnet: A large scalestructured concept library for complex event detec-tion in video. In

Proceedings of the 23rd ACM inter-national conference on Multimedia , pages 471–480.ACM.Tongtao Zhang, Heng Ji, and Avirup Sil. 2019. Jointentity and event extraction with generative adversar-ial imitation learning.

Data Intelligence Vol 1 (2):99-120 .Tongtao Zhang, Spencer Whitehead, Hanwang Zhang,Hongzhi Li, Joseph Ellis, Lifu Huang, Wei Liu,Heng Ji, and Shih-Fu Chang. 2017. Improving eventextraction via multimodal integration. In

Proceed-ings of the 25th ACM international conference onMultimedia , pages 270–278. ACM.Yifan Zhang, Changsheng Xu, Yong Rui, JinqiaoWang, and Hanqing Lu. 2007. Semantic event ex-traction from basketball games using multi-modalanalysis. In2007 IEEE International Conference onMultimedia and Expo