[PDF] Multi 2 OIE: Multilingual Open Information Extraction Based on Multi-Head Attention with BERT

Abstract

In this paper, we propose Multi 2 OIE, which performs open information extraction (open IE) by combining BERT with multi-head attention. Our model is a sequence-labeling system with an efficient and effective argument extraction method. We use a query, key, and value setting inspired by the Multimodal Transformer to replace the previously used bidirectional long short-term memory architecture with multi-head attention. Multi 2 OIE outperforms existing sequence-labeling systems with high computational efficiency on two benchmark evaluation datasets, Re-OIE2016 and CaRB. Additionally, we apply the proposed method to multilingual open IE using multilingual BERT. Experimental results on new benchmark datasets introduced for two languages (Spanish and Portuguese) demonstrate that our model outperforms other multilingual systems without training data for the target languages.

Full PDF

MMulti OIE: Multilingual Open Information Extraction Based onMulti-Head Attention with BERT

Youngbin Ro Yukyung Lee Pilsung Kang † Korea University, Seoul, Republic of Korea { youngbin ro, yukyung lee, pilsung kang } @korea.ac.kr Abstract

In this paper, we propose Multi OIE, whichperforms open information extraction (openIE) by combining BERT (Devlin et al., 2019)with multi-head attention blocks (Vaswaniet al., 2017). Our model is a sequence-labelingsystem with an efﬁcient and effective argu-ment extraction method. We use a query,key, and value setting inspired by the Multi-modal Transformer (Tsai et al., 2019) to re-place the previously used bidirectional longshort-term memory architecture with multi-head attention. Multi OIE outperforms exist-ing sequence-labeling systems with high com-putational efﬁciency on two benchmark eval-uation datasets, Re-OIE2016 and CaRB. Addi-tionally, we apply the proposed method to mul-tilingual open IE using multilingual BERT. Ex-perimental results on new benchmark datasetsintroduced for two languages (Spanish andPortuguese) demonstrate that our model out-performs other multilingual systems withouttraining data for the target languages.

Open information extraction (Open IE) (Bankoet al., 2007) aims to extract a set of arguments andtheir corresponding relationship phrases from natu-ral language text. For example, an open IE systemcould derive the relational tuple ( was elected ; TheRepublican candidate ; President ) from the givensentence “

The Republican candidate was electedPresident. ” Because the extractions generated byopen IE are considered as useful intermediate repre-sentations of the source text (Mausam, 2016), thismethod has been applied to various downstreamtasks (Christensen et al., 2013; Ding et al., 2016;Khot et al., 2017; Wu et al., 2018).Although early open IE systems were largelybased on handcrafted features or ﬁne-grained rules † Corresponding author

𝐐𝐔𝐄𝐑𝐘 𝐊𝐄𝐘 𝐕𝐀𝐋𝐔𝐄 𝑤 𝑤 𝑤 … 𝑤 𝑙−1 𝑤 𝑙 [ 𝐄𝐱𝐢𝐬𝐭𝐢𝐧𝐠 𝐌𝐞𝐭𝐡𝐨𝐝𝐬 ] [ 𝐏𝐫𝐨𝐩𝐨𝐬𝐞𝐝 𝐌𝐞𝐭𝐡𝐨𝐝 ] 𝐏𝐫𝐞𝐝𝐢𝐜𝐚𝐭𝐞𝐄𝐱𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧𝐀𝐫𝐠𝐮𝐦𝐞𝐧𝐭𝐄𝐱𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧𝐅𝐞𝐚𝐭𝐮𝐫𝐞

𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠

Embedding

Layers

𝐀𝐑𝐆𝟎 𝐀𝐑𝐆𝟎 𝐀𝐑𝐆𝟏𝐏𝐑𝐄𝐃

Multi−Head Attention Block

EmbeddingLayers tokens tokens

𝐀𝐑𝐆𝟎 𝐀𝐑𝐆𝟎 𝐀𝐑𝐆𝟏𝐏𝐑𝐄𝐃 𝑤 𝑤 𝑤 … 𝑤 𝑙−1 𝑤 𝑙 𝑤 𝑤 𝑤 … 𝑤 𝑙−1 𝑤 𝑙 𝑒 𝑒 𝑒 … 𝑒 𝑙−1 𝑒 𝑙 ℎ ℎ ℎ ℎ 𝑙 … Bidirectional−LSTMBidirectional−LSTM

Figure 1: Comparison between existing extractors andthe proposed method. We use BERT for feature embed-ding layers and as a predicate extractor. Predicate infor-mation is reﬂected through multi-head attention insteadof simple concatenation. (Fader et al., 2011; Mausam et al., 2012; Del Corroand Gemulla, 2013), most recent open IE researchhas focused on deep-neural-network-based super-vised learning models. Such systems are typicallybased on bidirectional long short-term memory(BiLSTM) and are formulated for two categories:sequence labeling (Stanovsky et al., 2018; Sarhanand Spruit, 2019; Jia and Xiang, 2019) and se-quence generation (Cui et al., 2018; Sun et al.,2018; Bhutani et al., 2019). The latter enables ﬂexi-ble extraction; however, it is more computationallyexpensive than the former. Additionally, generationmethods are not suitable for non-English text owingto a lack of training data because they are heavilydependent on in-language supervision (Ponti et al.,2019). Therefore, we adopted the sequence labelingmethod to maximize scalability by using (multilin-gual) BERT (Devlin et al., 2019) and multi-head at-tention (Vaswani et al., 2017). The main advantagesof our approach can be summarized as follows:• Our model can consider rich semantic and con-textual relationships between a predicate andother individual tokens in the same text duringsequence labeling by adopting a multi-head at- a r X i v : . [ c s . C L ] O c t ention structure . Speciﬁcally, we apply multi-head attention with the ﬁnal hidden states fromBERT as a query and the hidden states of pred-icate positions as key-value pairs. This methodrepeatedly reinforces sentence features by learn-ing attention weights across the predicate andeach token (Tsai et al., 2019). Figure 1 presentsthe difference between the existing sequence la-beling methods and the proposed method.• Multi OIE can operate on multilingual textwithout non-English training datasets by us-ing BERT’s multilingual version. By contrast,for sequence generation systems, performingzero-shot multilingual extraction is much moredifﬁcult (R¨onnqvist et al., 2019).• Our model is more computationally efﬁcient than sequence generation systems. This is be-cause the autoregressive properties of sequencegeneration create a bottleneck for real-world sys-tems. This is an important issue for downstreamtasks that require processing of large corpora.Experimental results on two English benchmarkdatasets called Re-OIE2016 (Zhan and Zhao, 2020)and CaRB (Bhardwaj et al., 2019) show that ourmodel yields the best performance among the avail-able sequence-labeling systems. Additionally, it isdemonstrated that the computational efﬁciency ofMulti OIE is far greater than that of sequence gen-eration systems. For a multilingual experiment, weintroduce multilingual open IE benchmarks (Span-ish and Portuguese) constructed by translating andre-annotating the Re-OIE2016 dataset. Experimen-tal results demonstrate that the proposed Multi OIEoutperforms other multilingual systems without ad-ditional training data for non-English languages.To the best of our knowledge, ours is the ﬁrst ap-proach using BERT for multilingual open IE . Thecode and related resources can be found in https://github.com/youngbin-ro/Multi2OIE . In sequence labeling open IE systems, whenextracting arguments for a speciﬁc predicate,predicate-related features are used as input vari-ables (Stanovsky et al., 2018; Zhan and Zhao, 2020; Although CrossOIE (Cabral et al., 2020) considered mul-tilingual BERT in the system, it was not used when extractingthe tuples but used only when validating the extracted results.

Jia and Xiang, 2019). We analyzed this extrac-tion process from the perspective of multimodallearning (Mangai et al., 2010; Ngiam et al., 2011;Baltrusaitis et al., 2019), which deﬁnes an entiresequence and the corresponding predicate infor-mation as a modality. The most frequently usedmethod for open IE is simple concatenation (Figure1, left), which can be interpreted as an early fusionapproach. Simple concatenation has low compu-tational complexity, but requires intensive featureengineering. It is also highly reliant on the choiceof a classiﬁer (Ergun et al., 2016; Liu et al., 2018).Instead, we propose the use of a multi-modalitymechanism (Tsai et al., 2019) to capture the com-plicated relationships between predicates and othertokens. In our method, multi-head attention is com-puted by using target modality as a query withsource modalities as key-value pairs to adapt thelatent information from sources to targets. Thisallows our model to assign greater weights tomeaningful interactions between modalities. Ac-cordingly, Multi OIE uses multi-head attentionto reﬂect predicate information (source modality)throughout a sequence (target modality). We ex-pect this module to transform a general sentenceembedding into a suitable feature for extracting thearguments associated with a speciﬁc predicate.

Despite the increasing amount of available web textin languages other than English, most open IE ap-proaches have focused on the English language. Fornon-English languages, most systems are heavilyreliant on handcrafted features and rules, resultingin limited performance (Zhila and Gelbukh, 2014;de Oliveira and Claro, 2019; Wang et al., 2019;Guarasci et al., 2020). Although some studies havedemonstrated the potential of multilingual openIE (Faruqui and Kumar, 2015; Gamallo and Gar-cia, 2015; White et al., 2016), most approaches arebased on shallow patterns, resulting in low preci-sion (Claro et al., 2019).Therefore, we introduce a multilingual-BERT-based open IE system. BERT provides language-agnostic embedding through its multilingual ver-sion and provides excellent zero-shot performanceon many classiﬁcation and labeling tasks (Pireset al., 2019; Wu and Dredze, 2019; Karthikeyanet al., 2020). In Section 5, we demonstrate that ourmultilingual system yields acceptable performancewhen it is trained using only an English dataset.

𝐒𝐞𝐧𝐭𝐞𝐧𝐜𝐞 ∶< The man was born in 1960 > • 𝐏𝐫𝐞𝐝𝐢𝐜𝐚𝐭𝐞 ∶ < was born > • 𝐀𝐫𝐠𝐮𝐦𝐞𝐧𝐭𝟎 ∶ < The man > • 𝐀𝐫𝐠𝐮𝐦𝐞𝐧𝐭𝟏 ∶ < in 1960 >Predicate Classifier

BERT ℎ [𝐶𝐿𝑆] ℎ 𝑡ℎ𝑒 ℎ 𝑚𝑎𝑛 ℎ 𝑤𝑎𝑠 ℎ 𝑏𝑜𝑟𝑛 ℎ 𝑖𝑛 ℎ ℎ [𝑆𝐸𝑃] 𝑒 [𝐶𝐿𝑆] 𝑒 𝑡ℎ𝑒 𝑒 𝑚𝑎𝑛 𝑒 𝑤𝑎𝑠 𝑒 𝑏𝑜𝑟𝑛 𝑒 𝑖𝑛 𝑒 𝑒 [𝑆𝐸𝑃] Multi-Head Attention Blocks

Argument Classifier

QUERY KEY VALUEതℎ 𝒑𝒓𝒆𝒅 തℎ 𝒑𝒓𝒆𝒅 തℎ 𝒑𝒓𝒆𝒅 തℎ 𝒑𝒓𝒆𝒅 തℎ 𝒑𝒓𝒆𝒅 തℎ 𝒑𝒓𝒆𝒅 തℎ 𝒑𝒓𝒆𝒅 തℎ 𝒑𝒓𝒆𝒅 ℎ [𝐶𝐿𝑆] ℎ 𝑡ℎ𝑒 ℎ 𝑚𝑎𝑛 ℎ 𝑤𝑎𝑠 ℎ 𝑏𝑜𝑟𝑛 ℎ 𝑖𝑛 ℎ ℎ [𝑆𝐸𝑃] Position embedding (predicate or not)

BERT hidden sequence

Predicate average𝑒 𝑒 𝑒 𝑒 𝑝𝑜𝑠 𝑒 𝑝𝑜𝑠 𝑒 𝑒 𝑒 ෠𝑇 𝑝𝑟𝑒𝑑 ෠𝑇 𝑎𝑟𝑔 [P-O] [P-O] [P-B] [P-I] [P-O] [P-O] The man was born in 1960 [SEP][CLS] [A0-B] [A0-I] [A-O] [A-O] [A1-B] [A1-I]

Figure 2: Architecture of Multi OIE. After predicates are extracted using the hidden states of BERT, the hiddensequence, average vector of predicates, and position embedding are concatenated and used as inputs for multi-headattention blocks for argument extraction.

Multi OIE extracts relational tuples from a givensentence in two steps. The ﬁrst step is to ﬁnd allpredicates in the sentence. The second step is to ex-tract the arguments associated with each identiﬁedpredicate. The architecture of the proposed modelis presented in Figure 2.

Let S = ( w , w , ..., w l ) be an input sentence,where w i is the i -th token and l is the sequencelength. The objective of the proposed model f is toﬁnd a set of tags T = ( t , t , ..., t l ) , where each el-ement of T indicates one of the “beginning, inside,outside” (BIO) tags (Ramshaw and Marcus, 1995).However, unlike the method proposed in Stanovskyet al. (2018), which uses a predicate head as aninput and predicts all tags simultaneously, we ﬁrstpredict a predicate tagset T pred = ( t p , t p , ..., t pl ) us-ing a predicate model f pred . An argument tagset T arg = ( t a , t a , ..., t al ) is predicted using f arg basedon S and ˆ T pred . Therefore, our model maximizesthe following log-likelihood formulation: l (cid:88) i =1 (cid:16) log p ( t pi | S ; θ pred )+ log p ( t ai | ˆ T pred ; S ; θ pred ; θ arg ) (cid:17) , (1) where θ pred and θ arg are the trainable parametersof f pred and f arg , respectively. In this formulation, f pred contributes to extracting not only the pred-icates, but also the arguments. The loss and gra-dients derived from argument extraction are alsopropagated to θ pred and θ arg .Additionally, we treat open IE as an n -ary ex-traction task and consider BIO tags for argumentsup to ARG3. We refer readers to Stanovsky et al.(2018) for a more detailed explanation of the BIOsequence labeling policy. We assume that a given sentence S is tokenizedby SentencePiece (Kudo and Richardson, 2018).BERT embeds and encodes S through multiplelayers. The ﬁnal hidden states are deﬁned as H ∈ R l × d , where d is the hidden state size of BERT. H is then fed into a feed-forward network anda softmax layer to calculate the probability thateach token is classiﬁed into each predicate tag. Thepredicted tagset ˆ T pred is obtained by applying theargmax operation to the softmax outputs. Finally,the loss for predicate extraction, denoted L pred , iscalculated as per-token cross-entropy loss. A sentence contains one or more predicates. The ar-gument extraction method described in this section [𝑖−1] 𝑊 𝑞 𝑊 𝑘 𝑊 𝑣 𝑋 𝑘 𝑌 [𝑁] 𝑌 [𝑖] 𝑋 𝑞 = 𝑌 [0] 𝑋 𝑣 Multi-Head Attention LayerPosition-wise Feed-forward Layer

Layer NormalizationLayer Normalization

Residual Connection

𝑄 𝐾 𝑉𝑖-th block × 𝑁 block

Figure 3: Multi-head attention blocks for argument ex-traction. The architecture consists of N blocks and theoutput of ﬁnal block Y [ N ] is used as the input for theargument classiﬁer. targets only one predicate. The process is simplyrepeated for multiple predicates. Input representation

The inputs for argumentextraction are concatenations of the following threefeatures: H , ¯ H pred , and E pos . The ﬁrst feature isthe same as the last hidden state of BERT, as dis-cussed in Section 3.2. The second feature is thearithmetic mean vector of hidden states at predi-cate positions. We duplicate this vector to matchthe sequence length l and deﬁne it as ¯ H pred ∈ R l × d .We refer to the true tagset T pred to ﬁnd the indicesof predicates instead of using the predicted tagset ˆ T pred to achieve more stable training (Williams andZipser, 1989). The ﬁnal feature E pos is a positionembedding of binary values that indicates whethereach token is included in the predicate span. Wethen concatenate these three features to obtain theinput X ∈ R l × d mh , where d mh = 2 · d + d pos isthe dimension of multi-head attention and d pos isthe dimension of the position embedding E pos .Following concatenation, X is divided into aquery and key-value pairs. We use X itself as aquery, denoted as X q (target sequence). Key-valuepairs, denoted as X k and X v (source sequence), aresubsets of X derived from predicate positions. Multi-head attention block

The argument ex-tractor consists of N multi-head attention blocks,each of which has a multi-head attention layer fol-lowed by a position-wise feed-forward layer, as shown in Figure 3.The attention layer is the same as the encoder-decoder attention layer in the original transformer(Vaswani et al., 2017). It ﬁrst transforms X q , X k ,and X v into Q = X q W q , K = X k W k , and V = X v W v , respectively, where W q , W k , and W v areweight matrices with dimensions of ( d mh × d mh ).Following transformation, the computation of at-tention is performed for each head as follows: Z h = Softmax ( Q h K Th √ d h ) V h . (2)Each head is indexed by h and has dimensionsof d h = d mh n h , where n h denotes the number ofheads. The attention outputs for each head are thenconcatenated and linearly transformed. In addition,we apply residual connections (He et al., 2016) andlayer normalization (Ba et al., 2016) based on theresults of prior works on transformers.The position-wise feed-forward layer consistsof two linear transformations surrounding a ReLUactivation function. Residual connections and layernormalization are also applied in this layer. Finally,the output of the ﬁnal multi-head attention blockis fed into the argument classiﬁer. The process forobtaining a predicted argument tagset ˆ T arg and cor-responding argument loss L arg is the same as thatdescribed in Section 3.2. The ﬁnal loss for parame-ter updating is the summation of L pred and L arg . In open IE, conﬁdence scores can help control theprecision-recall tradeoff of a system. Multi OIEprovides a conﬁdence score for every extraction byadding the predicate score and all argument scores,as suggested in Zhan and Zhao (2020). The score ofthe predicate and each argument is obtained fromthe probability value of the

Beginning tag. CS = p ( P-B ) + (cid:88) i =0 p ( A i -B ) , (3)where the probability values are given by the soft-max layer in each extraction step. For fair comparisons with other sys-tems, we trained our model using the same dataset plit Dataset

Train OpenIE4 1,109,411 2,175,294Dev OIE2016-dev 582 1,671CaRB-dev 641 2,548Test Re-OIE2016 595 1,508CaRB-test 641 2,715

Table 1: Numbers of sentences and tuples in eachdataset used in this study. used by Zhan and Zhao (2020) . This datasetwas bootstrapped from extractions of the OpenIE4(Mausam, 2016). For testing data, we used the Re-OIE2016 (Zhan and Zhao, 2020) and CaRB (Bhard-waj et al., 2019), which were generated via humanannotation based on the sentences in the OIE2016(Stanovsky and Dagan, 2016) dataset. Table 1 liststhe details of the datasets used in this study. Evaluation metrics

We evaluated each systemusing the area under the curve (AUC) and

F1-score (F1). AUC is calculated from a plot of the pre-cision and recall values for all potential cutoffs.The F1-score is the maximum value among theprecision-recall pairs. We used the evaluation codeprovided with each test data, which contains the fol-lowing matching functions: lexical match for Re-OIE2016, and tuple match for CaRB. Althoughthe former only considers the existence of wordswithin extractions, the latter is stricter in that itpenalizes long extractions (Bhardwaj et al., 2019). Hyperparameters

Model hyperparameters weretuned by performing a grid search. We ﬁrst trainedthe model for one epoch with an initial learningrate of 3e-5. The model contains four multi-headattention blocks with eight attention heads and a 64-dimensional position-embedding layer. The batchsize was set to 128. The dropout rates for the ar-gument classiﬁer and attention blocks were set to0.2, respectively. AdamW (Loshchilov and Hut-ter, 2019) was used as an optimizer in combina-tion with training heuristics, such as learning ratewarmup (Goyal et al., 2017) and gradient clipping(Pascanu et al., 2013). https://github.com/zhanjunlang/Span_OIE https://github.com/gabrielStanovsky/oie-benchmark https://github.com/dair-iitd/CaRB Method f pred f arg BIO BIO tagging BiLSTM BiLSTMBIO+MH BIO tagging BiLSTM MHSpanOIE Span selection BiLSTM BiLSTMSpanOIE+MH Span selection BiLSTM MHBERT+BiLSTM BIO tagging BERT BiLSTM

Multi OIE

BIO tagging BERT MH

Table 2: Baseline models with difference settings.

As baseline models, we selected RnnOIE(Stanovsky et al., 2018), SpanOIE (Zhan and Zhao,2020), and a few custom systems to evaluate thevalidity of the multi-head attention blocks (MH).Although these are all sequence-labeling systems,note that SpanOIE uses the span selection methodrather than BIO tagging. Table 2 presents a sum-mary of the main baselines used in this study. Wealso report the results of the following systemsdeveloped prior to the use of neural networks: Stan-ford (Angeli et al., 2015), O

LLIE (Mausam et al.,2012), P

ROP

S (Stanovsky et al., 2016), ClausIE(Del Corro and Gemulla, 2013), and OpenIE4. Forthese systems, the results were from previous stud-ies (Zhan and Zhao, 2020; Bhardwaj et al., 2019).

The performance results for each system on theRe-OIE2016 and CaRB test data are presented inTable 3. The precision-recall curves are presentedin Figure 4. We also present extraction examplesfrom Multi OIE and SpanOIE in Table 4.

Overall performance

Our model outperformsthe other systems on all datasets and metrics. Ourmodel yields average improvements of approxi-mately 6.9%p and 2.9%p in terms of F1 for theRe-OIE2016 and CaRB datasets, respectively, com-pared to the state-of-the-art system (SpanOIE).Similar to previous studies (Stanovsky et al.,2018; Zhan and Zhao, 2020), the excellent per-formance of Multi OIE is attributed to improvedrecall. As shown in Table 3, our method achievesthe highest recall rate on both datasets. The exam-ples in Table 4 also demonstrate that our modelcan extract more tuples from the same sentence.An additional tuple (debut; the newly solvent air-line; its new image) is found by Multi OIE, butnot by SpanOIE. Additionally, Multi OIE extractsthe place information “At a ... hangar” for the ﬁrst a) Re-OIE2016 (b) CaRB

Figure 4: Precision-recall curves for each open IE system on two testing datasets.

Re-OIE2016 CaRBAUC F1

PREC. REC.

AUC F1

PREC. REC.

Stanford 11.5 16.7 - - 13.4 23.0 - -OLLIE 31.3 49.5 - - 22.4 41.1 - -PropS 43.3 64.2 - - 12.6 31.9 - -ClausIE 46.4 64.2 - - 22.4 44.9 - -OpenIE4 50.9 68.3 - - 27.2 48.8 - -RnnOIE 68.3 78.7 84.2 73.9 26.8 46.7 55.6 40.2BIO 71.9 80.3 84.1 76.8 27.7 46.6 55.1 40.4BIO+MH 71.3 81.5

Multi OIE (ours)

Table 3: Performance of Multi OIE and baseline systems on the Re-OIE2016 and CaRB datasets. tuple, which is omitted by SpanOIE.

Effects of multi-head attention

We comparedthree pairs of methods to determine the valid-ity of multi-head attention blocks: (BIO andBIO+MH), (SpanOIE and SpanOIE+MH), and(BERT+BiLSTM and Multi OIE). As a result, ex-cept for BIO+MH yielding a lower AUC thanBIO, the models with multi-head attention achievehigher performance than the BiLSTM-based mod-els. This performance improvement is consistent,regardless of the choice of classiﬁcation method(BIO tagging and span selection). These resultssuggest that the use of multi-head attention is su-perior to simple concatenation in terms of utilizingpredicate information. Additionally, the performance improvementfrom using MH is greater with BERT than withBiLSTM. The average performance improvementsfrom BIO to BIO+MH are -0.5%p (AUC) and1.1%p (F1), whereas the improvements fromBERT+BiLSTM to Multi OIE are 2.3%p (AUC)and 2.2%p (F1). This indicates that Multi OIE hasa model architecture that can create synergies be-tween the predicate and argument extractors.

Computational cost

We measured the trainingand inference times of each system to evaluatecomputational efﬁciency. As an additional base-line model, we considered a recently publishedsequence generation system called IMoJIE (Kol-luru et al., 2020). It achieved state-of-the-art per-entence

At a presentation in the Toronto Pearson International Airport hangar,Celine Dion helped the newly solvent airline debut its new image.

SpanOIE (helped; Celine Dion; the newly solvent airline debut its new image)

Multi OIE (helped; Celine Dion; the newly solvent airline debut its new image;

At a presentation in the Toronto Pearson International Airport hangar ) (debut; the newly solvent airline; its new image) Table 4: Extraction examples from Multi OIE and SpanOIE. The sentences are from the CaRB testing set.

Training Inference Sec./Sent.

BERT+BiLSTM

SpanOIE

IMoJIE

Multi OIE

Table 5: Training and inference times of each system. formance on the CaRB dataset using sequentialdecoding of tuples conditioned on previous extrac-tions. For calculating inference times, we selected641 sentences from the CaRB testing dataset andexecuted the models on a single TITAN RTX GPU.Table 5 reveals that Multi OIE has much greaterefﬁciency than IMoJIE. Our model only requires15.5 s to process the 641 sentences, whereas IMo-JIE requires more than 3 min, which is a differ-ence of approximately 14 times. This bottleneckof IMoJIE could be a drawback for downstreamtasks, such as knowledge base construction, whichmust work with large amounts of text. Consider-ing that the performance difference between thetwo models is only approximately 1%p , it may bereasonable to use Multi OIE to process large-scalecorpora. Multi OIE also exhibits competitive com-putational costs compared to the other sequence-labeling systems. Our model has similar trainingtimes compared to BERT+BiLSTM, but is fasterfor inference. This demonstrates that MH has apositive effect on both efﬁciency and performance.In the case of SpanOIE, its span selection methodcreates bottlenecks for both training and inference.

As mentioned in Section 2.2, we trained a multi-lingual version of Multi OIE using multilingualBERT and the same training dataset as the En-glish version. We assumed that data for non-English languages were not available and tested IMoJIE achieved (AUC, F1) of (33.3, 53.5) on the CaRBdataset.

AUC F1 PREC. REC.

EN version

MT version

Table 6: Comparison between English (EN) and Multi-lingual (MT) versions of our model on CaRB dataset. the model’s zero-shot performance. Evaluationswere conducted using a dataset generated based onthe Re-OIE2016 dataset.

Considering the availability of baselinesystems, we selected Spanish and Portuguese as theevaluation dataset languages. First, all sentences,predicates, and arguments from the Re-OIE2016 dataset were translated into the target languages us-ing Google . To prevent adverse effects from trans-lation errors, we modiﬁed the translated sentencesto make sure that the back-translated sentenceshave the same meaning with the original sentence.After the translation and modiﬁcation, we manu-ally re-annotated all tuples of the target languagesbased on the English annotation of Re-OIE2016. Evaluation metrics

Because the baseline sys-tems are binary extractors and do not provide con-ﬁdence scores, we report binary extraction perfor-mance without AUC values. Additionally, althoughthe introduced dataset was generated based on theRe-OIE2016, each system was tested using CaRB’sevaluation code for more rigorous evaluation.

Baselines

Our baseline models were two rule-based multilingual systems: ArgOE (Gamallo andGarcia, 2015) and PredPatt (White et al., 2016).The former takes dependency parses in the CoNLL-X format as inputs. Similarly, the latter uses We chose the Re-OIE2016 because the CaRB datasetwas originally created not to label sequences but to generatesequences. https://cloud.google.com/translate/ entence When the explosion tore through the hut,Stauffenberg was convinced that no one in the room could have survived.

English (tore; the explosion; through the hut)(was convinced; Stauffenberg; that no one in the room could have survived)(could have survived; no one in the room)

Spanish (desgarr´o; la explosi´on; a trav´es de la caba˜na)(estaba convencido; Stauffenberg; de que nadie en la habitaci´on podr´ıa haber sobrevivido)(podr´ıa haber sobrevivido; nadie en la habitaci´on)

Portuguese (rasgou; a explos˜ao; atrav´es da cabana)(estava convencido; Stauffenberg; de que ningu´em na sala poderia ter sobrevivido)(poderia ter sobrevivido; ningu´em na sala)

Table 7: Extraction examples from Multi OIE for each language.

Lang. System F1 PREC. REC.EN

ArgOE

PredPatt

Multi OIE ES ArgOE

PredPatt

Multi OIE PT ArgOE

PredPatt

Multi OIE

Table 8: Binary extraction performance without conﬁ-dence scores on the multilingual Re-OIE2016 dataset. language-agnostic patterns of UD structures . Prior to com-paring the multilingual systems, we evaluatedwhether Multi OIE’s multilingual version exhib-ited a satisfactory performance for English com-pared to the English-only version. Table 6 lists theperformance metrics for the English and multilin-gual versions of our model on the CaRB dataset.The performance of the English version was copiedfrom Table 3. Although the multilingual versionyields lower performance for both metrics com-pared to the English version, the F1 score is com-parable and the recall is higher. Furthermore, themultilingual version still outperforms the othersequence-labeling systems, indicating that multilin-gual BERT can successfully construct a Multi OIEmodel with favorable performance.

Multilingual performance

Table 8 lists the per-formance metrics for each system for the multi- https://universaldependencies.org/ lingual dataset. Table 7 contains an example ofMulti OIE’s extraction results for each language.One can see that Multi OIE outperforms the othersystems on all languages. Similar to the resultsin Section 4.3, the superiority of our multilingualmodel is attributed to its high recall. Multi OIEyields the highest recall for all languages by approx-imately 20%p. In contrast, ArgOE has relativelyhigh precision, but low recall negatively impactsits F1 score. PredPatt provides the best balance ofprecision and recall, but the overall performance islower than that of our model.The performance differences between languagesare similar for all models. All models exhibit thebest performance for English, followed by Span-ish and Portuguese. Multi OIE also exhibits per-formance degradation for non-English languages.However, considering that our model was nevertrained to perform open IE tasks on Spanish orPortuguese, its performance is remarkable. Forsome non-English sentences, our model extractsthe same results as those extracted in the Englishextraction result, as shown in Table 7. This resultagrees with the results of previous studies (Pireset al., 2019; Wu and Dredze, 2019; Karthikeyanet al., 2020), which have demonstrated the excel-lent cross-lingual abilities of multilingual BERT.Based on these results, we expect that Multi OIEwill also work well on languages other than thoseconsidered in this study.

In this paper, we propose Multi OIE, which ex-ploits BERT and multi-head attention for the openIE task. Multi-head attention has the advantage offusing sentence and predicate features, which ade-quately reﬂect predicate information throughout aentence. Our model achieved the best performanceamong sequence labeling models. Multi OIE alsoexhibited superior computational efﬁciency withcompetitive performance compared to the state-of-the-art sequence generation systems. Addition-ally, a Multi OIE model trained using multilingualBERT, outperformed the baseline models withouttraining on any non-English languages.However, some types of extractions, such asnominal relations, conjunctions in arguments, andcontextual information, are not considered inMulti OIE. Future work could investigate how toapply Multi OIE to these cases. For multilingualopen IE, performance evaluations and further studyon non-alphabetic languages that were not consid-ered in this study can be conducted.

References

Gabor Angeli, Melvin Jose Johnson Premkumar, andChristopher D. Manning. 2015. Leveraging linguis-tic structure for open domain information extraction.In

Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) , pages344–354, Beijing, China. Association for Computa-tional Linguistics.Jimmy Ba, Jamie Ryan Kiros, and Geoffrey Hinton.2016. Layer normalization. arXiv:1607.06450.Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machinelearning: A survey and taxonomy.

IEEE Transac-tions on Pattern Analysis and Machine Intelligence ,41(2):423–443.Michele Banko, Michael John Cafarella, StephenSoderland, Matt Broadhead, and Oren Etzioni. 2007.Open information extraction from the web. In

Pro-ceedings of the 20th International Joint Conferenceon Artiﬁcal Intelligence , IJCAI’07, page 2670–2676,San Francisco, CA, USA.Sangnie Bhardwaj, Samarth Aggarwal, and MausamMausam. 2019. CaRB: A crowdsourced benchmarkfor open IE. In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 6262–6267, Hong Kong, China. As-sociation for Computational Linguistics.Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan,Alon Halevy, and Hosagrahar Visvesvaraya Ja-gadish. 2019. Open information extraction fromquestion-answer pairs. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ShortPapers) , pages 2294–2305, Minneapolis, Minnesota.Association for Computational Linguistics.Bruno Cabral, Rafael Glauber, Marlo Souza, andDaniela Claro. 2020. Crossoie: Cross-lingual clas-siﬁer for open information extraction. In

Com-putational Processing of the Portuguese Language ,pages 368–378.Janara Christensen, Mausam, Stephen Soderland, andOren Etzioni. 2013. Towards coherent multi-document summarization. In

Proceedings of the2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies , pages 1163–1173, At-lanta, Georgia. Association for Computational Lin-guistics.Daniela Barreiro Claro, Marlo Souza, Clarissa Castell˜aXavier, and Leandro Oliveira. 2019. Multilingualopen information extraction: Challenges and oppor-tunities.

Information , 10(7):228.Lei Cui, Furu Wei, and Ming Zhou. 2018. Neu-ral open information extraction. In

Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 2: Short Papers) ,pages 407–413, Melbourne, Australia. Associationfor Computational Linguistics.Luciano Del Corro and Rainer Gemulla. 2013. Clausie:Clause-based open information extraction. In

Pro-ceedings of the 22nd International Conference onWorld Wide Web , WWW ’13, page 355–366, NewYork, NY, USA. Association for Computing Machin-ery.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.2016. Knowledge-driven event embedding for stockprediction. In

Proceedings of COLING 2016, the26th International Conference on ComputationalLinguistics: Technical Papers , pages 2133–2142.Hilal Ergun, Yusuf Caglar Akyuz, Mustafa Sert, andJianquan Liu. 2016. Early and late level fusion ofdeep convolutional neural networks for visual con-cept recognition.

International Journal of SemanticComputing , 10(03):379–397.Anthony Fader, Stephen Soderland, and Oren Etzioni.2011. Identifying relations for open information ex-traction. In

Proceedings of the 2011 Conference onEmpirical Methods in Natural Language Processing ,pages 1535–1545, Edinburgh, Scotland, UK. Associ-ation for Computational Linguistics.anaal Faruqui and Shankar Kumar. 2015. Multilin-gual open relation extraction using cross-lingual pro-jection. In

Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies , pages 1351–1356, Denver, Colorado. As-sociation for Computational Linguistics.Pablo Gamallo and Marcos Garcia. 2015. Multilingualopen information extraction. In

Progress in Artiﬁ-cial Intelligence (EPIA 2015) , pages 711–722.Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter No-ordhuis, Lukasz Wesolowski, Aapo Kyrola, AndrewTulloch, Yangqing Jia, and Kaiming He. 2017. Ac-curate, large minibatch sgd: Training imagenet in 1hour. arXiv:1706.02677.Raffaele Guarasci, Emanuele Damiano, Aniello Min-utolo, Massimo Esposito, and Giuseppe De Pietro.2020. Lexicon-grammar based open information ex-traction from natural language sentences in italian.

Expert Systems with Applications , 143:112954.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. , pages 770–778.Shengbin Jia and Yang Xiang. 2019. Hybrid neu-ral tagging model for open relation extraction.arXiv:1908.01761.Kaliyaperumal Karthikeyan, Zihan Wang, StephenMayhew, and Dan Roth. 2020. Cross-lingual abil-ity of multilingual bert: An empirical study. In

Pro-ceedings of the International Conference on Learn-ing Representations (ICLR) .Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017.Answering complex questions using open informa-tion extraction. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 311–316,Vancouver, Canada. Association for ComputationalLinguistics.Keshav Kolluru, Samarth Aggarwal, Vipul Rathore,Mausam Mausam, and Soumen Chakrabarti. 2020.Imojie: Iterative memory-based joint open informa-tion extraction. In

The 58th Annual Meeting of theAssociation for Computational Linguistics (ACL) ,Seattle, U.S.A. Association for Computational Lin-guistics.Taku Kudo and John Richardson. 2018. Sentence-Piece: A simple and language independent subwordtokenizer and detokenizer for neural text process-ing. In

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations , pages 66–71, Brussels, Bel-gium. Association for Computational Linguistics.Kuan Liu, Yanen Li, Ning Xu, and Premkumar Natara-jan. 2018. Learn to combine modalities in multi-modal deep learning. arXiv:1805.11730. Ilya Loshchilov and Frank Hutter. 2019. Decoupledweight decay regularization. In

Proceedings of theInternational Conference on Learning Representa-tions (ICLR) .Utthara Gosa Mangai, Suranjana Samanta, SukhenduDas, and Pinaki Roy Chowdhury. 2010. A survey ofdecision fusion and feature fusion strategies for pat-tern classiﬁcation.

Iete Technical Review , 27:293–307.Mausam. 2016. Open information extraction systemsand downstream applications. In

Proceedings of theTwenty-Fifth International Joint Conference on Arti-ﬁcial Intelligence , IJCAI’16, page 4074–4077.Mausam, Michael Schmitz, Stephen Soderland, RobertBart, and Oren Etzioni. 2012. Open language learn-ing for information extraction. In

Proceedings of the2012 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Natu-ral Language Learning , pages 523–534, Jeju Island,Korea. Association for Computational Linguistics.Jiquan Ngiam, Aditya Khosla, Mingyu Kim, JuhanNam, Honglak Lee, and Andrew Ng. 2011. Multi-modal deep learning. In

Proceedings of the 28th In-ternational Conference on International Conferenceon Machine Learning , ICML’11, page 689–696,Madison, WI, USA.Leandro Souza de Oliveira and Daniela Barreiro Claro.2019. Dptoie: a portuguese open information extrac-tion system based on dependency analysis.

Com-puter Speech and Language , under review.Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-gio. 2013. On the difﬁculty of training recurrentneural networks. In

Proceedings of the 30th In-ternational Conference on International Conferenceon Machine Learning - Volume 28 , ICML’13, pageIII–1310–III–1318.Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual BERT? In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4996–5001, Florence, Italy. Association for ComputationalLinguistics.Edoardo Maria Ponti, Ivan Vuli´c, Ryan Cotterell, RoiReichart, and Anna Korhonen. 2019. Towards zero-shot language modeling. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2900–2910, Hong Kong,China. Association for Computational Linguistics.Lance Ramshaw and Mitchell Marcus. 1995. Textchunking using transformation-based learning. In

Proceedings of the Third ACL Workshop on VeryLarge Corpora , pages 82–94.amuel R¨onnqvist, Jenna Kanerva, Tapio Salakoski,and Filip Ginter. 2019. Is multilingual BERT ﬂu-ent in language generation? In

Proceedings of theFirst NLPL Workshop on Deep Learning for NaturalLanguage Processing , pages 29–36, Turku, Finland.Injy Sarhan and Marco Spruit. 2019. Contextualizedword embeddings in a neural open information ex-traction model. In

Natural Language Processingand Information Systems , pages 359–367.Gabriel Stanovsky and Ido Dagan. 2016. Creatinga large benchmark for open information extraction.In

Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing , pages2300–2305, Austin, Texas. Association for Compu-tational Linguistics.Gabriel Stanovsky, Jessica Ficler, Ido Dagan, and YoavGoldberg. 2016. Getting more out of syntax withprops. arXiv:1603.01648.Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer,and Ido Dagan. 2018. Supervised open informa-tion extraction. In

Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages885–895, New Orleans, Louisiana. Association forComputational Linguistics.Mingming Sun, Xu Li, Xin Wang, Miao Fan, Yue Feng,and Ping Li. 2018. Logician: A uniﬁed end-to-end neural approach for open-domain informationextraction. In

Proceedings of the Eleventh ACMInternational Conference on Web Search and DataMining , WSDM ’18, page 556–564, New York, NY,USA. Association for Computing Machinery.Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang,J. Zico Kolter, Louis-Philippe Morency, and Rus-lan Salakhutdinov. 2019. Multimodal transformerfor unaligned multimodal language sequences. In

Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 6558–6569, Florence, Italy. Association for ComputationalLinguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems , pages 6000–6010.Chengyu Wang, Xiaofeng He, and Aoying Zhou. 2019.Open relation extraction for chinese noun phrases.

IEEE Transactions on Knowledge and Data Engi-neering , PP:1–1.Aaron Steven White, Drew Reisinger, Keisuke Sak-aguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger,Kyle Rawlins, and Benjamin Van Durme. 2016. Uni-versal Decompositional Semantics on Universal De-pendencies. In

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Process-ing , pages 1713–1723, Austin, Texas. Associationfor Computational Linguistics.Ronald Williams and David Zipser. 1989. A learn-ing algorithm for continually running fully recurrentneural networks.

Neural Computation , 1(2):270–280.Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofBERT. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages833–844, Hong Kong, China. Association for Com-putational Linguistics.Tien-Hsuan Wu, Zhiyong Wu, Ben Kao, andPengcheng Yin. 2018. Towards practical openknowledge base canonicalization. In

Proceedingsof the 27th ACM International Conference on Infor-mation and Knowledge Management , CIKM ’18,page 883–892, New York, NY, USA. Associationfor Computing Machinery.Junlang Zhan and Hai Zhao. 2020. Span modelfor open information extraction on accurate corpus.

Proceedings of the AAAI Conference on Artiﬁcial In-telligence , 34:9523–9530.Alisa Zhila and Alexander Gelbukh. 2014. Open in-formation extraction for Spanish language based onsyntactic constraints. In