[PDF] Tag and Correct: Question aware Open Information Extraction with Two-stage Decoding

Abstract

Full PDF

TT AG AND C ORRECT : Q

UESTION AWARE O PEN I NFORMATION E XTRACTION WITH T WO - STAGE D ECODING

Martin Kuo

MicrosoftBeijing, China [email protected]

Yaobo Liang

MicrosoftBeijing, China [email protected]

Lei Ji

MicrosoftBeijing, China [email protected]

Nan Duan

MicrosoftBeijing, China [email protected]

Linjun Shou

MicrosoftBeijing, China [email protected]

Ming Gong

MicrosoftBeijing, China [email protected]

Peng Chen

MicrosoftBeijing, China [email protected] A BSTRACT

Question Aware Open Information Extraction (Question aware Open IE) takes question and passageas inputs, outputting an answer tuple which contains a subject, a predicate, and one or more arguments.Each ﬁeld of answer is a natural language word sequence and is extracted from the passage. Thesemi-structured answer has two advantages which are more readable and falsiﬁable compared to spananswer. There are two approaches to solve this problem. One is an extractive method which extractscandidate answers from the passage with the Open IE model, and ranks them by matching withquestions. It fully uses the passage information at the extraction step, but the extraction is independentto the question. The other one is the generative method which uses a sequence to sequence modelto generate answers directly. It combines the question and passage as input at the same time, butit generates the answer from scratch, which does not use the facts that most of the answer wordscome from in the passage. To guide the generation by passage, we present a two-stage decodingmodel which contains a tagging decoder and a correction decoder. At the ﬁrst stage, the taggingdecoder will tag keywords from the passage. At the second stage, the correction decoder will generateanswers based on tagged keywords. Our model could be trained end-to-end although it has two stages.Compared to previous generative models, we generate better answers by generating coarse to ﬁne.We evaluate our model on WebAssertions [13] which is a Question aware Open IE dataset. Ourmodel achieves a BLEU score of 59.32, which is better than previous generative methods. K eywords natural language processing, question answering, open information extraction Question aware Open Information Extraction (Question aware Open IE) system takes question and passage as inputsand extracts a semi-structure answer in tuple format from passage which can answer the question. Question awareOpen IE is both an Open IE task and a question answering task. From an Open IE view, the Open IE system extractsall possible tuples. For example, in Table 1, Open IE system aims to extract four answer tuples from the passage,which are independent of any questions. A question aware Open IE only extracts one answer tuple which can answerthe question. From a question answering view, the answer of the search engine is a passage; the answer for MachineReading Comprehension tasks like SQuAD [20], TriviaQA [21] and NewsQA [22] is a span from the passage; theanswer of MS MACRO [24] is a generated sentence. Different to them, the answer for a question aware Open IE is asemi-structure tuple which is shorter than the passage and longer than the span. It has a semantic role for each partwhich is easier for understanding for downstream task.The current solution for a question aware Open IE has two approaches, the extractive method and the generativemethod. The extractive method extracts all possible answer tuples as candidates from the passage, independent of the a r X i v : . [ c s . C L ] S e p PREPRINT - S

EPTEMBER

17, 2020

Question how many albums has the doors sold

Passage although the doors’ active career endedin 1973 , their popularity has persisted.according to the riaa, they have sold over100 million records worldwide, makingthem one of the best-selling bands of alltime.

Open IEResult (the doors active career; ended; in 1973)(their popularity; has persisted; although the doors active career ended in 1973)(they; making; them one of the best-selling bands of all time)(they; have sold; 100 million records worldwide)

Answer(QuestionawareOpen IE) (they; have sold; 100 million recordsworldwide)Table 1: Example of Open IE and Question aware Open IE. The Open IE Result and Answer has same format, whichis (subject; predicate; arguments), there could be more than one argument. The column "Open IE Result" is tuplesextracted by Open IE tools from passage independent with question. The "Answer" is extracted from passage and couldanswer the question.

Question where is smallville ﬁlmed

Passage smallville was primarily ﬁlmed in and around vancouver , british columbia , with local businessesand buildings substituting for smallville locations .

Answer smallville; was ﬁlmed; in british columbia; with local businesses

Tagging Label smallville S − B was P − B primarily O f ilmed P − B in A − B and O around O vancouver O , O british A − B columbia A − I , O with A − B local A − I businesses A − I and O buildings O substituting O f or O smallville O locations O . O Tagging Result smallville, was primarily ﬁlmed, vancouver, british columbia, with local businesses

Correction Result smallville, was ﬁlmed, in vancouver, british columbia with local businessesTable 2: Example of tagging label and model output of tagging decoder and correction decoder. The tagging label iscreated from answer since original dataset only has tuple format answer.question, by Open IE models. It then ranks all the candidates by a matching model between candidate and question.The coverage of the extraction step is crucial for the ﬁnal performance because it is a twostep method and the extractionstep is independent of the question. Since the ﬁrst step is extraction, most of the words in the result will come from thepassage.The generative method concatenates the question and passage as input, and then generates the answer tuple as aconcatenated sequence or generates each ﬁeld one by one. The generative method uses the question and passage at thesame stage and does not rely on an extraction model, so it has better interaction between question and passage. Butwhile it removes the extraction step, it does not use the facts that most of the answer word is from passage any more.To better use passage information in generation, we propose a two-stage decoder model for this task which can havemore interaction between question and passage by containing a tagging decoder and a correction decoder. At the ﬁrststage, the tagging decoder will tag the words which may be useful for answer generation. The output of this step couldform a coarse answer. At the second stage, the correction decoder generates a new answer with a step by step decoder.A correction decoder can reorder and add new words to output a ﬂuency answer. Then we joint train two decoders atthe same time.We evaluated our model on the WebAssertion dataset [13]. Our model achieves a 59.32 BLEU score which is betterthan previously generative methods.

Open Information Extraction (Open IE) [1, 2] aims to extract all (subject, predicate, arguments) tuples from asentence. To solve this challenge, TextRunner [1] and WOE [8] use a self-supervised approach. Then many of the2

PREPRINT - S

EPTEMBER

17, 2020methods use a rule based approach like ReVerb [3], OLLIE [4], KrakeN [5], ClauseIE [6] PropS [9]. Open IE4 extracts tuples from Semantic Role Labeling structures. Stanford Open Information Extraction [10] uses natural logicinference to extract a shorter argument. Recently, Stanovsky et al. [7] have proposed a supervised method for OpenIE by formulating it as a sequential labeling task. Compared to Open IE, our task has an additional question, so ourtagging decoder needs to contain an interactive layer between question and passage. Our tagging decoder is similar to[7], since they have the same output and trained by supervised learning. However, we have an additional correctiondecoder to improve answer quality and can handle an answer ﬁeld that is not a span.Current Machine Reading Comprehension (MRC) like SQuAD [20], TriviaQA [21] and NewsQA [22] focus onselecting a span from passage as the answer. Most MRC models [17, 18, 11] generate answers by predicting the startand end points of a span. The MS MACRO [24] dataset needs to generate a sequence which is not a span of a passage.Tan et al. [25] solve it by selecting a span from the passage at ﬁrst, then generates an answer based on the question,passage and selected span. Similar to Tan et al. [25], we also used the idea of coarse to ﬁne generation, but the answerof our task is not a span or sentence. Our answer has structure and each ﬁeld has a semantic role. The arguments havedynamic length. Each ﬁeld of it does not have to be a span, although most of the words are from a passage. because ofthis we use a sequential labeling method to tag each word in a passage instead of predicting the start and end point of aspan. The two stages of our model could be jointly trained.For

Question aware Open IE , Yan et al. [13] propose two methods, an extractive method and a generative method.The extractive method extracts all answer tuples from the passage ﬁrst, and ranks them with a matching model betweenanswer candidates and questions. The generative model takes the concatenation of question and passage as input, andgenerates the representation of each answer ﬁeld at ﬁrst, and then generates each ﬁeld based on its representation.

In this section, we formulate the Question aware Open IE problem and brieﬂy introduce our model. Then we separatelyintroduce each part of our model including the encoder, tagging decoder, and correction decoder.

The Question aware Open IE is a task which given a question containing n words Q = { q , q , ..., q n } and a passagecontaining m words P = { p , p , ..., p m } , and output a semi-structured answer which can answer the question basedon passage. The answer consists of a subject, a predicate, and one or more arguments. We represent the answer as ( subject, predicate, argument , ..., argument k ) , k ≥ . Each answer ﬁeld is a natural language word sequence. Our model consists of three parts which contains an encoder, tagging decoder and correction decoder. We show ourmodel in Figure 1. We used two same-structure encoders to encode question and passage separately. Then the taggingdecoder interacts between the encoded question and passage and tags each word in passage about its semantic rolein the answer. The tagging decoder tags all passage words at same time. The correction decoder then generates ananswer based on the tagging result. The correction decoder generate answer words one by one. Our intuition is to usetagging decoder to highlight the words in an article and use correction decoder to generate a ﬂuent answer based ontagging results. As our approach’s potentials, we show not only our approach being effective over this dataset but alsothe correction decoder in correcting the missing words which are not tagged as ground truth.We use an example in Table 2 to show our idea. The argument is not a span of passage, the ”in” is far from ”britishcolumbia”, but most of the words in the answer are from the passage. The ﬁrst stage of our model is to tag keywordsfrom the passage. In this case, our tagging decoder tags all location words as argument such as ”vancouver” and ”britishcolumbida”. But the tagging result misses ”in”. The second stage is to generate a ﬂuent answer based on the taggingresult. In this case, our correction decoder adds ”in” compared to the tagging result. We also could noticed that modelalso able to remove positive adverb "primarily". Based on our case study, the correct model is good at word orderingand small post-editing guided by language model. The encoder of our model contains a question encoder and a passage encoder, which is been used to encode the questionand passage separately. These two encoders have the same structure but different weights in implementation. The https://github.com/dair-iitd/OpenIE-standalone PREPRINT - S

EPTEMBER

17, 2020encoder is composed of two basic building blocks, Multi-Head Attention Block and Feed Forward Block [12]. We willintroduce these two building blocks and how to build an encoder with them.

The core layer of the Multi-Head Attention Block is the Multi-Head Attention Layer [12]. The input of Multi-HeadAttention Layer contains query( Q ), key( K ) and value( V ). All the inputs are matrices. Q ∈ R n q × d k , K ∈ R n k × d k , V ∈ R n k × d v . The output O of Multi-Head Attention Layer is a matrix too. O ∈ R n q × d v . We represent this layer as afunction M ultiHeadAttention ( Q, K, V ) .Intuitively, this layer is a soft dictionary lookup layer in vector space, and all the operation unit is vector. The dictionaryin computer science is a set of key value pairs, lookup in dictionary is to ﬁnd the key which equals to query andreturn corresponding value as output. In Multi-Head Attention, there are n k key value pairs, each key is a vector withdimension d k and each value is a vector with dimension d v . The n q queries will have n q corresponding output. Foreach query, we will calculate attention score to each key, and use attention score as weight to calculate the weightedsum of value. The weighted sum is the output. More details are at Vaswani et al. (2017).Multi-Head Attention Block has same input as Multi-Head Attention Layer. But it requires the d k = d v . The inputswill go through a Multi-Head Attention layer wrapped with residual connection. Then pass the output though a layernorm layer to get the ﬁnal output. M HBlock ( Q, K, V ) =

LayerN orm ( Q + M ultiHeadAttention ( Q, K, V )) The core layer of Feed Forward Block is Feed-Forward Network [12]. Feed-Forward Network is a two-layer projectionon each row of matrix.

F F N ( x ) = max (0 , xW + b ) W + b The Feed-Forward Block has same input and output with Feed-Forward Network. We add the input and the output ofFeed-Forward Network, then pass through a layer norm layer to get the ﬁnal output.

F F N Block ( x ) = LayerN orm ( x + F F N ( x )) An encoder is used to map a sequence of words into a sequence of hidden states. The question encoder and the passageencoder have the same structure. The input of the encoder is the word embedding of each word plus the positionembedding. We use sine and cosine position embedding [12]. The encoder is composed of a stack of N e identicallayers. The output of the question encoder and the passage encoder are h q and h p respectively. For the question encoder: h q, = Embedding ( Q ) + W pos h mq,i = M HBlock ( h q,i − , h q,i − , h q,i − ) ∀ i ∈ [1 , N e ] h q,i = F F N Block ( h mq,i ) ∀ i ∈ [1 , N e ] h q = h q,n e Embedding ( x ) is an embedding look up function, It takes the word id and output corresponding word embeddingvector. W p is position embedding. h mq,i is an intermediate result. Passage encoder has same structure, so we don’tformulate it again. A tagging decoder is used to generate the tagging probability distribution for each word in a passage given a questionencode result h q and a passage encode result h p . In this sub subsection, we will introduce the tag format SemanticBIO Tags, and tagging decoder structure. The output of the tagging decoder is a distribution of tags T . Formally, for4 PREPRINT - S

EPTEMBER

17, 2020Figure 1: Overview of two-stage model. Multi-Head Attention Block has three inputs, query(Q), key(K) and value(V).In this ﬁgure, we draw them in order of K, V and Q for clarity. For the answer embedding, we use entire ground truthanswer in training. In decoding, the correction decoder generates answer word one by one. So we only use the generatedanswer word as input to generate next word.each word p i in passage, the tagging decoder outputs a distribution p ( t i | P, Q ) . t i is the tags for i-th passage word. Wedenote the result as T = { p ( t | P, Q ) , p ( t | P, Q ) , ..., p ( t m | P, Q ) } . In our model, we keep it as a continuous probabilitydistribution so as to back propagate loss. If we want to give each word an explicit tag, we output the tag with maximumprobability. We use semantic BIO tag like Stanovsky et al. (2018) to tag passage word. Each tag is combined by two parts, semantictag for semantic role in answer and BIO tag for position in a ﬁeld. The semantic role tag contains subject(S), predicate(P)and arguments(A). Since there are more than one argument, arguments also been distinguished by position as Ai for i-thargument. BIO tag contains Begin(B), Inside(I) and Outside(O). For each continuous subsequence belongs to samesemantic role, we tag the ﬁrst word to B, and tag the rest words to I. After tagging all continuous subsequence, we tagall the otherwords too. Then we add semantic role to BIO tags. If the semantic role is predicate,the tag will be extendedto P-B, P-I. We showed an example at Table 2. The predicate has two words ”was ﬁlmed” but they are not consecutive.For each sub span, we tag the ﬁrst word as P-B. Then both the tag of ”was” and ”ﬁlmed” are P-B.So the re maybe morethan one word been tagged as P-B although there is only one predicate in answer. For the same reason, there are twoA1-B for ”in” and ”british” in example.

We need to create ground truth for tagging decoder training by ourself because the answer in dataset is in tuple format.Intuitively, when answer tuple was been created, some words were been selected from passage and copied to answer.Ideally, we want to tag these words out and let our model generate answer based on them too. Formally, we need toselect out some continuous subsequences based on answer and tag them by previous tagging rules. Each subsequencemust belong to same semantic role, but each semantic role may correspond to several subsequences. The key challengeof it is that one word may have multiple occurrences in passage. We proposed a rule-based solution for this problem.Intuitively, for adjacent words in answer, we prefer to match the adjacent words in passage. For each ﬁeld of answer, weprefer to keep all matches as close as possible. For details, we match the ﬁelds in the order of arguments, subject andpredicate, because arguments is longest, and predicate is shortest. Then for each ﬁeld, we try to match all bi-gram inpassage. We will keep the multiple occurrence if exist. Then we match as much as possible single word which haven’tcovered by matched bi-gram. In this step, we will minimize the distance between rightmost word to leftmost word. In5

PREPRINT - S

EPTEMBER

17, 2020Open IE task, the predicate is shortest and often is unigram. So we match it at last and we prefer the predicate betweensubject and arguments.

Compared to the question encoder, the tagging decoder also needs to encode the passage. The difference is it needsto interactive with the question. We achieve this requirement by adding an additional attention layer from passage toquestion. The tagging decoder is composed of a stack of N t identical layers. Each layer consists of three sub-layers, selfattention layer, passage to query encoding layer and feed forward layer. Self-attention layer is a Multi-Head AttentionBlock used to encode passage. The query, key and value of it is identical and is the output of previous layer. Thepassage to question layer is a Multi-Head Attention Block which is used to interactive with question. It’s query is theoutput of previous self-attention layer, the key and value is identical and is the output of question encoder h q . We alsotried interactive layer like BIDAF (Seo et al. 2016), but there is no improvement compared to our model. Formally: h t, = h p h t t,i = M HBlock ( h t,i − , h t,i − , h t,i − ) ∀ i ∈ [1 , N t ] h t t,i = M HBlock ( h t t,i , h q , h q ) ∀ i ∈ [1 , N t ] h t,i = F F N Block ( h t t,i ) ∀ i ∈ [1 , N t ] h t = h t,N t h t,i is the output of i-th layer, h t t,i and h t t,i is two intermediate results in same layer. h t is the ﬁnal output of these N t layers.Then we used linear projection and softmax on each item of h t to calculate the tag probability distribution p ( t i | P, Q ) ofeach passage word p i . p ( t i | P, Q ) = sof tmax ( h t,i ∗ W t ) h t,i is the i-th row of h t . W t is a linear projection matrix.We use a semantic BIO tag like Stanovsky et al. 1. [7] to tag passage words. Since there are multiple arguments,arguments are also distinguished by position as A i for the i-the argument. We need to create ground truth for taggingdecoder training on our own because the answer in the dataset is in tuple format. Therefore we tag answer words whichis in the passage and tag them as close as possible. The correction decoder takes the output of the tagging decoder h t and T as input, and generate a new answer. Thecorrection decoder will generate answer words one by one like machine translation.We concatenate the answer tuple to one string as the output of the correction decoder. Formally, we concatenate the tupleto a sequence of l words A = { a , a , ..., a l } which is formatted as " subject predicate argument ... argument k ". The "" is an additional format word used to identify the semantic role. We use"" to separate multiple tuples for structured answer representation. We concatenate them into one string by"" tag as our string version answer. We can choose a structured version or a string version as our output. Theonly difference between a structured version and a string version is whether it has tag.The structure correction decoder is very similar to tagging decoder. it also is composed of a stack of N c identical layers.Each layer consists of same three sub-layers, except ﬁrst layer is a masked self-attention layer. The input is the sum ofanswer word embedding and position embedding too. Different to tagging decoder, the ﬁrst layer, masked self attentionlayer, contains additional memory mask. This memory mask only allows the hidden vector at position i to pay attentionon hidden vector before position i. This is because the generative decoder is a step by step decoder.we only have hiddenvector before position i when we generate i-th word. Answer to passage encoding layer also is a Multi-Head Attentionlayer. The query of it is the output of masked self attention layer. the key and value of it is identical and is concatenationof two parts, tagging decoder hidden state h t and tagging result T .In training, we could train the answer word decoding parallel by masked attention tricks. The only structure’s differencebetween tagging decoder and correction decoder is that word in correction decoder only could attend to previous words.6 PREPRINT - S

EPTEMBER

17, 2020

Model Answer (BLEU-4) Subject (BLEU1) Predicate (BLEU1) Arguments (BLEU1)

Seq2Seq + Attention [13] 31.85 - - -Seq2Ast [13] 35.76 - - -Tagging 55.60 51.61 57.19 46.62Tagging + Correction 59.32 63.40 67.50 61.01w/o question 56.71 63.02 64.03 56.89w/o semantic tag (Only BIO tag) 58.78 62.61 66.36 59.72

Table 3: Test results on WebAssertions.Because in decoding, we must generate answer word one by one. Suppose we had generated the ﬁrst j-1 words: h c, = Embedding ( concat ( BOS, a m (cid:88) i =1 logp ( t i | P, Q ) t i is the ground truth tag of i-th word in passage. N is the number of samples.For generation, the loss is conditionally negative log probability of the ground truth answer. L correct = − N (cid:88) l (cid:88) i =1 logp ( a i | a

PREPRINT - S

EPTEMBER

17, 2020retrieving and ﬁltering related passages by search engine which cab directly answer the question. Then they extractanswer tuples from the passage with Open IE model ClausIE. The labeler will judges whether the answer tuple hascomplete meaning and can answer the question. The answer tuple that has a positive label is the ﬁnal answer. About40% of answers contain a ﬁeld which is not a span of the passage. For example, sometimes the answer will deletewords in a passage. Some words in the correction ground truth don’t appear in the passage, and they thus don’t appearin the span that tags result. This dataset contains 358,427 (question, passage, answer) triples. We randomly split theWebAssertions dataset into training, validation, and test sets with 8:1:1 split. We use the validation set to tune the modeland report the results on the test set.

We evaluate the quality of the entire answer and each semantic role. For entire answer, we concatenate the answer tupleto a string and split different role with the special word "". Since there is only one subject and one predicate inanswer, we can evaluate them directly. But there may be more than one argument, we concatenate them to one string by"", just like the entire answer.We use BLEU-4 score [14] as the evaluation metric for the entire answer, so as to be comparable with previous work.The subject and predicate are relatively short with an average length of 3.3 and 1.4. Therefore, we used BLEU-1 toevaluate each semantic role.

We need a post process the output to get subject, predicate, and arguments separately. For tagging results, we collect allthe words with same semantic tag to produce the corresponding answer ﬁeld. The selected words are concatenatedaccording to their order in passage. If no word is been tagged as one semantic role, then the result is an empty string.For the generated answer, we split them to a phrases list by the special split word. Then the ﬁrst phrase is the subject,the second phrase is the predicate, and all other phrases are arguments. We use byte-pair encoding (BPE) [15] to handlethe out of vocabulary problem in the correction decoder. BPE will split each word to several subwords. We controlthe corpus distinct subwords number. In the ground truth creation of semantic BIO tags, we match the continuoussubsequence at the word level, and map the semantic tag to sub-word level and label BIO tag at subword level. Afterthe model output tagging result, we collect the subword belonging to the same semantic role and undo BPE. We ignorethe possibility of incomplete words and just let the model learn. For the generation, we split output to phrases list ﬁrstand undo BPE on each phrase.

We tune our hyper-parameter on the validation dataset. The hidden size of our model is 512. We use shared vocabularybetween question, passage, and answer. The BPE vocabulary size is 37000. We share the embedding weight for thequestion encoder, passage encoder, correction decoder, and pre-softmax linear transformation of the correction decoder.We use 8 head for Multi-Head Attention. The question and passage encoder layer number Ne is 2, the tagging decoderlayer number N e Nt is 4. The correction decoder layer number N t is 4. The correction decoder layer number N c is 6.The weight of loss λ is set to 3. We use ADAM optimizer [16] to update model parameters and set the learning rate to0.001. Our proposed model is called

Tagging + Generation . We compared it with three baselines. • Seq2Seq + Attention [13]

This model formulates this task to a sequence to sequence problem. Theyconcatenate question and passage to a string as input, and concatenate the tuple to a string as output. Theyinsert special tag "" between question and passage, and special tag "," between ﬁeld of tuple for format.This model uses a bidirectional GRU as encoder, GRU as decoder, and used attention mechanism. • Seq2Ast [13]

This sequence to assertion model (Seq2Ast) has the same input process and encoder as Seq2Seq+ Attention. The difference is this model used a hierarchical decoder which ﬁrst generates a representation foreach ﬁeld by a tuple-level decoder, then generates the words for each ﬁeld by a word-level decoder. • Tagging

We remove the correction decoder and only train the tagging decoder.8

PREPRINT - S

EPTEMBER

17, 2020The Yan et al. [13] also propose an extractive method, but it is not comparable with generative methods. This methodextracts all possible answer tuples from the passage ﬁrst. We use a ranking model to select the best answer as output.This dataset also is constructed by extracting tuples and their extraction model using the same extractor. The rightanswer is always in the ranking list. The key challenge with extractive methods is how to design the matching model. Itis evaluated by ranking metrics, such as MAP, MRR. If we evaluate it with BLEU, it will reach 72.27. This result ishigher that our result, but it is also reasonable because they leverage the dataset construct property. Our method doesnot rely on any Open IE model, so it still does not work well on dataset constructed in this way.

The results are in Table 3. Both the result of the entire answer and each semantic role show the same trend. We see thefollowing: (i) the Tagging + Correction model achieves the best results which proves the effectiveness of our model; (ii)the Tagging+ Correction model is better than the Seq2Seq model. It means that by tagging the keyword ﬁrst improvesgeneration quality, which we think is because the tagging decoder provides a guide for the correction decoder; (iii)the Tagging + Correction model is also better than the tagging decoder, which means the second step correction isnecessary.For the subject, predicate, and arguments column, we ﬁnd the results for the predicate are better than for subject,and subject results are better than arguments. This may be because of the different properties of different semanticroles. The subject is often a noun phrase. The predicate is a verb and has an average length of 1.4. The arguments aremodifying phrase, which are longest and most complicated. Intuitively, the property of one word is enough to determinewhether it is a predicate. The property of two adjacent words is enough to determine the boundary of a noun phrase.But we may need more sophisticated sentence information to extract arguments like syntax tree. We will leave this asfuture work to improve our model.We remove the question, so it becomes an Open IE problem. We denote it as w/o question . The entire answer result onBLEU will decrease by 2.6 compared with Tagging + Correction, and all the semantic role result will decrease too.This proves the Question aware Open IE cannot be solved as an Open IE task.We also try to remove the semantic role in tags and only keeps the BIO tags. The results w/o semantic tag show thatthe BLEU of the entire answer will decrease 0.56, and BLEU of each semantic role also will decrease more than 1.This proves the semantic tag beneﬁts from the correction decoder.

We also do a case study to analysis our result. We randomly sample 50 samples in test dataset and predict with Tagging+ Correction model. The summarization of the results is in Table 4.

Label Ratio correct / exactly match 30%correct / comparable 10%correct / better 18%correct / incomplete label 18%wrong / wrong focus 12%wrong / grammar problem 6%wrong / lost key words 6%Table 4: Case study of Tagging + Correction.We ﬁnd that about 76% of cases are correct.

Comparable means the model output is comparable with ground truth andit is hard to tell which one is better. About 18% of cases are better than the ground truth. This is because the generatedanswer is shorter and clearer than the ground truth answer, especially on arguments. Another 18% of wrong cases arebecause of an incomplete label , which means there are more than one answer in the passage for the question. Based onthese results, we see that it is hard to evaluate the Question aware Open IE because of the open deﬁnition of informationextraction problem. There may be more than one answer in the passage and each answer may have multiple paraphrases.A better dataset could help to solve the ”better” and ”incomplete label” problems.For the wrong output, about 12% of wrong cases are because of the wrong focus . This means the answer is not relatedto the question. 6% of cases are because of a grammar problem , which means the answer is not ﬂuent. This is becausethe language model of the correction decoder is still not good enough. 6% of cases are because of lost key words. Inthe future, we could try to improve the interaction between question and passage to improve the wrong focus and lost9

PREPRINT - S

EPTEMBER

17, 2020key words problem. In the future, we may try to improve the interaction between question and passage to improve thewrong focus and lost key words problem. We could also try to transfer learning to improve the language model.

In this paper, we introduce a two-stage decoder model to solve the question aware Open IE task. Because most of theanswer words are from a passage, we use a tagging decoder to tag the key words in the passage ﬁrst, and generate areﬁned answer with a correction decoder based on the output of the tagging decoder. The experiments on WebAssertionsshow that our method outperforms other pure generation models or tagging models. Our model does not rely on anyOpen IE tools which gives it good generalization ability. In the future, we will try more methods to improve our resultslike incorporate syntax information or more interaction methods. We will also consider creating a better dataset toaccelerate research in this area.

References [1] Banko, Michele and Cafarella, Michael J and Soderland, Stephen and Broadhead, Matthew and Etzioni, Oren.Open information extraction from the web. Vol. 7. pages 2670–2676. IJCAI, 2014.[2] Etzioni, Oren and Banko, Michele and Soderland, Stephen and Weld, Daniel S. Open information extraction fromthe web. Vol. 51. pages 68–74. ACM, 2008.[3] Fader, Anthony and Soderland, Stephen and Etzioni, Oren. Identifying relations for open information extraction. In

Proceedings of the conference on empirical methods in natural language processing , pages 1535–1545. Associationfor Computational Linguistics, 2011.[4] Schmitz, Michael and Bart, Robert and Soderland, Stephen and Etzioni, Oren and others. Open language learningfor information extraction. In

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning , pages 523–534. Association for ComputationalLinguistics, 2012.[5] Akbik, Alan and Löser, Alexander. Kraken: N-ary facts in open information extraction. In

Proceedings of theJoint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction , pages 52–56.Association for Computational Linguistics, 2012.[6] Del Corro, Luciano and Gemulla, Rainer. Clausie: clause-based open information extraction. In

Proceedings of the22nd international conference on World Wide Web , pages 355–366. ACM, 2013.[7] Stanovsky, Gabriel and Michael, Julian and Zettlemoyer, Luke and Dagan, Ido. Supervised open information extrac-tion. In

Proceedings of the 2018 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long Papers) , Vol. 1. pages 885–895. 2018.[8] Wu, Fei and Weld, Daniel S. Open information extraction using Wikipedia. In

Proceedings of the 48th annualmeeting of the association for computational linguistics , pages 118–127. Association for Computational Linguistics,2010.[9] Stanovsky, Gabriel and Ficler, Jessica and Dagan, Ido and Goldberg, Yoav. Getting more out of syntax with props.In arXiv preprint arXiv:1603.01648 , 2016.[10] Angeli, Gabor and Premkumar, Melvin Jose Johnson and Manning, Christopher D. Leveraging linguistic structurefor open domain information extraction. In

Proceedings of the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1:Long Papers) , Vol. 1. pages 344–354. 2015.[11] Yu, Adams Wei and Dohan, David and Luong, Minh-Thang and Zhao, Rui and Chen, Kai and Norouzi, Mohammadand Le, Quoc V. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension.In arXiv preprint arXiv:1804.09541 , 2018.[12] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, AidanN and Kaiser, Łukasz and Polosukhin, Illia. Attention is all you need. pages 5998–6008. Advances in NeuralInformation Processing Systems, 2017.[13] Yan, Zhao and Tang, Duyu and Duan, Nan and Liu, Shujie and Wang, Wendi and Jiang, Daxin and Zhou, Mingand Li, Zhoujun. Assertion-based QA with Question-Aware Open Information Extraction. Association for theAdvancement of Artiﬁcial Intelligence, 2018. 10

PREPRINT - S

EPTEMBER

17, 2020[14] Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. BLEU: a method for automaticevaluation of machine translation. In

Proceedings of the 40th annual meeting on association for computationallinguistics , pages 311–318. Association for Computational Linguistics, 2002.[15] Philip Gage. A New Algorithm for Data Compression. C Users Journal, 1994[16] Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. In arXiv preprintarXiv:1412.6980 , 2014.[17] Wang, Shuohang and Jiang, Jing. Machine comprehension using match-lstm and answer pointer. In arXiv preprintarXiv:1608.07905 , 2016.[18] Wang, Wenhui and Yang, Nan and Wei, Furu and Chang, Baobao and Zhou, Ming. Gated self-matching networksfor reading comprehension and question answering. In

Proceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) , Vol. 1. pages 189–198. 2017.[19] Ba, Jimmy Lei and Kiros, Jamie Ryan and Hinton, Geoffrey E. Layer normalization. In arXiv preprintarXiv:1607.06450 , 2016.[20] Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQuAD: 100,000+ Questionsfor Machine Comprehension of Text. In

Proceedings of the 2016 Conference on Empirical Methods in NaturalLanguage Processing , pages 2383–2392. 2016.[21] Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehension. In arXiv preprint arXiv:1705.03551 , 2017.[22] Trischler, Adam and Wang, Tong and Yuan, Xingdi and Harris, Justin and Sordoni, Alessandro and Bachman,Philip and Suleman, Kaheer. Newsqa: A machine comprehension dataset. In arXiv preprint arXiv:1611.09830 ,2016.[23] Seo, Minjoon and Kembhavi, Aniruddha and Farhadi, Ali and Hajishirzi, Hannaneh. Bidirectional attention ﬂowfor machine comprehension. In arXiv preprint arXiv:1611.01603 , 2016.[24] Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Ranganand Deng, Li. MS MARCO: A human generated machine reading comprehension dataset. In arXiv preprintarXiv:1611.09268 , 2016.[25] Tan, Chuanqi and Wei, Furu and Yang, Nan and Du, Bowen and Lv, Weifeng and Zhou, Ming. S-net: From answerextraction to answer generation for machine reading comprehension. In arXiv preprint arXiv:1706.04815arXiv preprint arXiv:1706.04815

Related Researches

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

by Jonáš Kulhánek

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

by Claytone Sikasote

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

by Wenmeng Yu

Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model

by Fatemah Husain

Bootstrapping Relation Extractors using Syntactic Search by Examples

by Matan Eyal

Leveraging cross-platform data to improve automated hate speech detection

by John D Gallacher

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

by Chuhan Wu

Broader terms curriculum mapping: Using natural language processing and visual-supported communication to create representative program planning experiences

by Rogério Duarte

Decontextualization: Making Sentences Stand-Alone

by Eunsol Choi

The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

by Magnus Sahlgren

Generate and Revise: Reinforcement Learning in Neural Poetry

by Andrea Zugarini

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

by Boliang Zhang

SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

by Di Wu

Wake Word Detection with Streaming Transformers

by Yiming Wang

A study of text representations in Hate Speech Detection

by Chrysoula Themeli

OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

by Lalit Mohan Sanagavarapu

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

by Onno Eberhard

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

by Hannah Kirk

In-Order Chart-Based Constituent Parsing

by Yang Wei

Quality Estimation without Human-labeled Data

by Yi-Lin Tuan

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

by Betty van Aken

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

by Yunyang Xiong

Spoiler Alert: Using Natural Language Processing to Detect Spoilers in Book Reviews

by Allen Bao

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by ElMehdi Boujou

Representation Learning for Natural Language Processing

by Zhiyuan Liu

«

1

2

3

4

»

Submitted on 16 Sep 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar