Exploring the Role of Logically Related Non-Question Phrases for Answering Why-Questions
Niraj Kumar, Rashmi Gangadharaiah, Kannan Srinathan, Vasudeva Varma
EExploring the Role of Logically Related Non-Question Phrases for Answering Why-Questions
Niraj Kumar , Rashmi Gangadharia , Kannan Srinathan , Vasudeva Varma , [email protected] , [email protected] , [email protected] , [email protected] , {IIIT-Hyderabad, Hyderabad-500032} , {IBM Research} , INDIA Abstract.
In this paper, we show that certain phrases although not present in a given question/query, play a very important role in answering the question. Exploring the role of such phrases in answering questions not only reduces the dependency on matching question phrases for extracting answers, but also improves the quality of the extracted answers. Here matching question phrases means phrases which co-occur in given question and candidate answers. To achieve the above discussed goal, we introduce a bigram-based word graph model populated with semantic and topical relatedness of terms in the given document. Next, we apply an improved version of ranking with a prior-based approach, which ranks all words in the candidate document with respect to a set of root words (i.e. non-stopwords present in the question and in the candidate document). As a result, terms logically related to the root words are scored higher than terms that are not related to the root words. Experimental results show that our devised system performs better than state-of-the-art for the task of answering
Why- questions . Keywords:
Ranking with Prior, Non-Factoid question answering, semantic relatedness, topical relatedness.
1 Introduction
According to [5], about 5% of all questions asked in QA systems are why -questions.
Why -questions have been less explored when compared to factoid-based question answering tasks. Answers to these questions contain a reason or a cause and in majority of the cases, the answers do not have a high lexical similarity with the question. Even linguistic features fail to identify the answers correctly. The task of answering why- questions has been treated as a difficult task due to the low performance of various systems suggested in the past [14]. Additionally, due to variability in the length of the answer, higher difficulty level and pragmatic nature, the task of
Why -based question answering becomes a challenging task of significant interest. To get an effective solution, the task is devised as a two step process. The first step involves identification of a few candidate/relevant documents that most likely contain the answer. The next step generates a ranked list of answers extracted from these relevant documents (named as Answer Extraction). It is important to note that the technique suggested in this paper can be easily extended to community-based answer ranking with some additional external features and clues. The contribution of this paper can be summarized as follows: We introduce a bigram-based word graph model of text, which captures semantic and topical information of text. It also capture the information of overlapping terms of given question (i.e., which co-occur in given question and candidate document). The main aim of this step is to utilize this information in answering the given question. We finally introduce a modified scheme for ranking with priors, which makes use of the information captured in text graph. This arrangement gives high scores to both types of terms, i.e. (1) terms which exist in the given question and (2) terms which are not present in the given question but are logically important for answering the question.
Brief Description of System:
First of all, we convert the given question into query, which contains sequence of non-stopwords terms connected by appropriate “AND” and “OR” relational operators. We pass this query to Solr and extract top 100 documents. We also note their relevance score (See Section 4). Next, we rank candidate answers in each extracted document (See Section 5). Finally, to get the final ranked list of answers, we combine the document’s relevance sore (w.r.t. given question) and rank score of candidate answer (See Section 6).
2 Problem Definition and Motivation
Majority of times, matching and/or overlapping phrases/terms, which are common in given question and candidate answers may not give the accurate answer. To understand this problem, we go through a simple example, which contains a sample question; human annotator’s selected answer and candidate answers (See Table 1). From Table 1, it is clear that the most relevant answer as selected by the human annotators and paragraphs, “PID-3” and “PID-4” contain the same number of matching non-stopword question terms, like: “white”, “flag”, “symbol” and “surrender”. This doesn’t mean that, we are neglecting the importance of non-stopword terms which are common to both, i.e., given question and candidate answers, in extracting answer(s) for any given question. But, in addition to the importance of such terms, we are trying to explore the role of some other terms that are logically related to the terms in the question but not present in question. http://lucene.apache.org/solr/ f we go through the given/selected answer, then, we find that there are some other terms like: “negotiation”, “internationally”, “fired”, “signifies”, “unarmed”, “waving”, “person”, “geneva”, “convention”, etc. (see, Table 1, bold and underlined words in given answer), plays very important role in answer, but not present in question. So, now the main problem is how to identify/give relatively higher rank to such terms. It is important to note that, identifying such kind of logically related terms by using only semantic relatedness or linguistic features is very tough. Table 1: Sample Question with selected answer and other candidate answers (source: [11], [12])
Question:
Why is a white flag a symbol of surrender ? Given/selected answer by annotator (
Source Document: White_flag.htm ): The white flag is an internationally recognized protective sign of truce or ceasefire , and request for negotiation . It is also used to symbolize surrender , since it is often the weaker military party which requests negotiation . A white flag signifies to all that an approaching negotiator is unarmed , with an intent to surrender or a desire to communicate. Persons carrying or waving a white flag are not to be fired upon, nor are they allowed to open fire . The use of the flag to surrender is included in the Geneva
Conventions . Other candidate answers: PID-4
The first mention of the usage of white flag s to surrender is made during from the Eastern Han dynasty (A.D 25-220). In the Roman Empire, the historian Cornelius Tacitus mentions a white flag of surrender in A.D. 109. Before that time, Roman armies would surrender by holding their shields above their heads. The usage of the white flag has since spread worldwide. PID-3
Many times since the weaker party is in a decrepit state, a white flag would be fashioned out of anything readily available like a T Shirt, handkerchief, anything white . The most common way of making a white flag is to obtain a pole and tie two corners of a sheet of cloth to the top of the pole and somewhere in the middle.
To achieve the above discussed goal, we propose a bigram based word graph model. To improve the importance of word pairs, which co-occur in given question and candidate answers, we apply simple link boosting. To properly capture the semantic and topical information in document, we populate the graph with semantic and topical relatedness information. Finally, we modify the “ranking with prior” based scheme, to get higher rank for (1) terms which exist in the given question and (2) terms which are not present in the given question but are logically important for answering the question.
Ranking with prior:
For any graph evG , , where, N vvvvv ,...,, represents the set of vertices and ji vvE , represents the edges, if there exist a link between ji vv & . The ranking with prior score [15] of any node ‘v’ of the graph can be given as: vivadjui PvPPRuvPvPPR (1) here, i vPPR represents the page rank with prior of node ‘v’ at (i+1) th iteration, vadj represents the adjacent node of node ‘v’. If ‘R’ represents the set of root nodes then, prior or bias can be given as: v P and can be explained as otherwiseRvforRP v represents the back propagation probability , determines how often we jump back to node ‘v’. Description:
According to [15], in ranking with prior, root is considered as data analyst’s prior knowledge or bias in terms of which nodes are considered important in a graph. If we select a root set that encompasses the entire graph, the relative importance converges to the graph’s importance. Figure 1, presents an example showing the differences between the traditional Page Rank approach (“Page Rank”, [7]) and ranking with priors (obtained by considering node ‘A’ and ‘F’ as root nodes) by using a toy graph.
Figure 1: Effect of ranking with prior w.r.t. Root node ‘A’ and ‘F’, see score “PRankP”, (ref: [15]).
Why modification in ranking with prior?:
The main aim behind this modification is to avoid the effect of noisy words and to exploit the information useful for answering the question. For this we use bigram model based weighted word graph, which contains (1) additional boosting links, to increase the importance of overlapping phrases from question, (2) semantic importance, to capture the semantic strength of word pairs in text and finally, (3) topical relatedness to join the topically related terms, which are not directly related. We use these weights to modify the equation of page rank with prior. We use all non-stopword terms of given question as root words. To rank the answers, we use top scored terms, obtained from modified form of ranking with prior.
3 Related Work
The task of Question Answering has mainly focussed on answering factoid questions, where answers are usually short phrases such as, named entities. Focus has recently moved towards the task of non-factoid question answering, such as, “ why questions”, “ how-to questions”, etc. [3], rank candidate answer paragraphs for answering why -questions in Japanese. It uses Support Vector Machines and features like: content similarity, causal expressions, and causal relations from two annotated corpora and a dictionary were extracted. [6], also presented a supervised technique, which used sentiment analysis and word classes to answer why- questions in Japanese. 14], proposed a three-step model for why-
QA: (1) a question-processing module that transforms the input question to a query; (2) an off the-shelf retrieval module that retrieves and ranks passages of text that share content with the input query; and (3) a re-ranking module that adapts the scores of the retrieved passages using structural information from the input question and the retrieved passages. Although significant work has not been done towards answering why- questions, a large number of approaches have been suggested in the past for the task of open domain question answering. [1]; [4]; [9] bridge the lexical chasm between the question and the answer using a machine translation model. Such data may not always be available and in order to obtain a reasonable performance, these techniques require large amounts of such data. [10], presents a BOW-based model, which uses statistical weights based on term frequency, document frequency, passage length, and term density. [8], consider the problem of answering definition questions. They use predicate–argument structures (PAS) for improved answer ranking. [13], compared a number of machine learning techniques in their performance for the task of ranking answers that are described by TF-IDF, a set of 36 linguistically-motivated overlap features and a binary label representing their correctness.
4 Identifying Candidate Documents
We use Solr to retrieve documents from INEX corpus for every given “WHY” question. For this we use combination of “OR” and “AND” relational operators. We use “AND” and “OR” operator for corresponding “and” and “or” words in question. Next, We replace all other stopwords and punctuation marks of the given question by “OR” relational operator and we put “AND” relational operator between every word pairs, which lays inside the two “OR” operators. This scheme is similar to [6], used to identify the predefined phrase boundary for keyphrases. Next, we separate all highly frequent verbs from sequence of words through “OR” operator. Here, the main aim is to separate these word sequences from word sequences. For this, we use a collection of 1534 frequent verbs from [6]. Finally, we retrieve top-100 documents for each given question and use the extracted documents to generate the ranked list of answers. We also note the relevance score of each such candidate document.
5 Extracting Ranked List of Answers
To extract the ranked list of answers from candidate document (s), we apply modified version of Page Rank with prior on undirected word graph of sentences.
We treat every word of a given document as a node in the graph. We add links between two words, if they co-occur together within a window of size two words (i.e., bigram) in the sentences of the candidate document. Formally, we can define an ndirected graph as,
EVG , , where, N VVVVV ,...,, represents the set of vertices and ji VVE , represents the edges, if there exist a link between ji VV & . Figure 2: Undirected word graph of sentences, Here S1, S2 and S3 represents the sentences of document and ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’ and ‘H’ represents the distinct words.
We boost the multi-word overlapping phrases in word graph of sentences, which appear both in the given question and in the source document. For this we add new links on word graph of sentences based on the number of times the matching bigram appeared in the question.
E.g.,
Suppose a document contains the bigrams “a b”, “a d”, “b c” and “b e” (Figure 3). Now suppose the question contains a matching bigram “a b”. For this we add a new link between “a” and “b”. Now, the normalized random walk weight between “a” and “b” is 2/3 (effect of boosting), whereas, the weight between “a” and “d” is 1/3.
Figure 3: Giving additional boosts to matching phrases
We use semantic relatedness score of words of bigrams, to calculate the edge weight of word graph of sentences. Next, we use topical relatedness to identify the topical relation between bigrams if they are not adjacent/directly linked to each other. Finally, we add link between such bigrams and assign weight calculated on the basis of topical relatedness of bigrams.
Calculating edge weight in word graph by using semantic relatedness score:
For this, we use Wikipedia’s extended abstracts2 and consider only frequent bigrams. This step reduces the participation of noisy/less-important bigrams in calculating the http://wiki.dbpedia.org/Downloads38 emantic relatedness score. Finally, we use Pointwise Mutual Information to calculate the semantic relatedness strengths of bigrams. ji jiji tCWtCW ttCWNttPMI ,log, (2) Where, ji ttCW , =Number of Wikipedia extended abstracts which contains terms i t and j t in adjacency and having a co-occurrence frequency of at least two. i tCW =Number of Wikipedia extended abstracts, in which i t occurs at least twice. N = total number of Wikipedia extended abstracts. Calculating Edge Weight:
Weight of edge between i V and j V ‘ ji VVEdgeW t , ’ can be given as: jiji VVPMIlinksVVEdgeW t , (3) Where, links =Number of links between i V and j V (also include additional links after boosting if any such link exists). Utilizing topical relatedness score of bigrams:
Co-occurrence of words within a relatively large window in the text suggests that both words are related to the general topic discussed in the text. [16], used such scheme on word-sense disambiguation. We use this concept in identification of topical relatedness of non adjacent bigrams. For this, we use Wikipedia extended abstracts, which contain long abstracts of Wikipedia articles with minor topical deviation from corresponding document-title. However, to reduce the participation of noisy bigrams, we consider only frequent bigrams. Now, we use Pointwise Mutual Information to calculate the topical relatedness between two bigrams:
21 21221 ,log,
BCWBCW BBCWNBBPMI (4) Where, B , B represents two bigrams. , BBCW = Number of Wikipedia extended abstracts in which both B and B occurs at least twice. BCW =Number of Wikipedia extended abstracts in which B occurs at least twice. BCW =Number of Wikipedia extended abstracts in which B occurs at least twice. Figure 4: Adding link for topical similarity of bigrams (i.e., “C D” and “G H”) Adding additional edge:
For bigrams, having semantic relatedness score more than average (calculated by using Eq-4), we add additional edge. The edge weight for such edge can be calculated as: ,,,
BfBfMinBBPMIBBEdgeWt (5) Where, Bf =Occurrence frequency of B in candidate document. Bf = Occurrence frequency of B in candidate document. For example (in Figure 4), we add an additional edge between the two topically related bigrams “C D” and “G H” (i.e. between second word of bigram “C D” and first word of bigram “G H”). Here bigram “C D” comes earlier than “G H” in parent document. This scheme adds link between some highly topically related bigrams in word graph. For this we use word graph of sentences populated with semantic and topical information. Next, we convert it into a row stochastic matrix, by normalizing the row sums of the corresponding transition matrix to one. Finally, we calculate the prior probability of every root word and apply it with modification proposed in ranking with prior.
Calculating prior probabilities of root words:
Let ‘R’ represents the set of root words and U P denotes the relative importance (“prior bias”), we attach to node ‘U’. OtherwiseRUForRP U (6) Ranking with prior:
The modified scheme for ranking with prior is given below:
UUadjV VadjW
PVRRVWEdgeW tVUEdgeW tURR ,,1 (7) Where,
URR = Ranking with prior of node ‘U’ . , BBEdgeW t represents edge weight between ‘U’ and ‘V’ (See Eq-3,5).
Uadj represents the set of nodes, which are adjacent to node ‘U’ . Vadj represents the set of nodes, which are adjacent to node ‘V’ . represents the back probability . It determines how often we jump back to the set of root nodes in ‘R’ . We use =0.70 (best performance setup, used in all experiments). Actually, Eq-7 estimates the relative probability of landing on any particular node. By using this equation, we calculate the rank of all words in the given document.
6 Ranking Candidate answers
Ranking candidate answers in each candidate document:
To calculate the scores of a candidate answer, we add the Page rank with prior scores of all the words in the answer. However, to reduce the chances of lengthy candidate answers getting higher ranks, we just use the words whose rank score is greater than the average rank score f all the words. Generally this contains 20-25% of the top ranked words. If it contains more than 25% of total number of words, then we select top 25% of the ranked words. Now, the local score can be calculated as: jiPWjDi
DscorelvWScorePScore _Re (8) Where, i P represents the candidate answer, jDi PScore = Score of candidate answer i P in document j D . WScore = Rank (score) of word ‘W’ in document, j D , whose score is more than the average score of all the words, calculated by using Eq-7. j Dscorelv _Re represents the relevance score of given document j D for given question (See Section 4, for details). By using same way we calculate the rank score of all candidate answers in each document.
Final Ranking of Candidate answers:
For this we sort all candidate answers in descending order according to their rank scores.
7 Pseudo-code
Input: (1) Question, (2) Text collection (here, Wikipedia INEX corpus is used) and (3) Wikipedia extended abstracts.
Output:
Ranked list of answers for the given question.
Algorithm:
St1.
We identify the top-100 candidate documents for given question. (Section 4). St2.
Next, we apply following procedures to rank the candidate answers in every candidate documents (St 3. To St 7.). St3.
Prepare word graph of sentences (Sub-section 5.1). St4.
Apply boosting of co-occurring word pairs in word graph, which co-occur in the given question. (Sub-section 5.2). St5.
Add (1) semantic information of words/nodes of every bigram and (2) topical relatedness score of every bigram with word graph of sentences. (Sub-section 5.3). St6.
Apply ranking with prior to calculate the ranks of all words in the candidate document (Sub-section 5.4). St7.
Use ranking with prior score of words in the given document and calculate the scores of every candidate answer. (Section 6). St8.
Take product of relevance score of candidate document and score of related candidate answer to calculate the final score of all candidate answer and then rank all candidate answers in descending order of their scores (Section 6).
8 Evaluation
To evaluate our devised system, we used the dataset3 prepared by [11], [12] and the Wikipedia INEX corpus4. We compare the results of our devised system with http://lands.let.ru.nl/~sverbern/ ublished results of [14] and other baseline system, evaluated on the same set. The details of the dataset and the evaluation metrics are given below: Dataset:
We used the Wikipedia INEX corpus [2] as our dataset (contains more than 600,000 XML documents in English). To evaluate the accuracy of the system, we used the gold standard dataset, manually prepared by [11], [12]. This dataset contains 216 questions for which there exists an answer in the Wikipedia collection. These answers are in 206 different documents. This dataset contains 210 answer fragments and manually annotated RST structures (Carlson et al., 2003), in the rs3 file format. This dataset contains a wide variety of why- questions like: (1) Questions having no direct match with the title or the main theme of the documents, (2) Questions for which the correct answer passage does not contain phrases that match the phrases in the question, (3) Long questions like: “Why do Americans cut their meat then put the knife down and change hands with their fork in contrast to most Europeans who work their fork with the opposite hand from their knife?” , and (4) very short questions like: “Why does it snow?”, “Why do we laugh?” . Evaluation Metrics:
We use the following two evaluation metrics (same as in [14]).
Success@n: the number questions that have at least one answer in the top ‘n’ results.
MRR (Mean Reciprocal Rank):
It is the average of the reciprocal ranks of the results for a sample of queries: Qi i rankQMRR (9) Where, Q = Number of queries and i rank is the rank of the answer of the th i query. Note:
If the system does not retrieve any answers in the list of given ‘n’ top answers, the system gets an RR=0 for the given question. (This approach is same as applied in [14]).
To properly evaluate the techniques applied in our devised system, we use four different setups:
SETUP-1 : In this setup, we consider paragraphs as candidate answers (same as applied by [14]) and prepared a ranked list of paragraphs as answers. See pseudo code (Section 7) for description of the system.
SETUP-2 : Here we consider 5 consecutive sentences as a candidate answer. This is similar to the approach adopted by [6]. Rest of the system is same, as described in pseudo code (Section 7).
SETUP-3 : This setup is similar to “SETUP-1”, except that we do not use semantic and topical relatedness measure in preparation of graph (i.e., absence of scheme applied in Sub Section 5.3). Rest of the system is same, as described in pseudo code (Section 7).
SETUP-4 : This setup is similar to “SETUP-3”, except that we do not use topical relatedness measure in preparation of graph (See Sub section 5.3 for related description). Rest of the system is same, as described in pseudo code (Section 7). ote: As page rank [7], performs poorer then baselines, so we didn’t include it in evaluation result. Table 2:
Evaluation results System Success@10 Success@150 MRR@10 MRR@150 SETUP-1
SETUP-2 66.603 82.627 40.791 41.119 SETUP-3 62.47 77.5 38.26 38.568 SETUP-4 66.052 81.944 40.454 40.778 Lemur / Tf_IDF-sliding 45.00% 78.50% N/A 25.00% [14] 57.00% 78.50% N/A 34.00%
Analyses of result: based on the results given in Table 2: i.
In both cases i.e., in SETUP-1 and 2, the performances of the systems are nearly the same. This shows the effectiveness of our devised system in generating answers of different format and length, without affecting the quality of answers. ii.
SETUP-3 performs poorer than SETUP-1 and 2. This is due to the absence of the semantic and topical relatedness score. However, the results are still better than the state-of-the-art (last two rows in Table 2). iii.
Slightly lower performance of SETUP-4 w.r.t. SETUP-1 shows the impact of absence of the topical relatedness score. iv.
Comparison with [14]:
We also compared our results with the approach adopted in [14] and “Lemur/tf-idf” based baselines used by [14] (Table 2). The results in Table 2, show that our devised system performs better than both systems.
Some issues with questions contain only one root word:
For some questions like: Why does it snow ? (Source: Snow.htm). Why do we laugh ? (Source: Laughter.htm) Our devised systems show poor performance, such types of questions i.e., average MRR@10 = 20%. The poor performance could be attributed to the fact that the ranking with prior based system is required to rank the words in the word-graph with respect to a single root word. There are total ‘4’ such questions are given in the dataset. However, with the increase in non-stopword question words or root words (>=2) performance of system shows stability (i.e., in quality of result).
9 Conclusion and Scope
In this paper, we used bigram based word graph of sentences, populated with semantic and topical information with some boosted links. Finally, we presented an improved version of “page rank with prior”, to rank words in the graph w.r.t. root words in given question. Improvements in the quality of the results show the effectiveness of this scheme. This scheme can be extended in some other tasks like: (1) guided summarization task (where prior information is supplied to extract the most suitable summary sentences) and (2) community-based question answering systems (by incorporating community-based features and clues), etc. eferences L. Berger, R. Caruana, D. Cohn, D. Freitag, and V. O. Mittal.Bridging the lexical chasm: statistical approaches to answer-finding.In Proc. of SIGIR, 2000. 2.
Denoyer, L. and Gallinari, P., (2006). The Wikipedia XML corpus. ACM SIGIR Forum, 40(1):64-69. 3.
Higashinaka, R. and H. Isozaki. (2008).Corpus-based question answering for why-questions. In Proceedings of IJCNLP,pages 418–425, Hyderabad. 4.
J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In CIKM, 2005. 5.
Hovy, E. H., U. Hermjakob, and D. Ravichandran. (2002). A question/answer typology with surface text patterns. In Proceedings of the Human Language Technology conference (HLT), pages 247–251, San Diego, CA. 6.
Oh, J., Torisawa, K., Hashimoto, C., Kawada, T., Saeger, S., D., Kazama, J., Wang, Y., (2012). Why Question An-swering using Sentiment Analysis and Word Classes. EMNLP-CoNLL 2012: 368-378. 7.
Page, L., s. Brin, r. Motwani and t. Winograd., (1998). The pagerank citation ranking: bringing order to the web. Technical report, Stanford digital library technologies project, 1998. 8.
Quarteroni, S., A. Moschitti, S. Manandhar, and R. Basili. (2007). Advanced structural representations for question classification and answer re-ranking. In Proceedings of ECIR 2007, pages 234–245, Rome. 9.
S. Riezler, A. Vasserman, I. Tsochantaridis, V. Mittal, and Y. Liu. Statistical machine translation for query expan-sion in answer retrieval. In ACL, 2007. 10.
Tellex, S., Katz, B., Lin, J., Fernandes, A., and Marton, G., (2003). Quantitative evaluation of passage retrieval algo-rithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval, pages 41–47, Toronto. 11.
Verberne, S., Boves, L., Oostdijk, N., and Coppen, P. A., (2007a). Discourse-based answering of why-questions. Traitement Automatique des Langues (TAL), special issue on “Discours et document: traitements automatiques”, 47(2):21–41. 12.
Verberne, S., Boves, L., Oostdijk, N., and Coppen, P. A., (2007b). Evaluating discourse-based answer extraction for why-question answering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 735–736, Amsterdam. 13.
Verberne, S., Halteren, H. Van., Theijssen, D., Raaijmakers, S. and Boves, L., (2009). Learning to rank QA data. In Proceedings of the Workshop on Learning to Rank for Information Retrieval (LR4IR) at SIGIR 2009, pages 41–48, Boston, MA. 14.
Verberne, S., Boves,V., Oostdijk, N., and Coppen, P. A., (2010). What Is Not in the Bag ofWords for Why-QA? Association for Computational Linguistics. 15.
White, S., Smyth, P., (2003). Algorithms for estimating relative importance in networks. KDD 2003: 266-275. 16.