Elena Filatova | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Elena Filatova is active.

Explore More

Publication

Featured researches published by Elena Filatova.

Archive | 2004

Event-Based Extractive Summarization

Elena Filatova; Vasileios Hatzivassiloglou

Most approaches to extractive summarization define a set of features upon which selection of sentences is based, using algorithms independent of the features themselves. We propose a new set of features based on low-level, atomic events that describe relationships between important actors in a document or set of documents. We investigate the effect this new feature has on extractive summarization, compared with a baseline feature set consisting of the words in the input documents, and with state-of-the-art summarization systems. Our experimental results indicate that not only the event-based features offer an improvement in summary quality over words as features, but that this effect is more pronounced for more sophisticated summarization methods that avoid redundancy in the output.

TASIP '01 Proceedings of the workshop on Temporal and spatial information processing - Volume 13 | 2001

Assigning time-stamps to event-clauses

Elena Filatova; Eduard H. Hovy

We describe a procedure for arranging into a time-line the contents of news stories describing the development of some situation. We describe the parts of the system that deal with 1. breaking sentences into event-clauses and 2. resolving both explicit and implicit temporal references. Evaluations show a performance of 52%, compared to humans.

international conference on computational linguistics | 2004

A formal model for information selection in multi-sentence text extraction

Elena Filatova; Vasileios Hatzivassiloglou

Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units with an associated mapping between these two dimensions. This representation is then used to describe the task of selecting textual units for a summary or answer as a formal optimization task. We provide approximation algorithms and empirically validate the performance of the proposed model when used with two very different sets of features, words and atomic events.

meeting of the association for computational linguistics | 2008

An Unsupervised Approach to Biography Production Using Wikipedia

Fadi Biadsy; Julia Hirschberg; Elena Filatova

We describe an unsupervised approach to multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classifier from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges. Overall, our system significantly outperforms all systems that participated in DUC2004, according to the ROUGE-L metric, and is preferred by human subjects.

Archive | 2003

Domain -independent detection, extraction, and labeling of Atomic Events

Vasileios Hatzivassiloglou; Elena Filatova

The notion of an “event” has been widely used in the computational linguistics literature as well as in information retrieval and various NLP applications, although with significant variance in what exactly an event is. We describe an empirical study aimed at developing an operational definition of an event at the atomic (sentence or predicate) level, and use our observations to create a system for detecting and prioritizing the atomic events described in a collection of texts. We report results from testing our system on several sets of related texts, including human assessments of the system’s output and a comparison with information extraction techniques. We discuss how event detection at this level can be used for indexing, summarization, and question-answering.

meeting of the association for computational linguistics | 2006

Automatic Creation of Domain Templates

Elena Filatova; Vasileios Hatzivassiloglou; Kathleen R. McKeown

Recently, many Natural Language Processing (NLP) applications have improved the quality of their output by using various machine learning techniques to mine Information Extraction (IE) patterns for capturing information from the input text. Currently, to mine IE patterns one should know in advance the type of the information that should be captured by these patterns. In this work we propose a novel methodology for corpus analysis based on cross-examination of several document collections representing different instances of the same domain. We show that this methodology can be used for automatic domain template creation. As the problem of automatic domain template creation is rather new, there is no well-defined procedure for the evaluation of the domain template quality. Thus, we propose a methodology for identifying what information should be present in the template. Using this information we evaluate the automatically created domain templates through the text snippets retrieved according to the created templates.

empirical methods in natural language processing | 2005

Tell Me What You Do and I'll Tell You What You Are: Learning Occupation-Related Activities for Biographies

Elena Filatova; John M. Prager

Biography creation requires the identification of important events in the life of the individual in question. While there are events such as birth and death that apply to everyone, most of the other activities tend to be occupation-specific. Hence, occupation gives important clues as to which activities should be included in the biography. We present techniques for automatically identifying which important events apply to the general population, which ones are occupation-specific, and which ones are person-specific. We use the extracted information as features for a multi-class SVM classifier, which is then used to automatically identify the occupation of a previously unseen individual. We present experiments involving 189 individuals from ten occupations, and we show that our approach accurately identifies general and occupation-specific activities and assigns unseen individuals to the correct occupations. Finally, we present evidence that our technique can lead to efficient and effective biography generation relying only on statistical techniques.

Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3) | 2009

Directions for Exploiting Asymmetries in Multilingual Wikipedia

Elena Filatova

Multilingual Wikipedia has been used extensively for a variety Natural Language Processing (NLP) tasks. Many Wikipedia entries (people, locations, events, etc.) have descriptions in several languages. These descriptions, however, are not identical. On the contrary, descriptions in different languages created for the same Wikipedia entry can vary greatly in terms of description length and information choice. Keeping these peculiarities in mind is necessary while using multilingual Wikipedia as a corpus for training and testing NLP applications. In this paper we present preliminary results on quantifying Wikipedia multilinguality. Our results support the observation about the substantial variation in descriptions of Wikipedia entries created in different languages. However, we believe that asymmetries in multilingual Wikipedia do not make Wikipedia an undesirable corpus for NLP applications training. On the contrary, we outline research directions that can utilize multilingual Wikipedia asymmetries to bridge the communication gaps in multilingual societies.

text retrieval conference | 2005

Building on Redundancy: Factoid Question Answering, Robust Retrieval and the "Other"

Dmitri Roussinov; Elena Filatova; Michael Chau; Jose Antonio Robles-Flores

We have explored how redundancy based techniques can be used in improving factoid question answering, definitional questions (“other”), and robust retrieval. For the factoids, we explored the meta approach: we submit the questions to the several open domain question answering systems available on the Web and applied our redundancy-based triangulation algorithm to analyze their outputs in order to identify the most promising answers. Our results support the added value of the meta approach: the performance of the combined system surpassed the underlying performances of its components. To answer definitional (“other”) questions, we were looking for the sentences containing re-occurring pairs of noun entities containing the elements of the target. For robust retrieval, we applied our redundancy based Internet mining technique to identify the concepts (single word terms or phrases) that were highly related to the topic (query) and expanded the queries with them. All our results are above the mean performance in the categories in which we have participated, with one of our robust runs being the best in its category among all 24 participants. Overall, our findings support the hypothesis that using as much as possible textual data, specifically such as mined from the World Wide Web, is extre mely promising. FACTOID QUESTION ANSWERING The Natural Language Processing (NLP) task, which is behind Question Answering (QA) technology, is known to be Artificial Intelligence (AI) complete: it requires the computers to be as intelligent as people, to understand the deep semantics of human communication, and to be capable of common sense reasoning. As a result, different systems have different capabilities. They vary in the range of tasks that they support, the types of questions they can handle, and the ways in which they present the answers. By following the example of meta search engines on the Web (Selberg & Etzioni, 1995), we advocate combining several fact seeking engines into a single “Meta” approach. Meta search engines (sometimes called metacrawlers) can take a query consisting of keywords (e.g. “Rotary engines”), send them to several portals (e.g. Google, MSN, etc.), and then combine the results. This allows them to provide better coverage and specialization. The examples are MetaCrawler (Selberg & Etzioni, 1995), 37.com (www.37.com), and Dogpile (www.dogpile.com). Although, the keyword based meta search engines have been suggested and explored in the past, we are not aware of the similar approach tried for the task of open domain/corpus question answering (fact seeking). The practical benefits of the meta approach are justified by general consideration: eliminating “weakest link” dependency. It does not rely on a single system which may fail or may simply not be designed for a specific type of tasks (questions). The meta approach promises higher coverage and recall of the correct answers since different QA engines may cover different databases or different parts of the Web. In addition, the meta approach can reduce subjectivity by querying several engines; like in the real-world, one can gather the views from several people in order to make the answers more accurate and objective. The speed provided by several systems queried in parallel can also significantly exceed those obtained by working with only one system, since their responsiveness may vary with the task and network traffic conditions. In addition, the meta approach fits nicely into a becoming-popular Web services model, where each service (QA engine) is independently developed and maintained and the meta engine integrates them together, while still being organizationally independent from them. Since each engine may be provided by a commercial company interested in increasing their advertising revenue or a research group showcasing their cutting edge technology, the competition mechanism will also ensure quality and diversity among the services. Finally, a meta engine can be customized for a particular portal such as those supporting business intelligence, education, serving visually impaired or mobile phone users. Figure 1. Example of START output. Figure 2. Example of Btainboost output. Meta Approach Defined We define a fact seeking meta engine as the system that can combine, analyze, and represent the answers that are obtained from several underlying systems (called answer services throughout our paper). At least some of these underlying services (systems ) have to be capable of providing candidate answers to some types of questions asked in a natural language form, otherwise the overall architecture would not be any different from a single fact seeking engine which are typically based on a commercial keyword search engines, e.g. Google. The technology behind each of the answer services can be as complex as deep semantic NLP or as simple as shallow pattern matching. Fact Seeking Service Web address Output Format Organization/System Performance in our evaluation (MRR) START start.csail.mit.edu Single answer sentence Research Prototype 0.049** AskJeeves www.ask.com Up to 200 ordered snippets Commercial 0.397** BrainBoost www.brainboost.com Up to 4 snippets Commercial 0.409* ASU QA on the Web qa.wpcarey.asu.edu Up to 20 ordered sentences Research Prototype 0.337** Wikipedia en.wikipedia.org Narrative Non profit 0.194** ASU Meta QA http://qa.wpcarey.asu.edu/ Precise answer Research Prototype 0.435 Table 1. The fact seeking services involved, their characteristics and performances in the evaluation on the 2004 questions. * and ** indicate 0.1 and .05 levels of statistical significance of the difference from the best accordingly. Challenges Faced and Addressed Combing multiple fact seeking engines also faces several challenges. First, the output formats of them may differ : some engines produce exact answer (e.g. START), some other present one sentence or an entire snippet (several sentences) simi lar to web search engines, as shown in Figures 1-4. Table 1 summarizes those differences and other capabilities for the popular fact seeking engines. Second, the accuracy of responses may differ overall and have even higher variability depending on a specific type of a question. And finally, we have to deal with multiple answers, thus removing duplicates, and resolving answer variations is necessary. The issues with merging search results from multiple engines have been already explored by MetaCrawler (Selberg & Etzioni, 1995) and fusion studies in information retrieval (e.g. Vogt & Cottrell, 1999) but only in the context or merging lists of retrieved text documents. We argue that the task of fusing multiple short answers, which may potentially conflict or confirm each other, is fundamentally different and poses a new challenge for the researchers. For example, some answer services (components) may be very precise (e.g. START), but cover only a small proportion of questions. They need to be backed up by less precise services that have higher coverage (e.g. AskJeeves). However, backing up may easily result in diluting the answer set by spurious (wrong) answers. Thus, there is a need for some kind of triangulation of the candidate answers provided by the different services or multiple candidate answers provided by the same service. Figure 3. Example of Ask Jeeves output. Figure 4. Example of ASU QA output. Triangulation, a term which is widely used in intelligence and journalism, stands for confirming or disconfirming facts, by using multiple sources. Roussinov et al. (2004) went one step further than using the frequency counts explored earlier by Dumais et al. (2002) and groups involved in TREC competitions. They explored a more fine-grained triangulation process which we also used in our prototype. Their algorithm can be demonstrated by the following intuitive example. Imagine that we have two candidate answers for the question “What was the purpose of the Manhattan Project?”: 1) “To develop a nuclear bomb” 2) “To create an atomic weapon”. These two answers support (triangulate) each other since they are semantically similar. However, a straightforward frequency count approach would not pick this similarity. The advantage of triangulation over simple frequency counting is that it is more powerful for less “factual” questions, such as those that may allow variations in the correct answers. In order to enjoy the full power of triangulation with factoid questions (e.g. Who is the CEO of IBM?), the candidate answers have to be extracted from their sentences (e.g. Samuel Palmisano), so they can be more accurately compared with the other candidate answers (e.g. Sam Palmisano). That is why the meta engine needs to possess answer understanding capabilities as well, including such crucial capability as question interpretation and semantic verification of the candidate answers to check that they belong to a desired category (person in the example above). Figure 5. The Meta approach to fact seeking. Fact Seeking Engine Meta Prototype: Underlying Technologies and Architecture In the first version of our prototype, we included several freely available demonstrational prototypes and popular commercial engines on the Web that have some QA (fact seeking) capabilities, specifically START, AskJeeves, BrainBoost and ASU QA (Table 1, Figures 1-4). We also added Wikipedia to the list. Although it does not have QA capabilities, it provides good quality factual information on a variety of topics, which adds power to our triangulation mechanism. Google was not used directly as a service but BrainBoost and ASU QA are already using it among the other major keyword search engines. The meta-search part of our system was based on the MetaSpider architecture (Chau et al., 2001; Chen et al., 2001). Multithreads are launched to submit the query to fetch the candidate answers from each service. After these results are obtained, the system performs answer extraction, triangulation and semantic verifi

data and knowledge engineering | 2012

Editorial: Occupation inference through detection and classification of biographical activities

Elena Filatova; John M. Prager

Dealing with biographical information (e.g., biography generation, answering biography-related questions, etc.) requires the identification of important activities in the life of the individual in question. While there are activities that can be used in any biography (e.g., person was born on a particular date, person lived in a particular location, etc.), many activities used in biographies tend to be occupation-related, others are person-specific. Hence, occupation gives important clues as to which activities should be included in the biography. In this paper, we present a methodology for identifying a three-level hierarchy of biographical activities: those activities that apply to the general population, those activities that are occupation-related, and those activities that are person-specific. We use the obtained occupation-related activities as features for a multi-class SVM classifier to identify the occupation of a previously unseen individual. We also show that the activities automatically obtained from text can be used as features not only for a classification task but for a clustering task as well. We show that, given the correct number of clusters, people belonging to the same occupation are clustered together. At the same time, clustering people into a smaller number of classes allows the grouping of practitioners of the occupations that share a considerable number of occupation-related activities. Thus, analyzing descriptions of people belonging to various occupations, we can build a hierarchy of occupations.

Explore More