Arunav Mishra | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arunav Mishra is active.

Explore More

Publication

Featured researches published by Arunav Mishra.

web search and data mining | 2017

Modeling Event Importance for Ranking Daily News Events

Vinay Setty; Abhijit Anand; Arunav Mishra; Avishek Anand

We deal with the problem of ranking news events on a daily basis for large news corpora, an essential building block for news aggregation. News ranking has been addressed in the literature before but with individual news articles as the unit of ranking. However, estimating event importance accurately requires models to quantify current day event importance as well as its significance in the historical context. Consequently, in this paper we show that a cluster of news articles representing an event is a better unit of ranking as it provides an improved estimation of popularity, source diversity and authority cues. In addition, events facilitate quantifying their historical significance by linking them with long-running topics and recent chain of events. Our main contribution in this paper is to provide effective models for improved news event ranking. To this end, we propose novel event mining and feature generation approaches for improving estimates of event importance. Finally, we conduct extensive evaluation of our approaches on two large real-world news corpora each of which span for more than a year with a large volume of up to tens of thousands of daily news articles. Our evaluations are large-scale and based on a clean human curated ground-truth from Wikipedia Current Events Portal. Experimental comparison with a state-of-the-art news ranking technique based on language models demonstrates the effectiveness of our approach.

international acm sigir conference on research and development in information retrieval | 2016

Event Digest: A Holistic View on Past Events

Arunav Mishra; Klaus Berberich

For a general user, easy access to vast amounts of online information available on past events has made retrospection much harder. We propose a problem of automatic event digest generation to aid effective and efficient retrospection. For this, in addition to text, a digest should maximize the reportage of time, geolocations, and entities to present a holistic view on the past event of interest. We propose a novel divergence-based framework that selects excerpts from an initial set of pseudo-relevant documents, such that the overall relevance is maximized, while avoiding redundancy in text, time, geolocations, and named entities, by treating them as independent dimensions of an event. Our method formulates the problem as an Integer Linear Program (ILP) for global inference to diversify across the event dimensions. Relevance and redundancy measures are defined based on JS-divergence between independent query and excerpt models estimated for each event dimension. Elaborate experiments on three real-world datasets are conducted to compare our methods against the state-of-the-art from the literature. Using Wikipedia articles as gold standard summaries in our evaluation, we find that the most holistic digest of an event is generated with our method that integrates all event dimensions. We compare all methods using standard Rouge-1, -2, and -SU4 along with Rouge-NP, and a novel weighted variant of Rouge.

cross language evaluation forum | 2013

Overview of INEX 2013

Patrice Bellot; Antoine Doucet; Shlomo Geva; Sairam Gurajada; Jaap Kamps; Gabriella Kazai; Marijn Koolen; Arunav Mishra; Véronique Moriceau; Josiane Mothe; Michael Preminger; Eric SanJuan; Ralf Schenkel; Xavier Tannier; Martin Theobald; Matthew Trappett; Qiuyue Wang

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2013 evaluation campaign, which consisted of four activities addressing three themes: searching professional and user generated data Social Book Search track; searching structured or semantic data Linked Data track; and focused retrieval Snippet Retrieval and Tweet Contextualization tracks. INEX 2013 was an exciting year for INEX in which we consolidated the collaboration with other activities in CLEF and for the second time ran our workshop as part of the CLEF labs in order to facilitate knowledge transfer between the evaluation forums. This paper gives an overview of all the INEX 2013 tracks, their aims and task, the built test-collections, and gives an initial analysis of the results.

european conference on information retrieval | 2016

Leveraging Semantic Annotations to Link Wikipedia and News Archives

Arunav Mishra; Klaus Berberich

The incomprehensible amount of information available online has made it difficult to retrospect on past events. We propose a novel linking problem to connect excerpts from Wikipedia summarizing events to online news articles elaborating on them. To address this linking problem, we cast it into an information retrieval task by treating a given excerpt as a user query with the goal to retrieve a ranked list of relevant news articles. We find that Wikipedia excerpts often come with additional semantics, in their textual descriptions, representing the time, geolocations, and named entities involved in the event. Our retrieval model leverages text and semantic annotations as different dimensions of an event by estimating independent query models to rank documents. In our experiments on two datasets, we compare methods that consider different combinations of dimensions and find that the approach that leverages all dimensions suits our problem best.

international world wide web conferences | 2015

EXPOSÉ: EXploring Past news fOr Seminal Events

Arunav Mishra; Klaus Berberich

Recent increases in digitization and archiving efforts on news data have led to overwhelming amounts of online information for general users, thus making it difficult for them to retrospect on past events. One dimension along which past events can be effectively organized is time. Motivated by this idea, we introduce EXPOSÉ, an exploratory search system that explicitly uses temporal information associated with events to link different kinds of information sources for effective exploration of past events. In this demonstration, we use Wikipedia and news articles as two orthogonal sources. Wikipedia is viewed as an event directory that systematically lists seminal events in a year; news articles are viewed as a source of detailed information on each of these events. To this end, our demo includes several time-aware retrieval approaches that a user can employ for retrieving relevant news articles, as well as a timeline tool for temporal analysis and entity-based facets for filtering results.

exploiting semantic annotations in information retrieval | 2012

Design and evaluation of an ir-benchmark for sparql queries with fulltext conditions

Arunav Mishra; Sairam Gurajada; Martin Theobald

In this paper, we describe our goals in introducing a new, annotated benchmark collection, with which we aim to bridge the gap between the fundamentally different aspects that are involved in querying both structured and unstructured data. This semantically rich collection, captured in a unified XML format, combines components (unstructured text, semistructured infoboxes, and category structure) from 3.1 Million Wikipedia articles with highly structured RDF properties from both DBpedia and YAGO2. The new collection serves as the basis of the INEX 2012 Ad-hoc, Faceted Search, and Jeopardy retrieval tasks. With a focus on the new Jeopardy task, we particularly motivate the usage of the collection for question-answering (QA) style retrieval settings, which we also exemplify by introducing a set of 90 QA-style benchmark queries which come shipped in a SPARQL-based query format that has been extended by fulltext filter conditions.

ph.d. workshop on information and knowledge management | 2014

Linking Today's Wikipedia and News from the Past

Arunav Mishra

In this paper we propose a novel task of automatically linking Wikipedia excerpts describing events to past news articles. Constantly evolving Wikipedia articles tend to summarize past events by abstracting fine-grained details that mattered when the event happened. On the other hand, contemporary news articles provide details of events, as they had happened. With connections between these two orthogonal information sources in place, a user could jump between them to acquire a holistic view on past events. We cast the linking problems into two retrieval tasks and propose a single framework for addressing them. In addition, we delineate challenges involved in both these tasks and propose a framework to address these challenges. To build a better understanding of the problem, we initially consider the simpler task of linking Wikipedia events that are systematically curated into years, decades and centuries, to relevant news articles from the past. These events come with a short textual description and a date indicating when the event happened. We present a two-stage cascade approach that leverages the temporal information associated to a given event for improving the linking effectiveness. We additionally design several baselines and show that our approach outperforms all the baselines. Through the results of studying the simplified task we come a step closer to solving the larger problem proposed in this paper. As future work, we plan to build an automatic linking system that answers to the challenges identified in this paper.

european conference on information retrieval | 2018

Long-Span Language Models for Query-Focused Unsupervised Extractive Text Summarization

Mittul Singh; Arunav Mishra; Youssef Oualil; Klaus Berberich; Dietrich Klakow

Effective unsupervised query-focused extractive summarization systems use query-specific features along with short-range language models (LMs) in sentence ranking and selection summarization subtasks. We hypothesize that applying long-span n-gram-based and neural LMs that better capture larger context can help improve these subtasks. Hence, we outline the first attempt to apply long-span models to a query-focused summarization task in an unsupervised setting. We also propose the A cross S entence B oundary LSTM-based LMs, ASB LSTM and bi ASB LSTM, that is geared towards the query-focused summarization subtasks. Intrinsic and extrinsic experiments on a real word corpus with 100 Wikipedia event descriptions as queries show that using the long-span models applied in an integer linear programming (ILP) formulation of MMR criterion are the most effective against several state-of-the-art baseline methods from the literature.

european conference on information retrieval | 2017

How Do Order and Proximity Impact the Readability of Event Summaries

Arunav Mishra; Klaus Berberich

Organizing the structure of fixed-length text summaries for events is important for their coherence and readability. However, typical measures used for evaluation in text summarization tasks often ignore the structure. In this paper, we conduct an empirical study on a crowdsourcing platform to get insights into regularities that make a text summary coherent and readable. For this, we generate four variants of human-written text summaries with 10 sentences for 100 seminal events, and conduct three experiments. Experiment 1 and 2 focus on analyzing the impact of sentence ordering and proximity between originally occurring adjacent sentences, respectively. Experiment 3 analyzes the feasibility of conducting such a study on a crowdsourcing platform. We release our data to facilitate future work like designing dedicated measures to evaluate summary structures.

conference on information and knowledge management | 2016

Estimating Time Models for News Article Excerpts

Arunav Mishra; Klaus Berberich

It is often difficult to ground text to precise time intervals due to the inherent uncertainty arising from either missing or multiple expressions at year, month, and day time granularities. We address the problem of estimating an excerpt-time model capturing the temporal scope of a given news article excerpt as a probability distribution over chronons. For this, we propose a semi-supervised distribution propagation framework that leverages redundancy in the data to improve the quality of estimated time models. Our method generates an event graph with excerpts as nodes and models various inter-excerpt relations as edges. It then propagates empirical excerpt-time models estimated for temporally annotated excerpts, to those that are strongly related but miss annotations. In our experiments, we first generate a test query set by randomly sampling 100 Wikipedia events as queries. For each query, making use of a standard text retrieval model, we then obtain top-10 documents with an average of 150 excerpts. From these, each temporally annotated excerpt is considered as gold standard. The evaluation measures are first computed for each gold standard excerpt for a single query, by comparing the estimated model with our method to the empirical model from the original expressions. Final scores are reported by averaging over all the test queries. Experiments on the English Gigaword corpus show that our method estimates significantly better time models than several baselines taken from the literature.

Explore More