Dipasree Pal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dipasree Pal is active.

Explore More

Publication

Featured researches published by Dipasree Pal.

Journal of the Association for Information Science and Technology | 2014

Improving query expansion using WordNet

Dipasree Pal; Mandar Mitra; Kalyankumar Datta

This study proposes a new way of using WordNet for query expansion (QE). We choose candidate expansion terms from a set of pseudo‐relevant documents; however, the usefulness of these terms is measured based on their definitions provided in a hand‐crafted lexical resource such as WordNet. Experiments with a number of standard TREC collections WordNet‐based that this method outperforms existing WordNet‐based methods. It also compares favorably with established QE methods such as KLD and RM3. Leveraging earlier work in which a combination of QE methods was found to outperform each individual method (as well as other well‐known QE methods), we next propose a combination‐based QE method that takes into account three different aspects of a candidate expansion terms usefulness: (a) its distribution in the pseudo‐relevant documents and in the target corpus, (b) its statistical association with query terms, and (c) its semantic relation with the query, as determined by the overlap between the WordNet definitions of the term and query terms. This combination of diverse sources of information appears to work well on a number of test collections, viz., TREC123, TREC5, TREC678, TREC robust (new), and TREC910 collections, and yields significant improvements over competing methods on most of these collections.

international acm sigir conference on research and development in information retrieval | 2011

A novel corpus-based stemming algorithm using co-occurrence statistics

Jiaul H. Paik; Dipasree Pal; Swapan K. Parui

We present a stemming algorithm for text retrieval. The algorithm uses the statistics collected on the basis of certain corpus analysis based on the co-occurrence between two word variants. We use a very simple co-occurrence measure that reflects how often a pair of word variants occurs in a document as well as in the whole corpus. A graph is formed where the word variants are the nodes and two word variants form an edge if they co-occur. On the basis of the co-occurrence measure, a certain edge strength is defined for each of the edges. Finally, on the basis of the edge strengths, we propose a partition algorithm that groups the word variants based on their strongest neighbors, that is, the neighbors with largest strengths. Our stemming algorithm has two static parameters and does not use any other information except the co-occurrence statistics from the corpus. The experiments on TREC, CLEF and FIRE data consisting of four European and two Asian languages show a significant improvement over no-stem strategy on all the languages. Also, the proposed algorithm significantly outperforms a number of strong stemmers including the rule-based ones on a number of languages. For highly inflectional languages, a relative improvement of about 50% is obtained compared to un-normalized words and a relative improvement ranging from 5% to 16% is obtained compared to the rule based stemmer for the concerned language.

ACM Transactions on Asian Language Information Processing | 2010

The FIRE 2008 Evaluation Exercise

Prasenjit Majumder; Mandar Mitra; Dipasree Pal; Ayan Bandyopadhyay; Samaresh Maiti; Sukomal Pal; Deboshree Modak; Sucharita Sanyal

The aim of the Forum for Information Retrieval Evaluation (FIRE) is to create an evaluation framework in the spirit of TREC (Text REtrieval Conference), CLEF (Cross-Language Evaluation Forum), and NTCIR (NII Test Collection for IR Systems), for Indian language Information Retrieval. The first evaluation exercise conducted by FIRE was completed in 2008. This article describes the test collections used at FIRE 2008, summarizes the approaches adopted by various participants, discusses the limitations of the datasets, and outlines the tasks planned for the next iteration of FIRE.

international acm sigir conference on research and development in information retrieval | 2008

Text collections for FIRE

Prasenjit Majumder; Mandar Mitra; Dipasree Pal; Ayan Bandyopadhyay; Samaresh Maiti; Sukanya Mitra; Aparajita Sen; Sukomal Pal

The aim of the Forum for Information Retrieval Evaluation (FIRE) is to create a Cranfield-like evaluation framework in the spirit of TREC, CLEF and NTCIR, for Indian Language Information Retrieval. For the first year, six Indian languages have been selected: Bengali, Hindi, Marathi, Punjabi, Tamil, and Telugu. This poster describes the tasks as well as the document and topic collections that are to be used at the FIRE workshop.

cross language evaluation forum | 2008

Bulgarian, Hungarian and Czech Stemming Using YASS

Prasenjit Majumder; Mandar Mitra; Dipasree Pal

This is the second year in a row we are participating in CLEF. Our aim is to test the performance of a statistical stemmer on various languages. For CLEF 2006, we tried the stemmer on French [1]; while for CLEF 2007, we did experiments for the Hungarian, Bulgarian and Czech monolingual tasks. We find that, for all languages, YASS produces significant improvements over the baseline (unstemmed) runs. The performance of YASS is also found to be comparable to that of other available stemmers for all the three east European Languages.

arXiv: Information Retrieval | 2018

Query Expansion Using Term Distribution and Term Association

Dipasree Pal; Mandar Mitra; Samar Bhattacharya

Good term selection is an important issue for an automatic query expansion (AQE) technique. AQE techniques that select expansion terms from the target corpus usually do so in one of two ways. Distribution based term selection compares the distribution of a term in the (pseudo) relevant documents with that in the whole corpus / random distribution. Two well-known distribution-based methods are based on Kullback-Leibler Divergence (KLD) and Bose-Einstein statistics (Bo1). Association based term selection, on the other hand, uses information about how a candidate term co-occurs with the original query terms. Local Context Analysis (LCA) and Relevance-based Language Model (RM3) are examples of association-based methods. Our goal in this study is to investigate how these two classes of methods may be combined to improve retrieval effectiveness. We propose the following combination-based approach. Candidate expansion terms are first obtained using a distribution based method. This set is then refined based on the strength of the association of terms with the original query terms. We test our methods on 11 TREC collections. The proposed combinations generally yield better results than each individual method, as well as other state-of-the-art AQE approaches. En route to our primary goal, we also propose some modifications to LCA and Bo1 which lead to improved performance.

ACM Transactions on Information Systems | 2013

Effective and Robust Query-Based Stemming

Jiaul H. Paik; Swapan K. Parui; Dipasree Pal; Stephen E. Robertson

Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of the query words without considering their thematic coherence with the given query, which leads to poor performance. Moreover, for many queries, such approaches also produce retrieval performance that is poorer than no stemming, thereby degrading the robustness. The main goal of this article is to present corpus-based fully automatic stemming algorithms which address these issues. A set of experiments on six TREC collections and three other non-English collections containing news and web documents shows that the proposed query-based stemming algorithms consistently and significantly outperform four state of the art strong stemmers of completely varying principles. Our experiments also confirm that the robustness of the proposed query-based stemming algorithms are remarkably better than the existing strong baselines.

improving non english web searching | 2008

Issues in searching for Indian language web content

Dipasree Pal; Prasenjit Majumder; Mandar Mitra; Sukanya Mitra; Aparajita Sen

This paper looks at the problem of searching for Indian language (IL) content on the Web. Even though the amount of IL content that is available on the Web is growing rapidly, searching through this content using the most popular websearch engines poses certain problems. Since the popular search engines do not use any stemming / orthographic normalization for Indian languages, recall levels for IL searches can be low. We provide some examples to indicate the extent of this problem, and suggest a simple and efficient solution to the problem.

international conference on the theory of information retrieval | 2015

Improving Pseudo Relevance Feedback in the Divergence from Randomness Model

Dipasree Pal; Mandar Mitra; Samar Bhattacharya

In an earlier analysis of Pseudo Relevance Feedback (PRF) models by Clinchant and Gaussier (2013), five desirable properties that PRF models should satisfy were formalised. Also, modifications to two PRF models were proposed in order to improve compliance with the desirable properties. These resulted in improved retrieval effectiveness. In this study, we introduce a sixth property that we believe PRF models should satisfy. We also extend the earlier exercise to Bo1, a standard PRF model. Experimental results on the robust, wt10g and gov2 datasets show that the proposed modifications yield improvements in effectiveness.

FIRE | 2013

Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages – Bengali, Gujarati and Marathi

Jiaul H. Paik; Kimmo Kettunen; Dipasree Pal; Kalervo Järvelin

This paper presents results of a generative method for the management of morphological variation of query keywords in Bengali, Gujarati and Marathi. The method is called Frequent Case Generation (FCG). It is based on the skewed distributions of word forms in natural languages and is suitable for languages that have either fair amount of morphological variation or are morphologically very rich. We participated in the ad hoc task at FIRE 2011 and applied the FCG method on monolingual Bengali, Gujarati and Marathi test collections. Our evaluation was carried out with title and description fields of test topics, and the Lemur search engine. We used plain unprocessed word index as the baseline, and n-gramming and stemming as competing methods. The evaluation results show 30%, 16% and 70% relative mean average precision improvements for Bengali, Gujarati and Marathi respectively when comparing the FCG method to plain words. The method shows competitive performance in comparison to n-gramming and stemming.

Explore More