Is this you? Create Your Porfile

Prasenjit Majumder

Dhirubhai Ambani Institute of Information and Communication Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Prasenjit Majumder is active.

Explore More

Publication

Featured researches published by Prasenjit Majumder.

ACM Transactions on Information Systems | 2007

YASS: Yet another suffix stripper

Prasenjit Majumder; Mandar Mitra; Swapan K. Parui; Gobinda Kole; Pabitra Mitra; Kalyankumar Datta

Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porters and Lovins stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porters and Lovins stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor.

web science | 2012

Query expansion for microblog retrieval

Ayan Bandyopadhyay; Kripabandhu Ghosh; Prasenjit Majumder; Mandar Mitra

The extreme brevity of Microblog posts (such as ‘tweets’) exacerbates the well-known vocabulary mismatch problem when retrieving tweets in response to user queries. In this study, we explore various query expansion approaches as a way to address this problem. We use the Web as a source of query expansion terms. We also tried a variation of a standard pseudo-relevance feedback method. Results on the TREC 2011 Microblog test data (TWEETS11 corpus) are very promising – significant improvements are obtained over a baseline retrieval strategy that uses no query expansion. Since many of the TREC queries were oriented towards the news genre, we also tried using only news sites (BBC and NYTIMES) in the hope that these would be a cleaner, less noisy source for expansion terms. This turned out to be counter-productive.

ACM Transactions on Asian Language Information Processing | 2010

The FIRE 2008 Evaluation Exercise

Prasenjit Majumder; Mandar Mitra; Dipasree Pal; Ayan Bandyopadhyay; Samaresh Maiti; Sukomal Pal; Deboshree Modak; Sucharita Sanyal

The aim of the Forum for Information Retrieval Evaluation (FIRE) is to create an evaluation framework in the spirit of TREC (Text REtrieval Conference), CLEF (Cross-Language Evaluation Forum), and NTCIR (NII Test Collection for IR Systems), for Indian language Information Retrieval. The first evaluation exercise conducted by FIRE was completed in 2008. This article describes the test collections used at FIRE 2008, summarizes the approaches adopted by various participants, discusses the limitations of the datasets, and outlines the tasks planned for the next iteration of FIRE.

international acm sigir conference on research and development in information retrieval | 2008

Text collections for FIRE

Prasenjit Majumder; Mandar Mitra; Dipasree Pal; Ayan Bandyopadhyay; Samaresh Maiti; Sukanya Mitra; Aparajita Sen; Sukomal Pal

The aim of the Forum for Information Retrieval Evaluation (FIRE) is to create a Cranfield-like evaluation framework in the spirit of TREC, CLEF and NTCIR, for Indian Language Information Retrieval. For the first year, six Indian languages have been selected: Bengali, Hindi, Marathi, Punjabi, Tamil, and Telugu. This poster describes the tasks as well as the document and topic collections that are to be used at the FIRE workshop.

cross language evaluation forum | 2008

Bulgarian, Hungarian and Czech Stemming Using YASS

Prasenjit Majumder; Mandar Mitra; Dipasree Pal

This is the second year in a row we are participating in CLEF. Our aim is to test the performance of a statistical stemmer on various languages. For CLEF 2006, we tried the stemmer on French [1]; while for CLEF 2007, we did experiments for the Hungarian, Bulgarian and Czech monolingual tasks. We find that, for all languages, YASS produces significant improvements over the baseline (unstemmed) runs. The performance of YASS is also found to be comparable to that of other available stemmers for all the three east European Languages.

forum for information retrieval evaluation | 2013

Overview of the FIRE 2013 Track on Transliterated Search

Rishiraj Saha Roy; Monojit Choudhury; Prasenjit Majumder; Komal Agarwal

In this paper, we provide an overview of the FIRE 2013 track on transliterated search and describe the datasets released as part of the track. This was the first year that the track was organized. We had proposed two subtasks as part of the challenge. In the first subtask, which we had proposed for Hindi, Bangla, and Gujarati, participants had to devise an algorithm to label the true languages of words in a sentence. Additionally, if a non-English word was identified, the algorithm was also supposed to provide the transliteration of the word in the native script. The second subtask was retrieval-based, where mixed-script documents had to be retrieved and ranked by relevance in response to ad hoc queries. The queries in our dataset were Bollywood Hindi song lyrics, in Roman script. We received a total of 25 run submissions from five different teams across the world (three from India and two from abroad). Conducting this track helped us generate awareness about the importance of transliteration in the context of Indian languages. Results show that there is considerable scope for improvement of transliteration accuracies for the studied languages.

improving non english web searching | 2008

Issues in searching for Indian language web content

Dipasree Pal; Prasenjit Majumder; Mandar Mitra; Sukanya Mitra; Aparajita Sen

This paper looks at the problem of searching for Indian language (IL) content on the Web. Even though the amount of IL content that is available on the Web is growing rapidly, searching through this content using the most popular websearch engines poses certain problems. Since the popular search engines do not use any stemming / orthographic normalization for Indian languages, recall levels for IL searches can be low. We provide some examples to indicate the extent of this problem, and suggest a simple and efficient solution to the problem.

Information Processing and Management | 2015

Learning combination weights in data fusion using Genetic Algorithms

Kripabandhu Ghosh; Swapan K. Parui; Prasenjit Majumder

Genetic Algorithm is an effective scheme in determining data fusion weights.Tuning Genetic Algorithm increases time efficiency.Weight learning from only top ranked documents is useful.Redundant runs can be removed based on correlation between scores. Researchers have shown that a weighted linear combination in data fusion can produce better results than an unweighted combination. Many techniques have been used to determine the linear combination weights. In this work, we have used the Genetic Algorithm (GA) for the same purpose. The GA is not new and it has been used earlier in several other applications. But, to the best of our knowledge, the GA has not been used for fusion of runs in information retrieval. First, we use GA to learn the optimum fusion weights using the entire set of relevance assessment. Next, we learn the weights from the relevance assessments of the top retrieved documents only. Finally, we also learn the weights by a twofold training and testing on the queries. We test our method on the runs submitted in TREC. We see that our weight learning scheme, using both full and partial sets of relevance assessment, produces significant improvements over the best candidate run, CombSUM, CombMNZ, Z-Score, linear combination method with performance level, performance level square weighting scheme, multiple linear regression-based weight learning scheme, mixture model result merging scheme, LambdaMerge, ClustFuseCombSUM and ClustFuseCombMNZ. Furthermore, we study how the correlation among the scores in the runs can be used to eliminate redundant runs in a set of runs to be fused. We observe that similar runs have similar contributions in fusion. So, eliminating the redundant runs in a group of similar runs does not hurt fusion performance in any significant way.

cross language evaluation forum | 2006

Statistical vs. rule-based stemming for monolingual french retrieval

Prasenjit Majumder; Mandar Mitra; Kalyankumar Datta

This paper describes our approach to the 2006 Adhoc Monolingual Information Retrieval run for French. The goal of our experiment was to compare the performance of a proposed statistical stemmer with that of a rule-based stemmer, specifically the French version of Porters stemmer. The statistical stemming approach is based on lexicon clustering, using a novel string distance measure. We submitted three official runs, besides a baseline run that uses no stemming. The results show that stemming significantly improves retrieval performance (as expected) by about 9-10%, and the performance of the statistical stemmer is comparable with that of the rule-based stemmer.

Information Processing and Management | 2016

Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground Truth

Kripabandhu Ghosh; Anirban Chakraborty; Swapan K. Parui; Prasenjit Majumder

The proposed algorithm uses context information to segregate semantically related error variants from the unrelated ones.String similarity measures are used to join error variants with the correct query word.The algorithm is tested on Bangla, Hindi and English datasets to show that the proposed approach is language-independent.The Bangla and Hindi datasets have the clean, error-free versions for comparison. So, we have used the performances on the clean text versions as the performance upper-bounds. In addition, we have compared our method with an error modelling approach which, unlike our method, uses the clean version.The English dataset is a genuine use case scenario for our algorithm as this dataset does not have the error-free version.Our proposed method produces significant improvements on most of the baselines.We have also tested our proposed algorithm on TREC 5 Confusion track dataset and showed that our proposed method is significantly better than the baselines. OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.

Explore More