Is this you? Create Your Porfile

Kripabandhu Ghosh

Indian Institute of Technology Kanpur

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kripabandhu Ghosh is active.

Explore More

Publication

Featured researches published by Kripabandhu Ghosh.

web science | 2012

Query expansion for microblog retrieval

Ayan Bandyopadhyay; Kripabandhu Ghosh; Prasenjit Majumder; Mandar Mitra

The extreme brevity of Microblog posts (such as ‘tweets’) exacerbates the well-known vocabulary mismatch problem when retrieving tweets in response to user queries. In this study, we explore various query expansion approaches as a way to address this problem. We use the Web as a source of query expansion terms. We also tried a variation of a standard pseudo-relevance feedback method. Results on the TREC 2011 Microblog test data (TWEETS11 corpus) are very promising – significant improvements are obtained over a baseline retrieval strategy that uses no query expansion. Since many of the TREC queries were oriented towards the news genre, we also tried using only news sites (BBC and NYTIMES) in the hope that these would be a cleaner, less noisy source for expansion terms. This turned out to be counter-productive.

Information Processing and Management | 2015

Learning combination weights in data fusion using Genetic Algorithms

Kripabandhu Ghosh; Swapan K. Parui; Prasenjit Majumder

Genetic Algorithm is an effective scheme in determining data fusion weights.Tuning Genetic Algorithm increases time efficiency.Weight learning from only top ranked documents is useful.Redundant runs can be removed based on correlation between scores. Researchers have shown that a weighted linear combination in data fusion can produce better results than an unweighted combination. Many techniques have been used to determine the linear combination weights. In this work, we have used the Genetic Algorithm (GA) for the same purpose. The GA is not new and it has been used earlier in several other applications. But, to the best of our knowledge, the GA has not been used for fusion of runs in information retrieval. First, we use GA to learn the optimum fusion weights using the entire set of relevance assessment. Next, we learn the weights from the relevance assessments of the top retrieved documents only. Finally, we also learn the weights by a twofold training and testing on the queries. We test our method on the runs submitted in TREC. We see that our weight learning scheme, using both full and partial sets of relevance assessment, produces significant improvements over the best candidate run, CombSUM, CombMNZ, Z-Score, linear combination method with performance level, performance level square weighting scheme, multiple linear regression-based weight learning scheme, mixture model result merging scheme, LambdaMerge, ClustFuseCombSUM and ClustFuseCombMNZ. Furthermore, we study how the correlation among the scores in the runs can be used to eliminate redundant runs in a set of runs to be fused. We observe that similar runs have similar contributions in fusion. So, eliminating the redundant runs in a group of similar runs does not hurt fusion performance in any significant way.

Information Processing and Management | 2016

Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground Truth

Kripabandhu Ghosh; Anirban Chakraborty; Swapan K. Parui; Prasenjit Majumder

The proposed algorithm uses context information to segregate semantically related error variants from the unrelated ones.String similarity measures are used to join error variants with the correct query word.The algorithm is tested on Bangla, Hindi and English datasets to show that the proposed approach is language-independent.The Bangla and Hindi datasets have the clean, error-free versions for comparison. So, we have used the performances on the clean text versions as the performance upper-bounds. In addition, we have compared our method with an error modelling approach which, unlike our method, uses the clean version.The English dataset is a genuine use case scenario for our algorithm as this dataset does not have the error-free version.Our proposed method produces significant improvements on most of the baselines.We have also tested our proposed algorithm on TREC 5 Confusion track dataset and showed that our proposed method is significantly better than the baselines. OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.

forum for information retrieval evaluation | 2013

Improving IR Performance from OCRed Text using Cooccurrence

Kripabandhu Ghosh; Anirban Chakraborty; Swapan K. Parui

Information Retrieval performance is hurt to a great extent by OCR errors. Much research has been reported on modelling and correction of OCR errors. However, all the existing systems make use of language dependent resources or training texts to study the nature of errors. No research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose a novel algorithm for automatic detection of OCR errors and improvement of retrieval performance from the erroneous corpus. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is blank space. We have tested our algorithm on erroneous OCRed Bangla FIRE collection offered in the RISOT 2012 track and obtained about 9% improvement over the OCRed baseline. However, the improvement is not statistically significant.

FIRE | 2013

Retrieval from OCR Text: RISOT Track

Kripabandhu Ghosh; Swapan K. Parui

In this paper, we present our work in the RISOT track of FIRE 2011. Here, we describe an error modeling technique for OCR errors in an Indic script. Based on the error model, we apply a two-fold error correction method on the OCRed corpus. First, we correct the corpus by correction with full confidence and correction without full confidence approaches. Finally, we use query expansion for error correction. We have achieved retrieval results which are significantly better than the baseline and the difference between our best result and the original text run is not significant.

international joint conference on knowledge discovery knowledge engineering and knowledge management | 2014

A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text

Anirban Chakraborty; Kripabandhu Ghosh; Utpal Roy

OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose an algorithm of detecting OCR errors and improving retrieval performance from the erroneous corpus. We present two versions of the algorithm: one based on word cooccurrence and the other based on Pointwise Mutual Information. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is a blank space. We have tested our algorithm on erroneous Bangla FIRE collection and obtained significant improvements.

WWW '18 Companion Proceedings of the The Web Conference 2018 | 2018

Retrieving Information from Multiple Sources

Anurag Roy; Kripabandhu Ghosh; Moumita Basu; Parth Gupta; Saptarshi Ghosh

The Web has several information sources on which an ongoing event is discussed. To get a complete picture of the event, it is important to retrieve information from multiple sources. We propose a novel neural network based model which integrates the embeddings from multiple sources, and thus retrieves information from them jointly, %all the sources together, as opposed to combining multiple retrieval results. The importance of the proposed model is that no document-aligned comparable data is needed. Experiments on posts related to a particular event from three different sources - Facebook, Twitter and WhatsApp - exhibit the efficacy of the proposed model.

Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18 | 2018

Automatic Matching of Resource Needs and Availabilities in Microblogs for Post-Disaster Relief

Moumita Basu; Anurag Shandilya; Kripabandhu Ghosh; Saptarshi Ghosh

During a disaster event, it is essential to know about needs and availabilities of different types of resources, for coordinating relief operations. Microblogging sites are frequently used for aiding post-disaster relief operations, and there have been prior attempts to identify tweets that inform about resource needs and availabilities (termed as need-tweets and availability-tweets respectively). However, there has not been much attempt to effectively utilise such tweets. We introduce the problem of automatically matching need-tweets with appropriate availability-tweets, which is practically important for coordination of post-disaster relief operations. We also experiment with several methodologies for automatically matching need-tweets and availability-tweets.

conference on information and knowledge management | 2017

Combining Local and Global Word Embeddings for Microblog Stemming

Anurag Roy; Trishnendu Ghorai; Kripabandhu Ghosh; Saptarshi Ghosh

Stemming is a vital step employed to improve retrieval performance through efficient unification of morphological variants of a word. We propose an unsupervised, context-specific stemming algorithm for microblogs, based on both local and global word embeddings, which is capable of handling the informal, noisy vocabulary of microblogs. Experiments on two standard microblog data collections (TREC 2016 and FIRE 2016) show that, the proposed stemmer enables significantly better retrieval performance than several state-of-the-art stemming algorithms, for the same queries.

advances in social networks analysis and mining | 2017

Identifying Post-Disaster Resource Needs and Availabilities from Microblogs

Moumita Basu; Kripabandhu Ghosh; Somenath Das; Ratnadeep Dey; Somprakash Bandyopadhyay; Saptarshi Ghosh

Microblogging sites like Twitter are increasingly being used for aiding post-disaster relief operations. In such situations, identifying needs and availabilities of various types of resources is critical for effective coordination of the relief operations. We focus on the problem of automatically identifying tweets that inform about needs and availabilities of resources, termed as need-tweets and availability-tweets respectively. Traditionally, pattern matching techniques are adopted to identify such tweets. In this work, we present novel retrieval methodologies, based on word embeddings, for automatically identifying need-tweets and availability-tweets. Experiments over tweets posted during two recent disaster events show that the proposed methodologies outperform prior pattern-matching techniques.

Explore More