Anurag Bhardwaj | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anurag Bhardwaj is active.

Explore More

Publication

Featured researches published by Anurag Bhardwaj.

document analysis systems | 2010

Gabor features for offline Arabic handwriting recognition

Jin Chen; Huaigu Cao; Rohit Prasad; Anurag Bhardwaj; Premkumar Natarajan

Many feature extraction approaches for off-line handwriting recognition (OHR) rely on accurate binarization of gray-level images. However, high-quality binarization of most real-world documents is extremely difficult due to varying characteristics of noises artifacts common in such documents. Unlike most of these features, Gabor features do not require binarization of the document images, and thus are likely to be more robust to noises in document images. To demonstrate the efficacy of our proposed Gabor features, we perform subword recognition for off-line Arabic handwritten images using Support Vector Machines (SVM). We also compare the recognition performance with other binarization based features which have been proven to be effective in capturing shape characteristics of handwritten Arabic subwords, such as GSC (a set of gradient, structure, and concavity features) and skeleton based Graph features. Our preliminary experimental results show that Gabor features outperform Graph features and are slightly better than GSC features for Arabic subword recognition. In addition, by combining Gabor and GSC features, we obtain a significant reduction in classification error rate over using GSC or Gabor features alone.

International Journal on Document Analysis and Recognition | 2009

Automatic recognition of handwritten medical forms for search engines

Robert Milewski; Venu Govindaraju; Anurag Bhardwaj

A new paradigm, which models the relationships between handwriting and topic categories, in the context of medical forms, is presented. The ultimate goals are: (1) a robust method which categorizes medical forms into specified categories, and (2) the use of such information for practical applications such as an improved recognition of medical handwriting or retrieval of medical forms as in a search engine. Medical forms have diverse, complex and large lexicons consisting of English, Medical and Pharmacology corpus. Our technique shows that a few recognized characters, returned by handwriting recognition, can be used to construct a linguistic model capable of representing a medical topic category. This allows (1) a reduced lexicon to be constructed, thereby improving handwriting recognition performance, and (2) PCR (Pre-Hospital Care Report) forms to be tagged with a topic category and subsequently searched by information retrieval systems. We present an improvement of over 7% in raw recognition rate and a mean average precision of 0.28 over a set of 1,175 queries on a data set of unconstrained handwritten medical forms filled in emergency environments.

analytics for noisy unstructured text data | 2008

Topic based language models for OCR correction

Anurag Bhardwaj; Huaigu Cao; Venu Govindaraju

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers produce reasonably clean output when used with a restricted lexicon. But in absence of such a restricted lexicon, the output of an unconstrained handwritten word recognizer is noisy. The objective of this research is to process noisy recognizer output and eliminate spurious recognition choices using a topic based language model. We construct a topic based language model for every document using a training data which is manually categorized. A topic categorization sub-system based on Maximum Entropy model is also trained which is used to generate the topic distribution of a test document. A given test word image is processed by the recognizer and its word recognition likelihood is refined by incorporating topic distribution of the document and topic based language model probability. The proposed method is evaluated on a publicly available IAM dataset and experimental results show significant improvement in the word recognition accuracy from 32% to 40% over a test set consisting of 4033 word images extracted from 70 handwritten document images.

international conference on frontiers in handwriting recognition | 2010

Retrieving Handwriting Styles: A Content Based Approach to Handwritten Document Retrieval

Anurag Bhardwaj; Achint Oommen Thomas; Yun Fu; Venu Govindaraju

Large scale retrieval of handwritten documents has primarily been focused around searching a query text in the OCR’ed transcription of the document images, which provides a limited view of the complete search process. Recent research advances have led to a number of content based retrieval techniques which expand the search scope to document content level (i.e. image features, meta-information). Based on similar motivations, we propose a new approach to content based retrieval of handwritten document images by retrieving similar handwriting styles corresponding to a handwritten query image. At the core, we formulate this problem as the task of unsupervised writer style classification without the need of any style definitions or grammar. We build upon our previous work in writer style modeling and apply it to learn a style distribution for every handwriting sample in the corpus. Given a query image, all documents are ranked in order of their style distribution similarity. Experimental results conducted on publicly available IAM dataset demonstrate the efficacy of our proposed method over baseline feature based systems.

international conference on document analysis and recognition | 2009

Stochastic Segment Modeling for Offline Handwriting Recognition

Premkumar Natarajan; Krishna Subramanian; Anurag Bhardwaj; Rohit Prasad

In this paper, we present a novel approach for incorporating structural information into the hidden Markov Modeling (HMM) framework for offline handwriting recognition. Traditionally, structural features have been used in recognition approaches that rely on accurate segmentation of words into smaller units (sub-words or characters). However, such segmentation based approaches do not perform well on real-world handwritten images, because breaks and merges in glyphs typically create new connected components that are not observed in the training data. To mitigate the problem of having to derive accurate segmentation from connected components, we present a novel framework where the HMM based recognition system trained on shorter-span features is used to generate the 2-D character images (the “Stochastic Segments”), and then another classifier that uses structural features extracted from the stochastic character segments generates a new set of scores. Finally, the scores from the HMM system and from structural matching are used in combination to generate a hypothesis that is better than the results from either the HMM or from structural matching alone. We demonstrate the efficacy of our approach by reporting experimental results on a large corpus of handwritten Arabic documents.

document analysis systems | 2010

Latent Dirichlet allocation based writer identification in offline handwriting

Anurag Bhardwaj; Manavender Reddy; Srirangaraj Setlur; Venu Govindaraju; Sitaram Ramachandrula

In this paper, we describe a novel approach to Writer Identification in Offline handwriting using Latent Dirichlet Allocation. State-of-the-art methods for writer identification employ the traditional feature-classification paradigm which does not provide enough information about the handwriting attributes such as writing style which are key components in any forensic analysis of handwriting. This problem is also compounded due to lack of efficient rules for defining a particular writing style that can capture writer specific characteristics over a large dataset. We propose to address this issue by using a generative model in form of Latent Dirichlet Allocation(LDA) that automatically infers writing styles from handwritten document collection without any pre-defined set of rules. This information is then used to represent each writer as a distribution over multiple writing style for classifying any unknown writer sample. We describe our approach on two different feature sets consisting of contour angle features as well as structural and concavity features. Our experimental results show comparable performance with baseline systems and also demonstrate the efficacy of LDA for learning multiple handwriting styles.

International Journal on Document Analysis and Recognition | 2009

Using topic models for OCR correction

Anurag Bhardwaj; Venu Govindaraju

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.

International Journal on Document Analysis and Recognition | 2011

Unconstrained handwritten document retrieval

Huaigu Cao; Venu Govindaraju; Anurag Bhardwaj

With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.

Sanskrit Computational Linguistics | 2009

Keyword Spotting Techniques for Sanskrit Documents

Anurag Bhardwaj; Srirangaraj Setlur; Venu Govindaraju

With advances in the field of digitization of printed documents and several mass digitization projects underway, information retrieval and document search have emerged as key research areas. However, most of the current work in these areas is limited to English and a few oriental languages. The lack of efficient solutions for Indic scripts and languages such as Sanskrit has hampered information extraction from a large body of documents of cultural and historical importance. This chapter presents two relevant topics in this area. First, we describe the use of a script specific Keyword Spotting for Sanskrit documents that makes use of domain knowledge of the script. Second, we address the needs of a digital library to provide access to a collection of documents from multiple scripts. This requires intelligent solutions which scale across different scripts. We present a script independent Keyword Spotting approach for this purpose. Experimental results illustrate the efficacy of our methods.

document recognition and retrieval | 2008

An OCR Based Approach for Word Spotting in Devanagari Documents

Anurag Bhardwaj; Suryaprakash Kompalli; Srirangaraj Setlur; Venu Govindaraju

This paper describes an OCR-based technique for word spotting in Devanagari printed documents. The system accepts a Devanagari word as input and returns a sequence of word images that are ranked according to their similarity with the input query. The methodology involves line and word separation, pre-processing document words, word recognition using OCR and similarity matching. We demonstrate a Block Adjacency Graph (BAG) based document cleanup in the pre-processing phase. During word recognition, multiple recognition hypotheses are generated for each document word using a font-independent Devanagari OCR. The similarity matching phase uses a cost based model to match the word input by a user and the OCR results. Experiments are conducted on document images from the publicly available ILT and Million Book Project dataset. The technique achieves an average precision of 80% for 10 queries and 67% for 20 queries for a set of 64 documents containing 5780 word images. The paper also presents a comparison of our method with template-based word spotting techniques.

Explore More