Anoop Sarkar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anoop Sarkar is active.

Explore More

Publication

Featured researches published by Anoop Sarkar.

north american chapter of the association for computational linguistics | 2003

Example selection for bootstrapping statistical parsers

Mark Steedman; Rebecca Hwa; Stephen Clark; Miles Osborne; Anoop Sarkar; Julia Hockenmaier; Paul Ruhlen; Steven Baker; Jeremiah Crim

This paper investigates bootstrapping for statistical parsers to reduce their reliance on manually annotated training data. We consider both a mostly-unsupervised approach, cotraining, in which two parsers are iteratively re-trained on each others output; and a semi-supervised approach, corrected co-training, in which a human corrects each parsers output before adding it to the training data. The selection of labeled training examples is an integral part of both frameworks. We propose several selection methods based on the criteria of minimizing errors in the data and maximizing training utility. We show that incorporating the utility criterion into the selection method results in better parsers for both frameworks.

north american chapter of the association for computational linguistics | 2001

Applying co-training methods to statistical parsing

Anoop Sarkar

We propose a novel Co-Training method for statistical parsing. The algorithm takes as input a small corpus (9695 sentences) annotated with parse trees, a dictionary of possible lexicalized structures for each word in the training set and a large pool of unlabeled text. The algorithm iteratively labels the entire data set with parse trees. Using empirical results based on parsing the Wall Street Journal corpus we show that training a statistical parser on the combined labeled and unlabeled data strongly out-performs training only on the labeled data.

conference of the european chapter of the association for computational linguistics | 2003

Bootstrapping statistical parsers from small datasets

Mark Steedman; Miles Osborne; Anoop Sarkar; Stephen Clark; Rebecca Hwa; Julia Hockenmaier; Paul Ruhlen; Steven Baker; Jeremiah Crim

We present a practical co-training method for bootstrapping statistical parsers using a small amount of manually parsed training material and a much larger pool of raw sentences. Experimental results show that unlabelled sentences can be used to improve the performance of statistical parsers. In addition, we consider the problem of boot-strapping parsers when the manually parsed training material is in a different domain to either the raw sentences or the testing material. We show that boot-strapping continues to be useful, even though no manually produced parses from the target domain are used.

international conference on computational linguistics | 2000

Automatic extraction of subcategorization frames for Czech

Anoop Sarkar; Daniel Zeman

We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we are able to achieve 88% precision on unseen parsed text.

north american chapter of the association for computational linguistics | 2009

Active Learning for Statistical Phrase-based Machine Translation

Gholamreza Haffari; Maxim Roy; Anoop Sarkar

Statistical machine translation (SMT) models need large bilingual corpora for training, which are unavailable for some language pairs. This paper provides the first serious experimental study of active learning for SMT. We use active learning to improve the quality of a phrase-based SMT system, and show significant improvements in translation compared to a random sentence selection baseline, when test and training data are taken from the same or different domains. Experimental results are shown in a simulated setting using three language pairs, and in a realistic situation for Bangla-English, a language pair with limited translation resources.

Journal of Logic, Language and Information | 2003

D-LTAG System: Discourse Parsing with a Lexicalized Tree-Adjoining Grammar

Katherine Forbes; Eleni Miltsakaki; Rashmi Prasad; Anoop Sarkar; Aravind K. Joshi; Bonnie Webber

We present an implementation of a discourse parsing system for alexicalized Tree-Adjoining Grammar for discourse, specifying the integrationof sentence and discourse level processing. Our system is based on theassumption that the compositional aspects of semantics at thediscourse level parallel those at the sentence level. This coupling isachieved by factoring away inferential semantics and anaphoric features ofdiscourse connectives. Computationally, this parallelism is achievedbecause both the sentence and discourse grammar are LTAG-based and the sameparser works at both levels. The approach to an LTAG for discourse has beendeveloped by Webber and colleagues in some recent papers. Our system takes a discourseas input, parses the sentences individually, extracts the basic discourseconstituent units from the sentence derivations, and reparses the discoursewith reference to the discourse grammar while using the same parser usedat the sentence level.

empirical methods in natural language processing | 2003

Using LTAG based features in parse reranking

Libin Shen; Anoop Sarkar; Aravind K. Joshi

We propose the use of Lexicalized Tree Adjoining Grammar (LTAG) as a source of features that are useful for reranking the output of a statistical parser. In this paper, we extend the notion of a tree kernel over arbitrary sub-trees of the parse to the derivation trees and derived trees provided by the LTAG formalism, and in addition, we extend the original definition of the tree kernel, making it more lexicalized and more compact. We use LTAG based features for the parse reranking task and obtain labeled recall and precision of 89.7%/90.0% on WSJ section 23 of Penn Treebank for sentences of length ≤ 100 words. Our results show that the use of LTAG based tree kernel gives rise to a 17% relative difference in f-score improvement over the use of a linear kernel without LTAG based features.

international conference on computational linguistics | 1996

Coordination in Tree Adjoining Grammars: formalization and implementation

Anoop Sarkar; Aravind K. Joshi

In this paper we show that an account for coordination can be constructed using the derivation structures in a lexicalized Tree Adjoining Grammar (LTAG). We present a notion of derivation in LTAGs that preserves the notion of fixed constituency in the LTAG lexicon while providing the flexibility needed for coordination phenomena. We also discuss the construction of a practical parser for LTAGs that can handle coordination including cases of non-constituent coordination.

Machine Translation | 2007

Semi-supervised model adaptation for statistical machine translation

Nicola Ueffing; Gholamreza Haffari; Anoop Sarkar

Statistical machine translation systems are usually trained on large amounts of bilingual text (used to learn a translation model), and also large amounts of monolingual text in the target language (used to train a language model). In this article we explore the use of semi-supervised model adaptation methods for the effective use of monolingual data from the source language in order to improve translation quality. We propose several algorithms with this aim, and present the strengths and weaknesses of each one. We present detailed experimental evaluations on the French–English EuroParl data set and on data from the NIST Chinese–English large-data track. We show a significant improvement in translation quality on both tasks.

canadian conference on artificial intelligence | 2005

Voting between multiple data representations for text chunking

Hong Shen; Anoop Sarkar

This paper considers the hypothesis that voting between multiple data representations can be more accurate than voting between multiple learning models This hypothesis has been considered before (cf [San00]) but the focus was on voting methods rather than the data representations In this paper, we focus on choosing specific data representations combined with simple majority voting On the community standard CoNLL-2000 data set, using no additional knowledge sources apart from the training data, we achieved 94.01 Fβ=1 score for arbitrary phrase identification compared to the previous best Fβ=1 93.90 We also obtained 95.23 Fβ=1 score for Base NP identification Significance tests show that our Base NP identification score is significantly better than the previous comparable best Fβ=1 score of 94.22 Our main contribution is that our model is a fast linear time approach and the previous best approach is significantly slower than our system.

Explore More