Kishore Papineni
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kishore Papineni.
meeting of the association for computational linguistics | 2002
Kishore Papineni; Salim Roukos; Todd Ward; Wei-Jing Zhu
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.
international conference on acoustics speech and signal processing | 1996
Mark E. Epstein; Kishore Papineni; Salim Roukos; Todd Ward; S. Della Pietra
We present a new approach to natural language understanding (NLU) based on the source-channel paradigm, and apply it to ARPAs Air Travel Information Service (ATIS) domain. The model uses techniques similar to those used by IBM in statistical machine translation. The parameters are trained using the exact match algorithm; a hierarchy of models is used to facilitate the bootstrapping of more complex models from simpler models.
meeting of the association for computational linguistics | 2006
Yaser Al-Onaizan; Kishore Papineni
In this paper, we argue that n-gram language models are not sufficient to address word reordering required for Machine Translation. We propose a new distortion model that can be used with existing phrase-based SMT decoders to address those n-gram language model limitations. We present empirical results in Arabic to English Machine Translation that show statistically significant improvements when our proposed model is used. We also propose a novel metric to measure word order similarity (or difference) between any pair of languages based on word alignments.
meeting of the association for computational linguistics | 2003
Young-Suk Lee; Kishore Papineni; Salim Roukos; Ossama Emam; Hany Hassan
We approximate Arabics rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.
international conference on acoustics speech and signal processing | 1998
Kishore Papineni; Salim Roukos; Robert Todd Ward
We consider translating natural language sentences into a formal language using direct translation models built automatically from training data. Direct translation models have three components: an arbitrary prior conditional probability distribution, features that capture correlations between automatically determined key phrases or sets of words in both languages, and weights associated with these features. The features and the weights are selected using a training corpus of matched pairs of source and target language sentences to maximize the entropy or a new discrimination measure of the resulting conditional probability model. We report results in the air travel information system domain and compare the two methods of training.
north american chapter of the association for computational linguistics | 2001
Kishore Papineni
Inverse Document Frequency (IDF) is a popular measure of a words importance. The IDF invariably appears in a host of heuristic measures used in information retrieval. However, so far the IDF has itself been a heuristic. In this paper, we show IDF to be optimal in a principled sense. We show that IDF is the optimal weight of a word with respect to minimization of a Kullback-Leibler distance suitably generalized to nonnegative functions which need not be probability distributions. This optimization problem is closely related to maximum entropy problem. We show that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where we treat each document as the query that retrieves itself. That is, IDF is optimal for document self-retrieval.
international conference on acoustics speech and signal processing | 1999
Kishore Papineni
This paper presents a linear programming approach to discriminative training. We first define a measure of discrimination of an arbitrary conditional probability model on a set of labeled training data. We consider maximizing discrimination on a parametric family of exponential models that arises naturally in the maximum entropy framework. We show that this optimization problem is globally convex in R/sup n/, and is moreover piecewise linear on R/sup n/. We propose a solution that involves solving a series of linear programming problems. We provide a characterization of global optimizers. We compare this framework with those of minimum classification error and maximum entropy.
north american chapter of the association for computational linguistics | 2003
Yaser Al-Onaizan; Radu Florian; Martin Franz; Hany Hassan; Young-Suk Lee; J. Scott McCarley; Kishore Papineni; Salim Roukos; Jeffrey S. Sorensen; Christoph Tillmann; Todd Ward; Fei Xia
Searching online information is increasingly a daily activity for many people. The multilinguality of online content is also increasing (e.g. the proportion of English web users, which has been decreasing as a fraction the increasing population of web users, dipped below 50% in the summer of 2001). To improve the ability of an English speaker to search mutlilingual content, we built a system that supports cross-lingual search of an Arabic newswire collection and provides on demand translation of Arabic web pages into English. The cross-lingual search engine supports a fast search capability (sub-second response for typical queries) and achieves state-of-the-art performance in the high precision region of the result list. The on demand statistical machine translation uses the Direct Translation model along with a novel statistical Arabic Morphological Analyzer to yield state-of-the-art translation quality. The on demand SMT uses an efficient dynamic programming decoder that achieves reasonable speed for translating web documents.
Archive | 1998
Kishore Papineni; Salim Roukos; Robert Todd Ward
Archive | 1997
Kishore Papineni; Salim Roukos; Robert Todd Ward