Nizar Habash
New York University Abu Dhabi
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nizar Habash.
meeting of the association for computational linguistics | 2005
Nizar Habash; Owen Rambow
We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including part-of-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.
north american chapter of the association for computational linguistics | 2006
Nizar Habash; Fatiha Sadat
In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like to-kenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.
Archive | 2007
Nizar Habash; Abdelhadi Soudi; Timothy Buckwalter
This chapter introduces the transliteration scheme used to represent Arabic characters in this book. The scheme is a one-to-one transliteration of the Arabic script that is complete, easy to read, and consistent with Arabic computer encodings. We present guidelines for Arabic pronunciation using this transliteration scheme and discuss various idiosyncrasies of Arabic orthography
meeting of the association for computational linguistics | 2008
Ryan M. Roth; Owen Rambow; Nizar Habash; Mona T. Diab; Cynthia Rudin
We investigate the tasks of general morphological tagging, diacritization, and lemmatization for Arabic. We show that for all tasks we consider, both modeling the lexeme explicitly, and retuning the weights of individual classifiers for the specific task, improve the performance.
meeting of the association for computational linguistics | 2006
Nizar Habash; Owen Rambow
We present MAGEAD, a morphological analyzer and generator for the Arabic language family. Our work is novel in that it explicitly addresses the need for processing the morphology of the dialects. MAGEAD performs an on-line analysis to or generation from a root+pattern+features representation, it has separate phonological and orthographic representations, and it allows for combining morphemes from different dialects. We present a detailed evaluation of MAGEAD.
north american chapter of the association for computational linguistics | 2007
Nizar Habash; Owen Rambow
We present a diacritization system for written Arabic which is based on a lexical resource. It combines a tagger and a lexeme language model. It improves on the best results reported in the literature.
conference of the european chapter of the association for computational linguistics | 2006
David Chiang; R.M. Diab; Nizar Habash; R. Hwa; Roger Levy; Owen Rambow; Khalil Sima'an
The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem of parsing transcribed spoken Levantine Arabic (LA).We do not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel corpus LAMSA. Instead, we use explicit knowledge about the relation between LA and MSA.
meeting of the association for computational linguistics | 2009
Nizar Habash; Ryan M. Roth
The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences. CATiB contrasts with previous approaches to Arabic treebanking in its emphasis on speed with some constraints on linguistic richness. Two basic ideas inspire the CATiB approach: no annotation of redundant information and using representations and terminology inspired by traditional Arabic syntax. We describe CATiBs representation and annotation procedure, and report on inter-annotator agreement and speed.
meeting of the association for computational linguistics | 2008
Nizar Habash
We present four techniques for online handling of Out-of-Vocabulary words in Phrase-based Statistical Machine Translation. The techniques use spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the performance of these techniques and combine them. Our results show a consistent improvement over a state-of-the-art baseline in terms of BLEU and a manual error analysis.
Archive | 2006
Fadi Biadsy; Jihad El-Sana; Nizar Habash
Online handwriting recognition of Arabic script is a difficult problem since it is naturally both cursive and unconstrained. The analysis of Arabic script is further complicated in comparison to Latin script due to obligatory dots/stokes that are placed above or below most letters. This paper introduces a Hidden Markov Model (HMM) based system to provide solutions for most of the difficulties inherent in recognizing Arabic script including: letter connectivity, position-dependent letter shaping, and delayed strokes. This is the first HMM-based solution to online Arabic handwriting recognition. We report successful results for writerdependent and writer-independent word recognition.