Khalil Sima'an
University of Amsterdam
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Khalil Sima'an.
international conference on computational linguistics | 1996
Khalil Sima'an
This paper studies the computational complexity of disambiguation under probabilistic tree-grammars as in (Bod, 1992; Schabes and Waters, 1993). It presents a proof that the following problems are NP-hard: computing the Most Probable Parse from a sentence or from a word-graph, and computing the Most Probable Sentence (MPS) from a word-graph. The NP-hardness of computing the MPS from a word-graph also holds for Stochastic Context-Free Grammars (SCFGs).
conference of the european chapter of the association for computational linguistics | 2006
David Chiang; R.M. Diab; Nizar Habash; R. Hwa; Roger Levy; Owen Rambow; Khalil Sima'an
The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem of parsing transcribed spoken Levantine Arabic (LA).We do not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel corpus LAMSA. Instead, we use explicit knowledge about the relation between LA and MSA.
meeting of the association for computational linguistics | 2000
Khalil Sima'an
This paper explores the kinds of probabilistic relations that are important in syntactic disambiguation. It proposes that two widely used kinds of relations, lexical dependencies and structural relations, have complementary disambiguation capabilities. It presents a new model based on structural relations, the Tree-gram model, and reports experiments showing that structural relations should benefit from enrichment by lexical dependencies.
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers | 2016
Lucia Specia; Stella Frank; Khalil Sima'an; Desmond Elliott
This paper introduces and summarises the findings of a new shared task at the intersection of Natural Language Processing and Computer Vision: the generation of image descriptions in a target language, given an image and/or one or more descriptions in a different (source) language. This challenge was organised along with the Conference on Machine Translation (WMT16), and called for system submissions for two task variants: (i) a translation task, in which a source language image description needs to be translated to a target language, (optionally) with additional cues from the corresponding image, and (ii) a description generation task, in which a target language description needs to be generated for an image, (optionally) with additional cues from source language descriptions of the same image. In this first edition of the shared task, 16 systems were submitted for the translation task and seven for the image description task, from a total of 10 teams.
meeting of the association for computational linguistics | 2005
Roy Bar-Haim; Khalil Sima'an; Yoad Winter
A major architectural decision in designing a disambiguation model for segmentation and Part-of-Speech (POS) tagging in Semitic languages concerns the choice of the input-output terminal symbols over which the probability distributions are defined. In this paper we develop a segmenter and a tagger for Hebrew based on Hidden Markov Models (HMMs). We start out from a morphological analyzer and a very small morphologically annotated corpus. We show that a model whose terminal symbols are word segments (=morphemes), is advantageous over a word-level model for the task of POS tagging. However, for segmentation alone, the morpheme-level model has no significant advantage over the word-level model. Error analysis shows that both models are not adequate for resolving a common type of segmentation ambiguity in Hebrew -- whether or not a word in a written text is prefixed by a definiteness marker. Hence, we propose a morpheme-level model where the definiteness morpheme is treated as a possible feature of morpheme terminals. This model exhibits the best overall performance, both in POS tagging and in segmentation. Despite the small size of the annotated corpus available for Hebrew, the results achieved using our best model are on par with recent results on Modern Standard Arabic.
meeting of the association for computational linguistics | 2016
Desmond Elliott; Stella Frank; Khalil Sima'an; Lucia Specia
We introduce the Multi30K dataset to stimulate multilingual multimodal research. Recent advances in image description have been demonstrated on English-language datasets almost exclusively, but image description should not be limited to English. This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) descriptions crowdsourced independently of the original English descriptions. We outline how the data can be used for multilingual image description and multimodal machine translation, but we anticipate the data will be useful for a broader range of tasks.
workshop on statistical machine translation | 2014
Milos Stanojevic; Khalil Sima'an
We present the UvA-ILLC submission of the BEER metric to WMT 14 metrics task. BEER is a sentence level metric that can incorporate a large number of features combined in a linear model. Novel contributions are (1) efficient tuning of a large number of features for maximizing correlation with human system ranking, and (2) novel features that give smoother sentence level scores.
Natural Language Engineering | 2008
Roy Bar-Haim; Khalil Sima'an; Yoad Winter
Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.
meeting of the association for computational linguistics | 2007
Saib Manour; Khalil Sima'an; Yoad Winter
We propose an enhanced Part-of-Speech (POS) tagger of Semitic languages that treats Modern Standard Arabic (henceforth Arabic) and Modern Hebrew (henceforth Hebrew) using the same probabilistic model and architectural setting. We start out by porting an existing Hidden Markov Model POS tagger for Hebrew to Arabic by exchanging a morphological analyzer for Hebrew with Buckwalters (2002) morphological analyzer for Arabic. This gives state-of-the-art accuracy (96.12%), comparable to Habash and Rambows (2005) analyzer-based POS tagger on the same Arabic datasets. However, further improvement of such analyzer-based tagging methods is hindered by the incomplete coverage of standard morphological analyzer (Bar Haim et al., 2005). To overcome this coverage problem we supplement the output of Buckwalters analyzer with synthetically constructed analyses that are proposed by a model which uses character information (Diab et al., 2004) in a way that is similar to Nakagawas (2004) system for Chinese and Japanese. A version of this extended model that (unlike Nakagawa) incorporates synthetically constructed analyses also for known words achieves 96.28% accuracy on the standard Arabic test set.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Hany Hassan; Khalil Sima'an; Andy Way
Until quite recently, extending phrase-based statistical machine translation (PBSMT) with syntactic knowledge caused system performance to deteriorate. The most recent successful enrichments of PBSMT with hierarchical structure either employ nonlinguistically motivated syntax for capturing hierarchical reordering phenomena, or extend the phrase translation table with redundantly ambiguous syntactic structures over phrase pairs. In this paper, we present an extended, harmonized account of our previous work which showed that incorporating linguistically motivated lexical syntactic descriptions, called supertags, can yield significantly better PBSMT systems at insignificant extra computational cost. We describe a novel PBSMT model that integrates supertags into the target language model and the target side of the translation model. Two kinds of supertags are employed: those from lexicalized tree-adjoining grammar and combinatory categorial grammar. Despite the differences between the two sets of supertags, they give similar improvements. In addition to integrating the Markov supertagging approach in PBSMT, we explore the utility of a new surface grammaticality measure based on combinatory operators. We perform various experiments on the Arabic-to-English NIST 2005 test set addressing the issues of sparseness, scalability, and the utility of system subcomponents. We show that even when the parallel training data grows very large, the supertagged system retains a relatively stable absolute performance advantage over the unadorned PBSMT system. Arguably, this hints at a performance gap that cannot be bridged by acquiring more phrase pairs. Our best result shows a relative improvement of 6.1% over a state-of-the-art PBSMT model, which compares favorably with the leading systems on the NIST 2005 task. We also demonstrate that the advantages of a supertag-based system carry over to German-English, where improvements of up to 8.9% relative to the baseline system are observed.