Yassine Benajiba
Columbia University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yassine Benajiba.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Yassine Benajiba; Mona T. Diab; Paolo Rosso
The named entity recognition task aims at identifying and classifying named entities within an open-domain text. This task has been garnering significant attention recently as it has been shown to help improve the performance of many natural language processing applications. In this paper, we investigate the impact of using different sets of features in three discriminative machine learning frameworks, namely, support vector machines, maximum entropy and conditional random fields for the task of named entity recognition. Our language of interest is Arabic. We explore lexical, contextual and morphological features and nine data-sets of different genres and annotations. We measure the impact of the different features in isolation and incrementally combine them in order to evaluate the robustness to noise of each approach. We achieve the highest performance using a combination of 15 features in conditional random fields using broadcast news data (Fbeta = 1=83.34).
international conference natural language processing | 2011
Mohamed Outahajala; Yassine Benajiba; Paolo Rosso; Lahbib Zenkouar
The aim of this paper is to present the first Amazighe POS tagger. Very few linguistic resources have been developed so far for Amazighe and we believe that the development of a POS tagger tool is the first step needed for automatic text processing. The used data have been manually collected and annotated. We have used state-of-art supervised machine learning approaches to build our POS-tagging models. The obtained accuracy achieved 92.58% and we have used the 10-fold technique to further validate our results.
ACM Transactions on Asian Language Information Processing | 2009
Yassine Benajiba; Imed Zitouni
The Arabic language has a very rich/complex morphology. Each Arabic word is composed of zero or more prefixes, one stem and zero or more suffixes. Consequently, the Arabic data is sparse compared to other languages such as English, and it is necessary to conduct word segmentation before any natural language processing task. Therefore, the word-segmentation step is worth a deeper study since it is a preprocessing step which shall have a significant impact on all the steps coming afterward. In this article, we present an Arabic mention detection system that has very competitive results in the recent Automatic Content Extraction (ACE) evaluation campaign. We investigate the impact of different segmentation schemes on Arabic mention detection systems and we show how these systems may benefit from more than one segmentation scheme. We report the performance of several mention detection models using different kinds of possible and known segmentation schemes for Arabic text: punctuation separation, Arabic Treebank, and morphological and character-level segmentations. We show that the combination of competitive segmentation styles leads to a better performance. Results indicate a statistically significant improvement when Arabic Treebank and morphological segmentations are combined.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Imed Zitouni; Yassine Benajiba
In the last two decades, significant effort has been put into annotating linguistic resources in several languages. Despite this valiant effort, there are still many languages left that have only small amounts of such resources. The goal of this article is to present and investigate a method of propagating information (specifically mentions) from a resource-rich language such as English into a relatively less-resource language such as Arabic. We compare also this approach to its equivalent counterpart using monolingual resources. Part of the investigation is to quantify the contribution of propagating information in different conditions - based on the availability of resources in the target language. Experiments on the language pair Arabic-English show that one can achieve relatively decent performance by propagating information from a language with richer resources such as English into Arabic alone (no resources or models in the source language Arabic). Furthermore, results show that propagated features from English do help improve the Arabic system performance even when used in conjunction with all feature types built from the source language. Experiments also show that using propagated features in conjunction with lexically-derived features only (as can be obtained directly from a mention annotated corpus) brings the system performance at the one obtained in the target language by using feature derived from many linguistic resources, therefore improving the system when such resources are not available.
2013 ACS International Conference on Computer Systems and Applications (AICCSA) | 2013
Mohamed Outahajala; Yassine Benajiba; Lahbib Zenkouar; Paolo Rosso
Like most of the languages which have only recently started being investigated for the Natural Language Processing (NLP) tasks, Amazigh lacks annotated corpora and tools and still suffers from the scarcity of linguistic tools and resources. The main aim of this paper is to present a tokenizer tool and a new part-of-speech (POS) tagger based on a new Amazigh tag set (AMTS) composed of 28 tag. In line with our goal we have trained two sequence classification models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) to build a toknizer and a POS tagger for the Amazigh language. We have used the 10-fold technique to evaluate and validate our approach. We report that POS tagging results using SVMs and CRFs are very comparable. Across the board, CRFs outperformed SVMs on the fold level (91.18% vs. 90.75%) and CRFs outperformed SVMs on the 10 folds average level (87.95% vs. 87.11%). Regarding tokenization task, SVMs outperformed CRFs on the fold level (99.97% vs. 99.85%) and on the 10 folds average level (99.95% vs. 99.89%).
CLEF (Online Working Notes/Labs/Workshop) | 2011
Anselmo Peñas; Eduard H. Hovy; Pamela Forner; Álvaro Rodrigo; Richard F. E. Sutcliffe; Caroline Sporleder; Corina Forascu; Yassine Benajiba; Petya Osenova
meeting of the association for computational linguistics | 2010
Yassine Benajiba; Imed Zitouni; Mona T. Diab; Paolo Rosso
The International Arab Journal of Information Technology | 2009
Yassine Benajiba; Mona T. Diab; Paolo Rosso
CLEF (Working Notes) | 2013
Richard F. E. Sutcliffe; Anselmo Peñas; Eduard H. Hovy; Pamela Forner; Álvaro Rodrigo; Corina Forascu; Yassine Benajiba; Petya Osenova
empirical methods in natural language processing | 2010
Yassine Benajiba; Imed Zitouni