Kamel Smaïli
French Institute for Research in Computer Science and Automation
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kamel Smaïli.
Machine Translation | 2011
Sylvain Raybaud; David Langlois; Kamel Smaïli
Machine translation systems are not reliable enough to be used “as is”: except for the most simple tasks, they can only be used to grasp the general meaning of a text or assist human translators. The purpose of confidence measures is to detect erroneous words or sentences produced by a machine translation system. In this article, after reviewing the mathematical foundations of confidence estimation, we propose a comparison of several state-of-the-art confidence measures, predictive parameters and classifiers. We also propose two original confidence measures based on Mutual Information and a method for automatically generating data for training and testing classifiers. We applied these techniques to data from the WMT campaign 2008 and found that the best confidence measures yielded an Equal Error Rate of 36.3% at word level and 34.2% at sentence level, but combining different measures reduced these rates to 35.0% and 29.0%, respectively. We also present the results of an experiment aimed at determining how helpful confidence measures are in a post-editing task. Preliminary results suggest that our system is not yet ready to efficiently help post-editors, but we now have both software and a protocol that we can apply to further experiments, and user feedback has indicated aspects which must be improved in order to increase the level of helpfulness of confidence measures.
string processing and information retrieval | 2001
Brigitte Bigi; Armelle Brun; Jean Paul Haton; Kamel Smaïli; Imed Zitouni
This work presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classijier, topic peqdexity, and weighted model. Our work aims to study these methods by confronting them to very diferent data. This study is very fruitful for our research. Statistical topic identiJication methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identiJcation of 80% on a general newspaper corpus but does not exceed 30% on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.
string processing and information retrieval | 2000
Armelle Brun; Kamel Smaïli; Jean Paul Haton
We present several methods for topic detection on newspaper articles, using either a general vocabulary or topic-specific vocabularies. Specific vocabularies are determined manually or statistically. In both cases, we aim at finding the most representative words of a topic. Several methods have been experimented, the first one is based on perplexity, this method achieves a 100% topic identification rate, on large test corpora, when the two first propositions are taken into account. Other methods are based on statistical counts and achieve 94% of identification on smaller test corpora. The major challenge of this work is to identify topics with only few words in order to be able, during speech recognition, to determine the best adequate language model.
conference on intelligent text processing and computational linguistics | 2015
Salima Harrat; Karima Meftouh; Mourad Abbas; Salma Jamoussi; Motaz Saad; Kamel Smaïli
We present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we compare to the Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria and one from Tunisia and two dialects from Middle-east (Syria and Palestine). The resources which have been built from scratch have lead to a collection of a multi-dialect parallel resource. Furthermore, this collection has been aligned by hand with a MSA corpus. We conducted several analytical studies in order to understand the relationship between these vernacular languages. For this, we studied the closeness between all the pairs of dialects and MSA in terms of Hellinger distance. We also performed an experiment of dialect identification. This experiment showed that neighbouring dialects as expected tend to be confused, making difficult their identification. Because the Arabic dialects are different from one region to another which make the communication between people difficult, we conducted cross-lingual machine translation between all the pairs of dialects and also with MSA. Several interesting conclusions have been carried out from this experiment.
International Journal of Advanced Computer Science and Applications | 2016
Salima Harrat; Karima Meftouh; Mourad Abbas; Walid-Khaled Hidouci; Kamel Smaïli
Arabic is the official language overall Arab coun-tries, it is used for official speech, news-papers, public adminis-tration and school. In Parallel, for everyday communication, non-official talks, songs and movies, Arab people use their dialects which are inspired from Standard Arabic and differ from one Arabic country to another. These linguistic phenomenon is called disglossia, a situation in which two distinct varieties of a language are spoken within the same speech community. It is observed Throughout all Arab countries, standard Arabic widely written but not used in everyday conversation, dialect widely spoken in everyday life but almost never written. Thus, in NLP area, a lot of works have been dedicated for written Arabic. In contrast, Arabic dialects at a near time were not studied enough. Interest for them is recent. First work for these dialects began in the last decade for middle-east ones. Dialects of the Maghreb are just beginning to be studied. Compared to written Arabic, dialects are under-resourced languages which suffer from lack of NLP resources despite their large use. We deal in this paper with Arabic Algerian dialect a non-resourced language for which no known resource is available to date. We present a first linguistic study introducing its most important features and we describe the resources that we created from scratch for this dialect.
information sciences, signal processing and their applications | 2007
Armelle Brun; David Langlois; Kamel Smaïli
This study examines how to take originally advantage from distant information in statistical language models. We show that it is possible to use n-gram models considering histories different from those used during training. These models are called crossing context models. Our study deals with classical and distant n-gram models. A mixture of four models is proposed and evaluated. A bigram linear mixture achieves an improvement of 14% in terms of perplexity. Moreover the trigram mixture outperforms the standard trigram by 5.6%. These improvements have been obtained without complexifying standard n-gram models. The resulting mixture language model has been integrated into a speech recognition system. Its evaluation achieves a slight improvement in terms of word error rate on the data used for the francophone evaluation campaign ESTER [1]. Finally, the impact of the proposed crossing context language models on performance is presented according to various speakers.
Computer Speech & Language | 2001
Frédéric Bimbot; Marc El-Beze; Stéphane Igounet; Michèle Jardino; Kamel Smaïli; Imed Zitouni
Language models are usually evaluated on test texts using the perplexity derived from the model likelihood function computed on these texts (test set perplexity). In order to use this measure in the framework of a comparative evaluation campaign, we have developed an alternative scheme for estimating the test set perplexity. The method is derived from the Shannon game and based on a gambling approach on the next word to come in a truncated sentence. We also study the entropy bounds proposed by Shannon and based on the rank of the correct answer, in order to estimate a perplexity interval for non-probabilistic language models. The relevance of the approach is validated on an example. We then report the results of a preliminary comparative evaluation using the proposed scheme.
intelligent data analysis | 2010
Cherif Chiraz Latiri; Kamel Smaïli; Caroline Lavecchia; David Langlois
In this paper, we describe two new methods of mining monolingual and bilingual text corpora that heavily rely on the use of association rules and triggers. The association rules based method is firstly applied in query expansion. The conducted experiments on French newspapers and on a set of scientific documents show that the proposed approach outperforms the baseline model. The second method focuses on the machine translation and is motivated by the results of triggers on statistical language modeling. In order to build up a translation table, association rules and triggers are then generalized to mine bilingual corpora. In this respect, we propose respectively the concepts of inter-lingual association rules and inter-lingual triggers. Both methods have been integrated in a real statistical machine translation. Carried out experiments highlight the practical feasibility of the introduced approaches in the context of machine translation and show that inter-lingual triggers achieve better results than those obtained using the third IBM model.
Proceedings of the Third Arabic Natural Language Processing Workshop | 2017
Mohamed Amine Menacer; Odile Mella; Dominique Fohr; Denis Jouvet; David Langlois; Kamel Smaïli
Automatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition , it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right hamoza above or below Alif.
Procedia Computer Science | 2017
Mohamed Amine Menacer; Odile Mella; Dominique Fohr; Denis Jouvet; David Langlois; Kamel Smaïli
Abstract This paper addresses the development of an Automatic Speech Recognition system for Modern Standard Arabic (MSA) and its extension to Algerian dialect. Algerian dialect is very different from Arabic dialects of the Middle-East, since it is highly influenced by the French language. In this article, we start by presenting the new automatic speech recognition named ALASR (Arabic Loria Automatic Speech Recognition) system. The acoustic model of ALASR is based on a DNN approach and the language model is a classical n-gram. Several options are investigated in this paper to find the best combination of models and parameters. ALASR achieves good results for MSA in terms of WER (14.02%), but it completely collapses on an Algerian dialect data set of 70 minutes (a WER of 89%). In order to take into account the impact of the French language, on the Algerian dialect, we combine in ALASR two acoustic models, the original one (MSA) and a French one trained on ESTER corpus. This solution has been adopted because no transcribed speech data for Algerian dialect are available. This combination leads to a substantial absolute reduction of the word error of 24%.
Collaboration
Dive into the Kamel Smaïli's collaboration.
French Institute for Research in Computer Science and Automation
View shared research outputs