Sherif M. Abdou
Cairo University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sherif M. Abdou.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Mohsen A. Rashwan; Mohamed Al-Badrashiny; Mohamed Attia; Sherif M. Abdou; Ahmed Rafea
This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A^ lattice search and long-horizon n-grams probability estimation. When full-form words are OOV, the system switches to the second mode which factorizes each Arabic word into all its possible morphological constituents, then uses also the same techniques used by the first mode to get the most likely sequence of morphemes, hence the most likely diacritization. While the second mode achieves a far better coverage of the highly derivative and inflective Arabic language, the first mode is faster to learn, i.e., yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-ending) diacritics. Our presented hybrid system that benefits from the advantages of both modes has experimentally been found superior to the best performing reported systems of Habash and Rambow, and of Zitouni, using the same training and test corpus for the sake of fair comparison. The word error rates of (morphological diacritization, overall diacritization including the case endings) for the three systems are, respectively, as follows (3.1%, 12.5%), (5.5%, 14.9%), and (7.9%, 18%). The hybrid architecture of language factorizing and unfactorizing components may be inspiring to other NLP/HLT problems in analogous situations.
arXiv: Computation and Language | 2015
Hossam Samir Ibrahim; Sherif M. Abdou; Mervat Gheith
The rise of social media such as blogs and social networks has fueled interest in sentiment analysis. With the proliferation of reviews, ratings, recommendations and other forms of online expression, online opinion has turned into a kind of virtual currency for businesses looking to market their products, identify new opportunities and manage their reputations, therefore many are now looking to the field of sentiment analysis. In this paper, we present a feature-based sentence level approach for Arabic sentiment analysis. Our approach is using Arabic idioms/saying phrases lexicon as a key importance for improving the detection of the sentiment polarity in Arabic sentences as well as a number of novels and rich set of linguistically motivated features (contextual Intensifiers, contextual Shifter and negation handling), syntactic features for conflicting phrases which enhance the sentiment classification accuracy. Furthermore, we introduce an automatic expandable wide coverage polarity lexicon of Arabic sentiment words. The lexicon is built with gold-standard sentiment words as a seed which is manually collected and annotated and it expands and detects the sentiment orientation automatically of new sentiment words using synset aggregation technique and free online Arabic lexicons and thesauruses. Our data focus on modern standard Arabic (MSA) and Egyptian dialectal Arabic tweets and microblogs (hotel reservation, product reviews, etc.). The experimental results using our resources and techniques with SVM classifier indicate high performance levels, with accuracies of over 95%.
international conference natural language processing | 2008
Mohamed Attia; Mohsen A. Rashwan; Ahmed Ragheb; Mohamed Al-Badrashiny; Husein Al-Basoumy; Sherif M. Abdou
Applications of statistical Arabic NLP in general, and text mining in specific, along with the tools underneath perform much better as the statistical processing operates on deeper language factorizations than on raw text. Lexical semantic factorization is very important in this regard due to its feasibility, high level of abstraction, and the language independence of its output. In the core of such a factorization lies an Arabic lexical semantic DB. While building this LR, we had to go beyond the conventional exclusive collection of words from dictionaries and thesauri that cannot alone produce a satisfactory coverage of this highly inflective and derivative language. This paper is hence devoted to the design and implementation of an Arabic lexical semantics LR that enables the retrieval of the possible senses of any given Arabic word at a high coverage. Instead of tying full Arabic words to their possible senses, our LR flexibly relates morphologically and PoS-tags constrained Arabic lexical compounds to a predefined limited set of semantic fields across which the standard semantic relations are defined. With the aid of the same large-scale Arabic morphological analyzer and PoS tagger in the runtime, the possible senses of virtually any given Arabic word are retrievable.
ieee international conference on recent trends in information systems | 2015
Hossam Samir Ibrahim; Sherif M. Abdou; Mervat Gheith
Sentiment analysis (SA) and opinion mining (OM) becomes a field of interest that fueled the attention of research during the last decade, due to the rise of the amount of internet documents (especially online reviews and comments) on the social media such as blogs and social networks. Many attempts have been conducted to build a corpus for SA, due to the consideration of importance of building such resource as a key factor in SA and OM systems. But the need of building these resources is still ongoing, especially for morphologically-Rich language (MRL) such as Arabic. In this paper, we present MIKA a multi-genre tagged corpus of modern standard Arabic (MSA) and colloquial. MIKA is manually collected and annotated at sentence level with semantic orientation (positive or negative or neutral). A number of rich set of linguistically motivated features (contextual Intensifiers, contextual Shifter and negation handling), syntactic features for conflicting phrases and others are used for the annotation process. Our data focus on MSA and Egyptian dialectal Arabic. We report the efforts of manually building and annotating our sentiment corpus using different types of data, such as tweets and Arabic microblogs (hotel reservation, product reviews, and TV program comments).
asian conference on pattern recognition | 2011
Ibrahim Hosny; Sherif M. Abdou; Aly A. Fahmy
Online handwriting recognition of Arabic script is a difficult problem since it is naturally both cursive and unconstrained. The analysis of Arabic script is further complicated due to obligatory dots/stokes that are placed above or below most letters and usually are written delayed in order. This paper introduces a Hidden Markov Model (HMM) based system to provide solutions for most of the difficulties inherent in recognizing Arabic script. A preprocessing for the delayed strokes to match the structure of the HMM model is introduced. The used HMM models are trained with Writer Adaptive Training (WAT) to minimize the variance between writers in the training data. Also the models discrimination power is enhanced with Discriminative training. The system performance is evaluated using an international test set from the ADAB completion and shows a promising performance compared with the state-of-art systems.
International Journal of Computer Applications | 2015
Hossam Samir Ibrahim; Sherif M. Abdou; Mervat Gheith
the fair amount of works in sentiment analysis (SA) and opinion mining (OM) systems in the last decade and with respect to the performance of these systems, but it still not desired performance, especially for morphologically-Rich Language (MRL) such as Arabic, due to the complexities and challenges exist in the nature of the languages itself. One of these challenges is the detection of idioms or proverbs phrases within the writer text or comment. An idiom or proverb is a form of speech or an expression that is peculiar to itself. Grammatically, it cannot be understood from the individual meanings of its elements and can yield different sentiment when treats as separate words. Consequently, In order to facilitate the task of detection and classification of lexical phrases for automated SA systems, this paper presents AIPSeLEX a novel idioms/ proverbs sentiment lexicon for modern standard Arabic (MSA) and colloquial. AIPSeLEX is manually collected and annotated at sentence level with semantic orientation (positive or negative). The efforts of manually building and annotating the lexicon are reported. Moreover, we build a classifier that extracts idioms and proverbs, phrases from text using n-gram and similarity measure methods. Finally, several experiments were carried out on various data, including Arabic tweets and Arabic microblogs (hotel reservation, product reviews, and TV program comments) from publicly available Arabic online reviews websites (social media, blogs, forums, e-commerce web sites) to evaluate the coverage and accuracy of AIPSeLEX.
arXiv: Computation and Language | 2015
AbdelRahim A. Elmadany; Sherif M. Abdou; Mervat Gheith
Building dialogues systems interaction has recently gained considerable attention, but most of the resources and systems built so far are tailored to English and other Indo-European languages. The need for designing systems for other languages is increasing such as Arabic language. For this reasons, there are more interest for Arabic dialogue acts classification task because it a key player in Arabic language understanding to building this systems. This paper surveys different techniques for dialogue acts classification for Arabic. We describe the main existing techniques for utterances segmentations and classification, annotation schemas, and test corpora for Arabic dialogues understanding that have introduced in the literature
International Journal of Computer Applications | 2013
Waleed M. Azmy; Sherif M. Abdou; Mahmoud Shoman
This paper introduce the work done to build an Arabic unit selection voice that could carry emotional information. Three emotional sates were covered; normal, sad and questions. An emotional speech classifier was used to enhance the intelligibility of the used recorded speech database. The classification information was employed in the proposed target cost to produce more natural and emotive synthetic speech. The system is evaluated according to the naturalness and emotiveness of the produced speech. The system evaluations show significant increase in the naturalness and emotiveness scores.
International Journal of Computer Applications | 2012
Abdulrahman Alshameri; Sherif M. Abdou; Khaled Mostafa
Text and not -text segmentation and text line extraction from document images are the most challenging problems of information indexing of Arabic document images such as books, technical articles, business letters and faxes in order to successfully process them in systems such as OCR. Researches on Arabic language related to documents digitization have been focusing on word and handwriting recognition. Few approaches have been proposed for layout analysis for Arabic scanned/captured documents. In this paper we present a page segmentation method that deals with the complexity of the Arabic language characteristics and fonts using the combination between two algorithms. The first method is the Run length Smoothing. The second method is the Connected Component Labeling algorithm for text and non-text classification using SVM. The combination of the two methods is based on Anding and Oring operations between the outputs of the two methods based on certain conditions. Then, dynamic horizontal projection based on dynamic updating of the threshold to commensurate with the noise associated with different documents and in between text lines. The performance evaluation is performed using manually generated ground truth representations from a dataset of Arabic document images captured using cameras and a hardware built for this purpose. Evaluation and experimental results demonstrate that the proposed text extraction method is independent from different document size, text size, font, shape, and is robust to Arabic document segmentation and text lines extraction. General Terms Image processing, Pattern Recognition.
Pattern Analysis and Applications | 2016
Ibrahim Abdelaziz; Sherif M. Abdou; Hassanin M. Al-Barhamtoshy
The success of using Hidden Markov Models (HMMs) for speech recognition application has motivated the adoption of these models for handwriting recognition especially the online handwriting that has large similarity with the speech signal as a sequential process. Some languages such as Arabic, Farsi and Urdo include large number of delayed strokes that are written above or below most letters and usually written delayed in time. These delayed strokes represent a modeling challenge for the conventional left-right HMM that is commonly used for Automatic Speech Recognition (ASR) systems. In this paper, we introduce a new approach for handling delayed strokes in Arabic online handwriting recognition using HMMs. We also show that several modeling approaches such as context based tri-grapheme models, speaker adaptive training and discriminative training that are currently used in most state-of-the-art ASR systems can provide similar performance improvement for Hand Writing Recognition (HWR) systems. Finally, we show that using a multi-pass decoder that use the computationally less expensive models in the early passes can provide an Arabic large vocabulary HWR system with practical decoding time. We evaluated the performance of our proposed Arabic HWR system using two databases of small and large lexicons. For the small lexicon data set, our system achieved competing results compared to the best reported state-of-the-art Arabic HWR systems. For the large lexicon, our system achieved promising results (accuracy and time) for a vocabulary size of 64k words with the possibility of adapting the models for specific writers to get even better results.