Mohamed Al-Badrashiny

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohamed Al-Badrashiny is active.

Explore More

Publication

Featured researches published by Mohamed Al-Badrashiny.

IEEE Transactions on Audio, Speech, and Language Processing | 2011

A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features

Mohsen A. Rashwan; Mohamed Al-Badrashiny; Mohamed Attia; Sherif M. Abdou; Ahmed Rafea

This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A^ lattice search and long-horizon n-grams probability estimation. When full-form words are OOV, the system switches to the second mode which factorizes each Arabic word into all its possible morphological constituents, then uses also the same techniques used by the first mode to get the most likely sequence of morphemes, hence the most likely diacritization. While the second mode achieves a far better coverage of the highly derivative and inflective Arabic language, the first mode is faster to learn, i.e., yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-ending) diacritics. Our presented hybrid system that benefits from the advantages of both modes has experimentally been found superior to the best performing reported systems of Habash and Rambow, and of Zitouni, using the same training and test corpus for the sake of fair comparison. The word error rates of (morphological diacritization, overall diacritization including the case endings) for the three systems are, respectively, as follows (3.1%, 12.5%), (5.5%, 14.9%), and (7.9%, 18%). The hybrid architecture of language factorizing and unfactorizing components may be inspiring to other NLP/HLT problems in analogous situations.

applications of natural language to data bases | 2013

Code Switch Point Detection in Arabic

Heba Elfardy; Mohamed Al-Badrashiny; Mona T. Diab

This paper introduces a dual-mode stochastic system to automatically identify linguistic code switch points in Arabic. The first of these modes determines the most likely word tag (i.e. dialect or modern standard Arabic) by choosing the sequence of Arabic word tags with maximum marginal probability via lattice search and 5-gram probability estimation. When words are out of vocabulary, the system switches to the second mode which uses a dialectal Arabic (DA) and modern standard Arabic (MSA) morphological analyzer. If the OOV word is analyzable using the DA morphological analyzer only, it is tagged as “DA”, if it is analyzable using the “MSA” morphological analyzer only, it is tagged as MSA, otherwise if analyzable using both of them, then it is tagged as “both”. The system yields an F β = 1 score of 76.9% on the development dataset and 76.5% on the held-out test dataset, both judged against human-annotated Egyptian forum data.

international conference natural language processing | 2008

A Compact Arabic Lexical Semantics Language Resource Based on the Theory of Semantic Fields

Mohamed Attia; Mohsen A. Rashwan; Ahmed Ragheb; Mohamed Al-Badrashiny; Husein Al-Basoumy; Sherif M. Abdou

Applications of statistical Arabic NLP in general, and text mining in specific, along with the tools underneath perform much better as the statistical processing operates on deeper language factorizations than on raw text. Lexical semantic factorization is very important in this regard due to its feasibility, high level of abstraction, and the language independence of its output. In the core of such a factorization lies an Arabic lexical semantic DB. While building this LR, we had to go beyond the conventional exclusive collection of words from dictionaries and thesauri that cannot alone produce a satisfactory coverage of this highly inflective and derivative language. This paper is hence devoted to the design and implementation of an Arabic lexical semantics LR that enables the retrieval of the possible senses of any given Arabic word at a high coverage. Instead of tying full Arabic words to their possible senses, our LR flexibly relates morphologically and PoS-tags constrained Arabic lexical compounds to a predefined limited set of semantic fields across which the standard semantic relations are defined. With the aid of the same large-scale Arabic morphological analyzer and PoS tagger in the runtime, the possible senses of virtually any given Arabic word are retrievable.

workshop on computational approaches to code switching | 2014

AIDA: Identifying Code Switching in Informal Arabic Text

Heba Elfardy; Mohamed Al-Badrashiny; Mona T. Diab

In this paper, we present the latest version of our system for identifying linguistic code switching in Arabic text. The system relies on Language Models and a tool for morphological analysis and disambiguation for Arabic to identify the class of each word in a given sentence. We evaluate the performance of our system on the test datasets of the shared task at the EMNLP workshop on Computational Approaches to Code Switching (Solorio et al., 2014). The system yields an average token-level F =1 score of 93.6%, 77.7% and 80.1%, on the first, second, and surprise-genre test-sets, respectively, and a tweet-level F =1 score of 4.4%, 36% and 27.7%, on the same test-sets.

conference on computational natural language learning | 2014

Automatic Transliteration of Romanized Dialectal Arabic

Mohamed Al-Badrashiny; Ramy Eskander; Nizar Habash; Owen Rambow

In this paper, we address the problem of converting Dialectal Arabic (DA) text that is written in the Latin script (called Arabizi) into Arabic script following the CODA convention for DA orthography. The presented system uses a finite state transducer trained at the character level to generate all possible transliterations for the input Arabizi words. We then filter the generated list using a DA morphological analyzer. After that we pick the best choice for each input word using a language model. We achieve an accuracy of 69.4% on an unseen test set compared to 63.1% using a system which represents a previously proposed approach.

workshop on computational approaches to code switching | 2014

Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script

Ramy Eskander; Mohamed Al-Badrashiny; Nizar Habash; Owen Rambow

Arabic on social media has all the properties of any language on social media that make it tough for natural language processing, plus some specific problems. These include diglossia, the use of an alternative alphabet (Roman), and code switching with foreign languages. In this paper, we present a system which can process Arabic written in Roman alphabet (“Arabizi”). It identifies whether each word is a foreign word or one of another four categories (Arabic, name, punctuation, sound), and transliterates Arabic words and names into the Arabic alphabet. We obtain an overall system performance of 83.8% on an unseen test set.

IEEE Transactions on Audio, Speech, and Language Processing | 2009

Fassieh¯, a Semi-Automatic Visual Interactive Tool for Morphological, PoS-Tags, Phonetic, and Semantic Annotation of Arabic Text Corpora

Mohamed Attia; Mohsen A. Rashwan; Mohamed Al-Badrashiny

This paper introduces an Arabic text annotation tool called Fassiehreg. Via a sophisticated interactive GUI application, Fassiehreg makes it easy to build structured large standard written Arabic corpora, then allows the production of fundamental linguistic analyses; i.e., language factorizations, at high coverage and accuracy rates over such corpora. Arabic morphological analysis, part-of-speech (PoS)-tagging, full phonetic transcription (diacritization), and lexical semantics analysis are the most significant Arabic language factorizations currently supported by Fassiehreg. The high inherent ambiguity of these analyses is statistically resolved in Fassiehreg which also affords a multitude of auxiliary features enabling a guided, normalized, and efficient proofreading of any part of the factorized corpus. The paper first reviews the highly inflective and derivative nature of Arabic language, our Arabic language factorization models, and the associated statistical disambiguation methodology. Afterwards, we present Fassiehreg which is not only a text annotation tool, but is also an evaluation, demonstrative, and tutorial means of Arabic natural language processing (NLP).

empirical methods in natural language processing | 2014

GWU-HASP: Hybrid Arabic Spelling and Punctuation Corrector

Mohammed Attia; Mohamed Al-Badrashiny; Mona T. Diab

In this paper, we describe our Hybrid Arabic Spelling and Punctuation Corrector (HASP). HASP was one of the systems participating in the QALB-2014 Shared Task on Arabic Error Correction. The system uses a CRF (Conditional Random Fields) classifier for correcting punctuation errors, an open-source dictionary (or word list) for detecting errors and generating and filtering candidates, an n-gram language model for selecting the best candidates, and a set of deterministic rules for text normalization (such as removing diacritics and kashida and converting Hindi numbers into Arabic numerals). We also experiment with word alignment for spelling correction at the character level and report some preliminary results.

workshop on computational approaches to code switching | 2016

The George Washington University System for the Code-Switching Workshop Shared Task 2016.

Mohamed Al-Badrashiny; Mona T. Diab

We describe our work in the EMNLP 2016 second code-switching shared task; a generic language independent framework for linguistic code switch point detection (LCSPD). The system uses characters level 5-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We participated in the Modern Standard Arabic (MSA)-dialectal Arabic (DA) and SpanishEnglish tracks, obtaining a weighted average F-scores of 0.83 and 0.91 on MSA-DA and EN-SP respectively.

meeting of the association for computational linguistics | 2015

GWU-HASP-2015

Mohammed Attia; Mohamed Al-Badrashiny; Mona T. Diab

In this paper, we describe our system HASP-2015 (Hybrid Arabic Spelling and Punctuation Corrector) in which we introduce significant improvements over our previous version HASP-2014 and with which we participated in the QALB2015 Second Shared Task on Arabic Error Correction. Our system utilizes probabilistic information on errors and their possible corrections in the training data and combine that with an open-source reference dictionary (or word list) for detecting errors and generating and filtering candidates. We enhance our system further by allowing it to generate candidates for common semantic and grammatical errors. Eventually, an n-gram language model is used for selecting best candidates. We use a CRF (Conditional Random Fields) classifier for correcting punctuation errors in a two-pass process where first the system learns punctuation placement, and then it learns to identify punctuation types.

Explore More