Kareem Darwish
Qatar Computing Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kareem Darwish.
meeting of the association for computational linguistics | 2002
Kareem Darwish
The paper presents a rapid method of developing a shallow Arabic morphological analyzer. The analyzer will only be concerned with generating the possible roots of any given Arabic word. The analyzer is based on automatically derived rules and statistics. For evaluation, the analyzer is compared to a commercially available Arabic Morphological Analyzer.
international acm sigir conference on research and development in information retrieval | 2003
Kareem Darwish; Douglas W. Oard
Structured methods for query term replacement rely on separate estimates of term tes of replacement probabilities. Statistically significantfrequency and document frequency to compute a weight for each query term. This paper reviews prior work on structured query techniques and introduces three new variants that leverage estima improvements in retrieval effectiveness are demonstrated for cross-language retrieval and for retrieval based on optical character recognition when replacement probabilities are used to estimate both term frequency and document frequency.
conference on information and knowledge management | 2012
Wei Gao; Peng Li; Kareem Darwish
Social media streams such as Twitter are regarded as faster first-hand sources of information generated by massive users. The content diffused through this channel, although noisy, provides important complement and sometimes even a substitute to the traditional news media reporting. In this paper, we propose a novel unsupervised approach based on topic modeling to summarize trending subjects by jointly discovering the representative and complementary information from news and tweets. Our method captures the content that enriches the subject matter by reinforcing the identification of complementary sentence-tweet pairs. To valuate the complementarity of a pair, we leverage topic modeling formalism by combining a two-dimensional topic-aspect model and a cross-collection approach in the multi-document summarization literature. The final summaries are generated by co-ranking the news sentences and tweets in both sides simultaneously. Experiments give promising results as compared to state-of-the-art baselines.
conference on information and knowledge management | 2012
Kareem Darwish; Walid Magdy; Ahmed Mourad
The use of social media has profoundly affected social and political dynamics in the Arab world. In this paper, we explore the Arabic microblogs retrieval. We illustrate some of the challenges associated with Arabic microblog retrieval, which mainly stem from the use of different Arabic dialects that vary in lexical selection, morphology, and phonetics and lack orthographic and spelling conventions. We present some of the required processing for effective retrieval such as improved letter normalization, elongated word handling, stopword removal, and stemming
international acm sigir conference on research and development in information retrieval | 2002
Kareem Darwish; Douglas W. Oard
Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.
empirical methods in natural language processing | 2006
Walid Magdy; Kareem Darwish
This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.
ACM Transactions on Asian Language Information Processing | 2003
Daqing He; Douglas W. Oard; Jianqiang Wang; Jun Luo; Dina Demner-Fushman; Kareem Darwish; Philip Resnik; Sanjeev Khudanpur; Michael Nossal; Michael Subotin; Anton Leuski
Searching is inherently a user-centered process; people pose the questions for which machines seek answers, and ultimately people judge the degree to which retrieved documents meet their needs. Rapid development of interactive systems that use queries expressed in one language to search documents written in another poses five key challenges: (1) interaction design, (2) query formulation, (3) cross-language search, (4) construction of translated summaries, and (5) machine translation. This article describes the design of MIRACLE, an easily extensible system based on English queries that has previously been used to search French, German, and Spanish documents, and explains how the capabilities of MIRACLE were rapidly extended to accommodate Cebuano and Hindi. Evaluation results for the cross-language search component are presented for both languages, along with results from a brief full-system interactive experiment with Hindi. The article concludes with some observations on directions for further research on interactive cross-language information retrieval.
empirical methods in natural language processing | 2014
Hamdy Mubarak; Kareem Darwish
This paper describes the collection and classification of a multi-dialectal corpus of Arabic based on the geographical information of tweets. We mapped information of user locations to one of the Arab countries, and extracted tweets that have dialectal word(s). Manual evaluation of the extracted corpus shows that the accuracy of assignment of tweets to some countries (like Saudi Arabia and Egypt) is above 93% while the accuracy for other countries, such Algeria and Syria is below 70%.
empirical methods in natural language processing | 2014
Kareem Darwish
Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Modern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as social networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Arabic characters. We used word and sequence-level features to identify Arabizi that is mixed with English. We achieved an identification accuracy of 98.5%. As for conversion, we used transliteration mining with language modeling to generate equivalent Arabic text. We achieved 88.7% conversion accuracy, with roughly a third of errors being spelling and morphological variants of the forms in ground truth.
meeting of the association for computational linguistics | 2007
Walid Magdy; Kareem Darwish; Ossama Emam; Hany Hassan
This paper presents a machine learning approach based on an SVM classifier coupled with preprocessing rules for cross-document named entity normalization. The classifier uses lexical, orthographic, phonetic, and morphological features. The process involves disambiguating different entities with shared name mentions and normalizing identical entities with different name mentions. In evaluating the quality of the clusters, the reported approach achieves a cluster F-measure of 0.93. The approach is significantly better than the two baseline approaches in which none of the entities are normalized or entities with exact name mentions are normalized. The two baseline approaches achieve cluster F-measures of 0.62 and 0.74 respectively. The classifier properly normalizes the vast majority of entities that are misnormalized by the baseline system.