Hany Hassan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hany Hassan is active.

Explore More

Publication

Featured researches published by Hany Hassan.

meeting of the association for computational linguistics | 2003

Language Model Based Arabic Word Segmentation

Young-Suk Lee; Kishore Papineni; Salim Roukos; Ossama Emam; Hany Hassan

We approximate Arabics rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.

meeting of the association for computational linguistics | 2007

Arabic Cross-Document Person Name Normalization

Walid Magdy; Kareem Darwish; Ossama Emam; Hany Hassan

This paper presents a machine learning approach based on an SVM classifier coupled with preprocessing rules for cross-document named entity normalization. The classifier uses lexical, orthographic, phonetic, and morphological features. The process involves disambiguating different entities with shared name mentions and normalizing identical entities with different name mentions. In evaluating the quality of the clusters, the reported approach achieves a cluster F-measure of 0.93. The approach is significantly better than the two baseline approaches in which none of the entities are normalized or entities with exact name mentions are normalized. The two baseline approaches achieve cluster F-measures of 0.62 and 0.74 respectively. The classifier properly normalizes the vast majority of entities that are misnormalized by the baseline system.

meeting of the association for computational linguistics | 2005

An Integrated Approach for Arabic-English Named Entity Translation

Hany Hassan; Jeffrey S. Sorensen

Translation of named entities (NEs), such as person names, organization names and location names is crucial for cross lingual information retrieval, machine translation, and many other natural language processing applications. Newly named entities are introduced on daily basis in newswire and this greatly complicates the translation task. Also, while some names can be translated, others must be transliterated, and, still, others are mixed. In this paper we introduce an integrated approach for named entity translation deploying phrase-based translation, word-based translation, and transliteration modules into a single framework. While Arabic based, the approach introduced here is a unified approach that can be applied to NE translation for any language pair.

meeting of the association for computational linguistics | 2005

Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval

Kareem Darwish; Hany Hassan; Ossama Emam

This paper explores the effect of improved morphological analysis, particularly context sensitive morphology, on monolingual Arabic Information Retrieval (IR). It also compares the effect of context sensitive morphology to non-context sensitive morphology. The results show that better coverage and improved correctness have a dramatic effect on IR effectiveness and that context sensitive morphology further improves retrieval effectiveness, but the improvement is not statistically significant. Furthermore, the improvement obtained by the use of context sensitive morphology over the use of light stemming was not significantly significant.

north american chapter of the association for computational linguistics | 2015

Learning Translation Models from Monolingual Continuous Representations

Kai Zhao; Hany Hassan; Michael Auli

Translation models often fail to generate good translations for infrequent words or phrases. Previous work attacked this problem by inducing new translation rules from monolingual data with a semi-supervised algorithm. However, this approach does not scale very well since it is very computationally expensive to generate new translation rules for only a few thousand sentences. We propose a much faster and simpler method that directly hallucinates translation rules for infrequent phrases based on phrases with similar continuous representations for which a translation is known. To speed up the retrieval of similar phrases, we investigate approximated nearest neighbor search with redundant bit vectors which we find to be three times faster and significantly more accurate than locality sensitive hashing. Our approach of learning new translation rules improves a phrase-based baseline by up to 1.6 BLEU on Arabic-English translation, it is three-orders of magnitudes faster than existing semi-supervised methods and 0.5 BLEU more accurate.

IEEE Transactions on Audio, Speech, and Language Processing | 2008

Syntactically Lexicalized Phrase-Based SMT

Hany Hassan; Khalil Sima'an; Andy Way

Until quite recently, extending phrase-based statistical machine translation (PBSMT) with syntactic knowledge caused system performance to deteriorate. The most recent successful enrichments of PBSMT with hierarchical structure either employ nonlinguistically motivated syntax for capturing hierarchical reordering phenomena, or extend the phrase translation table with redundantly ambiguous syntactic structures over phrase pairs. In this paper, we present an extended, harmonized account of our previous work which showed that incorporating linguistically motivated lexical syntactic descriptions, called supertags, can yield significantly better PBSMT systems at insignificant extra computational cost. We describe a novel PBSMT model that integrates supertags into the target language model and the target side of the translation model. Two kinds of supertags are employed: those from lexicalized tree-adjoining grammar and combinatory categorial grammar. Despite the differences between the two sets of supertags, they give similar improvements. In addition to integrating the Markov supertagging approach in PBSMT, we explore the utility of a new surface grammaticality measure based on combinatory operators. We perform various experiments on the Arabic-to-English NIST 2005 test set addressing the issues of sparseness, scalability, and the utility of system subcomponents. We show that even when the parallel training data grows very large, the supertagged system retains a relatively stable absolute performance advantage over the unadorned PBSMT system. Arguably, this hints at a performance gap that cannot be bridged by acquiring more phrase pairs. Our best result shows a relative improvement of 6.1% over a state-of-the-art PBSMT model, which compares favorably with the leading systems on the NIST 2005 task. We also demonstrate that the advantages of a supertag-based system carry over to German-English, where improvements of up to 8.9% relative to the baseline system are observed.

meeting of the association for computational linguistics | 2014

Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data

Avneesh Saluja; Hany Hassan; Kristina Toutanova; Chris Quirk

Statistical phrase-based translation learns translation rules from bilingual corpora, and has traditionally only used monolingual evidence to construct features that rescore existing translation candidates. In this work, we present a semi-supervised graph-based approach for generating new translation rules that leverages bilingual and monolingual data. The proposed technique first constructs phrase graphs using both source and target language monolingual corpora. Next, graph propagation identifies translations of phrases that were not observed in the bilingual corpus, assuming that similar phrases have similar translations. We report results on a large Arabic-English system and a medium-sized Urdu-English system. Our proposed approach significantly improves the performance of competitive phrasebased systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.

spoken language technology workshop | 2008

A syntactic language model based on incremental CCG parsing

Hany Hassan; Khalil Sima'an; Andy Way

Syntactically-enriched language models (parsers) constitute a promising component in applications such as machine translation and speech-recognition. To maintain a useful level of accuracy, existing parsers are non-incremental and must span a combinatorially growing space of possible structures as every input word is processed. This prohibits their incorporation into standard linear-time decoders. In this paper, we present an incremental, linear-time dependency parser based on Combinatory Categorial Grammar (CCG) and classification techniques. We devise a deterministic transform of CCG-bank canonical derivations into incremental ones, and train our parser on this data. We discover that a cascaded, incremental version provides an appealing balance between efficiency and accuracy.

meeting of the association for computational linguistics | 2007

BioNoculars: Extracting Protein-Protein Interactions from Biomedical Text

Amgad Madkour; Kareem Darwish; Hany Hassan; Ahmed Hassan; Ossama Emam

The vast number of published medical documents is considered a vital source for relationship discovery. This paper presents a statistical unsupervised system, called BioNoculars, for extracting protein-protein interactions from biomedical text. BioNoculars uses graph-based mutual reinforcement to make use of redundancy in data to construct extraction patterns in a domain independent fashion. The system was tested using MEDLINE abstract for which the protein-protein interactions that they contain are listed in the database of interacting proteins and protein-protein interactions (DIPPPI). The system reports an F-Measure of 0.55 on test MEDLINE abstracts.

workshop on graph based methods for natural language processing | 2006

Graph Based Semi-Supervised Approach for Information Extraction

Hany Hassan; Ahmed Hassan; Sara Noeman

Classification techniques deploy supervised labeled instances to train classifiers for various classification problems. However labeled instances are limited, expensive, and time consuming to obtain, due to the need of experienced human annotators. Meanwhile large amount of unlabeled data is usually easy to obtain. Semi-supervised learning addresses the problem of utilizing unlabeled data along with supervised labeled data, to build better classifiers. In this paper we introduce a semi-supervised approach based on mutual reinforcement in graphs to obtain more labeled data to enhance the classifier accuracy. The approach has been used to supplement a maximum entropy model for semi-supervised training of the ACE Relation Detection and Characterization (RDC) task. ACE RDC is considered a hard task in information extraction due to lack of large amounts of training data and inconsistencies in the available data. The proposed approach provides 10% relative improvement over the state of the art supervised baseline system.

Explore More