Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Haithem Afli is active.

Publication


Featured researches published by Haithem Afli.


International Conference on NLP | 2012

Parallel Texts Extraction from Multimodal Comparable Corpora

Haithem Afli; Loïc Barrault; Holger Schwenk

Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus in the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT’11 speech translation task that shows the feasibility of our approach.


The Prague Bulletin of Mathematical Linguistics | 2016

FaDA: Fast Document Aligner using Word Embedding

Pintu Lohar; Debasis Ganguly; Haithem Afli; Andy Way; Gareth J. F. Jones

Abstract FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR)-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT)- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.


Natural Language Engineering | 2016

Building and Using Multimodal Comparable Corpora for Machine Translation

Haithem Afli; Loïc Barrault; Holger Schwenk

In recent decades, statistical approaches have significantly advanced the development of machine translation systems. However, the applicability of these methods directly depends on the availability of very large quantities of parallel data. Recent works have demonstrated that a comparable corpus can compensate for the shortage of parallel corpora. In this paper, we propose an alternative to comparable corpora containing text documents as resources for extracting parallel data: a multimodal comparable corpus with audio documents in source language and text document in target language, built from Euronews and TED web sites. The audio is transcribed by an automatic speech recognition system, and translated with a baseline statistical machine translation system. We then use information retrieval in a large text corpus in the target language in order to extract parallel sentences/phrases. We evaluate the quality of the extracted data on an English to French translation task and show significant improvements over a state-of-the-art baseline.


Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers | 2016

The ADAPT Bilingual Document Alignment system at WMT16

Pintu Lohar; Haithem Afli; Chao-Hong Liu; Andy Way

Comparable corpora have been shown to be useful in several multilingual natural language processing (NLP) tasks. Many previous papers have focused on how to improve the extraction of parallel data from this kind of corpus on different levels. In this paper, we are interested in improving the quality of bilingual comparable corpora according to increased document alignment score. We describe our participation in the bilingual document alignment shared task of the First Conference on Machine Translation (WMT16). We propose a technique based on sourceto-target sentence- and word-based scores and the fraction of matched source named entities. We performed our experiments on English-to-French document alignments for this bilingual task.


The Prague Bulletin of Mathematical Linguistics | 2017

Maintaining Sentiment Polarity in Translation of User-Generated Content

Pintu Lohar; Haithem Afli; Andy Way

Abstract The advent of social media has shaken the very foundations of how we share information, with Twitter, Facebook, and Linkedin among many well-known social networking platforms that facilitate information generation and distribution. However, the maximum 140-character restriction in Twitter encourages users to (sometimes deliberately) write somewhat informally in most cases. As a result, machine translation (MT) of user-generated content (UGC) becomes much more difficult for such noisy texts. In addition to translation quality being affected, this phenomenon may also negatively impact sentiment preservation in the translation process. That is, a sentence with positive sentiment in the source language may be translated into a sentence with negative or neutral sentiment in the target language. In this paper, we analyse both sentiment preservation and MT quality per se in the context of UGC, focusing especially on whether sentiment classification helps improve sentiment preservation in MT of UGC. We build four different experimental setups for tweet translation (i) using a single MT model trained on the whole Twitter parallel corpus, (ii) using multiple MT models based on sentiment classification, (iii) using MT models including additional out-of-domain data, and (iv) adding MT models based on the phrase-table fill-up method to accompany the sentiment translation models with an aim of improving MT quality and at the same time maintaining sentiment polarity preservation. Our empirical evaluation shows that despite a slight deterioration in MT quality, our system significantly outperforms the Baseline MT system (without using sentiment classification) in terms of sentiment preservation. We also demonstrate that using an MT engine that conveys a sentiment different from that of the UGC can even worsen both the translation quality and sentiment preservation.


Proceedings of the Third Arabic Natural Language Processing Workshop | 2017

Identifying Effective Translations for Cross-lingual Arabic-to-English User-generated Speech Search.

Ahmad Khwileh; Haithem Afli; Gareth J. F. Jones; Andy Way

Cross Language Information Retrieval (CLIR) systems are a valuable tool to enable speakers of one language to search for content of interest expressed in a different language. A group for whom this is of particular interest is bilingual Arabic speakers who wish to search for English language content using information needs expressed in Arabic queries. A key challenge in CLIR is crossing the language barrier between the query and the documents. The most common approach to bridging this gap is automated query translation, which can be unreliable for vague or short queries. In this work, we examine the potential for improving CLIR effectiveness by predicting the translation effectiveness using Query Performance Prediction (QPP) techniques. We propose a novel QPP method to estimate the quality of translation for an Arabic-Engish Cross-lingual User-generated Speech Search (CLUGS) task. We present an empirical evaluation that demonstrates the quality of our method on alternative translation outputs extracted from an Arabic-to-English Machine Translation system developed for this task. Finally, we show how this framework can be integrated in CLUGS to find relevant translations for improved retrieval performance.


workshop on statistical machine translation | 2011

LIUM’s SMT Machine Translation Systems for WMT 2012

Holger Schwenk; Patrik Lambert; Loïc Barrault; Christophe Servan; Sadaf Abdul-Rauf; Haithem Afli; Kashif Shah


international joint conference on natural language processing | 2013

Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction

Haithem Afli; Loïc Barrault; Holger Schwenk


language resources and evaluation | 2016

Using SMT for OCR Error Correction of Historical Texts.

Haithem Afli; Zhengwei Qiu; Andy Way; Páraic Sheridan


Int. J. Comput. Linguistics Appl. | 2016

OCR Error Correction Using Statistical Machine Translation.

Haithem Afli; Loïc Barrault; Holger Schwenk

Collaboration


Dive into the Haithem Afli's collaboration.

Top Co-Authors

Avatar

Andy Way

Dublin City University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Holger Schwenk

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jiang Zhou

Dublin City University

View shared research outputs
Top Co-Authors

Avatar

Jinhua Du

Dublin City University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge