Antonio Toral | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Antonio Toral is active.

Explore More

Publication

Featured researches published by Antonio Toral.

north american chapter of the association for computational linguistics | 2009

SemEval-2010 Task 17: All-words Word Sense Disambiguation on a Specific Domain

Eneko Agirre; Oier Lopez de Lacalle; Christiane Fellbaum; Andrea Marchetti; Antonio Toral; Piek Vossen

Domain portability and adaptation of NLP components and Word Sense Disambiguation systems present new challenges. The difficulties found by supervised systems to adapt might change the way we assess the strengths and weaknesses of supervised and knowledge-based WSD systems. Unfortunately, all existing evaluation datasets for specific domains are lexical-sample corpora. This task presented all-words datasets on the environment domain for WSD in four languages (Chinese, Dutch, English, Italian). 11 teams participated, with supervised and knowledge-based systems, mainly in the English dataset. The results show that in all languages the participants where able to beat the most frequent sense heuristic as estimated from general corpora. The most successful approaches used some sort of supervision in the form of hand-tagged examples from the domain.

Information Sciences | 2009

Exploiting Wikipedia and EuroWordNet to solve Cross-Lingual Question Answering

Sergio Ferrández; Antonio Toral; íscar Ferrández; Antonio Ferrández; Rafael Muñoz

This paper describes a new advance in solving Cross-Lingual Question Answering (CL-QA) tasks. It is built on three main pillars: (i) the use of several multilingual knowledge resources to reference words between languages (the Inter Lingual Index (ILI) module of EuroWordNet and the multilingual knowledge encoded in Wikipedia); (ii) the consideration of more than only one translation per word in order to search candidate answers; and (iii) the analysis of the question in the original language without any translation process. This novel approach overcomes the errors caused by the common use of Machine Translation (MT) services by CL-QA systems. We also expose some studies and experiments that justify the importance of analyzing whether a Named Entity should be translated or not. Experimental results in bilingual scenarios show that our approach performs better than an MT based CL-QA approach achieving an average improvement of 36.7%.

The Prague Bulletin of Mathematical Linguistics | 2012

DELiC4MT: A Tool for Diagnostic MT Evaluation over User-defined Linguistic Phenomena

Antonio Toral; Sudip Kumar Naskar; Federico Gaspari; Declan Groves

DELiC4MT: A Tool for Diagnostic MT Evaluation over User-defined Linguistic Phenomena This paper demonstrates DELiC4MT, a piece of software that allows the user to perform diagnostic evaluation of machine translation systems over linguistic checkpoints, i.e., source-language lexical elements and grammatical constructions specified by the user. Our integrated tool builds upon best practices, software components and formats developed under different projects and initiatives, focusing on enabling easy adaptation to any language pair and linguistic phenomenon. We provide a description of the different modules that make up the tool, introduce a web demo and present a step-by-step case study of how it can be applied to a specific language pair and linguistic phenomenon.

workshop on statistical machine translation | 2015

Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling

Raphael Rubino; Tommi A. Pirinen; Miquel Esplà-Gomis; Nikola Ljubešić; Sergio Ortiz Rojas; Vassilis Papavassiliou; Prokopis Prokopidis; Antonio Toral

This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish‐English language pair at the WMT 2015 translation task. We tackle the lack of resources and complex morphology of the Finnish language by (i) crawling parallel and monolingual data from the Web and (ii) applying rule-based and unsupervised methods for morphological segmentation. Several statistical machine translation approaches are evaluated and then combined to obtain our final submissions, which are the top performing English-to-Finnish unconstrained (all automatic metrics) and constrained (BLEU), and Finnish-to-English constrained (TER) systems.

The Prague Bulletin of Mathematical Linguistics | 2017

Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation

Filip Klubička; Antonio Toral; Víctor M. Sánchez-Cartagena

Abstract We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems’ outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.

Computer Speech & Language | 2015

Linguistically-augmented perplexity-based data selection for language models

Antonio Toral; Pavel Pecina; Longyue Wang; Josef van Genabith

HighlightsWord-level linguistic information for perplexity-based data selection.Evaluation and analysis for four languages: English, Spanish, Czech and Chinese.Combination of models lead to lower perplexity than the state-of-the-art baseline. This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naive selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chinese). The interpolation-based combination outperforms the purely statistical baseline in all the scenarios, resulting in language models with lower perplexity. In relative terms the improvements are similar regardless of the language, with perplexity reductions achieved in the range 7.72-13.02%. In absolute terms the reduction is higher for languages with high type-token ratio (Chinese, 202.16) or rich morphology (Czech, 81.53) and lower for the remaining languages, Spanish (55.2) and English (34.43 on the English side of the same parallel dataset as for Czech and 61.90 on the same parallel dataset as for Spanish).

systems and frameworks for computational morphology | 2011

A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer

Mohammed Attia; Pavel Pecina; Antonio Toral; Lamia Tounsi; Josef van Genabith

Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire and filter lexical knowledge about morpho-syntactic attributes and inflection paradigms. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, the Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit. We build a web application, AraComLex (Arabic Computer Lexicon), for managing and curating the lexical database.

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers | 2016

Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences

Víctor M. Sánchez-Cartagena; Antonio Toral

This paper presents the systems submitted by the Abu-MaTran project to the Englishto-Finnish language pair at the WMT 2016 news translation task. We applied morphological segmentation and deep learning in order to address (i) the data scarcity problem caused by the lack of in-domain parallel data in the constrained task and (ii) the complex morphology of Finnish. We submitted a neural machine translation system, a statistical machine translation system reranked with a neural language model and the combination of their outputs tuned on character sequences. The combination and the neural system were ranked first and second respectively according to automatic evaluation metrics and tied for the first place in the human evaluation.

language resources and evaluation | 2015

Domain adaptation of statistical machine translation with domain-focused web crawling

Pavel Pecina; Antonio Toral; Vassilis Papavassiliou; Prokopis Prokopidis; Aleš Tamchyna; Andy Way; Josef van Genabith

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.

workshop on statistical machine translation | 2014

Abu-MaTran at WMT 2014 Translation Task: Two-step Data Selection and RBMT-Style Synthetic Rules

Raphael Rubino; Antonio Toral; Víctor M. Sánchez-Cartagena; Jorge Ferrández-Tordera; Sergio Ortiz Rojas; Gema Ramírez-Sánchez; Felipe Sánchez-Martínez; Andy Way

This paper presents the machine translation systems submitted by the AbuMaTran project to the WMT 2014 translation task. The language pair concerned is English‐French with a focus on French as the target language. The French to English translation direction is also considered, based on the word alignment computed in the other direction. Large language and translation models are built using all the datasets provided by the shared task organisers, as well as the monolingual data from LDC. To build the translation models, we apply a two-step data selection method based on bilingual crossentropy difference and vocabulary saturation, considering each parallel corpus individually. Synthetic translation rules are extracted from the development sets and used to train another translation model. We then interpolate the translation models, minimising the perplexity on the development sets, to obtain our final SMT system. Our submission for the English to French translation task was ranked second amongst nine teams and a total of twenty submissions.

Explore More