Ioannis Korkontzelos
University of Manchester
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ioannis Korkontzelos.
international conference natural language processing | 2008
Ioannis Korkontzelos; Ioannis P. Klapaftis; Suresh Manandhar
Automatic Term Recognition (ATR) is defined as the task of identifying domain specific terms from technical corpora. Termhood-basedapproaches measure the degree that a candidate term refers to a domain specific concept. Unithood-basedapproaches measure the attachment strength of a candidate term constituents. These methods have been evaluated using different, often incompatible evaluation schemes and datasets. This paper provides an overview and a thorough evaluation of state-of-the-art ATRmethods, under a common evaluation framework, i.e. corpora and evaluation method. Our contributions are two-fold: (1) We compare a number of different ATRmethods, showing that termhood-basedmethods achieve in general superior performance. (2) We show that the number of independent occurrences of a candidate term is the most effective source for estimating term nestedness, improving ATRperformance.
Journal of Biomedical Informatics | 2016
Ioannis Korkontzelos; Azadeh Nikfarjam; Matthew Shardlow; Abeed Sarker; Sophia Ananiadou; Graciela Gonzalez
Graphical abstract
meeting of the association for computational linguistics | 2009
Ioannis Korkontzelos; Suresh Manandhar
Identifying whether a multi-word expression (MWE) is compositional or not is important for numerous NLP applications. Sense induction can partition the context of MWEs into semantic uses and therefore aid in deciding compositionality. We propose an unsupervised system to explore this hypothesis on compound nominals, proper names and adjective-noun constructions, and evaluate the contribution of sense induction. The evaluation set is derived from WordNet in a semisupervised way. Graph connectivity measures are employed for unsupervised parameter tuning.
Artificial Intelligence in Medicine | 2015
Ioannis Korkontzelos; Dimitrios Piliouras; Andrew W. Dowsey; Sophia Ananiadou
OBJECTIVE Drug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. METHODS We perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. MATERIALS Our approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. RESULTS Aggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. CONCLUSION We conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations.
conference of the european chapter of the association for computational linguistics | 2014
Georgios Kontonatsios; Ioannis Korkontzelos; Jun’ichi Tsujii; Sophia Ananiadou
We describe a machine learning approach, a Random Forest (RF) classifier, that is used to automatically compile bilingual dictionaries of technical terms from comparable corpora. We evaluate the RF classifier against a popular term alignment method, namely context vectors, and we report an improvement of the translation accuracy. As an application, we use the automatically extracted dictionary in combination with a trained Statistical Machine Translation (SMT) system to more accurately translate unknown terms. The dictionary extraction method described in this paper is freely available 1 .
conference on information and knowledge management | 2011
Ioannis Korkontzelos; Tingting Mu; Angelo Restificar; Sophia Ananiadou
Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols.
empirical methods in natural language processing | 2014
Georgios Kontonatsios; Ioannis Korkontzelos; Jun’ichi Tsujii; Sophia Ananiadou
Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.
Journal of Biomedical Semantics | 2013
Georgios Kontonatsios; Ioannis Korkontzelos; BalaKrishna Kolluru; Paul Thompson; Sophia Ananiadou
BackgroundU-Compare is a text mining platform that allows the construction, evaluation and comparison of text mining workflows. U-Compare contains a large library of components that are tuned to the biomedical domain. Users can rapidly develop biomedical text mining workflows by mixing and matching U-Compare’s components. Workflows developed using U-Compare can be exported and sent to other users who, in turn, can import and re-use them. However, the resulting workflows are standalone applications, i.e., software tools that run and are accessible only via a local machine, and that can only be run with the U-Compare platform.ResultsWe address the above issues by extending U-Compare to convert standalone workflows into web services automatically, via a two-click process. The resulting web services can be registered on a central server and made publicly available. Alternatively, users can make web services available on their own servers, after installing the web application framework, which is part of the extension to U-Compare. We have performed a user-oriented evaluation of the proposed extension, by asking users who have tested the enhanced functionality of U-Compare to complete questionnaires that assess its functionality, reliability, usability, efficiency and maintainability. The results obtained reveal that the new functionality is well received by users.ConclusionsThe web services produced by U-Compare are built on top of open standards, i.e., REST and SOAP protocols, and therefore, they are decoupled from the underlying platform. Exported workflows can be integrated with any application that supports these open standards. We demonstrate how the newly extended U-Compare enhances the cross-platform interoperability of workflows, by seamlessly importing a number of text mining workflow web services exported from U-Compare into Taverna, i.e., a generic scientific workflow construction platform.
association for information science and technology | 2016
Tingting Mu; John Yannis Goulermas; Ioannis Korkontzelos; Sophia Ananiadou
Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co‐embedded space that preserves higher‐order, neighbor‐based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co‐embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.
Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics | 2009
Ioannis Korkontzelos; Ioannis P. Klapaftis; Suresh Manandhar
Word Sense Induction (WSI) is the task of identifying the different senses (uses) of a target word in a given text. This paper focuses on the unsupervised estimation of the free parameters of a graph-based WSI method, and explores the use of eight Graph Connectivity Measures (GCM) that assess the degree of connectivity in a graph. Given a target word and a set of parameters, GCM evaluate the connectivity of the produced clusters, which correspond to subgraphs of the initial (unclustered) graph. Each parameter setting is assigned a score according to one of the GCM and the highest scoring setting is then selected. Our evaluation on the nouns of SemEval-2007 WSI task (SWSI) shows that: (1) all GCM estimate a set of parameters which significantly outperform the worst performing parameter setting in both SWSI evaluation schemes, (2) all GCM estimate a set of parameters which outperform the Most Frequent Sense (MFS) baseline by a statistically significant amount in the supervised evaluation scheme, and (3) two of the measures estimate a set of parameters that performs closely to a set of parameters estimated in supervised manner.