Is this you? Create Your Porfile

Illés Solt

Budapest University of Technology and Economics

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Illés Solt is active.

Explore More

Publication

Featured researches published by Illés Solt.

BMC Bioinformatics | 2011

The gene normalization task in BioCreative III

Zhiyong Lu; Hung Yu Kao; Chih-Hsuan Wei; Minlie Huang; Jingchen Liu; Cheng-Ju Kuo; Chun-Nan Hsu; Richard Tzong-Han Tsai; Hong-Jie Dai; Naoaki Okazaki; Han-Cheol Cho; Martin Gerner; Illés Solt; Shashank Agarwal; Feifan Liu; Dina Vishnyakova; Patrick Ruch; Martin Romacker; Fabio Rinaldi; Sanmitra Bhattacharya; Padmini Srinivasan; Hongfang Liu; Manabu Torii; Sérgio Matos; David Campos; Karin Verspoor; Kevin Livingston; W. John Wilbur

BackgroundWe report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).ResultsWe received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.ConclusionsBy using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.

Bioinformatics | 2011

The GNAT library for local and remote gene mention normalization

Jörg Hakenberg; Martin Gerner; Maximilian Haeussler; Illés Solt; Conrad Plake; Michael Schroeder; Graciela Gonzalez; Goran Nenadic; Casey M. Bergman

Summary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987. Availability: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net. Contact: [email protected]

Journal of the American Medical Informatics Association | 2009

Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier

Illés Solt; Domonkos Tikk; Viktor Gál; Zsolt Tivadar Kardkovács

OBJECTIVE Automated and disease-specific classification of textual clinical discharge summaries is of great importance in human life science, as it helps physicians to make medical studies by providing statistically relevant data for analysis. This can be further facilitated if, at the labeling of discharge summaries, semantic labels are also extracted from text, such as whether a given disease is present, absent, questionable in a patient, or is unmentioned in the document. The authors present a classification technique that successfully solves the semantic classification task. DESIGN The authors introduce a context-aware rule-based semantic classification technique for use on clinical discharge summaries. The classification is performed in subsequent steps. First, some misleading parts are removed from the text; then the text is partitioned into positive, negative, and uncertain context segments, then a sequence of binary classifiers is applied to assign the appropriate semantic labels. Measurement For evaluation the authors used the documents of the i2b2 Obesity Challenge and adopted its evaluation measures: F(1)-macro and F(1)-micro for measurements. RESULTS On the two subtasks of the Obesity Challenge (textual and intuitive classification) the system performed very well, and achieved a F(1)-macro = 0.80 for the textual and F(1)-macro = 0.67 for the intuitive tasks, and obtained second place at the textual and first place at the intuitive subtasks of the challenge. CONCLUSIONS The authors show in the paper that a simple rule-based classifier can tackle the semantic classification task more successfully than machine learning techniques, if the training data are limited and some semantic labels are very sparse.

BMC Bioinformatics | 2013

A detailed error analysis of 13 kernel methods for protein–protein interaction extraction

Domonkos Tikk; Illés Solt; Philippe Thomas; Ulf Leser

BackgroundKernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.ResultsWe report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.ConclusionsOur experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

north american chapter of the association for computational linguistics | 2009

Molecular event extraction from Link Grammar parse trees

Jörg Hakenberg; Illés Solt; Domonkos Tikk; Luis Tari; Astrid Rheinländer; Nguyen Quang Long; Graciela Gonzalez; Ulf Leser

We present an approach for extracting molecular events from literature based on a deep parser, using in a query language for parse trees. Detected events range from gene expression to protein localization, and cover a multitude of different entity types, including genes/proteins, binding sites, and locations. Furthermore, our approach is capable of recognizing negation and the speculative character of extracted statements. We first parse documents using Link Grammar (BioLG) and store the parse trees in a database. Events are extracted using a newly developed query language with traverses the BioLG linkages between trigger terms, arguments, and events. The concrete queries are learnt from an annotated corpus. On BioNLP Shared Task data, we achieve an overall F1-measure of 29.6%.

Journal of the American Medical Informatics Association | 2010

Improving textual medication extraction using combined conditional random fields and rule-based systems.

Domonkos Tikk; Illés Solt

OBJECTIVE In the i2b2 Medication Extraction Challenge, medication names together with details of their administration were to be extracted from medical discharge summaries. DESIGN The task of the challenge was decomposed into three pipelined components: named entity identification, context-aware filtering and relation extraction. For named entity identification, first a rule-based (RB) method that was used in our overall fifth place-ranked solution at the challenge was investigated. Second, a conditional random fields (CRF) approach is presented for named entity identification (NEI) developed after the completion of the challenge. The CRF models are trained on the 17 ground truth documents, the output of the rule-based NEI component on all documents, a larger but potentially inaccurate training dataset. For both NEI approaches their effect on relation extraction performance was investigated. The filtering and relation extraction components are both rule-based. MEASUREMENTS In addition to the official entry level evaluation of the challenge, entity level analysis is also provided. RESULTS On the test data an entry level F(1)-score of 80% was achieved for exact matching and 81% for inexact matching with the RB-NEI component. The CRF produces a significantly weaker result, but CRF outperforms the rule-based model with 81% exact and 82% inexact F(1)-score (p<0.02). CONCLUSION This study shows that a simple rule-based method is on a par with more complicated machine learners; CRF models can benefit from the addition of the potentially inaccurate training data, when only very few training documents are available. Such training data could be generated using the outputs of rule-based methods.

Bioinformatics | 2015

Computer-assisted curation of a human regulatory core network from the biological literature

Philippe Thomas; Pawel Durek; Illés Solt; Bertram Klinger; Franziska Witzel; Pascal Schulthess; Yvonne Mayer; Domonkos Tikk; Nils Blüthgen; Ulf Leser

MOTIVATION A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. RESULTS We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. CONCLUSIONS We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. AVAILABILITY AND IMPLEMENTATION Web-service is freely accessible at http://fastforward.sys-bio.net/. CONTACT [email protected] or [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

BMC Bioinformatics | 2010

Species identification for gene name normalization

Illés Solt; Domonkos Tikk; Ulf Leser

BackgroundProtein interaction networks are expensive to constructexperimentally. Therefore, researchers usually refer tothe literature or domain-specific databases to conveyknowledge on currently known interactions. Yet the taskof manual collection of knowledge from scientific papersis labor intensive, and therefore should be automated tothe extent possible. For this, an important step is identi-fying gene and protein names (termed entities). Afteridentification, gene names must be mapped to databaseidentifiers to connect them to structured knowledge.One particular problem in this step are homonymous,i.e., identical names referring to different genes in differ-ent species.MethodsWe present different approaches that aim at assigningspecies labels to MEDLINE abstracts. We use (1) as a

computational intelligence | 2011

Molecular event extraction from Link Grammar parse trees in the BioNLP'09 Shared Task

Jörg Hakenberg; Illés Solt; Domonkos Tikk; Vãu Há Nguyên; Luis Tari; Quang Long Nguyen; Chitta Baral; Ulf Leser

The BioNLP’09 Shared Task deals with extracting information on molecular events, such as gene expression and protein localization, from natural language text. Information in this benchmark are given as tuples including protein names, trigger terms for each event, and possible other participants such as bindings sites. We address all three tasks of BioNLP’09: event detection, event enrichment, and recognition of negation and speculation. Our method for the first two tasks is based on a deep parser; we store the parse tree of each sentence in a relational database scheme. From the training data, we collect the dependencies connecting any two relevant terms of a known tuple, that is, the shortest paths linking these two constituents. We encode all such linkages in a query language to retrieve similar linkages from unseen text. For the third task, we rely on a hierarchy of hand‐crafted regular expressions to recognize speculation and negated events. In this paper, we added extensions regarding a post‐processing step that handles ambiguous event trigger terms, as well as an extension of the query language to relax linkage constraints. On the BioNLP Shared Task test data, we achieve an overall F1‐measure of 32%, 29%, and 30% for the successive Tasks 1, 2, and 3, respectively.

knowledge discovery and data mining | 2012

Modality classification for medical images using sparse coded affine-invariant descriptors

Viktor Gál; Illés Solt; Etienne E. Kerre; Mike Nachtegael

Modality is a key facet in medical image retrieval, as a user is likely interested in only one of e.g. radiology images, flowcharts, and pathology photos. While assessing image modality is trivial for humans, reliable automatic methods are required to deal with large un-annotated image bases, such as figures taken from the millions of scientific publications. We present a multi-disciplinary approach to tackle the classification problem by combining image features, meta-data, textual and referential information. We test our systems accuracy on the Image- CLEF 2011 medical modality classification data set. We show that using a fully affine-invariant feature descriptor and sparse coding on these descriptors in the Bag-of-Words image representation significantly increases the classification accuracy. Our best method achieves 87.89 and outperforms the state of the art.

Explore More