Kim Luyckx
University of Antwerp
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kim Luyckx.
international conference on computational linguistics | 2008
Kim Luyckx; Walter Daelemans
Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.
Journal of the American Medical Informatics Association | 2016
Elyne Scheurwegs; Kim Luyckx; Léon Luyten; Walter Daelemans; Tim Van den Bulcke
OBJECTIVE Enormous amounts of healthcare data are becoming increasingly accessible through the large-scale adoption of electronic health records. In this work, structured and unstructured (textual) data are combined to assign clinical diagnostic and procedural codes (specifically ICD-9-CM) to patient stays. We investigate whether integrating these heterogeneous data types improves prediction strength compared to using the data types in isolation. METHODS Two separate data integration approaches were evaluated. Early data integration combines features of several sources within a single model, and late data integration learns a separate model per data source and combines these predictions with a meta-learner. This is evaluated on data sources and clinical codes from a broad set of medical specialties. RESULTS When compared with the best individual prediction source, late data integration leads to improvements in predictive power (eg, overall F-measure increased from 30.6% to 38.3% for International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnostic codes), while early data integration is less consistent. The predictive strength strongly differs between medical specialties, both for ICD-9-CM diagnostic and procedural codes. DISCUSSION Structured data provides complementary information to unstructured data (and vice versa) for predicting ICD-9-CM codes. This can be captured most effectively by the proposed late data integration approach. CONCLUSIONS We demonstrated that models using multiple electronic health record data sources systematically outperform models using data sources in isolation in the task of predicting ICD-9-CM codes over a broad range of medical specialties.
computational linguistics in the netherlands | 2012
Kim Luyckx; Frederik Vaassen; Claudia Peersman; Walter Daelemans
We present a system to automatically identify emotion-carrying sentences in suicide notes and to detect the specific fine-grained emotion conveyed. With this system, we competed in Track 2 of the 2011 Medical NLP Challenge, 14 where the task was to distinguish between fifteen emotion labels, from guilt, sorrow, and hopelessness to hopefulness and happiness. Since a sentence can be annotated with multiple emotions, we designed a thresholding approach that enables assigning multiple labels to a single instance. We rely on the probability estimates returned by an SVM classifier and experimentally set thresholds on these probabilities. Emotion labels are assigned only if their probability exceeds a certain threshold and if the probability of the sentence being emotion-free is low enough. We show the advantages of this thresholding approach by comparing it to a naïve system that assigns only the most probable label to each test sentence, and to a system trained on emotion-carrying sentences only.
Journal of Biomedical Informatics | 2017
Elyne Scheurwegs; Madhumita Sushil; Stéphan Tulkens; Walter Daelemans; Kim Luyckx
The CEGS N-GRID 2016 Shared Task (Filannino et al., 2017) in Clinical Natural Language Processing introduces the assignment of a severity score to a psychiatric symptom, based on a psychiatric intake report. We present a method that employs the inherent interview-like structure of the report to extract relevant information from the report and generate a representation. The representation consists of a restricted set of psychiatric concepts (and the context they occur in), identified using medical concepts defined in UMLS that are directly related to the psychiatric diagnoses present in the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition (DSM-IV) ontology. Random Forests provides a generalization of the extracted, case-specific features in our representation. The best variant presented here scored an inverse mean absolute error (MAE) of 80.64%. A concise concept-based representation, paired with identification of concept certainty and scope (family, patient), shows a robust performance on the task.
Journal of Biomedical Informatics | 2017
Elyne Scheurwegs; Kim Luyckx; Léon Luyten; Bart Goethals; Walter Daelemans
Clinical codes are used for public reporting purposes, are fundamental to determining public financing for hospitals, and form the basis for reimbursement claims to insurance providers. They are assigned to a patient stay to reflect the diagnosis and performed procedures during that stay. This paper aims to enrich algorithms for automated clinical coding by taking a data-driven approach and by using unsupervised and semi-supervised techniques for the extraction of multi-word expressions that convey a generalisable medical meaning (referred to as concepts). Several methods for extracting concepts from text are compared, two of which are constructed from a large unannotated corpus of clinical free text. A distributional semantic model (i.c. the word2vec skip-gram model) is used to generalize over concepts and retrieve relations between them. These methods are validated on three sets of patient stay data, in the disease areas of urology, cardiology, and gastroenterology. The datasets are in Dutch, which introduces a limitation on available concept definitions from expert-based ontologies (e.g. UMLS). The results show that when expert-based knowledge in ontologies is unavailable, concepts derived from raw clinical texts are a reliable alternative. Both concepts derived from raw clinical texts perform and concepts derived from expert-created dictionaries outperform a bag-of-words approach in clinical code assignment. Adding features based on tokens that appear in a semantically similar context has a positive influence for predicting diagnostic codes. Furthermore, the experiments indicate that a distributional semantics model can find relations between semantically related concepts in texts but also introduces erroneous and redundant relations, which can undermine clinical coding performance.
ieee international conference on healthcare informatics | 2013
David Damen; Kim Luyckx; Geert Hellebaut; Tim Van den Bulcke
Clinical trial recruitment encompasses many challenging tasks. Chief amongst those is the fast and reliable recruitment of eligible participants for the study. Much of this selection is still performed manually, despite the possibility of missing eligible patients (up to 60% according to some studies). To mitigate this issue, a Topic Maps-based semantic platform was developed at the Antwerp University Hospital to assist in the recruitment of clinical trial participants from the full patient population. The platform consists of (1) a Web-based editor for the creation of ontology-based clinical trial representations, (2) a patient evaluator that connects to structured and unstructured hospital data sources to determine eligibility for a clinical trial, and (3) a Web-based analytics module for reviewing evaluation results. The semantic nature of the clinical trial representation allows for generic formalization as well as for local adaptation of the study protocol to accommodate a specific hospital IT infrastructure.
international conference on natural language generation | 2008
Iris Hendrickx; Walter Daelemans; Kim Luyckx; Roser Morante; Vincent Van Asch
In this paper we describe our machine learning approach to the generation of referring expressions. As our algorithm we use memory-based learning. Our results show that in case of predicting the TYPE of the expression, having one general classifier gives the best results. On the contrary, when predicting the full set of properties of an expression, a combined set of specialized classifiers for each subdomain gives the best performance.
Journal of Biomedical Informatics | 2018
Madhumita Sushil; Simon Šuster; Kim Luyckx; Walter Daelemans
We have three contributions in this work: 1. We explore the utility of a stacked denoising autoencoder and a paragraph vector model to learn task-independent dense patient representations directly from clinical notes. To analyze if these representations are transferable across tasks, we evaluate them in multiple supervised setups to predict patient mortality, primary diagnostic and procedural category, and gender. We compare their performance with sparse representations obtained from a bag-of-words model. We observe that the learned generalized representations significantly outperform the sparse representations when we have few positive instances to learn from, and there is an absence of strong lexical features. 2. We compare the model performance of the feature set constructed from a bag of words to that obtained from medical concepts. In the latter case, concepts represent problems, treatments, and tests. We find that concept identification does not improve the classification performance. 3. We propose novel techniques to facilitate model interpretability. To understand and interpret the representations, we explore the best encoded features within the patient representations obtained from the autoencoder model. Further, we calculate feature sensitivity across two networks to identify the most significant input features for different classification tasks when we use these pretrained representations as the supervised input. We successfully extract the most influential features for the pipeline using this technique.
Literary and Linguistic Computing | 2011
Kim Luyckx; Walter Daelemans
Archive | 2010
Kim Luyckx