Beáta Megyesi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Beáta Megyesi is active.

Explore More

Publication

Featured researches published by Beáta Megyesi.

empirical methods in natural language processing | 2007

Single Malt or Blended? A Study in Multilingual Parser Optimization

Johan Hall; Jens Nilsson; Joakim Nivre; G"ulsen Eryigit; Beáta Megyesi; Mattias Nilsson; Markus Saers

We describe a two-stage optimization of the MaltParser system for the ten languages in the multilingual track of the CoNLL 2007 shared task on dependency parsing. The first stage consists in tuning a single-parser system for each language by optimizing parameters of the parsing algorithm, the feature model, and the learning algorithm. The second stage consists in building an ensemble system that combines six different parsing strategies, extrapolating from the optimal parameter settings for each language. When evaluated on the official test sets, the ensemble system significantly outperformed the single-parser system and achieved the highest average labeled attachment score of all systems participating in the shared task.

meeting of the association for computational linguistics | 2006

A Study on Automatically Extracted Keywords in Text Categorization

Anette Hulth; Beáta Megyesi

This paper presents a study on if and how automatically extracted keywords can be used to improve text categorization. In summary we show that a higher performance --- as measured by micro-averaged F-measure on a standard text categorization collection --- is achieved when the full-text representation is combined with the automatically extracted keywords. The combination is obtained by giving higher weights to words in the full-texts that are also extracted as keywords. We also present results for experiments in which the keywords are the only input to the categorizer, either represented as unigrams or intact. Of these two experiments, the unigrams have the best performance, although neither performs as well as headlines only.

sighum workshop on language technology for cultural heritage social sciences and humanities | 2014

A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text

Eva Pettersson; Beáta Megyesi; Joakim Nivre

We present a multilingual evaluation of approaches for spelling normalisation of historical text based on data from five languages: English, German, Hungarian, Icelandic, and Swedish. Three different normalisation methods are evaluated: a simplistic filtering model, a Levenshteinbased approach, and a character-based statistical machine translation approach. The evaluation shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.

Nordic Journal of Linguistics | 2014

Professional language in Swedish clinical text: Linguistic characterization and comparative studies

Kelly Smith; Beáta Megyesi; Sumithra Velupillai; Maria Kvist

This study investigates the linguistic characteristics of Swedish clinical text in radiology reports and doctors daily notes from electronic health records (EHRs) in comparison to general Swedish and biomedical journal text. We quantify linguistic features through a comparative register analysis to determine how the free text of EHRs differ from general and biomedical Swedish text in terms of lexical complexity, word and sentence composition, and common sentence structures. The linguistic features are extracted using state-of-the-art computational tools: a tokenizer, a part-of-speech tagger, and scripts for statistical analysis. Results show that technical terms and abbreviations are more frequent in clinical text, and lexical variance is low. Moreover, clinical text frequently omit subjects, verbs, and function words resulting in shorter sentences. Clinical text not only differs from general Swedish, but also internally, across its sub-domains, e.g. sentences lacking verbs are significantly more frequent in radiology reports. These results provide a foundation for future development of automatic methods for EHR simplification or clarification.

conference of the european chapter of the association for computational linguistics | 2014

EACL - Expansion of Abbreviations in CLinical text

Lisa Tengstrand; Beáta Megyesi; Aron Henriksson; Martin Duneld; Maria Kvist

In the medical domain, especially in clinical texts, non-standard abbreviations are prevalent, which impairs readability for patients. To ease the understanding of the physicians’ notes, abbreviations need to be identified and expanded to their original forms. We present a distributional semantic approach to find candidates of the original form of the abbreviation, and combine this with Levenshtein distance to choose the correct candidate among the semantically related words. We apply the method to radiology reports and medical journal texts, and compare the results to general Swedish. The results show that the correct expansion of the abbreviation can be found in 40% of the cases, an improvement by 24 percentage points compared to the baseline (0.16), and an increase by 22 percentage points compared to using word space models alone (0.18).

text speech and dialogue | 2000

Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora

Harald Berthelsen; Beáta Megyesi

In this paper we apply the ensemble approach to the identification of incorrectly annotated items (noise) in a training set. In a controlled experiment, memory-based, decision tree-based and transformation-based classifiers are used as a filter to detect and remove noise deliberately introduced into a manually tagged corpus. The results indicate that the method can be successfully applied to automatically detect errors in a corpus.

Archive | 2002