Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Veronika Vincze is active.

Publication


Featured researches published by Veronika Vincze.


BMC Bioinformatics | 2008

The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes

Veronika Vincze; György Szarvas; Richárd Farkas; György Móra; János Csirik

BackgroundDetecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus).ResultsThe corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty.ConclusionStatistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.


Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing | 2008

The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts

Gy"orgy Szarvas; Veronika Vincze; Richárd Farkas; János Csirik

This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus). The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief annotator -- also responsible for setting up the annotation guidelines -- who resolved cases where the annotators disagreed. We will report our statistics on corpus size, ambiguity levels and the consistency of annotations.


Computational Linguistics | 2012

Cross-genre and cross-domain detection of semantic uncertainty

György Szarvas; Veronika Vincze; Richárd Farkas; György Móra; Iryna Gurevych

Uncertainty is an important linguistic phenomenon that is relevant in various Natural Language Processing applications, in diverse genres from medical to community generated, newswire or scientific discourse, and domains from science to humanities. The semantic uncertainty of a proposition can be identified in most cases by using a finite dictionary (i.e., lexical cues) and the key steps of uncertainty detection in an application include the steps of locating the (genre- and domain-specific) lexical cues, disambiguating them, and linking them with the units of interest for the particular application (e.g., identified events in information extraction). In this study, we focus on the genre and domain differences of the context-dependent semantic uncertainty cue recognition task.We introduce a unified subcategorization of semantic uncertainty as different domain applications can apply different uncertainty categories. Based on this categorization, we normalized the annotation of three corpora and present results with a state-of-the-art uncertainty cue recognition model for four fine-grained categories of semantic uncertainty.Our results reveal the domain and genre dependence of the problem; nevertheless, we also show that even a distant source domain data set can contribute to the recognition and disambiguation of uncertainty cues, efficiently reducing the annotation costs needed to cover a new domain. Thus, the unified subcategorization and domain adaptation for training the models offer an efficient solution for cross-domain and cross-genre semantic uncertainty recognition.


Frontiers in Aging Neuroscience | 2015

Speaking in Alzheimer’s Disease, is That an Early Sign? Importance of Changes in Language Abilities in Alzheimer’s Disease

Greta Szatloczki; Ildikó Hoffmann; Veronika Vincze; Janos Kalman; Magdolna Pakaski

It is known that Alzheimer’s disease (AD) influences the temporal characteristics of spontaneous speech. These phonetical changes are present even in mild AD. Based on this, the question arises whether an examination based on language analysis could help the early diagnosis of AD and if so, which language and speech characteristics can identify AD in its early stage. The purpose of this article is to summarize the relation between prodromal and manifest AD and language functions and language domains. Based on our research, we are inclined to claim that AD can be more sensitively detected with the help of a linguistic analysis than with other cognitive examinations. The temporal characteristics of spontaneous speech, such as speech tempo, number of pauses in speech, and their length are sensitive detectors of the early stage of the disease, which enables an early simple linguistic screening for AD. However, knowledge about the unique features of the language problems associated with different dementia variants still has to be improved and refined.


Journal of Biomedical Semantics | 2011

Linguistic scope-based and biological event-based speculation and negation annotations in the BioScope and Genia Event corpora

Veronika Vincze; György Szarvas; György Móra; Tomoko Ohta; Richárd Farkas

BackgroundThe treatment of negation and hedging in natural language processing has received much interest recently, especially in the biomedical domain. However, open access corpora annotated for negation and/or speculation are hardly available for training and testing applications, and even if they are, they sometimes follow different design principles. In this paper, the annotation principles of the two largest corpora containing annotation for negation and speculation – BioScope and Genia Event – are compared. BioScope marks linguistic cues and their scopes for negation and hedging while in Genia biological events are marked for uncertainty and/or negation.ResultsDifferences among the annotations of the two corpora are thematically categorized and the frequency of each category is estimated. We found that the largest amount of differences is due to the issue that scopes – which cover text spans – deal with the key events and each argument (including events within events) of these events is under the scope as well. In contrast, Genia deals with the modality of events within events independently.ConclusionsThe analysis of multiple layers of annotation (linguistic scopes and biological events) showed that the detection of negation/hedge keywords and their scopes can contribute to determining the modality of key events (denoted by the main predicate). On the other hand, for the detection of the negation and speculation status of events within events, additional syntax-based rules investigating the dependency path between the modality cue and the event cue have to be employed.


ACM Transactions on Speech and Language Processing | 2013

Learning to detect english and hungarian light verb constructions

Veronika Vincze; T István Nagy; János Zsibrita

Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions. In this study, we present our conditional random fields-based tool—called FXTagger—for identifying light verb constructions. The flexibility of the tool is demonstrated on two, typologically different, languages, namely, English and Hungarian. As earlier studies labeled different linguistic phenomena as light verb constructions, we first present a linguistics-based classification of light verb constructions and then show that FXTagger is able to identify different classes of light verb constructions in both languages. Different types of texts may contain different types of light verb constructions; moreover, the frequency of light verb constructions may differ from domain to domain. Hence we focus on the portability of models trained on different corpora, and we also investigate the effect of simple domain adaptation techniques to reduce the gap between the domains. Our results show that in spite of domain specificities, out-domain data can also contribute to the successful LVC detection in all domains.


text speech and dialogue | 2008

Web-Based Lemmatisation of Named Entities

Richárd Farkas; Veronika Vincze; T István Nagy; Róbert Ormándi; György Szarvas; Attila Almási

Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.


conference of the international speech communication association | 2016

Detecting mild cognitive impairment from spontaneous speech by correlation-based phonetic feature selection

Gábor Gosztolya; László Tóth; Tamás Grósz; Veronika Vincze; Ildikó Hoffmann; Gréta Szatlóczki; Magdolna Pákáski; János Kálmán

Mild Cognitive Impairment (MCI), sometimes regarded as a prodromal stage of Alzheimer’s disease, is a mental disorder that is difficult to diagnose. Recent studies reported that MCI causes slight changes in the speech of the patient. Our previous studies showed that MCI can be efficiently classified by machine learning methods such as Support-Vector Machines and Random Forest, using features describing the amount of pause in the spontaneous speech of the subject. Furthermore, as hesitation is the most important indicator of MCI, we took special care when handling filled pauses, which usually correspond to hesitation. In contrast to our previous studies which employed manually constructed feature sets, we now employ (automatic) correlation-based feature selection methods to find the relevant feature subset for MCI classification. By analyzing the selected feature subsets we also show that features related to filled pauses are useful for MCI detection from speech samples.


Proceedings of the 10th Workshop on Multiword Expressions (MWE) | 2014

VPCTagger: Detecting Verb-Particle Constructions With Syntax-Based Methods

István Nagy T.; Veronika Vincze

Verb-particle combinations (VPCs) con- sist of a verbal and a preposition/particle component, which often have some addi- tional meaning compared to the meaning of their parts. If a data-driven morpholog- ical parser or a syntactic parser is trained on a dataset annotated with extra informa- tion for VPCs, they will be able to iden- tify VPCs in raw texts. In this paper, we examine how syntactic parsers perform on this task and we introduce VPCTag- ger, a machine learning-based tool that is able to identify English VPCs in context. Our method consists of two steps: it first selects VPC candidates on the basis of syntactic information and then selects gen- uine VPCs among them by exploiting new features like semantic and contextual ones. Based on our results, we see that VPC- Tagger outperforms state-of-the-art meth- ods in the VPC detection task.


Current Alzheimer Research | 2018

A speech recognition-based solution for the automatic detection of mild cognitive impairment from spontaneous speech

László Tóth; Ildikó Hoffmann; Gábor Gosztolya; Veronika Vincze; Gréta Szatlóczki; Zoltán Bánréti; Magdolna Pákáski; János Kálmán

Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer’s disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive de-cline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech sig-nals, first manually (using the Praat software), and then automatically, with an automatic speech recogni-tion (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. Conclusion: The temporal analysis of spontaneous speech can be exploited in implementing a new, auto-matic detection-based tool for screening MCI for the community.

Collaboration


Dive into the Veronika Vincze's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

György Szarvas

Technische Universität Darmstadt

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge