Ekaterina Lapshinova-Koltunski
Saarland University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ekaterina Lapshinova-Koltunski.
association for information science and technology | 2016
Elke Teich; Stefania Degaetano-Ortlieb; Peter Fankhauser; Hannah Kermes; Ekaterina Lapshinova-Koltunski
We analyze the linguistic evolution of selected scientific disciplines over a 30‐year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use—both individually and collectively—over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus‐based methods of feature extraction (various aggregated features [part‐of‐speech based], n‐grams, lexico‐grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Archive | 2014
Kerstin Kunz; Ekaterina Lapshinova-Koltunski
The present paper contrasts strategies of cohesive conjunction in English and German system and text. We clarify the notion of cohesive conjunction by discussing conceptualizations in the literature and by comparing cohesive conjunctions to other cohesive strategies. Using theory-informed methodologies we contrast the resources available in the two languages for explicitly establishing conjunctive relations of cohesion. Moreover, we discuss the first findings from our analysis of an English - German corpus of translations and originals, which reveal differences in the textual realizations in terms of frequencies and functions. Our study complements insights about other types of cohesion investigated in the frame of a larger research project.1
north american chapter of the association for computational linguistics | 2016
Kerstin Kunz; Ekaterina Lapshinova-Koltunski; José Manuel Martínez Martínez
This paper focuses on the interaction of chains of coreference identity with other types of relations, comparing English and German data sets in terms of language, mode (written vs. spoken) and register. We first describe the types of coreference and the chain features analysed as indicators of textual coherence and topic continuity. After sketching the feature categories under analysis and the methods used for statistical evaluation, we present the findings from our analysis and interpret them in terms of the contrasts mentioned above. We will also show that for some registers, coreference types other than identity are of great importance.
text speech and dialogue | 2015
Marcos Zampieri; Ekaterina Lapshinova-Koltunski
In this paper, we propose the use of automatic text classification methods to analyse variation in English-German translations from both a quantitative and a qualitative perspective. The experiments described in this paper are carried out in two steps. We trained classifiers to 1 discriminate between different genres fiction, political essays, etc.; and 2 identify the translation method machine vs. human. Using semi-delexicalized models excluding all nouns, we report results of up to 60.5% F-measure in distinguishing human and machine translations and 45.4% in discriminating between seven different genres. More than the classification performance itself, we argue that text classification methods can level out discriminative features of different variables genres and translation methods thus enabling researchers to investigate in more detail the properties of each of them.
north american chapter of the association for computational linguistics | 2016
Anna Nedoluzhko; Ekaterina Lapshinova-Koltunski
This paper aims at a cross-lingual analysis of coreference to abstract entities in Czech and German, two languages that are typologically not very close, since they belong to two different language groups – Slavic and Germanic. We will specifically focus on coreference chains to abstract entities, i.e. verbal phrases, clauses, sentences or even longer text passages. To our knowledge, this type of relation is underinvestigated in the current stateof-the-art literature.
linguistic annotation workshop | 2015
Ekaterina Lapshinova-Koltunski; Anna Nedoluzhko; Kerstin Kunz
The present paper describes an attempt to create an interoperable scheme using existing annotations of textual phenomena across languages and genres including non-canonical ones. Such a kind of analysis requires annotated multilingual resources which are costly. Therefore, we make use of annotations already available in the resources for English, German and Czech. As the annotations in these corpora are based on different conceptual and methodological backgrounds, we need an interoperable scheme that covers existing categories and at the same time allows a comparison of the resources. In this paper, we describe how this interoperable scheme was created and which problematic cases we had to consider. The resulting scheme is supposed to be applied in the future to explore contrasts between the three languages under analysis, for which we expect the greatest differences in the degree of variation between non-canonical and canonical language.
Bergen Language and Linguistics Studies | 2018
Kerstin Kunz; Ekaterina Lapshinova-Koltunski
This paper presents a cross-lingual corpus-based study on the intersection of chains of coreference and lexical cohesion. The two types of cohesion are often combined and thus play an important role for the development of discourse topics. We analyse chain intersection as cases where chain elements of lexical cohesion occur inside of coreference chains. We use a corpus of English and German original texts from four written and spoken registers which is annotated for both types of cohesion. Our analyses point to contrasts between the two languages and across the four registers under analysis in the types and the number of intersections in coreference chains. This variation has an effect on the way important topics
north american chapter of the association for computational linguistics | 2016
Raphael Rubino; Ekaterina Lapshinova-Koltunski; Josef van Genabith
This paper introduces information density and machine translation quality estimation inspired features to automatically detect and classify human translated texts. We investigate two settings: discriminating between translations and comparable originally authored texts, and distinguishing two levels of translation professionalism. Our framework is based on delexicalised sentence-level dense feature vector representations combined with a supervised machine learning approach. The results show state-of-the-art performance for mixed-domain translationese detection with information density and quality estimation based features, while results on translation expertise classification are mixed.
empirical methods in natural language processing | 2015
Ekaterina Lapshinova-Koltunski; Mihaela Vela
In this paper, we apply text classification techniques to prove how well translated texts obey linguistic conventions of the target language measured in terms of registers, which are characterised by particular distributions of lexico-grammatical features according to a given contextual configuration. The classifiers are trained on German original data and tested on comparable English-to-German translations. Our main goal is to see if both human and machine translations comply with the nontranslated target originals. The results of the present analysis provide evidence for our assumption that the usage of parallel corpora in machine translation should be treated with caution, as human translations might be prone to errors.
empirical methods in natural language processing | 2015
Ekaterina Lapshinova-Koltunski
In this paper, we analyse cross-linguistic variation of discourse phenomena, i.e. coreference, discourse relations and modality. We will show that contrasts in the distribution of these phenomena can be observed across languages, genres, and text production types, i.e. translated and non-translated ones. Translations, regardless of the method they were produced with, are different from their source texts and from the comparable originals in the target language, as it was stated in studies on translationese. These differences can be automatically detected and analysed with exploratory and automatic clustering techniques. The extracted frequencybased profiles of variables under analysis (languages, genres, text production types) can be used in further studies, e.g. in the development and enhancement of MT systems, or in further NLP applications.