Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Anke Lüdeling is active.

Publication


Featured researches published by Anke Lüdeling.


Archive | 2009

Corpus linguistics : an international handbook

Anke Lüdeling; Merja Kytö

This handbook provides an up-to-date survey of corpus linguistics. Spoken, written, and multimodal corpora serve as the bases for quantitative and qualitative research on many issues of linguistic interest. The two volumes together comprise 61 articles by renowned experts from around the world. They sketch the history of corpus linguistics and its relationship with neighbouring disciplines, show its potential, discuss its problems, and describe various methods of collecting, annotating, and searching corpora, as well as processing corpus data. Key features: up-to-date and complete handbook includes both an overview and detailed discussions gathers together a great number of experts


Proceedings of the Corpus Linguistics Conference 2009 (CL2009),, 2009, pág. 358 | 2009

ANNIS: a search tool for multi-layer annotated corpora

Christian Chiarcos; Anke Lüdeling

ANNIS (see Dipper & Gotze 2005; Chiarcos et al. 2008) is a flexible web-based corpus architecture for search and visualization of multi-layer linguistic corpora. By multi-layer we mean that the same primary datum may be annotated independently with (i) annotations of different types (spans, DAGs with labelled edges and arbitrary pointing relations between terminals or non-terminals), and (ii) annotation structures that possibly overlap and/or conflict hierarchically. In this paper we present the different features of the architecture as well as actual use cases for corpus linguistic research on such diverse areas as information structure, learner language and discourse level phenomena. The supported search functionalities of ANNIS2 include exact and regular expression matching on word forms and annotations, as well as complex relations between individual elements, such as all forms of overlapping, contained or adjacent annotation spans, hierarchical dominance (children, ancestors, leftor rightmost child etc.) and more. Alternatively to the query language, data can be accessed using a graphical query builder. Query matches are visualized depending on annotation types: annotations referring to tokens (e.g. lemma, POS, morphology) are shown immediately in the match list. Spans (covering one or more tokens) are displayed in a grid view, trees/graphs in a tree/graph view, and pointing relations (such as anaphoric links) in a discourse view, with same-colour highlighting for coreferent elements. Full Unicode support is provided and a media player is embedded for rendering audio files linked to the data, allowing for a large variety of corpora. Corpus data is annotated with automatic tools (taggers, parsers etc.) or taskspecific expert tools for manual annotation, and then mapped onto the interchange format PAULA (Dipper 2005), where stand-off annotations refer to the same primary data. Importers exist for many formats, including EXMARaLDA (Schmidt 2004), TigerXML (Brants & Plaehn 2000), MMAX2 (Muller & Strube 2006), RSTTool (O’Donnell 2000), PALinkA (Orasan 2003) and Toolbox (Stuart et al. 2007). Data is compiled into a relational DB for optimal performance. Query matches and their features can also be exported in the ARFF format and processed with the data mining tool WEKA (Witten & Frank 2005), which offers implementations of clustering and classification algorithms. ANNIS2 compares favourably with search functionalities in the above tools as well as other corpus search engines (EXAKT, http://www.exmaralda.org/exakt.html, TIGERSearch, Lezius,2002, CWB, Christ 1994) and other frameworks/architectures (NITE, Carletta et al. 2003, GATE, Cunningham, 2002).


Archive | 2006

Using web data for linguistic purposes

Anke Lüdeling; Stefan Evert; Marco Baroni

The world wide web is a mine of language data of unprecedented richness and ease of access (Kilgarriff and Grefenstette 2003). A growing body of studies has shown that simple algorithms using web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled data sources (cf. Turney 2001; Keller and Lapata 2003). Most current internet-based linguistic studies access the web through a commercial search engine. For example, some researchers rely on frequency estimates (number of hits) reported by engines (e.g. Turney 2001). Others use a search engine to find relevant pages, and then retrieve the pages to build a corpus (e.g. Ghani and Mladenic 2001; Baroni and Bernardini 2004). In this study, we first survey the state of the art, discussing the advantages and limits of various approaches, and in particular the inherent limitations of depending on a commercial search engine as a data source. We then focus on what we believe to be some of the core issues of using the web to do linguistics. Some of these issues concern the quality and nature of data we can obtain from the internet (What languages, genres and styles are represented on the web?), others pertain to data extraction, encoding and preservation (How can we ensure data stability? How can web data be marked up and categorized? How can we identify duplicate pages and near duplicates?), and others yet concern quantitative aspects (Which statistical quantities can be reliably estimated from web data, and how much web data do we need? What are the possible pitfalls due to the massive presence of duplicates, mixed-language pages?). All points are illustrated through concrete examples from English, German and Italian web corpora.


Archive | 2002

Neoclassical word formation in German

Anke Lüdeling; Tanja Schmid; Sawwas Kiokpasoglou

This paper deals with neoclassical word formation in German: Complex words consisting of stems and affixes of classical (Greek or Latin) origin appear in Germanic languages alongside complex words consisting of native stems and affixes. Example (1) lists just a few:1(1) a. Morphologie “morphology”, anthropomorph “anthropomorph”, monomorphemisch “monomorphemic”, Allomorphie “allomorphy” b. hydrophil “hydrophile”, Hydrogeologie “hydrogeology”, Hydrographie “hydrography”, Hydrologe “hydrologist” c. dokumentieren “to document”, kontingentieren “to ration”, fermentieren “to ferment”, supplementieren “to supplement”, patentieren “to patent”, instrumentieren “to arrange, orchestrate” d. telefonieren “to telephone”, Monitor “monitor”, Photographie “photograph”, Video “video”, Automobil “car”, Kosmetik “cosmetics” It is obvious from their meaning and usage that these are not just loanwords borrowed directly from classical languages. Rather, word formation involving classical elements is active and productive today — not just in science or technology but also in everyday language, as especially the words in (1d).


Archive | 2007

Syntactic annotation of non-canonical linguistic structures

Hagen Hirschmann; Seanna Doolittle; Anke Lüdeling

This paper deals with the syntactic annotation of corpora that contain both ‘canonical’ and ‘non-canonical’ sentences. Consider Examples (1) and (2) from the German learner corpus Falko which will be introduced below. (1) represents a syntactically correct (although perhaps not very enlightening) utterance to which it is easy to assign a syntactic structure. The utterance in (2), on the other hand, would be considered incorrect (and probably be interpreted as a word order error) – it is much more difficult to assign a syntactic structure to it. The question is: how can (1) and (2) be annotated in a uniform way that shows that there is a difference and makes clear exactly where that difference lies?


Archive | 2007

Das Zusammenspiel von qualitativen und quantitativen Methoden in der Korpuslinguistik

Anke Lüdeling

Es gibt viele linguistische Forschungsfragen, für deren Beantwortung man Korpusdaten qualitativ und quantitativ auswerten möchte. Beide Auswertungsmethoden können sich auf den Korpustext, aber auch auf Anntotationsebenen beziehen. Jede Art von Annotation, also Kategorisierung, stellt einen kontrollierten und notwendigen Informationsverlust dar. Das bedeutet, dass jede Art von Kategorisierung auch eine Interpretation der Daten ist. In den meisten großen Korpora wird zu jeder vorgesehenen Anntotationsebene, wie z. B. Wortart-Ebene oder Lemma-Ebene, genau eine Interpretation angeboten. In den letzten Jahren haben sich neben den großen,,flach“ annotierten Korpora Korpusmodelle herausgebildet, mit denen man konfligierende Informationen kodieren kann, die so genannten Mehrebenen-Modelle (multilevel standoff corpora), in denen alle Annotationsebenen unabhängig vom Text gespeichert werden und nur auf bestimmte Textanker verweisen. Ich argumentiere anhand der Fehlerannotation in einem Lemerkorpus dafür, dass zumindest Korpora, in denen es stark variierende Annotationsbedürfnisse und umstrittene Analysen geben kann, davon profitieren, in Mehrebenen-Modellen kodiert zu werden.


Archive | 2015

Error annotation systems

Anke Lüdeling; Hagen Hirschmann; Sylviane Granger; Gaëtanelle Gilquin; Fanny Meunier

and says only that this part of the learner utterance is unidiomatic, confl ating an implicit target hypothesis with an error tag (the annotator is only able to know that this expression is unidiomatic if he or she knows a more idiomatic expression). Different target hypotheses are not equivalent; a target hypothesis directly infl uences the following analysis. The Falko corpus consistently has two target hypotheses – the fi rst one deals with clear grammatical errors and the second one also corrects stylistic problems. The need for such an approach becomes clear in (11). The learner utterance in (11) contains a spelling error . The two occurrences of dependance have to be replaced by dependence . From a more abstract perspective, the whole phrase Dependence on gambling sounds unidiomatic if we take into account that the learner wants to refer to a specifi c kind of addiction. Similarly, dependence on drugs appears to be a marked expression as opposed to drug addiction . An annotation that wants to take this into consideration has to separate the description into the annotation of the spelling error and the annotation of the stylistic error in order not to lose one of the pieces of information. Example (12) illustrates this. The examples in this section show how important the step of formulating a target hypothesis is – the subsequent error classifi cation critically depends on this fi rst step. In order to operationalise the fi rst step of the error annotation , one can give guidelines for the formulation of target hypotheses, in addition to the guidelines for assigning error tags, which also need to be evaluated with regard to consistency (see Section 2.6 ). The problem of unclear error identifi cation has been discussed since the beginning of EA. Milton and Chowdhury ( 1994 ) have already suggested that sometimes multiple analyses should be coded in a learner corpus. If (11) Dependance on gambling is something like dependance on drugs (...) (ICLE-CZ-PRAG-0013.3) (12) LU Dependance on gambling TH 1 Dependence on gambling TH 2 Gambling addiction (10) LU it sleeps inside everyone from the start of being TH 1 it sleeps inside everyone since birth TH 2 it sleeps inside everyone from the beginning TH 3 it sleeps inside everyone UNIDIOMATIC 9781107041196c07_p135-158.indd 145 6/11/2015 1:48:09 PM LÜDELING AND HIRSCHMANN 146 the target hypothesis is left implicit or there is only one error analysis , the user is given an error annotation without knowing against which form the utterance was evaluated. In early corpora (pre-multi-layer, pre-XML) it was technically impossible to show the error exponent because errors could only be marked on one token. In corpora that use an XML format it is possible to mark spans, and target hypotheses are sometimes given in the XML mark-up. Only in standoff architectures, however, is it possible to give several competing target hypotheses. Examples of learner corpora with consistent and well-documented (multiple) target hypotheses are the Falko corpus, the trilingual MERLIN corpus (Wisniewski et al. 2013 ) or the Czech as a Second Language corpus (Rosen et al. 2014 ).


ACM Journal on Computing and Cultural Heritage | 2012

Introduction to the special issue on corpus and computational linguistics, philology, and the linguistic heritage of humanity

Gregory R. Crane; Anke Lüdeling

The articles in this issue make two complementary assertions: first, language and linguistic sources are a key element of human cultural heritage and, second, we need to integrate the ancient goals of philology with rapidly emerging methods from fields such as Corpus and Computational Linguistics. The first 15,000,000 volumes digitized by Google contained data from more than 400 languages covering more than four thousand years of the human record. We need to develop methods to explore linguistic changes and the ideas that languages encode as these evolve and circulate over millennia and on a global scale.


Archive | 2005

Storing and Querying Historical Texts in a Relational Database

Lukas C. Faulstich; Ulf Leser; Anke Lüdeling

This paper describes an approach for storing and querying a large corpus of linguistically annotated historical texts in a relational database management system. Texts in such a corpus have a complex structure consisting of multiple text layers that are richly annotated and aligned to each other. Modeling and managing such corpora poses various challenges not present in simpler text collections. In particular, it is a difficult task to design and efficiently implement a query language for such complex annotation structures that fulfills the requirements of linguists and philologists. In this report, we describe steps towards a solution of this task. We describe a model for storing arbitrarily complex linguistic annotation schemes for text. The text itself may be present in various transliterations, transcriptions, or editions. We identify the main requirements for a query language on linguistic annotations in this scenario. From these requirements, we derive fundamental query operators and sketch their implementation in our model. Furthermore, we discuss initial ideas for improving the efficiency of an implementation based on relational databases and XML techniques.


language resources and evaluation | 2015

CityU corpus of essay drafts of English language learners: a corpus of textual revision in second language writing

John Lee; Chak Yan Yeung; Amir Zeldes; Marc Reznicek; Anke Lüdeling; Jonathan J. Webster

Abstract Learner corpora consist of texts produced by non-native speakers. In addition to these texts, some learner corpora also contain error annotations, which can reveal common errors made by language learners, and provide training material for automatic error correction. We present a novel type of error-annotated learner corpus containing sequences of revised essay drafts written by non-native speakers of English. Sentences in these drafts are annotated with comments by language tutors, and are aligned to sentences in subsequent drafts. We describe the compilation process of our corpus, present its encoding in TEI XML, and report agreement levels on the error annotations. Further, we demonstrate the potential of the corpus to facilitate research on textual revision in L2 writing, by conducting a case study on verb tenses using ANNIS, a corpus search and visualization platform.

Collaboration


Dive into the Anke Lüdeling's collaboration.

Top Co-Authors

Avatar

Hagen Hirschmann

Humboldt University of Berlin

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Marc Reznicek

Complutense University of Madrid

View shared research outputs
Top Co-Authors

Avatar

Stefan Evert

University of Erlangen-Nuremberg

View shared research outputs
Top Co-Authors

Avatar

Thomas Krause

Humboldt University of Berlin

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Florian Zipser

Humboldt University of Berlin

View shared research outputs
Top Co-Authors

Avatar

Ulf Leser

Humboldt University of Berlin

View shared research outputs
Top Co-Authors

Avatar

Burkhard Dietterle

Humboldt University of Berlin

View shared research outputs
Top Co-Authors

Avatar

Malte Belz

Humboldt University of Berlin

View shared research outputs
Researchain Logo
Decentralizing Knowledge