Thomas Proisl
University of Erlangen-Nuremberg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thomas Proisl.
international conference on computational linguistics | 2014
Thomas Proisl; Stefan Evert; Paul Greiner; Besim Kabashi
Being able to quantify the semantic similarity between two texts is important for many practical applications. SemantiKLUE combines unsupervised and supervised techniques into a robust system for measuring semantic similarity. At the core of the system is a word-to-word alignment of two texts using a maximum weight matching algorithm. The system participated in three SemEval-2014 shared tasks and the competitive results are evidence for its usability in that broad field of application.
international conference on computational linguistics | 2014
Stefan Evert; Thomas Proisl; Paul Greiner; Besim Kabashi
SentiKLUE is an update of the KLUE polarity classifier – which achieved good and robust results in SemEval-2013 with a simple feature set – implemented in 48 hours.
Lexicographica: International annual for lexicography | 2012
Peter Uhrig; Thomas Proisl
Collocations in dictionaries are often based on automatically extracted candidate lists from large text corpora fi ltered by a lexicographer. The present paper discusses the two currently most popular approaches to the extraction process, the traditional window-based and the more recent Part-of-Speechpattern approach. As an improvement on current practices, we suggest to use a third approach to collocation candidate extraction based on dependency-annotated corpora. All three methods are evaluated against an existing collocations dictionary, revealing that the dependency-based approach can in general signifi cantly improve the quality of the candidate lists. Finally, a tool that allows lexicographers to use dependency-annotated versions of their own corpora by means of a simple web interface will be presented. 1. Collocation and lexicography It is probably unnecessary to stress the importance of collocation to lexicography – particularly to bilingual foreign language lexicography and to learner lexicography – in a journal that recently devoted almost an entire volume to “Collocations in European lexicography and dictionary research” (Lexicographica 24, 2008). Ever since the focus in foreign language pedagogy shifted from teaching isolated words to teaching words in their “natu1 The order of authors is arbitrary. 142 Peter Uhrig and Thomas Proisl ral environment” in the 1980s, the treatment of collocations in dictionaries has been widely discussed.2 The publication of COBUILD1 (1987) marks the introduction of computationally extracted and manually verifi ed collocation data into lexicography, a process that has since gained tremendous popularity and can be considered mainstream. Today, all major learner’s dictionaries of English devote considerable attention to collocation (see for instance Herbst/Mittmann 2008 or Götz-Votteler/Herbst 2009 for a survey) and there are specialised collocations dictionaries such as the BBI Combinatory Dictionary of English (fi rst edition 1986, third edition 2010), the Oxford Collocations Dictionary for students of English (fi rst edition 2002, second edition 2009; henceforth OCD1/OCD2) and – the most recent publication – the Macmillan Collocations Dictionary for Learners of English (2010; henceforth MCD). The present paper sets out to discuss current practices of and potential improvements on the computational extraction of collocations, the software and methodology for which have evolved to rather advanced levels. In this introductory section, we will fi rst have to briefl y discuss the theoretical status and various notions of collocation together with its relation to lexicography. The second section will give an overview of established techniques for collocation candidate extraction from corpora, in section 3 we will present a more sophisticated approach to the problem based on full syntactic parses and the resulting dependency structures. We will compare and evaluate all approaches in section 4, showing that the dependency-based approach is superior to other approaches. Section 5 briefl y presents Treebank.info, a freely available web interface implementing dependency-based collocation candidate extraction, and discusses consequences for lexicographic work. 1.1 Theoretical notions of collocation When Hausmann stated in 2003 (published as Hausmann 2004) that there is a “terminological war” about the term collocation and claimed that many computational and corpus linguists were not even aware of it, he was certainly exaggerating. Nonetheless it is necessary to take a look at the two major uses of the term.3 In the tradition of Firth (“you shall know a word by the company it keeps” (Firth 1957/1968, 179)), Sinclair defi nes the term collocation as “the occurrence of two or more words within a short space of each other in a text” (Sinclair 1991, 170). In computational implementations, the “short space” often corresponds to a so-called “window” of several orthographic words to the left and right (often 5; see discussion in 2.5), so we shall use the term window-based approach for such extraction methods. In this very general sense of collocation, any of the combinations given in Kjellmer’s (1994) dictionary4 can be regard2 In Germany, Hausmann’s (1984; 1985) publications can probably be seen as the starting point for the discussion, even though Hausmann himself is careful to show that the concept and the term had been widely used before and not only by researchers in British contextualism (see Hausmann 2008, 5–6). 3 We shall not cover here the text-linguistic use of the term as defi ned by Halliday/Hasan (1976, 287) due to its limited relevance to lexicography. 4 Kjellmer’s dictionary was not listed among the collocations dictionaries above since it is not aimed at foreign language learners but at researchers and was created without manual intervention. 143 Less hay, more needless – using dependency-annotated corpora for collocation extraction ed as a collocation, for instance hotel at, a downtown hotel, at her hotel, left the hotel. However, Sinclair further restricts the defi nition in order to exclude some such uses and states that collocation “in its purest sense [...] recognizes only the lexical co-occurrence of words” (Sinclair 1991, 170). He then goes on to state that the concept “is often related to measures of statistical signifi cance” (Sinclair 1991, 170). It is this view of collocations that is most widely used by corpus linguists and computational linguists, but it is also used in lexicography (for instance in the selection of examples in the fi rst edition of the Cobuild dictionary). In his comparison of the various uses of the term collocation, Herbst cites “sandy beaches” and “sell a house” as typical examples of this position (Herbst 1996, 384) since they are statistically signifi cantly associated in corpora even though they are free combinations semantically.5 The second approach to collocation we shall cover here is the one advocated by Hausmann (1979; 1984; 1985; 2004), which is inspired by the problems foreign learners of a language face when trying to produce idiomatic text and their requirements on dictionaries to provide them with the necessary information. Hausmann’s model limits the concept of collocation to a relationship between exactly two items, one of which he calls base (“Basis”), the other collocate6 (“Kollokator”). The base is a semantically autonomous word such as table, the collocate a word that shows a certain affi nity (Hausmann 1984, 398) to occur with the base and that can often only be interpreted semantically in the context of the respective base, such as lay in the context of table.7 According to Hausmann, the learner starts off with the base because he/she wants to make a statement about it and then needs the right collocate for the respective base. The distinction is tied to word class, so usually nouns are bases while verbs and adjectives are collocates of these nominal bases.8 Accordingly, base and collocate are usually connected by some sort of syntactic relation. This is also echoed in Bartsch’s working defi nition: Collocations are lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other. (Bartsch 2004, 76) Hausmann argues strongly for the inclusion of collocates in the dictionary entries of bases for production purposes since the learner can fi nd them only there when he/she does not know them already.9 Herbst cites false teeth and artifi cial leg as typical examples, where the learner has to know that artifi cial teeth and false leg are not the conventional wordings, 5 Nonetheless, as Herbst (2011) shows, even some of these apparently free combinations must be learned since they represent conceptual units and there is no way to predict that the concept of a sandy beach is usually expressed in the form of a premodifying adjective plus a noun in English and not – as for instance in German – as a compound (Sandstrand, “sand beach”). 6 Lea (2007) uses the English term collocator instead. 7 Example borrowed from Hausmann (2004, 309). 8 Of course verbs and adjectives are also bases when it comes to their modifi cation by adverbs. 9 While this is highly plausible from a lexicographic perspective, one may argue that cognitively many collocations are stored as conceptual units as a whole and not analysed into base and collocate in the same way (see Herbst 2011 for a brief discussion). Tarp (2008: 253) however also challenges Hausmann’s position from a lexicographic point of view and argues that the collocation should also be given in the collocate entry even for production purposes. 144 Peter Uhrig and Thomas Proisl and indeed research on learner language shows that learners produce signifi cantly more errors on such collocations than on free combinations (Nesselhauf 2005).10 Many researchers actually make a terminological distinction that roughly corresponds to the two positions outlined here, so there is the semantically motivated distinction between open collocation11 and restricted collocation (Cowie 1981), between cooccurrence and collocation (Evert 2005, 17) or between collocation candidates and collocations (Heid 1998, 301).12 The latter distinction will be used here, since for lexicographic purposes, the idea is that collocation candidates are extracted from a corpus and then fi ltered by a lexicographer as to whether they are actual collocations that merit inclusion in a dictionary.13 It has to be made clear, though, that it is very unlikely that all collocations in Hausmann’s sense can be identifi ed by such a method given that frequency is not a necessary criterion. We shall conclude this section with a very strong claim made by Hausmann with regard to the “war” on collocation and will discuss in the next section, to what extent it can be justifi ed for lexicographic applications: Der basisbezogene Kollokationsbegriff ist der enger
meeting of the association for computational linguistics | 2016
Thomas Proisl; Peter Uhrig
In this paper we describe SoMaJo, a rulebased tokenizer for German web and social media texts that was the best-performing system in the EmpiriST 2015 shared task with an average F1-score of 99.57. We give an overview of the system and the phenomena its rules cover, as well as a detailed error analysis. The tokenizer is available as free software.
systems and frameworks for computational morphology | 2009
Johannes Handl; Besim Kabashi; Thomas Proisl; Carsten Weber
JSLIM is a software system for writing grammars in accordance with the SLIM theory of language. Written in Java, it is designed to facilitate the coding of grammars for morphology as well as for syntax and semantics. This paper describes the system with a focus on morphology. We show how the system works, the evolution from previous versions, and how the rules for word form recognition can be used also for word form generation. The first section starts with a basic description of the functionality of a Left Associative Grammar (LAG) and provides an algebraic definition of a JSLIM grammar. The second section deals with the new concepts of JSLIM in comparison with earlier implementations. The third section describes the format of the grammar files, i.e. of the lexicon, of the rules and of the variables. The fourth section broaches the subject of the reversibility of grammar rules with the aim of an automatic word form production without any additional rule system. We conclude with an outlook on current and future developments.
Digital Scholarship in the Humanities | 2017
Stefan Evert; Thomas Proisl; Fotis Jannidis; Isabella Reger; Steffen Pielström; Christof Schöch; Thorsten Vitt
This article builds on a mathematical explanation of one the most prominent stylometric measures, Burrows’s Delta (and its variants), to understand and explain its working. Starting with the conceptual separation between feature selection, feature scaling, and distance measures, we have designed a series of controlled experiments in which we used the kind of feature scaling (various types of standardization and normalization) and the type of distance measures (notably Manhattan, Euclidean, and Cosine) as independent variables and the correct authorship attributions as the dependent variable indicative of the performance of each of the methods proposed. In this way, we are able to describe in some detail how each of these two variables interact with each other and how they influence the results. Thus we can show that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently. We are also able to show that the information particularly relevant to the identification of the author of a text lies in the profile of deviation across the most frequent words rather than in the extent of the deviation or in the deviation of specific words only. .................................................................................................................................................................................
Archive | 2018
Peter Uhrig; Stefan Evert; Thomas Proisl
Collocation candidate extraction from dependency-annotated corpora has become more and more mainstream in collocation research over the past years. In most studies, however, the results of one parser are compared to those of relatively “dumb” window-based approaches only. To date, the impact of the parser used and its parsing scheme has not been studied systematically to the best of our knowledge. This chapter evaluates a total of 8 parsers on 2 corpora with 20 different association measures plus several frequency thresholds for 6 different types of collocations against the Oxford Collocations Dictionary for Students of English (2nd edition; 2009). We find that the parser and parsing scheme both play a role in the quality of the collocation candidate extraction. The performance of different parsers can differ substantially across different collocation types. The filters used to extract different types of collocations from the corpora also play an important role in the trade-off between precision and recall we can observe. Furthermore, we find that carefully sampled and balanced corpora (such as the BNC) seem to have considerable advantages in precision, but of course for total coverage, larger, less balanced corpora (such as the web corpus used in this study) take the lead. Overall, log-likelihood is the best association measure, but for some specific types of collocation (such as adjective-noun or verb-adverb), other measures perform even better.
north american chapter of the association for computational linguistics | 2015
Nataliia Plotnikova; Gabriella Lapesa; Thomas Proisl; Stefan Evert
This paper describes the SemantiKLUE system (Proisl et al., 2014) used for the SemEval2015 shared task on Semantic Textual Similarity (STS) for English. The system was developed for SemEval-2013 and extended for SemEval-2014, where it participated in three tasks and ranked 13th out of 38 submissions for the English STS task. While this year’s submission ranks 46th out of 73, further experiments on the selection of training data led to notable improvements showing that the system could have achieved rank 22 out of 73. We report a detailed analysis of those training selection experiments in which we tested different combinations of all the available STS datasets, as well as results of a qualitative analysis conducted on a sample of the sentence pairs for which SemantiKLUE gave wrong STS predictions.
Archive | 2012
Thomas Proisl
The present article will introduce the Pareidoscope, a new web-based research tool for exploring the lexis-grammar interface. Its user interface allows the user to explore the interactions between word forms and syntactic structures in an interactive, network-like fashion. Sample analyses will highlight some of the abilities of the Pareidoscope to find associations between structures (which can be partially filled with word forms) and word forms or between the different word forms in a structure. 1
joint conference on lexical and computational semantics | 2013
Thomas Proisl; Paul Greiner; Stefan Evert; Besim Kabashi