Antti Arppe
University of Helsinki
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Antti Arppe.
Corpus Linguistics and Linguistic Theory | 2007
Antti Arppe; Juhani Järvikivi
Abstract In this study we explore the concurrent, combined use of three research methods, statistical corpus analysis and two psycholinguistic experiments (a forced-choice and an acceptability rating task), using verbal synonymy in Finnish as a case in point. In addition to supporting conclusions from earlier studies concerning the relationships between corpus-based and experimental data (e. g., Featherston 2005), we show that each method adds to our understanding of the studied phenomenon, in a way which could not be achieved through any single method by itself. Most importantly, whereas relative rareness in a corpus is associated with dispreference in selection, such infrequency does not categorically always entail substantially lower acceptability. Furthermore, we show that forced-choice and acceptability rating tasks pertain to distinct linguistic processes, with category-wise incommensurable scales of measurement, and should therefore be merged with caution, if at all.
Information & Management | 2005
Camilla Magnusson; Antti Arppe; Tomas Eklund; Barbro Back; Hannu Vanharanta; Ari Visa
This paper adopts a multi-methodological approach to information systems research in order to produce new information through data mining. This approach is particularly suitable for mining material that consists of both qualitative and quantitative information. The contents of quarterly reports from three telecommunications companies were compared. The study focused on the years 2000-2001, a period of economic decline for many IT companies. The central quantitative data, reflected by seven financial ratios, were visualised using self-organising maps. The qualitative data, consisting of the textual contents of the reports, were visualised using collocational networks; these showed the relationships between the central concepts in the texts. As the visualisations of the contents were compared, certain patterns could be found. The results seemed to suggest that changes in the networks indicated future changes in the self-organising maps. In the cases studied, a change in the textual data usually indicated a change in the financial data in the following quarter. This may be a consequence of the fact that the texts reflected the plans and future expectations of management, whereas the financial ratios reflected the current financial situation of the company.
Corpus Linguistics and Linguistic Theory | 2017
Antti Arppe; John Newman; Weifang Han
Shanghainese is an extremely topic-prominent language with many topic markers in competition with one another, often without any obvious basis for the selection of one topic marker over another. We explore the influence of five variables on the five most frequent topic markers in a corpus of (spoken) Shanghainese: topic length, syntactic category of the topic, function of the topic, comment type, and genre. We carry out a multivariate statistical analysis of the data, relying on a polytomous logistic regression model. Our approach leads to a satisfying quantification of the role of each factor, as well as an estimate of the probabilities of combinations of factors, in influencing the choice of topic marker. This study serves simultaneously as an introduction to the polytomous package (Arppe 2013) in the statistical
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages | 2014
Conor Snoek; Dorothy Thunder; Kaidi Lõo; Antti Arppe; Jordan Lachler; Sjur N. Moshagen; Trond Trosterud
This paper presents aspects of a computational model of the morphology of Plains Cree based on the technology of finite state transducers (FST). The paper focuses in particular on the modeling of nominal morphology. Plains Cree is a polysynthetic language whose nominal morphology relies on prefixes, suffixes and circumfixes. The model of Plains Cree morphology is capable of handling these complex affixation patterns and the morphophonological alternations that they engender. Plains Cree is an endangered Algonquian language spoken in numerous communities across Canada. The language has no agreed upon standard orthography, and exhibits widespread variation. We describe problems encountered and solutions found, while contextualizing the endeavor in the description, documentation and revitalization of First Nations Languages in Canada.
QITL-4 - Proceedings of Quantitative Investigations in Theoretical Linguistics 4 (QITL-4) | 2011
R. Harald Baayen; Antti Arppe
Statistical classification and principles of human learning In the application of any statistical analysis method to the modeling of linguistic phenomena, a recurring question is how to understand the statistical results from a cognitive perspective. Although quantitative models may provide detailed and useful insights into which factors enhance the probability of particular linguistic phenomena, they tend leave unanswered how actual speakers come to learn and use their language in the way they do. The present study addresses this question by introducing a new, parameter-free model for linguistic choice behavior based on naive discriminative learning that is driven fully and only by the distributional properties of its input. The learning principles on which this model is based, the so-called Rescorla-Wagner equations (see appendix), were first proposed by Wagner and Rescorla in 1972 (Wagner & Rescorla 1972), and have proved to be amazingly fruitful in psychology as a model for human and animal learning (Miller, Barnet, & Grahame 1995). A technical innovation due to Danks (2003) makes it possible to estimate the weights of the Rescorla-Wagner equations when learning has reached a state of equilibrium. Baayen, Milin, Filipovic Durdevic, Hendrix, and Marelli (submitted) incorporated the equilibrium equations of Danks (2003) into a general discriminative learning model that is naive in the sense of naive Bayes classifiers. These authors show that naive discriminative learning provides accurate predictions of response latencies in the visual lexical decision task. The model reproduces a wide range of effects in the morphological processing literature with a minimum of representational assumptions, using a learning engine that, in its simplest form, has no free parameters. In this paper, we pit this parameter-free statistical engine derived from human learning principles against several well-established statistical classifiers: random forests (Breiman 2001; Strobl, Malley, & Tutz 2009), support vector machines (Vapnik 1995), memory-based learning (Daelemans & Bosch 2005) and polytomous logistic regression (according to the one-vs.-rest heuristic, see e.g. Arppe (2008)). As our linguistic example case, we have selected the near-synonymous set of the four most frequent Finnish verbs denoting THINK, namely ajatella, miettia, pohtia, harkita ‘think, reflect, ponder, consider’, which have been comprehensively studied by Arppe (2008) using newspaper and Internet newsgroup discussion corpora. Altogether 3,404 occurrences of these four THINK verbs and their sentential contexts were analyzed in terms of their morphological and lexical as well as syntactic structure (following Functional Dependency Grammar, (Tapanainen & Jarvinen 1997)), supplemented with semantic and structural subclassifications. Of some 6000 contextual features, 46 were selected for the present study, as these 46 emerged from previous analyses as the most predictive ones when taken together. This subset of predictors included the most common morphological properties and general semantic characteristics of the verbchain in which the think verb occurred, and detailed information on the syntactic structure (functional roles and various subclassifications) linked with the think verbs in their sentential context. Arppe (2008) observed that using polytomous logistic regression (with any of several common heuristics) as a classifier seems to reach a ceiling at a Recall rate of roughly two-thirds of the sentences in the research corpus. The
Linguistic Typology | 2016
Matti Miestamo; D. Bakker; Antti Arppe
Abstract Variety sampling aims at capturing as much of the world’s linguistic variety as possible. The article discusses and compares two sampling methods designed for variety sampling: the Diversity Value method, in which sample languages are picked according to the diversity found in family trees, and the Genus-Macroarea method, in which genealogical stratification is primarily based on genera and areal stratification pays attention to the proportional representation of the genealogical diversity of macroareas. The pros and cons of the methods are discussed, some additional features are introduced to the Genus-Macroarea method, and the ability of both methods to capture crosslinguistic variety is tested with computerized simulations drawing on data in The world atlas of language structures database.
Archive | 2009
Antti Arppe
The purpose of this paper is to present a case study elucidating how multivariate models can be interpreted to shed light on the nature of the use and choice among lexical and structural alternatives in language, more specifically near-synonyms, and the underlying explanatory factors. By multivariate models I imply both the use of multiple linguistic variables from a range of categories, instead of only one or two, in order to study and explain some linguistic phenomenon, as well as the use of multivariate statistical methods such as polytomous logistic regression.
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages | 2017
Antti Arppe; Marie-Odile Junker; Delasie Torkornoo
In this paper we present a case study of how comprehensive, well-structured, and consistent lexical databases, one indicating the exact inflectional subtype of each word and another exhaustively listing the full paradigm for each inflectional subtype, can be quickly and reliably converted into a computational model of the finite-state transducer (FST) kind. As our example language, we will use (Northern) East Cree (Algonquian, ISO 639-3: crl), a morphologically complex Indigenous language. We will focus on modeling (Northern) East Cree verbs, as their paradigms represent the most richly inflected forms in this language.
international conference on computational linguistics | 2012
Mari-Sanna Paukkeri; Jaakko J. Väyrynen; Antti Arppe
In the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is selected to fill a lexical gap in a text. We experiment on an approach of an extensive set, over 650, linguistic features to represent the context of a word, and a range of machine learning approaches in the lexical choice task. We extend previous work by experimenting with unsupervised and semi-supervised methods, and use automatic feature selection to cope with the problems arising from the rich feature set. It is natural to think that linguistic analysis of the word context would yield almost perfect performance in the task but we show that too many features, even linguistic, introduce noise and make the task difficult for unsupervised and semi-supervised methods. We also show that purely syntactic features play the biggest role in the performance, but also certain semantic and morphological features are needed.
Archive | 2013
Antti Arppe; Weifeng Han; John Newman
Before carrying out the statistical analyses, we need to invoke the polytomous package to make it available within R, having installed the package earlier. As subsequent preliminary steps, we load in the shanghainese data frame, and then take a look at its composition, scrutinizing the first six lines (output length by default for the function head) and the overall content of the data frame with the summary method: