Tanja Säily
University of Helsinki
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tanja Säily.
Literary and Linguistic Computing | 2016
Jefrey Lijffijt; Terttu Nevalainen; Tanja Säily; Panagiotis Papapetrou; Kai Puolamäki; Heikki Mannila
Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.
Literary and Linguistic Computing | 2011
Tanja Säily; Terttu Nevalainen; Harri Siirtola
Many corpus linguists make the tacit assumption that part-of-speech frequencies remain constant during the period of observation. In this article, we will consider two related issues: (1) the reliability of part-of-speech tagging in a diachronic corpus, and (2) shifts in tag ratios over time. The purpose is both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results. We use noun and pronoun ratios as diagnostics indicative of opposing stylistic tendencies, but we are also interested in testing whether any observed variation in the ratios could be accounted for in sociolinguistic terms. The material for our study is provided by the Parsed Corpus of Early English Correspondence (PCEEC), which consists of 2.2 million running words covering the period 1415–1681. The part-of-speech tagging of the PCEEC has its problems, which we test by reannotating the corpus according to our own principles and comparing the two annotations. While there are quite a few changes, the mean percentage of change is very small for both nouns and pronouns. As for variation over time, the mean frequency of nouns declines somewhat, while the mean frequency of pronouns fluctuates with no clear diachronic trend. However, women consistently use more pronouns than men, while men use more nouns than women. More fine-grained distinctions are needed to uncover further regularities and possible reasons for this variation.
28th Annual Conference of the International Computer Archive for Modern and Medieval English (ICAME) | 2009
Tanja Säily; Jukka Suomela
This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes -ity and -ness in the 17th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of -ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women’s letters with the bounds, we are able to confirm our hypothesis.
2016 20th International Conference Information Visualisation (IV) | 2016
Harri Siirtola; Poika Isokoski; Tanja Säily; Terttu Nevalainen
Digitalization is changing how research is carried out in all areas of science. Humanities is no exception - materials that used to be hand-written or printed on paper are increasingly available in digital form. This development is changing how scholars are interacting with their material. We are addressing the problem of interactive text visualization in the context of sociolinguistic language study. When a scholar is reading and analyzing text from a computer screen instead of a paper, we can support this by providing a dashboard for reading, and by creating visualizations of the text structure, variation, and change. We have designed and developed a software tool called Text Variation Explorer (TVE) for sociolinguistic language study. It is based on interactive visualization with a direct manipulation user interface, and aimed for exploratory corpus linguistics. The TVE software tool has proven to be useful in supporting the study of language variation and change in its social contexts, or sociolinguistics. It is, to a certain degree, language-independent, and generic enough to be useful in other linguistic contexts as well. We are now in the process of designing and implementing the next iteration of TVE. We present the lessons learned from the first version, discuss the old and the new design, and welcome feedback from the communities involved.
2017 21st International Conference Information Visualisation (IV) | 2017
Harri Siirtola; Tanja Säily; Terttu Nevalainen
Principal Component Analysis (PCA) is an established and efficient method for finding structure in a multidimensional data set. PCA is based on orthogonal transformations that convert a set of multidimensional values into linearly uncorrelated variables called principal components.The main disadvantage to the PCA approach is that the procedure and outcome are often difficult to understand. The connection between input and output can be puzzling, a small change in input can yield a completely different output, and the user may often wonder if the PCA is doing the right thing.We introduce a user interface that makes the procedure and result easier to understand. We have implemented an interactive PCA view in our text visualization tool called Text Variation Explorer. It allows the user to interactively study the result of PCA, and provides a better understanding of the process.We believe that although we are addressing the problem of interactive principal component analysis in the context of text visualization, these ideas should be useful in other contexts as well.
ICAME Journal | 2016
Terttu Nevalainen; Turo Vartiainen; Tanja Säily; Joonas Kesäniemi; Agata Dominowska; Emily Öhman
Abstract We introduce the Language Change Database (LCD), which provides access to the results of previous corpus-based research dealing with change in the English language. The LCD will be published on an open-access linked data platform that will allow users to enter information about their own publications into the database and to conduct searches based on linguistic and extralinguistic parameters. Both metadata and numerical data from the original publications will be available for download, enabling systematic reviews, meta-analyses, replication studies and statistical modelling of language change. The LCD will be of interest to scholars, teachers and students of English.
Corpus Linguistics and Linguistic Theory | 2016
Tanja Säily
Abstract This paper presents ongoing work on Säily and Suomela’s (2009) method of comparing type frequencies across subcorpora. The method is here used to study variation in the productivity of the suffixes -ness and -ity in the eighteenth-century sections of the Corpora of Early English Correspondence and of the Old Bailey Corpus (OBC). Unlike the OBC, the eighteenth-century section of the letter corpora differs from previously studied materials in that there is no significant gender difference in the productivity of -ity. The study raises methodological issues involving periodization, multiple hypothesis testing, and the need for an interactive tool. Several improvements have been implemented in a new version of our software.
Proceedings of the first international workshop on Intelligent visual interfaces for text analysis | 2010
Harri Siirtola; Kari-Jouko Räihä; Tanja Säily; Terttu Nevalainen
In this paper linguists and researchers of visual data analysis outline the requirements and benefits of an information visualization approach for corpus linguistics. Over the years, the information visualization community has come up with a number of methods to visualize text, but the majority of these techniques do not serve the needs of the linguistic community. This is evident in the over-simplification of the linguistic problems and generally caused by a poor understanding of the domain. We started a joint research effort with linguists, data miners, and information visualizers to design and produce better data analysis tools for corpus linguistics. This work is still in its early stages, but we have a shared vision of what needs to be done.
International Journal of Corpus Linguistics | 2014
Harri Siirtola; Tanja Säily; Terttu Nevalainen; Kari-Jouko Räihä
Studies in Variation, Contacts and Change in English | 2012
Jefrey Lijffijt; Tanja Säily; Terttu Nevalainen