Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Sanja Štajner is active.

Publication


Featured researches published by Sanja Štajner.


ACM Transactions on Accessible Computing | 2015

Making It Simplext: Implementation and Evaluation of a Text Simplification System for Spanish

Horacio Saggion; Sanja Štajner; Stefan Bott; Simon Mille; Luz Rello

The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. In this article, we present results from the Simplext project, which is dedicated to automatic text simplification for Spanish. We present a modular system with dedicated procedures for syntactic and lexical simplification that are grounded on the analysis of a corpus manually simplified for people with special needs. We carried out an automatic evaluation of the system’s output, taking into account the interaction between three different modules dedicated to different simplification aspects. One evaluation is based on readability metrics for Spanish and shows that the system is able to reduce the lexical and syntactic complexity of the texts. We also show, by means of a human evaluation, that sentence meaning is preserved in most cases. Our results, even if our work represents the first automatic text simplification system for Spanish that addresses different linguistic aspects, are comparable to the state of the art in English Automatic Text Simplification.


international conference on computational linguistics | 2013

Automatic text simplification in spanish: a comparative evaluation of complementing modules

Sanja Štajner; Stefan Bott; Susana Bautista; Horacio Saggion

In this paper we present two components of an automatic text simplification system for Spanish, aimed at making news articles more accessible to readers with cognitive disabilities. Our system in its current state consists of a rule-based lexical transformation component and a module for syntactic simplification. We evaluate the two components separately and as a whole, with a view to determining the level of simplification and the preservation of meaning and grammaticality. In order to test the readability level pre- and post-simplification, we apply seven readability measures for Spanish to three sets of randomly chosen news articles: the original texts, the output obtained after lexical transformations, the syntactic simplification output, and the output of both system components. To test whether the simplification output is grammatically correct and semantically adequate, we ask human annotators to grade pairs of original and simplified sentences according to these two criteria. Our results suggest that both components of our system produce simpler output when compared to the original, and that grammaticality and meaning preservation are positively rated by the annotators.


international joint conference on natural language processing | 2015

Simplifying Lexical Simplification: Do We Need Simplified Corpora?

Goran Glavaš; Sanja Štajner

Simplification of lexically complex texts, by replacing complex words with their simpler synonyms, helps non-native speakers, children, and language-impaired people understand text better. Recent lexical simplification methods rely on manually simplified corpora, which are expensive and time-consuming to build. We present an unsupervised approach to lexical simplification that makes use of the most recent word vector representations and requires only regular corpora. Results of both automated and human evaluation show that our simple method is as effective as systems that rely on simplified corpora.


Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) | 2014

One Step Closer to Automatic Evaluation of Text Simplification Systems

Sanja Štajner; Ruslan Mitkov; Horacio Saggion

This study explores the possibility of replacing the costly and time-consuming human evaluation of the grammaticality and meaning preservation of the output of text simplification (TS) systems with some automatic measures. The focus is on six widely used machine translation (MT) evaluation metrics and their correlation with human judgements of grammaticality and meaning preservation in text snippets. As the results show a significant correlation between them, we go further and try to classify simplified sentences into: (1) those which are acceptable; (2) those which need minimal post-editing; and (3) those which should be discarded. The preliminary results, reported in this paper, are promising.


text speech and dialogue | 2013

Stylistic Changes for Temporal Text Classification

Sanja Štajner; Marcos Zampieri

This paper investigates stylistic changes in a set of Portuguese historical texts ranging from the 17th to the early 20th century and presents a supervised method to classify them per century. Four stylistic features – average sentence length (ASL), average word length (AWL), lexical density (LD), and lexical richness (LR) – were automatically extracted for each sub-corpus. The initial analysis of diachronic changes in these four features revealed that the texts written in the 17th and 18th centuries have similar AWL, LD and LR, which differ significantly from those in the texts written in the 19th and 20th centuries. This information was later used in automatic classification of texts per century, leading to an F-Measure of 0.92.


international joint conference on natural language processing | 2015

A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation

Sanja Štajner; Hannah Béchara; Horacio Saggion

Comunicacio presentada a: the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing del 26 al 31 de juliol de 2015 a Beijing, Xina.


Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014) | 2014

The Fewer, the Better? A Contrastive Study about Ways to Simplify

Ruslan Mitkov; Sanja Štajner

Simplified texts play an important role in providing accessible and easy-to-understand information for a whole range of users who, due to linguistic, developmental or social barriers, would have difficulty in understanding materials which are not adapted and/or simplified. However, the production of simplified texts can be a time-consuming and labour-intensive task. In this paper we show that the employment of a short list of simple simplification rules could result in texts of comparable readability to those written as a result of applying a long list of more fine-grained rules. We also prove that the simplification process based on the short list of simple rules is more time efficient and consistent. 1 Rationale Simplified texts play an important role in providing accessible and easy-to-understand information for a whole range of users who, due to linguistic, developmental or social barriers, would have difficulty in understanding materials which are not adapted and/or simplified. Such users include but are not limited to people with insufficient knowledge of the language in which the document is written, people with specific language disorders and people with low literacy levels. However, while the production of simplified texts is certainly an indispensable activity, it often proves to be a time-consuming and labour-intensive task. Various methodologies and simplification strategies have been developed which are often employed by authors to simplify original texts. Most methods involve a high number of rules which could result not only in the simplification task being time-consuming but also in the authors getting confused as to which rules to apply. We hypothesise that it is possible to achieve a comparable simplification effect by using a small set of simple rules similar to the ones used in Controlled Languages which, in addition, enhances the productivity and reliability of the simplification process. In order to test our hypothesis we conduct the following experiments. First, we propose six Controlled Language-inspired rules which we believe are simple and easy enough for writers of simplified texts to understand and apply. We then ask two writers to apply these rules to a selection of newswire texts and also to produce simplified versions of these texts using the 28 rules used in the Simplext project (Saggion et al., 2011). Both sets of texts are compared in terms of readability. In both simplification tasks the time efficiency is assessed and the inter-annotator agreement is evaluated. In an additional experiment, we seek to investigate the possible effect of familiarisation in simplification. In this experiment a third writer simplifies a sample of the texts used in the previous experiments by applying each set of rules in a mixed sequence pattern which does not offer any familiarisation nor the advantage of one set of rules over the other. Using these samples, three-way inter-annotator agreement is reported. The rest of the paper is structured as follows. Section 2 outlines related work on simplification rules. Section 3 introduces our proposal for a small set of easy-to-understand and easy-to-apply rules and contrasts them with the longer and more elaborate rules employed in the Simplext proposal. Section 4 details the experiments conducted in order to validate or refute our hypothesis, and outlines the data used for the experiments. Section 5 presents and discusses the results, while the last section of the paper summarises the main conclusions of this study.


Archive | 2015

Simple or Not Simple? A Readability Question

Sanja Štajner; Ruslan Mitkov; Gloria Corpas Pastor

Text Simplification (TS) has taken off as an important Natural Language Processing (NLP) application which promises to offer a significant societal impact in that it can be employed to the benefit of users with limited language comprehension skills such as children, foreigners who do not have a good command of a language, and readers struggling with a language disability. With the recent emergence of various TS systems, the question we are faced with is how to automatically evaluate their performance given that access to target users might be difficult. This chapter addresses one aspect of this issue by exploring whether existing readability formulae could be applied to assess the level of simplification offered by a TS system. It focuses on three readability indices for Spanish. The indices are first adapted in a way that allows them to be computed automatically and then applied to two corpora of original and manually simplified texts. The first corpus has been compiled as part of the Simplext project targeting people with Down syndrom, and the second corpus as part of the FIRST project, where the users are people with autism spectrum disorder. The experiments show that there is a significant correlation between each of the readability indices and eighteen linguistically motivated features which might be seen as reading obstacles for various target populations, thus indicating the possibility of using those indices as a measure of the degree of simplification achieved by TS systems. Various ways they can be used in TS are further illustrated by comparing their values when applied to four different corpora.


recent advances in natural language processing | 2017

Multilingual and Cross-Lingual Complex Word Identification.

Seid Muhie Yimam; Sanja Štajner; Martin Riedl; Chris Biemann

Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility. Due to the lack of CWI datasets, previous works largely depend on Simple English Wikipedia and edit histories for obtaining ‘gold standard’ annotations, which are of doubtable quality, and limited only to English. We collect complex words/phrases (CP) for English, German and Spanish, annotated by both native and non-native speakers, and propose language independent features that can be used to train multilingual and cross-lingual CWI models. We show that the performance of cross-lingual CWI systems (using a model trained on one language and applying it on the other languages) is comparable to the performance of monolingual CWI systems.


meeting of the association for computational linguistics | 2017

Sentence Alignment Methods for Improving Text Simplification Systems

Sanja Štajner; Marc Franco-Salvador; Simone Paolo Ponzetto; Paolo Rosso; Heiner Stuckenschmidt

We provide several methods for sentence alignment of texts with different complexity levels. Using the best of them, we sentence-align the Newsela corpora, thus providing large training materials for automatic text simplification (ATS) systems. We show that using this dataset, even the standard phrase-based statistical machine translation models for ATS can outperform the state-of-the-art ATS systems.

Collaboration


Dive into the Sanja Štajner's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ruslan Mitkov

University of Wolverhampton

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Richard Evans

University of Wolverhampton

View shared research outputs
Researchain Logo
Decentralizing Knowledge