Tanja Samardzic | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tanja Samardzic is active.

Explore More

Publication

Featured researches published by Tanja Samardzic.

international conference on computational linguistics | 2014

Part-of-Speech Tag Disambiguation by Cross-Linguistic Majority Vote

Noëmi Aepli; Ruprecht von Waldenfels; Tanja Samardzic

In this paper, we present an approach to developing resources for a low-resource language, taking advantage of the fact that it is closely related to languages with more resources. In particular, we test our approach on Macedonian, which lacks tools for natural language processing as well as data in order to build such tools. We improve the Macedonian training set for supervised part-ofspeech tagging by transferring available manual annotations from a number of similar languages. Our approach is based on multilingual parallel corpora, automatic word alignment, and a set of rules (majority vote). The performance of a tagger trained on the improved data set of 88% accuracy is significantly better than the baseline of 76%. It can serve as a stepping stone for further improvement of resources for Macedonian. The proposed approach is entirely automatic and it can be easily adapted to other language in similar circumstances.

conference on computational natural language learning | 2017

Neural Sequence-to-sequence Learning of Internal Word Structure.

Tatyana Ruzsics; Tanja Samardzic

Learning internal word structure has recently been recognized as an important step in various multilingual processing tasks and in theoretical language comparison. In this paper, we present a neural encoder-decoder model for learning canonical morphological segmentation. Our model combines character-level sequence-to-sequence transformation with a language model over canonical segments. We obtain up to 4% improvement over a strong character-level encoder-decoder baseline for three languages. Our model outperforms the previous state-of-the-art for two languages, while eliminating the need for external resources such as large dictionaries. Finally, by comparing the performance of encoder-decoder and classical statistical machine translation systems trained with and without corpus counts, we show that including corpus counts is beneficial to both approaches.

sighum workshop on language technology for cultural heritage social sciences and humanities | 2015

Automatic interlinear glossing as two-level sequence classification

Tanja Samardzic; Robert Schikowski; Sabine Stoll

Interlinear glossing is a type of annotation of morphosyntactic categories and crosslinguistic lexical correspondences that allows linguists to analyse sentences in languages that they do not necessarily speak. Automatising this annotation is necessary in order to provide glossed corpora big enough to be used for quantitative studies. In this paper, we present experiments on the automatic glossing of Chintang. We decompose the task of glossing into steps suitable for statistical processing. We first perform grammatical glossing as standard supervised part-of-speech tagging. We then add lexical glosses from a stand-off dictionary applying context disambiguation in a similar way to word lemmatisation. We obtain the highest accuracy score of 96% for grammatical and 94% for lexi-

conference of the european chapter of the association for computational linguistics | 2014

Likelihood of External Causation in the Structure of Events

Tanja Samardzic; Paola Merlo

This article addresses the causal structure of events described by verbs: whether an event happens spontaneously or it is caused by an external causer. We automatically estimate the likelihood of external causation of events based on the distribution of causative and anticausative uses of verbs in the causative alternation. We train a Bayesian model and test it on a monolingual and on a bilingual input. The performance is evaluated against an independent scale of likelihood of external causation based on typological data. The accuracy of a two-way classification is 85% in both monolingual and bilingual setting. On the task of a three-way classification, the score is 61% in the monolingual setting and 69% in the bilingual setting.

linguistic annotation workshop | 2010