Jan Šnajder | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Šnajder is active.

Explore More

Publication

Featured researches published by Jan Šnajder.

meeting of the association for computational linguistics | 2014

Back up your Stance: Recognizing Arguments in Online Discussions

Filip Boltuzic; Jan Šnajder

In online discussions, users often back up their stance with arguments. Their arguments are often vague, implicit, and poorly worded, yet they provide valuable insights into reasons underpinning users’ opinions. In this paper, we make a first step towards argument-based opinion mining from online discussions and introduce a new task of argument recognition. We match usercreated comments to a set of predefined topic-based arguments, which can be either attacked or supported in the comment. We present a manually-annotated corpus for argument recognition in online discussions. We describe a supervised model based on comment-argument similarity and entailment features. Depending on problem formulation, model performance ranges from 70.5% to 81.8% F1-score, and decreases only marginally when applied to an unseen topic.

Information Processing and Management | 2008

Automatic acquisition of inflectional lexica for morphological normalisation

Jan Šnajder; B. Dalbelo Bašić; Marko Tadić

Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.

Computer Speech & Language | 2010

Extending lexical association measures for collocation extraction

Sasa Petrovic; Jan Šnajder; Bojana Dalbelo Bašić

Collocations are linguistic phenomena that occur when two or more words appear together more often than by chance and whose meaning often cannot be inferred from the meanings of its parts. As collocations have found many applications in the fields of natural language processing, information retrieval, and text mining, extracting them from large corpora has been the focus of many studies over the past few years. In this paper, we introduce the notion of an extension pattern, a formalization of the idea of extending lexical association measures (AMs) defined for bigrams. An extension pattern provides a measure-independent way of extending AMs for extracting collocations of arbitrary length. We define different extension patterns and compare them on a task of extracting collocations from a newspaper corpus. We show that the stopword-sensitive extension patterns we propose outperform other extensions, which indicates that AMs could benefit by taking into account linguistic information about an n-grams part-of-speech pattern.

Expert Systems With Applications | 2014

Event graphs for information retrieval and multi-document summarization

Goran Glavaš; Jan Šnajder

With the number of documents describing real-world events and event-oriented information needs rapidly growing on a daily basis, the need for efficient retrieval and concise presentation of event-related information is becoming apparent. Nonetheless, the majority of information retrieval and text summarization methods rely on shallow document representations that do not account for the semantics of events. In this article, we present event graphs, a novel event-based document representation model that filters and structures the information about events described in text. To construct the event graphs, we combine machine learning and rule-based models to extract sentence-level event mentions and determine the temporal relations between them. Building on event graphs, we present novel models for information retrieval and multi-document summarization. The information retrieval model measures the similarity between queries and documents by computing graph kernels over event graphs. The extractive multi-document summarization model selects sentences based on the relevance of the individual event mentions and the temporal structure of events. Experimental evaluation shows that our retrieval model significantly outperforms well-established retrieval models on event-oriented test collections, while the summarization model outperforms competitive models from shared multi-document summarization tasks.

north american chapter of the association for computational linguistics | 2015

Identifying Prominent Arguments in Online Debates Using Semantic Textual Similarity

Filip Boltuzic; Jan Šnajder

Online debates sparkle argumentative discussions from which generally accepted arguments often emerge. We consider the task of unsupervised identification of prominent argument in online debates. As a first step, in this paper we perform a cluster analysis using semantic textual similarity to detect similar arguments. We perform a preliminary cluster evaluation and error analysis based on cluster-class matching against a manually labeled dataset.

international conference on computational linguistics | 2013

Exploring coreference uncertainty of generically extracted event mentions

Goran Glavaš; Jan Šnajder

Because event mentions in text may be referentially ambiguous, event coreferentiality often involves uncertainty. In this paper we consider event coreference uncertainty and explore how it is affected by the context. We develop a supervised event coreference resolution model based on the comparison of generically extracted event mentions. We analyse event coreference uncertainty in both human annotations and predictions of the model, and in both within-document and cross-document setting. We frame event coreference as a classification task when full context is available and no uncertainty is involved, and a regression task in a limited context setting that involves uncertainty. We show how a rich set of features based on argument comparison can be utilized in both settings. Experimental results on English data suggest that our approach is especially suitable for resolving cross-document event coreference. Results also suggest that modelling human coreference uncertainty in the case of limited context is feasible.

Information Processing and Management | 2008

Language morphology offset: Text classification on a Croatian-English parallel corpus

M. Malenica; T. Šmuc; Jan Šnajder; B. Dalbelo Bašić

We investigate how, and to what extent, morphological complexity of the language influences text classification using support vector machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance is statistically significant; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.

text speech and dialogue | 2011

Unsupervised topic-oriented keyphrase extraction and its application to Croatian

Josip Saratlija; Jan Šnajder; Bojana Dalbelo Bašić

Labeling documents with keyphrases is a tedious and expensive task. Most approaches to automatic keyphrases extraction rely on supervised learning and require manually labeled training data. In this paper we propose a fully unsupervised keyphrase extraction method, differing from the usual generic keyphrase extractor in the manner the keyphrases are formed. Our method begins by building topically related word clusters from which document keywords are selected, and then expands the selected keywords into syntactically valid keyphrases. We evaluate our approach on a Croatian document collection annotated by eight human experts, taking into account the high subjectivity of the keyphrase extraction task. The performance of the proposed method reaches up to F1 = 44.5%, which is outperformed by human annotators, but comparable to a supervised approach.

Natural Language Engineering | 2015

Construction and Evaluation of Event Graphs

Goran Glavaš; Jan Šnajder

Events play an important role in natural language processing and information retrieval due to numerous event-oriented texts and information needs. Many natural language processing and information retrieval applications could benefit from a structured event-oriented document representation. In this paper, we propose event graphs as a novel way of structuring event-based information from text. Nodes in event graphs represent the individual mentions of events, whereas edges represent the temporal and coreference relations between mentions. Contrary to previous natural language processing research, which has mainly focused on individual event extraction tasks, we describe a complete end- to-end system for event graph extraction from text. Our system is a three-stage pipeline that performs anchor extraction, argument extraction, and relation extraction (temporal relation extraction and event coreference resolution), each at a performance level comparable with the state of the art. We present EvExtra, a large newspaper corpus annotated with event mentions and event graphs, on which we train and evaluate our models. To measure the overall quality of the constructed event graphs, we propose two metrics based on the tensor product between automatically and manually constructed graphs. Finally, we evaluate the overall quality of event graphs with the proposed evaluation metrics and perform a headroom analysis of the system.

information technology interfaces | 2005

Computer aided document indexing system

Mladen Kolar; Igor Vukmirović; Bojana Dalbelo Bašić; Jan Šnajder

An enormous number of documents is being produced that have to be stored, searched and accessed. Document indexing represents an efficient way to tackle this problem. Contributing to the document indexing process, we developed the Computer Aided Document Indexing System (CADIS) that applies controlled vocabulary keywords from the EUROVOC thesaurus. The main contribution of this paper is the introduction of the special CADIS internal data structure that copes with the morphological complexity of the Croatian language. CADIS internal data structure ensures efficient statistical analysis of input documents and quick visual feedback generation that helps indexing documents more quickly, accurately and uniformly than manual indexing.

Explore More