Artur Šilić
University of Zagreb
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Artur Šilić.
international conference on knowledge based and intelligent information and engineering systems | 2010
Artur Šilić; Bojana Dalbelo Bašić
This work presents a survey of methods that visualize text streams. Existing methods are classified and compared from the aspect of visualization process. We introduce new aspects of method comparison: data type, text representation, and the temporal drawing approach. The subjectivity of visualization is described, and evaluation methodologies are explained. Related research areas are discussed and some future trends in the field anticipated.
portuguese conference on artificial intelligence | 2007
Artur Šilić; Jean-Hugues Chauchat; Bojana Dalbelo Bašić; Annie Morin
In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.
information technology interfaces | 2007
Artur Šilić; Frane Šarić; Bojana Dalbelo Bašić; Jan Šnajder
The purpose of the TMT (Text Mining Tools) library is to enable the use of modern text-mining techniques for natural languages on cross-platform environments that can be applied equally well to research and development of end-user text-mining applications. The paper is structured as follows. Section 2 discusses the related work. Section 3 describes the functionalities of the library, whereas Section 4 describes its usage. Section 5 concludes the paper.
international conference on knowledge based and intelligent information and engineering systems | 2008
Artur Šilić; Marie-Francine Moens; Lovro Žmak; Bojana Dalbelo Bašić
In this work, we jointly apply several text mining methods to a corpus of legal documents in order to compare the separation quality of two inherently different document classification schemes. The classification schemes are compared with the clusters produced by the K-means algorithm. In the future, we believe that our comparison method will be coupled with semi-supervised and active learning techniques. Also, this paper presents the idea of combining K-means and Principal Component Analysis for cluster visualization. The described idea allows calculations to be performed in reasonable amount of CPU time.
international conference on computational linguistics | 2012
Artur Šilić; Bojana Dalbelo Bašić
Concept drift has regained research interest during recent years as many applications use data sources that are changing over time. We study the classification task using logistic regression on a large news collection of 248K texts during a period of seven years. We present extrinsic methods of concept drift detection and quantification using training set formation with different windowing techniques. We characterize concept drift on a seven-year-long Le Monde news corpus and show the overestimation of classifier performance if it is neglected. We lay out paths for future work where we plan to refine extrinsic characterization methods and investigate the drifting of learning parameters when few examples are available.
international conference on knowledge based and intelligent information and engineering systems | 2010
Tomislav Reicher; Ivan Krišto; Igor Belša; Artur Šilić
In this work we investigate the use of various character, lexical, and syntactic level features and their combinations in automatic authorship attribution. Since the majority of text representation features are language specific, we examine their application on texts written in Croatian language. Our work differs from the similar work in at least three aspects. Firstly, we use slightly different set of features than previously proposed. Secondly, we use four different data sets and compare the same features across those data sets to draw stronger conclusions. The data sets that we use consist of articles, blogs, books, and forum posts written in Croatian language. Finally, we employ a classification method based on a strong classifier. We use the Support Vector Machines to learn classifiers which achieve excellent results for longer texts: 91% accuracy and F1 measure for blogs, 93% acc. and F1 for articles, and 99% acc. and F1 for books. Experiments conducted on forum posts show that more complex features need to be employed for shorter texts.
Lecture Notes in Computer Science | 2010
Tomislav Reicher; Ivan Krišto; Igor Belša; Artur Šilić
In this work we investigate the use of various character, lexical, and syntactic level features and their combinations in automatic authorship attribution. Since the majority of text representation features are language specific, we examine their application on texts written in Croatian language. Our work differs from the similar work in at least three aspects. Firstly, we use slightly different set of features than previously proposed. Secondly, we use four different data sets and compare the same features across those data sets to draw stronger conclusions. The data sets that we use consist of articles, blogs, books, and forum posts written in Croatian language. Finally, we employ a classification method based on a strong classifier.We use the Support Vector Machines to learn classifiers which achieve excellent results for longer texts: 91% accuracy and F1 measure for blogs, 93% acc. and F1 for articles, and 99% acc. and F1 for books. Experiments conducted on forum posts show that more complex features need to be employed for shorter texts.
Springer Lecture Notes in Computer Science | 2012
Artur Šilić; Bojana Dalbelo Bašić
Concept drift research has regained research interest during recent years as many applications use data sources that are changing over time. We study the classification task using logistic regression on a large news collection of 248K texts during a period of seven years. We present extrinsic methods of concept drift detection and quantification using training set formation with different windowing techniques. On our corpus, we characterize concept drift and show the overestimation of classifier performance if it is neglected. We lay out paths for future work where we plan to refine extrinsic characterization methods and investigate the drifting of learning parameters when few examples are available.Natural Language Processing systems are often composed of a sequence of transductive components that transform their input into an output with additional syntactic and/or semantic labels. However, each component in this chain is typically error-prone and the error is magnified as the processing proceeds down the chain. In this paper, we present details of two systems, first, a speech driven question answering system and second, a dialog modeling system, both of which reflect the theme of tightly incorporating constraints across multiple components to improve the accuracy of their tasks.
Proceedings of the Eighth Language Technologies Conference | 2012
Goran Glavaš; Mladen Karan; Frane Šarić; Jan Šnajder; Jure Mijić; Artur Šilić; Bojana Dalbelo Bašić
Informatica (lithuanian Academy of Sciences) | 2013
Mladen Karan; Goran Glavaš; Frane Šarić; Jan Šnajder; Jure Mijić; Artur Šilić; Bojana Dalbelo Bašić