Mladen Karan
University of Zagreb
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mladen Karan.
north american chapter of the association for computational linguistics | 2016
Martin Tutek; Ivan Sekulic; Paula Gombar; Ivan Paljak; Filip Culinovic; Filip Boltuzic; Mladen Karan; Domagoj Alagic; Jan Šnajder
This paper describes our system for the detection of stances in tweets submitted to SemEval 2016 Task 6A. The system uses an ensemble of learning algorithms, fine-tuned using a genetic algorithm. We experiment with various offthe-shelf classifiers and build our model using standard lexical and a number of task-specific features. Our system ranked 3rd among the 19 systems submitted to this task.
text speech and dialogue | 2016
Mladen Karan; Jan Šnajder
Frequently asked question (FAQ) collections are commonly used across the web to provide information about a specific domain (e.g., services of a company). With respect to traditional information retrieval, FAQ retrieval introduces additional challenges, the main ones being (1) the brevity of FAQ texts and (2) the need for topic-specific knowledge. The primary contribution of our work is a new domain-specific FAQ collection, providing a large number of queries with manually annotated relevance judgments. On this collection, we test several unsupervised baseline models, including both count based and semantic embedding based models, as well as a combined model. We evaluate the performance across different setups and identify potential venues for improvement. The collection constitutes a solid basis for research in supervised machine-learning-based FAQ retrieval.
north american chapter of the association for computational linguistics | 2015
Mladen Karan; Goran Glavaš; Jan Šnajder; Bojana Dalbelo Bašić; Ivan Vulić; Marie-Francine Moens
When tweeting on a topic, Twitter users often post messages that convey the same or similar meaning. We describe TweetingJay, a system for detecting paraphrases and semantic similarity of tweets, with which we participated in Task 1 of SemEval 2015. TweetingJay uses a supervised model that combines semantic overlap and word alignment features, previously shown to be effective for detecting semantic textual similarity. TweetingJay reaches 65.9% F1-score and ranked fourth among the 18 participating systems. We additionally provide an analysis of the dataset and point to some peculiarities of the evaluation setup.
sighum workshop on language technology for cultural heritage social sciences and humanities | 2016
Mladen Karan; Jan Šnajder; Daniela Sirinic; Goran Glavaš
Policy agenda research is concerned with measuring the policymaker activities. Topic classification has proven a valuable tool for policy agenda research. However, manual topic coding is extremely costly and time-consuming. Supervised topic classification offers a cost- effective and reliable alternative, yet it introduces new challenges, the most significant of which are the training set coding, classifier design, and accuracy-efficiency trade-off. In this work, we address these challenges in the context of the recently launched Croatian Policy Agendas project. We describe a new policy agenda dataset, explore the many system design choices, and report on the in- sights gained. Our best-performing model reaches 77% and 68% of F1- score for ma- jor topics and subtopics, respectively.
cross language evaluation forum | 2015
Mladen Karan; Jan Šnajder
Frequently asked question FAQ knowledge bases are a convenient way to organize domain specific information. However, FAQ retrieval is challenging because the documents are short and the vocabulary is domain specific, giving rise to the lexical gap problem. To address this problem, in this paper we consider rule-based query expansion QE for domain specific FAQ retrieval. We build a small test collection and evaluate the potential of QE rules. While we observe some improvement for difficult queries, our results suggest that the potential of manual rule compilation is limited.
international convention on information and communication technology, electronics and microelectronics | 2014
Mladen Karan; Damir Pintar; Zoran Skočir; Mihaela Vranić; Adrian Alajkovic; Jelena Milojevic; Marina Plesa
Demand forecasting plays a very important role in retail business. Retail information systems commonly store large amounts of data which are subsequently used by sophisticated data mining tools for building forecasting models. Quality of these models is usually measured through their predictive accuracy as their most important property, followed by other measures which consider average underestimate and overestimate costs etc. Even though the choice of data mining algorithm is usually paramount, training set cleansing and preparation has a significant influence on final model performance. This article discusses and analyses the impact of training set preparation and tailoring on a final forecasting model performance used in a real world example from the retail industry.
Expert Systems With Applications | 2018
Mladen Karan; Jan Šnajder
We study the potential of supervised learning to rank for FAQ retrieval.Supervised models offer performance improvements for this task.We explored low-effort paraphrase-based data labeling strategies.Paraphrase-based labeling was effective for the best models on two FAQ data collections.We make a new FAQ retrieval data set publicly available. A frequently asked questions (FAQ) retrieval system improves the access to information by allowing users to pose natural language queries over an FAQ collection. From an information retrieval perspective, FAQ retrieval is a challenging task, mainly because of the lexical gap that exists between a query and an FAQ pair, both of which are typically very short. In this work, we explore the use of supervised learning to rank to improve the performance of domain-specific FAQ retrieval. While supervised learning-to-rank models have been shown to yield effective retrieval performance, they require costly human-labeled training data in the form of document relevance judgments or question paraphrases. We investigate how this labeling effort can be reduced using a labeling strategy geared toward the manual creation of query paraphrases rather than the more time-consuming relevance judgments. In particular, we investigate two such strategies, and test them by applying supervised ranking models to two domain-specific FAQ retrieval data sets, showcasing typical FAQ retrieval scenarios. Our experiments show that supervised ranking models can yield significant improvements in the precision-at-rank-5 measure compared to unsupervised baselines. Furthermore, we show that a supervised model trained using data labeled via a low-effort paraphrase-focused strategy has the same performance as that of the same model trained using fully labeled data, indicating that the strategy is effective at reducing the labeling effort while retaining the performance gains of the supervised approach. To encourage further research on FAQ retrieval we make our FAQ retrieval data set publicly available.
applications of natural language to data bases | 2017
Mladen Karan; Jan Šnajder
Frequently asked questions (FAQ) collections are a popular and effective way of representing information, and FAQ retrieval systems provide a natural-language interface to such collections. An important aspect of efficient and trustworthy FAQ retrieval is to maintain a low fall-out rate by detecting non-covered questions. In this paper we address the task of detecting non-covered questions. We experiment with threshold-based methods as well as unsupervised one-class and supervised binary classifiers, considering tf-idf and word embeddings text representations. Experiments, carried out on a domain-specific FAQ collection, indicate that a cluster-based model with query paraphrases outperforms threshold-based, one-class, and binary classifiers.
joint conference on lexical and computational semantics | 2012
Frane Šarić; Goran Glavaš; Mladen Karan; Jan Šnajder; Bojana Dalbelo Bašić
Proceedings of the Eighth Language Technologies Conference | 2012
Goran Glavaš; Mladen Karan; Frane Šarić; Jan Šnajder; Jure Mijić; Artur Šilić; Bojana Dalbelo Bašić