Bojana Dalbelo Bašić

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bojana Dalbelo Bašić is active.

Explore More

Publication

Featured researches published by Bojana Dalbelo Bašić.

international conference on knowledge based and intelligent information and engineering systems | 2010

Visualization of text streams: a survey

Artur Šilić; Bojana Dalbelo Bašić

This work presents a survey of methods that visualize text streams. Existing methods are classified and compared from the aspect of visualization process. We introduce new aspects of method comparison: data type, text representation, and the temporal drawing approach. The subjectivity of visualization is described, and evaluation methodologies are explained. Related research areas are discussed and some future trends in the field anticipated.

Computer Speech & Language | 2010

Extending lexical association measures for collocation extraction

Sasa Petrovic; Jan Šnajder; Bojana Dalbelo Bašić

Collocations are linguistic phenomena that occur when two or more words appear together more often than by chance and whose meaning often cannot be inferred from the meanings of its parts. As collocations have found many applications in the fields of natural language processing, information retrieval, and text mining, extracting them from large corpora has been the focus of many studies over the past few years. In this paper, we introduce the notion of an extension pattern, a formalization of the idea of extending lexical association measures (AMs) defined for bigrams. An extension pattern provides a measure-independent way of extending AMs for extracting collocations of arbitrary length. We define different extension patterns and compare them on a task of extracting collocations from a newspaper corpus. We show that the stopword-sensitive extension patterns we propose outperform other extensions, which indicates that AMs could benefit by taking into account linguistic information about an n-grams part-of-speech pattern.

portuguese conference on artificial intelligence | 2007

N-grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus

Artur Šilić; Jean-Hugues Chauchat; Bojana Dalbelo Bašić; Annie Morin

In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.

web intelligence | 2003

Concept decomposition by fuzzy k-means algorithm

Jasminka Dobša; Bojana Dalbelo Bašić

The method of latent semantic indexing (LSI) is an information retrieval technique using a low-rank singular value decomposition (SVD) of the term-document matrix. Although the LSI method has empirical success, it suffers from the lack of interpretation for the low-rank approximation and, consequently, the lack of controls for accomplishing specific tasks in information retrieval. A method introduced by Dhillon and Modha is an improvement in that direction. It uses centroids of clusters or so called concept decomposition for lowering the rank of the term-document matrix. We focus on improvements of that method using fuzzy k-means algorithm. Also, we compare the precision of information retrieval for the two above methods.

international conference on software, telecommunications and computer networks | 2014

Software defect prediction with Bug-Code analyzer - A data collection tool demo

Goran Mauša; Tihana Galinac Grbac; Bojana Dalbelo Bašić

Empirical software engineering research community aims to accumulate knowledge in software engineering community based on the empirical studies on datasets obtained from the real software projects. Limiting factor to building the theory over thus accumulated knowledge is often related to dataset bias. One solution to this problem is developing a systematic data collection procedure through standard guidelines that would be available to open community and thus enable reducing data collection bias. In this paper we present a tool demonstration that implements a systematic data collection procedure for software defect prediction datasets from the open source bug tracking and the source code management repositories. Main challenging issue that the tool addresses is linking the information related to the same entity (e.g. class file) from these two sources. The tool implements interfaces to bug and source code repositories and even other tools for calculating the software metrics. Finally, it offers the user to create software defect prediction datasets even if he is unaware of all the details behind this complex task.

text speech and dialogue | 2011

Unsupervised topic-oriented keyphrase extraction and its application to Croatian

Josip Saratlija; Jan Šnajder; Bojana Dalbelo Bašić

Labeling documents with keyphrases is a tedious and expensive task. Most approaches to automatic keyphrases extraction rely on supervised learning and require manually labeled training data. In this paper we propose a fully unsupervised keyphrase extraction method, differing from the usual generic keyphrase extractor in the manner the keyphrases are formed. Our method begins by building topically related word clusters from which document keywords are selected, and then expands the selected keywords into syntactically valid keyphrases. We evaluate our approach on a Croatian document collection annotated by eight human experts, taking into account the high subjectivity of the keyphrase extraction task. The performance of the proposed method reaches up to F1 = 44.5%, which is outperformed by human annotators, but comparable to a supervised approach.

information technology interfaces | 2005

Computer aided document indexing system

Mladen Kolar; Igor Vukmirović; Bojana Dalbelo Bašić; Jan Šnajder

An enormous number of documents is being produced that have to be stored, searched and accessed. Document indexing represents an efficient way to tackle this problem. Contributing to the document indexing process, we developed the Computer Aided Document Indexing System (CADIS) that applies controlled vocabulary keywords from the EUROVOC thesaurus. The main contribution of this paper is the introduction of the special CADIS internal data structure that copes with the morphological complexity of the Croatian language. CADIS internal data structure ensures efficient statistical analysis of input documents and quick visual feedback generation that helps indexing documents more quickly, accurately and uniformly than manual indexing.

international conference on computational linguistics | 2009

TermeX: A Tool for Collocation Extraction

Davor Delač; Zoran Krleža; Jan Šnajder; Bojana Dalbelo Bašić; Frane Šarić

Collocations --- word combinations occurring together more often than by chance --- have a wide range of NLP applications. Many approaches for automating collocation extraction based on lexical association measures have been proposed in the literature. This paper presents TermeX --- a tool for efficient extraction of collocations based on a variety of association measures. TermeX implements POS filtering and lemmatization, and is capable of extracting collocations up to length four. We address trade-offs between high memory consumption and processing speed and propose an efficient implementation. Our implementation allows for processing time linear to corpus size and memory consumption linear to the number of word types.

intelligent data analysis | 2009

Textual features for corpus visualization using correspondence analysis

Sasa Petrovic; Bojana Dalbelo Bašić; Annie Morin; Blaž Zupan; Jean-Hugues Chauchat

Explorative data analysis in text mining essentially relies on effective visualization techniques which can expose hidden relationships among documents and reveal correspondence between documents and their features. In text mining, the documents are most often represented by feature vectors of very high dimensions, requiring dimensionality reduction to obtain visual projections in two- or three-dimensional space. Correspondence analysis is an unsupervised approach that allows for construction of low-dimensional projection space with simultaneous placement of both documents and features, making it ideal for explorative analysis in text mining. Its present use, however, has been limited to word-based features. In this paper, we investigate how this particular document representation compares to the representation with letter n-grams and word n-grams, and find that these alternative representations yield better results in separating documents of different class. We perform our experimental analysis on a bilingual Croatian-English parallel corpus, allowing us to additionally explore the impact of features in different languages on the quality of visualizations.

information technology interfaces | 2007

TMT: Object-Oriented Text Classification Library

Artur Šilić; Frane Šarić; Bojana Dalbelo Bašić; Jan Šnajder

The purpose of the TMT (Text Mining Tools) library is to enable the use of modern text-mining techniques for natural languages on cross-platform environments that can be applied equally well to research and development of end-user text-mining applications. The paper is structured as follows. Section 2 discusses the related work. Section 3 describes the functionalities of the library, whereas Section 4 describes its usage. Section 5 concludes the paper.

Explore More