Vlado Keselj | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vlado Keselj is active.

Explore More

Publication

Featured researches published by Vlado Keselj.

computer software and applications conference | 2004

N-gram-based detection of new malicious code

Tony Abou-Assaleh; Nick Cercone; Vlado Keselj; Ray Sweidan

The current commercial anti-virus software detects a virus only after the virus has appeared and caused damage. Motivated by the standard signature-based technique for detecting viruses, and a recent successful text classification method, we explore the idea of automatically detecting new malicious code using the collected dataset of the benign and malicious code. We obtained accuracy of 100% in the training data, and 98% in 3-fold cross-validation.

conference of the european chapter of the association for computational linguistics | 2003

Language independent authorship attribution using character level language models

Fuchun Peng; Dale Schuurmans; Shaojun Wang; Vlado Keselj

We present a method for computer-assisted authorship attribution based on character-level n-gram language models. Our approach is based on simple information theoretic principles, and achieves improved performance across a variety of languages without requiring extensive pre-processing or feature selection. To demonstrate the effectiveness and language independence of our approach, we present experimental results on Greek, English, and Chinese data. We show that our approach achieves state of the art performance in each of these cases. In particular, we obtain a 18% accuracy improvement over the best published results for a Greek data set, while using a far simpler technique than previous investigations.

Computer Methods and Programs in Biomedicine | 2006

n-Gram-based classification and unsupervised hierarchical clustering of genome sequences

Andrija Tomovic; Predrag Janičić; Vlado Keselj

In this paper we address the problem of automated classification of isolates, i.e., the problem of determining the family of genomes to which a given genome belongs. Additionally, we address the problem of automated unsupervised hierarchical clustering of isolates according only to their statistical substring properties. For both of these problems we present novel algorithms based on nucleotide n-grams, with no required preprocessing steps such as sequence alignment. Results obtained experimentally are very positive and suggest that the proposed techniques can be successfully used in a variety of related problems. The reported experiments demonstrate better performance than some of the state-of-the-art methods. We report on a new distance measure between n-gram profiles, which shows superior performance compared to many other measures, including commonly used Euclidean distance.

international conference on mechatronics and automation | 2005

Automatic detection and rating of dementia of Alzheimer type through lexical analysis of spontaneous speech

Calvin Thomas; Vlado Keselj; Nick Cercone; Kenneth Rockwood; Elissa Asp

Current methods of assessing dementia of Alzheimer type (DAT) in older adults involve structured interviews that attempt to capture the complex nature of deficits suffered. One of the most significant areas affected by the disease is the capacity for functional communication as linguistic skills break down. These methods often do note capture the true nature of language deficits in spontaneous speech. We address this issue by exploring novel automatic and objective methods for diagnosing patients through analysis of spontaneous speech. We detail several lexical approaches to the problem of detecting and rating DAT. The approaches explored rely on character n-gram-based techniques, shown recently to perform successfully in a different, but related task of automatic authorship attribution. We also explore the correlation of usage frequency of different parts of speech and DAT. We achieve a high 95% accuracy of detecting dementia when compared with a control group, and we achieve 70% accuracy in rating dementia in two classes, and 50% accuracy in rating dementia into four classes. Our results show that purely computational solutions offer a viable alternative to standard approaches to diagnosing the level of impairment in patients. These results are significant step forward toward automatic and objective means to identifying early symptoms of DAT in older adults.

canadian conference on artificial intelligence | 2012

Text similarity using google tri-grams

Aminul Islam; Evangelos E. Milios; Vlado Keselj

The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data set show that the proposed unsupervised method outperforms the state-of-the-art supervised method and the improvement achieved is statistically significant at 0.05 level. The approach is language-independent; it can be applied to other languages as long as n-grams are available.

PLOS ONE | 2013

Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations

Haibin Liu; Lawrence Hunter; Vlado Keselj; Karin Verspoor

The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.

conference on information and knowledge management | 2005

Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

Yingbo Miao; Vlado Keselj; Evangelos E. Milios

We propose a novel method for document clustering using character N-grams. In the traditional vector-space model, the documents are represented as vectors, in which each dimension corresponds to a word. We propose a document representation based on the most frequent character N-grams, with window size of up to 10 characters. We derive a new distance measure, which produces uniformly better results when compared to the word-based and term-based methods. The result becomes more significant in the light of the robustness of the N-gram method with no language-dependent preprocessing. Experiments on the performance of a clustering algorithm on a variety of test document corpora demonstrate that the N-gram representation with n=3 outperforms both word and term representations. The comparison between word and term representations depends on the data set and the selected dimensionality.

IEEE Computer | 2002

From computational intelligence to Web intelligence

Nick Cercone; Lijun Hou; Vlado Keselj; Aijun An; Kanlaya Naruedomkul; Xiaohua Hu

The authors explore three topics in computational intelligence: machine translation, machine learning and user interface design and speculate on their effects on Web intelligence. Systems that can communicate naturally and learn from interactions will power Web intelligences long term success. The large number of problems requiring Web-specific solutions demand a sustained and complementary effort to advance fundamental machine-learning research and incorporate a learning component into every Internet interaction. Traditional forms of machine translation either translate poorly, require resources that grow exponentially with the number of languages translated, or simplify language excessively. Recent success in statistical, nonlinguistic, and hybrid machine translation suggests that systems based on these technologies can achieve better results with a large annotated language corpus. Adapting existing computational intelligence solutions, when appropriate for Web intelligence applications, must incorporate a robust notion of learning that will scale to the Web, adapt to individual user requirements, and personalize interfaces.

canadian conference on artificial intelligence | 2005

Integrating web content clustering into web log association rule mining

Jiayun Guo; Vlado Keselj; Qigang Gao

One of the effects of the general Internet growth is an immense number of user accesses to WWW resources These accesses are recorded in the web server log files, which are a rich data resource for finding useful patterns and rules of user browsing behavior, and they caused the rise of technologies for Web usage mining Current Web usage mining applications rely exclusively on the web server log files The main hypothesis discussed in this paper is that Web content analysis can be used to improve Web usage mining results We propose a system that integrates Web page clustering into log file association mining and uses the cluster labels as Web page content indicators It is demonstrated that novel and interesting association rules can be mined from the combined data source The rules can be used further in various applications, including Web user profiling and Web site construction We experiment with several approaches to content clustering, relying on keyword and character n-gram based clustering with different distance measures and parameter settings Evaluation shows that character n-gram based clustering performs better than word-based clustering in terms of an internal quality measure (about 3 times better) On the other hand, word-based cluster profiles are easier to manually summarize Furthermore, it is demonstrated that high-quality rules are extracted from the combined dataset.

canadian conference on artificial intelligence | 2009

Financial Forecasting Using Character N-Gram Analysis and Readability Scores of Annual Reports

Matthew Butler; Vlado Keselj

Two novel Natural Language Processing (NLP) classification techniques are applied to the analysis of corporate annual reports in the task of financial forecasting. The hypothesis is that textual content of annual reports contain vital information for assessing the performance of the stock over the next year. The first method is based on character n-gram profiles, which are generated for each annual report, and then labeled based on the CNG classification. The second method draws on a more traditional approach, where readability scores are combined with performance inputs and then supplied to a support vector machine (SVM) for classification. Both methods consistently outperformed a benchmark portfolio, and their combination proved to be even more effective and efficient as the combined models yielded the highest returns with the fewest trades.

Explore More