Michael Tschuggnall
University of Innsbruck
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael Tschuggnall.
acm conference on hypertext | 2010
Wolfgang Gassler; Eva Zangerle; Michael Tschuggnall; Günther Specht
Knowledge is structured - until it is stored to a wiki-like information system. In this paper we present the multi-user system SnoopyDB, which preserves the structure of knowledge without restricting the type or schema of inserted information. A self-learning schema system and recommendation engine support the user during the process of inserting information. These dynamically calculated recommendations develop an implicit schema, which is used by the majority of stored information. Further recommendation measures enhance the content both semantically and syntactically and motivate the user to insert more information than he intended to.
applications of natural language to data bases | 2013
Michael Tschuggnall; Günther Specht
Intrinsic plagiarism detection deals with the task of finding plagiarized sections in text documents without using a reference corpus. This paper describes a novel approach in this field by analyzing the grammar of authors and using sliding windows to find significant differences in writing styles. To find suspicious text passages, the algorithm splits a document into single sentences, calculates syntax grammar trees and builds profiles based on frequently used grammar patterns. The text is then traversed, where each window is compared to the document profile using a distance metric. Finally, all sentences that have a significantly higher distance according to a utilized Gaussian normal distribution are marked as suspicious. A preliminary evaluation of the algorithm shows very promising results.
conference of the european chapter of the association for computational linguistics | 2014
Michael Tschuggnall; Günther Specht
The aim of modern authorship attribution approaches is to analyze known authors and to assign authorships to previously unseen and unlabeled text documents based on various features. In this paper we present a novel feature to enhance current attribution methods by analyzing the grammar of authors. To extract the feature, a syntax tree of each sentence of a document is calculated, which is then split up into length-independent patterns using pq-grams. The mostly used pq-grams are then used to compose sample profiles of authors that are compared with the profile of the unlabeled document by utilizing various distance metrics and similarity scores. An evaluation using three different and independent data sets reveals promising results and indicate that the grammar of authors is a significant feature to enhance modern authorship attribution methods.
applications of natural language to data bases | 2012
Michael Tschuggnall; Günther Specht
Intrinsic plagiarism detection deals with the task of finding plagiarized sections of text documents without using a reference corpus. This paper describes a novel approach to this task by processing and analyzing the grammar of a suspicious document. The main idea is to split a text into single sentences and to calculate grammar trees. To find suspicious sentences, these grammar trees are compared in a distance matrix by using the pq-gram-distance, an alternative for the tree edit distance. Finally, significantly different sentences regarding their grammar and with respect to the Gaussian normal distribution are marked as suspicious.
european conference on machine learning | 2016
Michael Tschuggnall; Günther Specht
The amount of textual data available from digitalized sources such as free online libraries or social media posts has increased drastically in the last decade. In this paper, the main idea to analyze authors by their grammatical writing style is presented. In particular, tasks like authorship attribution, plagiarism detection or author profiling are tackled using the presented algorithm, revealing promising results. Thereby all of the presented approaches are ultimately solved by machine learning algorithms.
european conference on information retrieval | 2018
Eva Zangerle; Michael Tschuggnall; Stefan Wurzinger; Günther Specht
In recent years, approaches in music information retrieval have been based on multimodal analyses of music incorporating audio as well as lyrics features. Because most of those approaches are lacking reusable, high-quality datasets, in this work we propose ALF-200k, a publicly available, novel dataset including 176 audio and lyrics features of more than 200,000 tracks and their attribution to more than 11,000 user-created playlists. While the dataset is of general purpose and thus, may be used in experiments for diverse music information retrieval problems, we present a first multimodal study on playlist features and particularly analyze, which type of features are shared within specific playlists and thus, characterize it. We show that while acoustic features act as the major glue between tracks contained in a playlists, also lyrics features are a powerful means to attribute tracks to playlists.
cross language evaluation forum | 2018
Efstathios Stamatatos; Francisco Rangel; Michael Tschuggnall; Benno Stein; Mike Kestemont; Paolo Rosso; Martin Potthast
PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics. More specifically, this edition of PAN introduces a shared task in cross-domain authorship attribution, where texts of known and unknown authorship belong to distinct domains, and another task in style change detection that distinguishes between single-author and multi-author texts. In addition, a shared task in multimodal author profiling examines, for the first time, a combination of information from both texts and images posted by social media users to estimate their gender. Finally, the author obfuscation task studies how a text by a certain author can be paraphrased so that existing author identification tools are confused and cannot recognize the similarity with other texts of the same author. New corpora have been built to support these shared tasks. A relatively large number of software submissions (41 in total) was received and evaluated. Best paradigms are highlighted while baselines indicate the pros and cons of submitted approaches.
cross language evaluation forum | 2016
Paolo Rosso; Francisco Rangel; Martin Potthast; Efstathios Stamatatos; Michael Tschuggnall; Benno Stein
This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of digital text forensic research. PAN 2016 comprises three shared tasks: (i) author identification, addressing author clustering and diarization (or intrinsic plagiarism detection); (ii) author profiling, addressing age and gender prediction from a cross-genre perspective; and (iii) author obfuscation, addressing author masking and obfuscation evaluation. In total, 35 teams participated in all three shared tasks of PAN 2016 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.
european intelligence and security informatics conference | 2013
Michael Tschuggnall; Günther Specht
Unauthorized copying or stealing of intellectual propierties of others is a serious problem in modern society. In case of textual plagiarism, it becomes more and more easier to find appropriate sources using the huge amount of data available through online databases. To counter this problem, the two main approaches are categorized as external and intrinsic plagiarism detection, respectively. While external algorithms have the possibility to compare a suspicious document with numerous sources, intrinsic algorithms are allowed to solely inspect the suspicious document in order to predict plagiarism, which is important especially if no sources are available. In this paper we present a novel approach in the field of intrinsic plagiarism detection by analyzing syntactic information of authors and finding irregularities in sentence constructions. The main idea follows the assumption that authors have their mostly unconsciously used set of how to build sentences, which can be utilized to distinguish authors. Therefore the algorithm splits a suspicious document into single sentences, tags each word with part-of-speech (POS) classifiers and creates POS-sequences representing each sentence. Subsequently, the distance between every distinct pair of sentences is calculated by applying modified sequence alignment algorithms and stored into a distance matrix. After utilizing a Gaussian normal distribution function over the mean distances for each sentence, suspicious sentences are selected, grouped and predicted to be plagiarized. Finally, thresholds and parameters the algorithm uses are optimized by applying genetic algorithms. The approach has been evaluated against a large test corpus of English documents, showing promising results.
Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings | 2016
Efstathios Stamatatos; Michael Tschuggnall; Ben Verhoeven; Walter Daelemans; Günther Specht; Benno Stein; Martin Potthast