Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Mirco Kocher is active.

Publication


Featured researches published by Mirco Kocher.


association for information science and technology | 2017

A simple and efficient algorithm for authorship verification

Mirco Kocher; Jacques Savoy

This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium‐L1. As features, we suggest using the 200 most frequent terms of the disputed text (isolated words and punctuation symbols). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was written by the proposed author. Moreover, based on a simple rule we can define when there is enough evidence to propose an answer or when the attribution scheme is unable to make a decision with a high degree of certainty. Evaluations based on 6 test collections (PAN CLEF 2014 evaluation campaign) indicate that Spatium‐L1 usually appears in the top 3 best verification systems, and on an aggregate measure, presents the best performance. The suggested strategy can be adapted without any problem to different Indo‐European languages (such as English, Dutch, Spanish, and Greek) or genres (essay, novel, review, and newspaper article).


Information Processing and Management | 2017

Distance measures in author profiling

Mirco Kocher; Jacques Savoy

Abstract Determining some demographics about the author of a document (e.g., gender, age) has attracted many studies during the last decade. To solve this author profiling task, various classification models have been proposed based on stylistic features (e.g., function word frequencies, n -gram of letters or words, POS distributions), as well as various vocabulary richness or overall stylistic measures. To determine the targeted category, different distance measures have been suggested without one approach clearly dominating all others. In this paper, 24 distance measures are studied, extracted from five general families of functions. Moreover, six theoretical properties are presented and we show that the Tanimoto or Matusita distance measures respect all proposed properties. To complement this analysis, 13 test collections extracted from the last CLEF evaluation campaigns are employed to evaluate empirically the effectiveness of these distance measures. This test set covers four languages (English, Spanish, Dutch, and Italian), four text genres (blogs, tweets, reviews, and social media) with respect to two genders and between four to five age groups. The empirical evaluations indicate that the Canberra or Clark distance measures tend to produce better effectiveness than the rest, at least in the context of an author profiling task. Moreover, our experiments indicate that having a training set closely related to the test set (e.g., the same collection) has a clear impact on the overall performance. The gender accuracy rate is decreased by 7% (19% for the age) when using the same text genre during the training compared to using the same collection (leaving-one-out methodology). Employing a different text genre in the training and in the test phases tends to hurt the overall performance, showing a decrease of the final accuracy rate of around 11% for the gender classification to 26% for the age.


acm ieee joint conference on digital libraries | 2017

Author clustering using Spatium

Mirco Kocher; Jacques Savoy

This paper presents the author clustering problem and compares it to related authorship attribution questions. The proposed model is based on a distance measure called Spatium derived from the Canberra measure (weighted version of L1 norm). The selected features consist of the 200 most frequent words and punctuation symbols. An evaluation methodology is presented and the test collections are extracted from the PAN CLEF 2016 evaluation campaign. In addition to those, we also consider two additional corpora reflecting the literature domain more closely. Based on four different languages, the evaluation measures demonstrate a high precision and F1 for all 20 test collections. A more detailed analysis provides reasons explaining some of the failures of the Spatium model.


Digital Scholarship in the Humanities | 2018

Evaluation of text representation schemes and distance measures for authorship linking

Mirco Kocher; Jacques Savoy

Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. This problem is closely related to authorship attribution questions, and its solution can be used in the author clustering task. However, no training information is provided and the solution must be unsupervised. To achieve this, various text representation strategies can be applied, such as characters, punctuation symbols, or letter n-grams as well as words, lemmas, Part-Of-Speech (POS) tags, and sequences of them. To estimate the stylistic distance (or similarity) between two text excerpts, different measures have been suggested based on the L norm (e.g. Manhattan, Tanimoto), the L norm (e.g. Matusita), the inner product (e.g. Cosine), or the entropy paradigm (e.g. Jeffrey divergence). From those possible implementations, it is not clear which text representation and distance functions produce the best performance, and this study provides an answer to this question. Three corpora, extracted from French and English literature, have been evaluated using standard methodology. Moreover, we suggest an additional performance measure called high precision (HPrec) capable of judging the quality of a ranked list of links to provide only correct answers. No systematic difference can be found between tokenor lemma-based text representations. Simple POS tags do not provide an effective solution but short sequences of them form a good text representation. Letter n-grams (with n1⁄4 4–6) give high HPrec rates. As distance measures, this study found that the Tanimoto, Matusita, and Clark distance measures perform better than the often-used Cosine function. Finally, applying a pruning procedure (e.g. culling terms appearing once or twice or limiting the vocabulary to the 500 most frequent words) reduces the representation complexity and might even improve the effectiveness of the attribution scheme. .................................................................................................................................................................................


cross language evaluation forum | 2017

Author Clustering with an Adaptive Threshold

Mirco Kocher; Jacques Savoy

This paper describes and evaluates an unsupervised author clustering model called Spatium. The proposed strategy can be adapted without any difficulty to different natural languages (such as Dutch, English, and Greek) and it can be applied to different text genres (newspaper articles, reviews, excerpts of novels, etc.). As features, we suggest using the m most frequent terms of each text (isolated words and punctuation symbols with m set to at most 200). Applying a distance measure, we define whether there is enough evidence that two texts were written by the same author. The evaluations are based on six test collections (PAN Author Clustering task at CLEF 2016). A more detailed analysis shows the strengths of our approach but also indicates the problems and provides reasons for some of the potential failures of the Spatium model.


CLEF (Working Notes) | 2016

UniNE at CLEF 2016: Author Profiling.

Mirco Kocher; Jacques Savoy


Digital Scholarship in the Humanities | 2018

Distributed language representation for authorship attribution

Mirco Kocher; Jacques Savoy


CLEF (Working Notes) | 2018

UniNE at CLEF 2018: Author Masking: Notebook for PAN at CLEF 2018.

Mirco Kocher; Jacques Savoy


CORIA | 2017

Regroupement d’auteurs : Qui a écrit cet ensemble de romans ?

Mirco Kocher; Jacques Savoy


CLEF (Working Notes) | 2017

UniNE at CLEF 2017: Author Profiling Reasoning.

Mirco Kocher; Jacques Savoy

Collaboration


Dive into the Mirco Kocher's collaboration.

Top Co-Authors

Avatar

Jacques Savoy

University of Neuchâtel

View shared research outputs
Researchain Logo
Decentralizing Knowledge