Mike Kestemont
University of Antwerp
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mike Kestemont.
Literary and Linguistic Computing | 2015
Mike Kestemont; Sara Moens; Jeroen Deploige
Hildegard of Bingen (1098-1179) is one of the most influential female authors of the Middle Ages. From the point of view of computational stylistics, the oeuvre attributed to Hildegard is fascinating. Hildegard dictated her texts to secretaries in Latin, a language of which she did not master all grammatical subtleties. She therefore allowed her scribes to correct her spelling and grammar. Especially Hildegards last collaborator, Guibert of Gembloux, seems to have considerably reworked her works during his secretaryship. Whereas her other scribes were only allowed to make superficial linguistic changes, Hildegard would have permitted Guibert to render her language stylistically more elegant. In this article, we focus on two shorter texts: the Visio ad Guibertum missa and Visio de Sancto Martino, both of which Hildegard allegedly authored during Guiberts secretaryship. We analyze a corpus containing the letter collections of Hildegard, Guibert, and Bernard of Clairvaux using a number of common stylometric techniques. We discuss our results in the light of the Synergy Hypothesis, suggesting that texts resulting from collaboration can display a style markedly different from that of the collaborating authors. Finally, we demonstrate that Guibert must have re- worked the disputed visionary texts allegedly authored by Hildegard to such an extent that style-oriented computational procedures attribute the texts to Guibert.
association for information science and technology | 2016
Justin Stover; Yaron Winter; Moshe Koppel; Mike Kestemont
We discuss a real‐world application of a recently proposed machine learning method for authorship verification. Authorship verification is considered an extremely difficult task in computational text classification, because it does not assume that the correct author of an anonymous text is included in the candidate authors available. To determine whether 2 documents have been written by the same author, the verification method discussed uses repeated feature subsampling and a pool of impostor authors. We use this technique to attribute a newly discovered Latin text from antiquity (the Compendiosa expositio) to Apuleius. This North African writer was one of the most important authors of the Roman Empire in the 2nd century and authored one of the worlds first novels. This attribution has profound and wide‐reaching cultural value, because it has been over a century since a new text by a major author from antiquity was discovered. This research therefore illustrates the rapidly growing potential of computational methods for studying the global textual heritage.
conference of the european chapter of the association for computational linguistics | 2014
Mike Kestemont
This position paper focuses on the use of function words in computational authorship attribution. Although recently there have been multiple successful applications of authorship attribution, the field is not particularly good at the explication of methods and theoretical issues, which might eventually compromise the acceptance of new research results in the traditional humanities community. I wish to partially help remedy this lack of explication and theory, by contributing a theoretical discussion on the use of function words in stylometry. I will concisely survey the attractiveness of function words in stylometry and relate them to the use of character n-grams. At the end of this paper, I will propose to replace the term ‘function word’ by the term ‘functor’ in stylometry, due to multiple theoretical considerations.
Expert Systems With Applications | 2016
Mike Kestemont; Justin Stover; Moshe Koppel; Folgert Karsdorp; Walter Daelemans
We shed new light on the authenticity of the writings of Julius Caesar.Hirtius, one of Caesars generals, must have contributed to Caesars writings.We benchmark two authorship verification systems on publicly available data sets.We test on both modern data sets, and Latin texts from Antiquity.We show how computational methods inform traditional authentication studies. In this paper, we shed new light on the authenticity of the Corpus Caesarianum, a group of five commentaries describing the campaigns of Julius Caesar (100-44 BC), the founder of the Roman empire. While Caesar himself has authored at least part of these commentaries, the authorship of the rest of the texts remains a puzzle that has persisted for nineteen centuries. In particular, the role of Caesars general Aulus Hirtius, who has claimed a role in shaping the corpus, has remained in contention. Determining the authorship of documents is an increasingly important authentication problem in information and computer science, with valuable applications, ranging from the domain of art history to counter-terrorism research. We describe two state-of-the-art authorship verification systems and benchmark them on 6 present-day evaluation corpora, as well as a Latin benchmark dataset. Regarding Caesars writings, our analyses allow us to establish that Hirtiuss claims to part of the corpus must be considered legitimate. We thus demonstrate how computational methods constitute a valuable methodological complement to traditional, expert-based approaches to document authentication.
sighum workshop on language technology for cultural heritage social sciences and humanities | 2014
Mike Kestemont; Folgert Karsdorp; Marten During
In this paper we report on an explorative study of the history of the twentieth cen- tury from a lexical point of view. As data, we use a diachronic collection of 270,000+ English-language articles har- vested from the electronic archive of the well-known Time Magazine (1923–2006). We attempt to automatically identify significant shifts in the vocabulary used in this corpus using efficient, yet unsupervised computational methods, such as Parsimonious Language Models. We offer a qualitative interpretation of the outcome of our experiments in the light of momen- tous events in the twentieth century, such as the Second World War or the rise of the Internet. This paper follows up on a recent string of frequentist approaches to studying cultural history (‘Culturomics’), in which the evolution of human culture is studied from a quantitative perspective, on the basis of lexical statistics extracted from large, textual data sets.
Digital Scholarship in the Humanities | 2016
Mike Kestemont; Guy De Pauw; Renske van Nie; Walter Daelemans
In this article, we describe a novel approach to sequence tagging for languages that are rich in (e.g. orthographic) surface variation. We focus on lemmatization, a basic step in many processing pipelines in the Digital Humanities. While this task has long been considered solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to solve, due to a lack of resources and unstable orthography. Our approach is based on recent advances in the field of ‘deep’ representation learning, where neural networks have led to a dramatic increase in performance across several domains. The proposed system combines two approaches: on the one hand, we apply temporal convolutions to model the orthography of input words at the character level; secondly, we use distributional word embeddings to represent the lexical context surrounding the input words. We demonstrate how this system reaches state-of-the-art performance on a number of representative Middle Dutch data sets, even without corpus-specific parameter tuning.
Finlayson, M.;Miller, B.;Lieto, A. (ed.), Proceedings of the 6th Workshop on Computational Models of Narrative (CMN-2015) | 2015
Folgert Karsdorp; Mike Kestemont; Christof Schöch; van den Bosch
We report on building a computational model of romantic relationships in a corpus of historical literary texts. We frame this task as a ranking problem in which, for a given character, we try to assign the highest rank to the character with whom (s)he is most likely to be romantically involved. As data we use a publicly available corpus of French 17th and 18th century plays (http://www.theatre-classique.fr/) which is well suited for this type of analysis because of the rich markup it provides (e.g. indications of characters speaking). We focus on distributional, so-called second-order features, which capture how speakers are contextually embedded in the texts. At a mean reciprocal rate (MRR) of 0.9 and MRR@1 of 0.81, our results are encouraging, suggesting that this approach might be successfully extended to other forms of social interactions in literature, such as antagonism or social power relations.
Digital Philology: A Journal of Medieval Cultures | 2012
Mike Kestemont
In the digital humanities much research has been done concerning stylometry, the computational study of style. Literary authorship at-tribution, especially, has been a central topic. After a brief introduction, I will discuss the enormous potential of this paradigm for medieval philology, a field that studies so many texts of unknown or disputed origin. At the same time, it will be stressed that stylometry’s application to medieval texts is currently not without problems: many attribution techniques are still controversial and do not account for the specific nature of medieval text production. Throughout this paper, I will tentatively apply two well-established attribution techniques (principal components analysis and Burrows’s Delta) to a number of case studies in Middle Dutch studies. These analyses shall be restricted to rhyme words, since these words are less likely to have been altered by scribes.
cross language evaluation forum | 2018
Efstathios Stamatatos; Francisco Rangel; Michael Tschuggnall; Benno Stein; Mike Kestemont; Paolo Rosso; Martin Potthast
PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics. More specifically, this edition of PAN introduces a shared task in cross-domain authorship attribution, where texts of known and unknown authorship belong to distinct domains, and another task in style change detection that distinguishes between single-author and multi-author texts. In addition, a shared task in multimodal author profiling examines, for the first time, a combination of information from both texts and images posted by social media users to estimate their gender. Finally, the author obfuscation task studies how a text by a certain author can be paraphrased so that existing author identification tools are confused and cannot recognize the similarity with other texts of the same author. New corpora have been built to support these shared tasks. A relatively large number of software submissions (41 in total) was received and evaluated. Best paradigms are highlighted while baselines indicate the pros and cons of submitted approaches.
Frontiers in Digital Humanities | 2018
Greta Franzini; Mike Kestemont; Gabriela Rotari; Melina Jander; Jeremi K. Ochab; Emily Franzini; Joanna Byszuk; Jan Rybicki
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regard to HTR, this research demonstrates that even though automated transcription significantly increases risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.