Marco Büchler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marco Büchler is active.

Explore More

Publication

Featured researches published by Marco Büchler.

theory and practice of digital libraries | 2012

Increasing recall for text re-use in historical documents to support research in the humanities

Marco Büchler; Gregory R. Crane; Maria Moritz; Alison Babeu

High precision text re-use detection allows humanists to discover where and how particular authors are quoted (e.g., the different sections of Platos work that come in and out of vogue). This paper reports on on-going work to provide the high recall text re-use detection that humanists often demand. Using an edition of one Greek work that marked quotations and paraphrases from the Homeric epics as our testbed, we were able to achieve a recall of at least 94% while maintaining a precision of 73%. This particular study is part of a larger effort to detect text re-use across 15 million words of Greek and 10 million words of Latin available or under development as openly licensed TEI XML.

Künstliche Intelligenz | 2015

Is it research or is it spying? Thinking-through ethics in Big Data AI and other knowledge sciences

Bettina Berendt; Marco Büchler; Geoffrey Rockwell

Abstract“How to be a knowledge scientist after the Snowden revelations?” is a question we all have to ask as it becomes clear that our work and our students could be involved in the building of an unprecedented surveillance society. In this essay, we argue that this affects all the knowledge sciences such as AI, computational linguistics and the digital humanities. Asking the question calls for dialogue within and across the disciplines. In this article, we will position ourselves with respect to typical stances towards the relationship between (computer) technology and its uses in a surveillance society, and we will look at what we can learn from other fields. We will propose ways of addressing the question in teaching and in research, and conclude with a call to action.

international conference on big data | 2014

Scaling historical text re-use

Marco Büchler; Greta Franzini; Emily Franzini; Maria Moritz

empirical methods in natural language processing | 2016

Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse.

Maria Moritz; Andreas Wiederhold; Barbara Pavlek; Yuri Bizzoni; Marco Büchler

Text reuse refers to citing, copying or alluding text excerpts from a text resource to a new context. While detecting reuse in contemporary languages is well supported—given extensive research, techniques, and corpora— automatically detecting historical text reuse is much more difficult. Corpora of historical languages are less documented and often encompass various genres, linguistic varieties, and topics. In fact, historical text reuse detection is much less understood and empirical studies are necessary to enable and improve its automation. We present a linguistic analysis of text reuse in two ancient data sets. We contribute an automated approach to analyze how an original text was transformed into its reuse, taking linguistic resources into account to understand how they help characterizing the transformation. It is complemented by a manual analysis of a subset of the reuse. Our results show the limitations of approaches focusing on literal reuse detection. Yet, linguistic resources can effectively support understanding the non-literal text reuse transformation process. Our results support practitioners and researchers working on understanding and detecting historical reuse.

international conference on computer vision | 2014

Designing Close and Distant Reading Visualizations for Text Re-use

Stefan Jänicke; Thomas Efer; Marco Büchler; Gerik Scheuermann

We present various visualizations for the Text Re-use found among texts of a collection to support answering a broad palette of research questions in the humanities. When juxtaposing all texts of a corpus in form of tuples, we propose the Text Re-use Grid as a distant reading method that emphasizes text tuples with systematic or repetitive Text Re-use. The Text Re-use Browser provides a closer look on the Text Re-use between the two texts of a tuple. Additionally, we present Text Re-use Alignment Visualizations to improve the readability of Text Variant Graphs that are used to compare various text editions to each other. Finally, we illustrate the benefit of the proposed visualizations with four usage scenarios for various topics in literary criticism.

Archive | 2014

Towards a Historical Text Re-use Detection

Marco Büchler; Philip R. Burns; Martin Müller; Emily Franzini; Greta Franzini

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.

Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage | 2017

Analysis of Part-Of-Speech Tagging of Historical German Texts

Markus Paluch; Gabriela Rotari; David Steding; Maximilian Weß; Maria Moritz; Marco Büchler

The amount of data in contemporary digital corpora is too large to be processed manually, which increases the necessity for computer linguistic tools in humanities. However, the processing of natural languages is a challenge for automatic tools, because languages are used heterogeneously. To process a text, often taggers are used that are trained on a standardized language variety (e.g. recent newspaper articles). Unfortunately, these training data often differ from the target texts (i.e. the text on which a trained model later is applied) in terms of language variety and register, which is especially the case for historical texts. Therefore, additional, manual analyses are usually inevitable. Training tools on the target language variety, however, can improve the results of these tools so that the manual prost-processing could be avoided. Thus, the need to process large datasets of diachronic texts and to obtain accurate results in a short time-span requires an adaptable approach. The present paper suggests this adaptable approach, by training taggers on a target language variety, to improve the accuracy of the structure of historical German corpora at the level of part-of-speech-tagging (hereafter POS-tagging). We trained four taggers (Perceptron tagger [26], Hidden Markov Model (HMM) [1], Conditional Random Fields (CRF) [13], and Unigram [21]) each on data from three different literary periods: Baroque (1600-1700), Romanticism (1790-1840) and Modernism (1880-1930). Compared with pre-tagged data, we obtained a maximum accuracy in POS-tagging of 98.3% for a single period (Modernism with Perceptron trained on Modernism) and a maximum mean accuracy for all three periods of 94.3% (Perceptron trained on Romanticism). Compared with manually tagged data, we obtained a maximum accuracy for one period of 96.8% (Romanticism with CRF and HMM trained on Romanticism) and a maximum mean accuracy for all three periods of 92.3% (Perceptron trained on Romanticism). In spite of the heterogeneity of literary data, these results demonstrate a high performance of the POS-taggers if the models are trained on target language varieties. Therefore, this adaptable approach provides reliable data allowing the use of taggers for analysis of different historical texts.

international conference on big data | 2016

Mining and analysing one billion requests to linguistic services

Marco Büchler; Greta Franzini; Emily Franzini; Thomas Eckart

From 2004 to 2016 the Leipzig Linguistic Services (LLS) existed as a SOAP-based cyberinfrastructure of atomic micro-services for the Wortschatz project, which covered different-sized textual corpora in more than 230 languages. The LLS were developed in 2004 and went live in 2005 in order to provide a webservice-based API to these corpus databases. In 2006, the LLS infrastructure began to systematically log and store requests made to the text collection, and in August 2016 the LLS were shut down. This article summarises the experience of the past ten years of running such a cyberinfrastructure with a total of nearly one billion requests. It includes an explanation of the technical decisions and limitations but also provides an overview of how the services were used.

Leonardo | 2013

Historical Relevance Feedback Detection by Text Re-use Networks

Marco Büchler; Gregory R. Crane; Gerhard Heyer

Text re-use has been in the humanists interest for centuries. Collecting parallel texts implies giving a certain information, e.g. a moral statement or report on wars and conflicts, a kind of witness. The more independent parallel texts are collected, the more feasible the information is. The contribution reported here is on automatic detection of text re-use and the usage of a text re-use network to derive a Cultural Heritage Aware PageRank technique given ancient text re-uses like quotations, paraphrases, and allusions.

DH | 2010