[PDF] Identifying Historical Travelogues in Large Text Corpora Using Machine Learning

Abstract

Travelogues represent an important and intensively studied source for scholars in the humanities, as they provide insights into people, cultures, and places of the past. However, existing studies rarely utilize more than a dozen primary sources, since the human capacities of working with a large number of historical sources are naturally limited. In this paper, we define the notion of travelogue and report upon an interdisciplinary method that, using machine learning as well as domain knowledge, can effectively identify German travelogues in the digitized inventory of the Austrian National Library with F1 scores between 0.94 and 1.00. We applied our method on a corpus of 161,522 German volumes and identified 345 travelogues that could not be identified using traditional search methods, resulting in the most extensive collection of early modern German travelogues ever created. To our knowledge, this is the first time such a method was implemented for the bibliographic indexing of a text corpus on this scale, improving and extending the traditional methods in the humanities. Overall, we consider our technique to be an important first step in a broader effort of developing a novel mixed-method approach for the large-scale serial analysis of travelogues.

Full PDF

IIdentifying Historical Travelogues in Large TextCorpora Using Machine Learning

Jan R¨orden − − − , Doris Gruber − − − X ] , MartinKrickl − − − , and Bernhard Haslhofer − − − AIT Austrian Institute of Technology, Vienna, Austria Austrian Academy of Sciences, Vienna, Austria Austrian National Library, Vienna, Austria

Abstract.

Travelogues represent an important and intensively studiedsource for scholars in the humanities, as they provide insights into peo-ple, cultures, and places of the past. However, existing studies rarelyutilize more than a dozen primary sources, since the human capacities ofworking with a large number of historical sources are naturally limited.In this paper, we deﬁne the notion of travelogue and report upon aninterdisciplinary method that, using machine learning as well as domainknowledge, can eﬀectively identify German travelogues in the digitizedinventory of the Austrian National Library with F1 scores between 0.94and 1.00. We applied our method on a corpus of 161,522 German vol-umes and identiﬁed 345 travelogues that could not be identiﬁed usingtraditional search methods, resulting in the most extensive collection ofearly modern German travelogues ever created. To our knowledge, thisis the ﬁrst time such a method was implemented for the bibliographicindexing of a text corpus on this scale, improving and extending the tra-ditional methods in the humanities. Overall, we consider our techniqueto be an important ﬁrst step in a broader eﬀort of developing a novelmixed-method approach for the large-scale serial analysis of travelogues.

Keywords:

Travelogues · Machine Learning · Digital Humanities.

Travelogues oﬀer a wide range of information on topics closely connected to cur-rent challenges, including mass tourism, transnational migration, interculturalityand globalization. By deﬁnition, documents considered to be travelogues containperceptions of

Otherness related to foreign regions, cultures, or religions. At thesame time, travelogues are strongly shaped by the socio-cultural background ofthe people involved in their production. Comparative analysis allows us, in turn,to scrutinize how (speciﬁc) cultures handled

Otherness , as well as to examine theevolution of stereotypes and prejudices. This high degree of topicality fosters thecontinuous growth of studies on travels and travelogues, as can be observed bythe sheer ﬂood of publications appearing every year (c.f. [23]). While heuristicapproaches proved to be fruitful [2,29], many fundamental questions connected a r X i v : . [ c s . D L ] J a n J. R¨orden et al. to travelogues remain unanswered, among other reasons, because previous anal-ysis of travelogues rarely exceeded a dozen primary sources.In response, we seek to leverage the possibilities oﬀered by large-scale dig-itization eﬀorts, as well as novel automated text-mining and machine learningtechniques, for the ﬁrst time, on travelogues. This allows us to signiﬁcantly in-crease the quantity of text we can analyze. The overall goal of our work is todevelop a novel mixed qualitative and quantitative method for the serial analy-sis of large-scale text corpora and apply that method to a comprehensive corpusof German language travelogues from the period 1500–1876 (ca. 3,000–3,500books) drawn from the Austrian Books Online project (ca. 600,000 books) ofthe Austrian National Library (ONB).As a ﬁrst step, and this is the focus of this paper, we seek to provide auto-mated support for scholars in identifying travelogues in large collections of his-torical documents, which have been scanned and undergone an optical characterrecognition (OCR) process by Google. A major challenge clearly lies in ﬁndingan eﬀective method that can be scaled for large collections, is robust enough tosupport documents with varying OCR quality and can deal with the evolution ofthe German language over almost four centuries. Previous studies have alreadydemonstrated the potential of quantitative methods for investigating culturaltrends [15] or types of discourses in the past [22], and the eﬀectiveness of au-tomated machine learning techniques for subject indexing [14]. However, to thebest of our knowledge, no method has previously been tailored to the speciﬁccharacteristics and unique challenges of identifying travelogues.To this end, our contributions can be summarized as follows:1. We reviewed the characteristics and commonalities of travelogues and com-bine our ﬁndings into a generic deﬁnition of a travelogue .2. We provided a manually annotated dataset of documents that match ourworking deﬁnition of a travelogue in the range of the 16th to the 19th cen-tury.

3. We employed that dataset as a ground-truth for evaluating a variety of doc-ument classiﬁcation methods and found that a multilayer perceptron (MLP)model trained with standard bag-of-words (BOW) and bag of n-grams (range1, 2) feature set can eﬀectively identify travelogues with an F1 ratio of 1 (16thc.), 0.94 (17th c.), 0.94 (18th c.) and 0.97 (19th c.).

4. We found that approximately 30 manually annotated documents are neededfor training an eﬀective classiﬁer.Our results show that standard machine learning approaches can eﬀectivelyidentify travelogues in large text corpora. When we applied our most eﬀectivemodel on the ONB’s entire German language corpus, we unearthed 345 travel-ogues that could not be identiﬁed using a traditional keyword search. Thus, we We will share the corpus here: https://github.com/Travelogues/travelogues-corpus . The code (as Jupyter notebook) that we used for the classiﬁcation is available here: https://github.com/Travelogues/identifying-travelogues .dentifying Historical Travelogues 3 were able to create the most extensive collection of early modern German trav-elogues to date. This will provide us a solid baseline for determining subsequentsteps to develop a serial text-analysis method, which will focus on the speciﬁcphenomena of intertextuality and analysis of semantic expressions referring to

Otherness .We will present our deﬁnition of travelogue and closely related work in thenext section. Afterward, in Section 3, we outline our methodology before present-ing our results in Section 4. Finally, we discuss the implications and limitationsin Section 5 and conclude our paper in Section 6.

For identifying travelogues we needed, ﬁrst of all, a precise deﬁnition of the notionof a travelogue . In previous research, very broad and general deﬁnitions weresuggested that, unfortunately, did not resolve all of our questions connected tothe classiﬁcation [20,33]. There was, for instance, no conclusive answer whetheror not missives, letters of consuls, or texts only partly including descriptions ofactual travel are to be considered travelogues. Consequently, we had to generateour own deﬁnition, which aims to apply to all historical eras, geographical regionsand media types. Our considerations, however, which are based on an analysisof printed (early) modern travelogues in German, have been formed accordinglyand can be characterized as follows:A travelogue is a speciﬁc type of media [7] that reports on a journeywhich, if detectable, actually took place. Consequently, a travelogue isformed by two relations: the ﬁrst is content-based (description of a jour-ney) and the second biographical (factuality of the journey).Our deﬁnition builds upon and reﬁnes the careful reﬂections of Almut H¨ofert [9],who provides a narrower characterization: Fictional narratives are excluded, butthere is no binary distinction between ﬁctionality and factuality, since a certainamount of ﬁctionality is part of every travelogue [17,24], apparent factualitywas often generated artiﬁcially, and ﬁctional narratives inﬂuenced reports attimes [27].A journey is a movement in space and time that begins at a starting pointand then moves through a variable set of further points outside of the well-known cultural environment of the traveler. In contrast to Wolfang Treue [28]we include (forced) emigrations and relics of people who died while traveling,but exclude movements on a permanent level (e.g., nomads, vagabonds).Travelogues can be handed down in various forms, whether through oralspeech, non-verbal communication, text, an image or video. Travelogues obtainedfrom the (early) modern period and available for research consist of text and/orimages. The available text is predominantly in prose and can be attributed toseveral text genres, such as reports, diaries, letters or missives. Notwithstanding

J. R¨orden et al. that there are many mixed forms and transition zones here, especially sincecertain guidelines for the creation of travelogues (the so-called

Ars Apodemica )only emerged and were, if at all, partially applied by the authors during thecourse of the study period [12,26].The only decisive element of a text to be classiﬁed as travelogue is the men-tion of the fact that it reﬂects the experiences of an actual journey that was un-dertaken, with all of the imaginable variations of spellings and semantic forms.A frequent, but not always included, feature is an itinerary listing diﬀerent sta-tions along the journey and the connected experiences associated with the stops.Images in travelogues are usually mimetic, predominantly including portraits,landscapes, and depictions of plants, animals or architecture, but may also in-corporate abstract representations. The decisive element here is the inclusion ofany pictorial form that is a reﬂection about experiences that occurred duringa journey. Thus, a series of pictures, which originated from an actual journeyand contain no text, are also understood to be travelogues, but are not collectedwithin the current project that is focusing speciﬁcally on text.Most of the travelogues from the (early) modern period were written by thetraveling persons themselves, are therefore known as ego documents [21], and,predominantly, in a narrower sense considered to be self-testimonies (Selbstzeug-nisse) [11,13]. Consequently, the personal experiences and cultural backgroundof the authors, as well as other persons involved in the production of the ﬁnaldocument, strongly shape the content of the resulting texts.However, (early) modern travelogues should not be considered detached fromeach other, since they depend on each other and/or other (types of) mediaintertextually [19], interpictorially [8], intermaterially or intermodally [3]. For thedeﬁnition itself it is considered irrelevant whether, in the case of a publication,a travelogue was published by the traveling person or by someone else (e.g.,posthumous publications, later editions, written/edited by a related person),whether they are independent publications, appear in the context of a travelcollection, as part of a larger publication (e.g., autobiography, historiography)or in the form of an excerpt.

Generally, linear classiﬁers have demonstrated solid performance for text clas-siﬁcation tasks. This includes support vector machines (SVM) and logistic re-gression, as shown in [10] and [6]. We build on these ﬁndings and evaluate bothmethods in our experiments.A recent study in the digital library ﬁeld, by Mai et al. [14], compared theeﬀectiveness of classiﬁcation models trained on titles only versus models trainedon full-texts and found that the former outperform the latter. They used mul-tilayer perceptron (MLP), convolutional neural network (CNN) and long short-term memory (LSTM) architectures, and found that MLP outperformed theother methods in most cases. Although their models were trained on large-scaledatasets from other domains (PubMed, EconBiz) and therefore not directly ap-plicable, we consider MLPs for building a travelogue identiﬁcation model. dentifying Historical Travelogues 5

Dai et al. [5] use an unsupervised method based on word embeddings to clus-ter social media tweets as related or unrelated to a topic, in their case inﬂuenza.Although they use much shorter texts (Twitter posts, or tweets, were limited to140 characters until 2018), their task is similar to the one we present here in thatboth are binary classiﬁcation tasks. The authors report an F1 score as high as0.847, using pre-trained word embeddings from the Google News dataset. Addi-tionally, the authors compared their approach to other methods such as keywordor related-word analysis but found their solution to perform better. In this pa-per, we will show that similar scores can be achieved without using pre-trainedword embeddings.In [31], Yang et al. use hierarchical attention networks for document classiﬁ-cation, in their case sentiment estimation and topic classiﬁcation (multi-class).Their model outperforms previous methods, depending on the dataset, reachingF1 scores between 0.494 and 0.758. Additionally, they are able to visualize theinformative components of a given document. This might be a suitable methodfor identifying possible subject indexing terms (classes) in an entire corpus anda subsequent document classiﬁcation task. However, since the identiﬁcation oftravelogues is a binary classiﬁcation problem, we refrain from these methods atthe moment.Zhang et al. [32] use character-level convolutional networks for text classiﬁ-cation, comparing them against methods such as BOW, n-grams and other neu-ral network architectures. They test on several large-scale datasets (e.g., news,reviews, question/answers, DBPedia), showing that their methodology outper-forms most of the other approaches, having up to 40% fewer errors. While theauthors do not report F1 scores, they illustrate that treating text as just a se-quence of characters, without syntactic or semantic information or even knowingthe words, can work well for classiﬁcation tasks. While we do not apply theirﬁndings directly, we take inspiration from their work and use BOW and bag-of-n-grams features.

Our overall goal is to develop a novel mixed qualitative and quantitative methodfor the serial analysis of large-scale text corpora. Since serial analysis typicallyfocuses on a speciﬁc topic or type of document, in this case travelogues , weﬁrst need to deﬁne a systematic method that supports scholars with diversebackgrounds (historical science, library and information science, data science) initeratively training a machine learning model that ultimately supports them inlocating travelogues within a huge collection of digitized documents.Figure 1 summarizes the overall workﬂow and involved participants from ahigh-level perspective: in the ﬁrst step, domain experts use the keyword searchfeature of the Austrian National Library’s catalog to search the overall corpus for documents meeting our deﬁnition of a travelogue . They manually inspect

J. R¨orden et al.

Fig. 1.

High-level overview of our interdisciplinary approach. Creation of the groundtruth is primarily the responsibility of domain experts (with data scientists contributingto identify non-travelogues). Model creation is completed by data scientists, with theresults of the model deployment on the whole corpus being evaluated by the domainexperts again. each result and annotate those matching our deﬁnition as being a travelogue . Inparallel, the data scientist automatically selects a randomized sample of docu-ments from the overall corpus, which are then manually inspected and veriﬁedby the domain experts as being non-travelogues . This process, which is describedin more detail in Section 3.2, yields a balanced ground truth corpus consistingof travelogues and non-travelogues documents, which can then be used for sub-sequent machine learning tasks.Before building machine learning models, documents in the ground truthcorpus need to be pre-processed, which includes cleansing, normalization andfeature engineering steps. Section 3.3 explains in more detail the steps we appliedto our documents. Next, we use the pre-processed documents for model building ,which includes training various machine learning models, such as SVMs andMLPs. This process is described in Section 3.4. Following this, we evaluate theeﬀectiveness of the trained models (see Section 3.5).The top-performing model was then deployed and used to classify the remain-ing documents in our corpus, in an attempt to identify additional, potentiallypreviously unknown travelogue documents. As a result, our iterative methodyields a growing travelogue corpus , which can be used for reﬁning the eﬀec-tiveness of machine learning tasks and for other quantitative and qualitativeanalytics tasks. In the following sections we describe each step in more detail. dentifying Historical Travelogues 7

In our work, we are focusing on prints published between 1500 and 1876, whichare part of the historical holdings of the ONB. Since 2011, more than 600,000books (volumes) from that period have been digitized and OCR-processed in apublic-private partnership with Google (Austrian Books Online, ABO ). There-fore, nearly all of the library’s historical books are currently accessible in a digitalform. Within this corpus, we are speciﬁcally searching for travelogues.As a ﬁrst step, we identiﬁed German volumes in the overall ABO corpus, andthen split the corpus by century. Then we initiated a ground truth by queryingover titles and subject headings. We searched for diﬀerent keywords in German,namely truncated spellings of ‘Reise’ (travel) and ‘Fahrt’ (journey) along withtheir known variants and with wildcard aﬃxes and suﬃxes (in alphabetical order:*faart*, *fahrt*, *fart*, *rais*, *raiß*, *raisz*, *rays*, *rayß*, *raysz*, *reis*,*reiß*, *reisz*, *reys*, *reyß*, *reysz*, *rys*, *ryß*) as well as common subject-headings in the library’s catalog including ‘Forschungsreise’ (expedition), ‘Reise’(travel) and ‘Reisebericht’ (travelogue).As these queries still generated many false positives, we cleaned up thedataset manually. Results were double-checked intellectually by two annotatorsﬂuent in German and experienced with early modern German, a historian anda librarian, who read parts of the texts and utilized external bibliographies, bi-ographies and catalogs, to conﬁrm whether a document meets our deﬁnitionof travelogue or belongs to another genre. Uncertainties were resolved unani-mously and there were no disagreements on the ﬁnal annotations. The result ofthis step is a manually annotated and veriﬁed sample of travelogues, which tookapproximately three months of full time work for both annotators.Since training and validation of machine learning models also requires coun-terexamples, in this case, non-travelogues, we implemented an automated pro-cedure for randomly selecting an equally sized sample of documents from thesubset of German volumes. Via a manual investigation process conducted by thesame annotators, we ensured that those documents were not travelogues. Thisprovides a manually veriﬁed sample of non-travelogues.In total, our travelogues ground truth dataset contains a balanced sample of6,048 volumes, representing 3.67% of 167,570 German language volumes fromthe complete ONB corpus. Table 1 provides an overview of our ground truthand its distribution over centuries. One can easily observe that the number ofvolumes, as well as the size of each publication increases with time. To provideinsight into how likely it is to ﬁnd a travelogue randomly, we also included thenumber of travelogues that were found while reviewing the randomized samplewe used as counterexamples. This approach was replicated for the 16 th , 17 th , 18 th . E.g.: , https://lb-eutin.kreis-oh.de/ , https://kvk.bibliothek.kit.edu/ , , , , , https://viaf.org/ , Wikipedia. J. R¨orden et al. and 19 th centuries. Volumes that were not evaluated remain in the candidates pool, upon which we applied our classiﬁer after identifying the best-performingmodel. Table 1.

Dataset overview. Our corpus consists of the total number of digitizedGerman-language books available to us. The ground truth contains an equal amount oftravelogues and randomly selected counter examples; in brackets, we provide the num-ber of travelogues we found by chance. Books not evaluated remain in the candidates pool. A token contains at least two alphanumerical characters, punctuation etc. is notcounted.

Corpus

Century No. candidatevolumes No. groundtruth volumes Total tokens Average tokens16th 8,526 67/67 362,244,353 41,82917th 8,763 161/161 651,957,983 71,76218th 55,971 873/873 5,041,741,840 82,27419th 88,262 1,897/1,897 11,464,645,150 124,539 (cid:80)

The preprocessing phase involves several steps. First, the texts were tokenizedat the word level, using blanks and interpunctuation as separators. The Germanlanguage uses upper- and lowercase spelling, depending on the word type andtheir position in the sentence, but to compensate for OCR and orthographic er-rors we transformed all tokens to lowercase. Furthermore, we removed all tokensthat do not contain at least two alphanumeric characters, as this removes OCRartifacts, which are often special characters. For the same reason, each tokenneeds to appear at least twice in the whole corpus.

As shown in Table 1, the documents that we seek to classify are rather large, asthey contain on average 41,000-124,000 tokens. We decided to use a combinationof BOW and bag-of-n-grams, as shown by (Wang and Manning, [30]). With thisapproach, we can both handle intricate problems with our data, while having acomputationally eﬀective method that still provides competitive results.Experiments were performed on the above-mentioned ground truth. We testeddiﬀerent classiﬁcation algorithms: – Multinominal Naive Bayes (MNB) – Support Vector Machine (SVM) dentifying Historical Travelogues 9 – Logistic regression (Log) – Multilayer perceptron (MLP)For the MNB, SVM and Log algorithms we used the sklearn [18] implemen-tation. We use the Tensorﬂow [1] and Keras [4] implementation for the MLP.The data for all algorithms was vectorized and hashed with the sklearn Hash-ingVectorizer.

As a baseline, we applied a random classiﬁcation. In all the experiments, wetreat every book as a single document.First, we split the ground truth into both a training (75%) set and a validation(25%) set, for every time period. We evaluated all classiﬁers presented here ﬁrstthrough a ﬁve-fold cross evaluation along the training split. This essentiallymeans that the training set was split into ﬁve equally sized subsets, and for eachfold one subset serves as a test, and the other four become the training data.When the results across the cross-evaluation are comparable, good scores areless likely to occur by chance. Subsequently, we applied the classiﬁers on theheld-out validation data. The classiﬁcation results are discussed in Section 4.The evaluation of our work follows a two-step approach. First, we gaugethe eﬀectiveness of a given method by precision, recall and F1 metrics on ourtraining set.

Precision is the number of correct results, divided by the numberof all returned results.

Recall shows how many of the documents that shouldhave been found are actually found, dividing the number of correctly classiﬁeddocuments by the number of documents that actually belong to that class. F1 isthe harmonic mean of precision and recall (with a range between 0 and 1, with1 representing the perfect result).For the second step of our evaluation, we apply the model that performs beston our training data to the remaining documents of our corpus. This resultsin a list of all those documents, with probability scores indicating how likelythey belong to our travelogues class. Starting with the highest probability, thosedocuments are then manually evaluated by domain experts (see Section 4), tojudge how well this model identiﬁes travelogues in a set of unseen documents. Additionally, we wanted to understand how many ground truth documents areneeded to train an eﬀective classiﬁer. We approached this by testing the topclassiﬁcation approach against diﬀerent amounts of ground truth documents andevaluated the results. The same setup as described above is used, but we variedthe number of ground truth documents, as well as randomizing their selection.For each time frame, we evaluated 5, 10, 15, 20, 25, 30 and 50 exampleseach for the positive class (travelogue) and the negative class (anything else);for the 18 th and 19 th centuries, we extended to 100 examples each. The modelcreated with those documents was then tested on the remaining ground truth documents. For each sample size, we repeated this a total of ﬁve times with adiﬀerent randomized sample each time. Table 2 shows the evaluation of our classiﬁcation algorithms.

Table 2.

Classiﬁcation results. We provide precision, recall and F1 scores for multi-nominal naive Bayes (MNB), support vector machine (SVM), logistic regression (Log)and a multi-layer perceptron neural network (MLP).MNB SVM Log MLPCentury P R F1 P R F1 P R F1 P R F116th 0.73

Our results show that it is possible to achieve good classiﬁcation scores withour dataset, even without extensive feature engineering or pre-trained word em-beddings. We assume that this is based in the comparably large size of ourdata points, as can be seen in Table 1. Additionally, through the randomnessinvolved in the selection of non-travelogues, those volumes are expected to havea high variance in genres, which matches the whole corpus as well. Comparingnumerous examples of one genre against an equal number of volumes coveringmany more genres certainly beneﬁts the classiﬁcation, especially when takinginto account the length of the documents.Taking the results from this evaluation, we were conﬁdent in approachingour main task, which was the identiﬁcation of travelogues from a much largerdataset: the other digitized books in our corpus not yet evaluated by us.Following the training of models suitable for classiﬁcation, one speciﬁcallydesigned for each century, we applied it to our pool of candidates . We usedthis process to create a list of books that are potentially travelogues, rankedfrom highest to lowest, and evaluated the ﬁrst 200 items. The results of thisare shown in Table 3, and have been subject to the same scrutiny as our initialground truth.We can show that our methodology proves a clear improvement over a lessguided evaluation: within our evaluated ﬁndings, true positives made up 12.5%(16 th c.), 30% (17 th c.), 41.5% (18 th c.) and 89.5% (19 th c.) respectively. Due to Many works focus on datasets that have more, but shorter documents, c.f. [31] forcomparisons of multiple classiﬁcation methods and datasets.dentifying Historical Travelogues 11 time constraints, our evaluation was discontinued after the ﬁrst 200 items, butthis already means that we discovered 345 books of the travelogue genre thatwere not found by traditional search queries on meta data, as we explained inSection 3.2. Discovery by chance only resulted in 3% (16 th c.) and 0.8% (18 th c.) positives, or none at all (17 th c., 19 th c.).It also has to be noted that the increase in the percentage of true positivesin a diachronic perspective is connected to the equally increasing number oftravelogues. There simply are many more travelogues that were printed in the19 th century than in the previous time periods. Table 3.

Applying the classiﬁer on the candidates pool.

By chance shows how manytravelogues were found when randomly selecting examples for the ground truth.

Century No.candidates Conﬁrmed(top 200) By chance th th th th During this evaluation, it became apparent that the majority of potential ﬁndingswith a high probability belong to the group we attribute as historiography. Dueto their nature, they have strong a overlap with travelogues but lack the requiredcriteria of describing a journey actually experienced by the author. This crucialinformation can in many cases only be gathered from external sources . Froma purely technical perspective, there is no diﬀerence with the content of thebooks between travelogues and historiographies. This means that while, for thepurposes of this project, they are very diﬀerent, right now we cannot furtherdiﬀerentiate between them. Additionally, in the 18 th century, many false positivesinclude publications on geography; a possible explanation here is that they oftendescribe locations, which naturally overlap with travelogues as well. The result of our eﬃciency evaluation is depicted in Figure 2. We provide theaverage F1 score for each time frame, as well as the variance between the samples.For every time frame, our general observation is that the performance ofthe MLP classiﬁcation ﬂuctuates heavily when only a small dataset is available.With at least 20 documents, but it is better with 30 examples, each for the C.f. Section 2.1.2 J. R¨orden et al. positive and negative class a stable performance of above a 0.8 F1 score can bereached. After that it slowly increases up to a 0.9 F1 score at 50 examples each,with only very minor changes for 100 examples each.This experiment shows that it is possible to create a working classiﬁcationmethodology, which reaches acceptable results with a modest time investment upfront, as shown in Table 2. Fig. 2.

Classiﬁer eﬃciency evaluation for MLP. Every step has a balanced number oftravelogues and non-travelogues (5/5, 10/10, 15/15, 20/20, 25/25, 30/30, 50/50 and100/100). The experiment was repeated ﬁve times for each step.

Our results show that standard machine learning techniques combined with rel-atively easily computable features (BOW and bag-of-n-grams) can eﬀectivelysupport scholars in identifying travelogues in a large-scale document corpus. Us-ing the same features for an MLP neural network generates even superior results.This is an important ﬁrst step in the development of a broader mixed-methodapproach for the large-scale serial analysis of travelogues.Speciﬁcally, we discovered a total of 345 travelogues in the evaluated timeperiods using the top 200 ﬁndings with the highest conﬁdence scores each (800 Depending on the sources, and if additional deﬁnitions etc. are needed, betweenseveral hours and up to a few weeks of full time work.dentifying Historical Travelogues 13 in total). We previously were not able to ﬁnd any of these ﬁles through searchqueries based on meta data or manual search in our catalog, hence this directlytranslates into a re-discovery of sources for scholars in the humanities.Additionally, a large fraction of false positives is, at the level of words andtheir semantics, extremely similar to the true positives. However, going by thedeﬁnition provided earlier we are required to use external information that couldnot be included as a feature, as it is dependant on domain knowledge. Thisseverely limits the eﬃcacy of unsupervised machine learning and deep learningapproaches.A clear limitation of our eﬀort lies in the time and eﬀort required to cre-ate a high-quality ground truth. While this eﬀort could possibly be reduced byapplying unsupervised clustering techniques beforehand, annotations providedby domain experts will always be key for eﬀective learning techniques. Apply-ing active learning techniques (c.f. [25]) for iteratively developing ground truthscould be a possible strategy for reducing this manual annotation eﬀort. Anotherlimitation of our approach lies in the focus on entire volumes, which currentlyneglects the fact that volumes may include travelogues and non-travelogues. Us-ing a wider spectrum of semantically richer features (c.f. [16]) such as namedentities could support classiﬁcation at the paragraph or page level.Nonetheless, our experiments on the eﬃciency of the classiﬁcation methodpresented here show that it is possible to achieve robust results above an F1 scoreof 0.8 with a relatively small ground truth size. Knowing this, future researchin diﬀerent domains should require substantially less time investments to getstarted.A remaining challenge lies in the distinction between highly similar genres, inour case historiographies or geographic books and travelogues. We hope that thiscan be tackled by further reﬁning the ground truth to ﬁt the given genres, takinginto account a wide range of external sources, to include domain knowledge ina structured way.

In this paper, we have described a methodology to identify historical traveloguesin a large dataset. Our approach combines the knowledge of both domain experts(historical science, library and information science) and data scientists to createa ground truth and subsequently build an MLP model, successfully identifying345 previously unknown travelogues. Furthermore, we have shown that a groundtruth for this kind of data can be as small as 30 examples each for the positiveand negative class and still perform well.In the upcoming weeks and months, we will begin looking at the discoveredtravelogues in more detail. In a ﬁrst step, we will identify intertextual relationsin our corpus to ﬁnd out in what way the travelogues depended on each other,why and how (certain) stereotypes and prejudices were handed down, evolvedor disappeared over the centuries.

Ultimately, we want to know how foreign cultures, places and people wereperceived, and if the perceptions diﬀered depending on the socio-cultural back-ground of the involved people. This will allow us to come to an understandinghow and why something was perceived as

Other , if and how this changed acrossthe centuries. For this, we have done the groundwork here, as it is crucial torely on as much data as possible; concrete next steps in this direction includecreating a formal description of intertextuality and

Otherness , and translatingit into a set of machine-readable text features.

Acknowledgments

The work in the Travelogues project ( )is funded through an international project grant by the Austrian Science Fund(FWF, Austria: I 3795) and the German Research Foundation (DFG, Germany:398697847).

References

1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A.,Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V.,Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng,X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015), , software available from tensorﬂow.org2. Agai, B., Conermann, S. (eds.): Wenn einer eine Reise tut, hat er was zu erz¨ahlen.Pr¨aﬁguration – Konﬁguration – Reﬁguration in muslimischen Reiseberichten. ebv,Berlin (2013)3. Bellingradt, D., Salman, J.: Books and Book History in Motion: Materiality, Social-ity and Spatiality. In: Daniel Bellingradt, P.N., Salman, J. (eds.) Books in Motionin Early Modern Europe. Beyond Production, Circulation and Consumption, pp.1–11. Springer International Publishing, Cham (2017)4. Chollet, F., et al.: Keras. https://keras.io (2015)5. Dai, X., Bikdash, M., Meyer, B.: From social media to public health surveillance:Word embedding based clustering method for twitter classiﬁcation. In: Southeast-Con 2017. pp. 1–7. IEEE (2017)6. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library forlarge linear classiﬁcation. Journal of machine learning research (Aug), 1871–1874(2008)7. Genz, J., G´evaudan, P.: Medialit¨at, Materialit¨at, Kodierung: Grundz¨uge einer all-gemeinen Theorie der Medien, pp. 201–209. transcript, Bielefeld (2016)8. Greve, A.: Die Konstruktion Amerikas: Bilderpolitik in den ”Grands Voyages” ausder Werkstatt de Bry, Europ¨aische Kulturstudien, vol. 14. B¨ohlau, K¨oln, Weimarund Wien (2004)9. H¨ofert, A.: Den Feind beschreiben. >> T¨urkengefahr << und europ¨aisches Wissen¨uber das Osmanische Reich 1450–1600, pp. 120–122. Ferdinand Sch¨oningh, Pader-born (2014)dentifying Historical Travelogues 1510. Joachims, T.: Text categorization with support vector machines: Learning withmany relevant features. In: European conference on machine learning. pp. 137–142. Springer (1998)11. von Krusenstjern, B.: Was sind Selbstzeugnisse? Begriﬀskritische und quel-lenkundliche ¨Uberlegungen anhand von Beispielen aus dem 17. Jahrhundert. ForumHistorische Anthropologie , 462–471 (1994)12. K¨urbis, H.: Hispania descripta: Von der Reise zum Bericht. Deutschsprachige Reise-berichte des 16. und 17. Jahrhunderts ¨uber Spanien. Ein Beitrag zur Struktur undFunktion der fr¨uhneuzeitlichen Reiseliteratur, pp. 345–356. Peter Lang, Frankfurtam Main (2004)13. L¨udtke A. et al., editors: Selbstzeugnisse der Neuzeit (1993–)14. Mai, F., Galke, L., Scherp, A.: Using deep learning for title-based semantic subjectindexing to reach competitive performance to full-text. In: Proceedings of the 18thACM/IEEE on Joint Conference on Digital Libraries. pp. 169–178. ACM (2018)15. Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg,D., Clancy, D., Norvig, P., Orwant, J., et al.: Quantitative analysis of culture usingmillions of digitized books. science (6014), 176–182 (2011)16. Momeni, E., Tao, K., Haslhofer, B., Houben, G.J.: Identiﬁcation of useful usercomments in social media: a case study on ﬂickr commons. In: Proceedings of the13th ACM/IEEE-CS Joint Conference on Digital libraries. pp. 1–10. ACM (2013)17. N¨unning, A.: Zur mehrfachen Pr¨aﬁguration/Pr¨amediation der Wirklichkeits-darstellung im Reisebericht: Grundz¨uge einer narratologischen Theorie, Typologieund Poetik der Reiseliteratur. In: Gymnich, M., al. (eds.) Points of Arrival: Trav-els in Time, Space, and Self/Zielpunkte: Unterwegs in Zeit, Raum und Selbst, pp.11–32. Francke, T¨ubingen (2008)18. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machinelearning in Python. Journal of Machine Learning Research , 2825–2830 (2011)19. Pﬁster, M.: Intertextuelles Reisen, oder: Der Reisebericht als Intertext. In: Wetzel,H.H. (ed.) Reisen in den Mittelmeerraum, pp. 55–101. Passavia Universit¨atsverlag,Passau (1993)20. Piera, M.: Travel as episteme–an Introductory Journey. In: Piera, M. (ed.) Remap-ping Travel Narratives (1000–1700), pp. 1–22. Arc Humanities Press, Leeds (2018)21. Presser, J.: Memoires als geschiedbron. In: Presser, J. (ed.) Uit het werk van JacobPresser, pp. 277–282. Athenaeum-Polak & Van Gennep, Amsterdam (1969)22. Purschwitz, A.: Netzwerke des Wissens – Thematische und personelle Relatio-nen innerhalb der halleschen Zeitungen und Zeitschriften der Aufkl¨arungsepoche(1688–1818). Journal of Historical Network Research (1), 109–142 (2018)23. Salzani, C., T¨ot¨osy de Zepetnek, S.: Bibliography for work in travel studies.CLCWeb Library p. travelstudiesbibliography (2010), https://docs.lib.purdue.edu/clcweblibrary/travelstudiesbibliography

24. Sandrock, K.: Truth and Lying in Early Modern Travel Narratives: Coryat’s Crudi-ties, Lithgow’s Totall Discourse and Generic Change. European Journal of EnglishStudies (2), 189–203 (2015)25. Settles, B.: Active learning. Synthesis Lectures on Artiﬁcial Intelligence and Ma-chine Learning (1), 1–114 (2012)26. Stagl, J.: Apodemiken. Eine r¨asonnierte Bibliographie der reisetheoretischen Liter-atur des 16., 17. und 18. Jahrhunderts, Quellen und Abhandlungen zur Geschichteder Staatsbeschreibung und Statistik (QASS), vol. 2. Ferdinand Sch¨oningh, Pader-born et al. (1983)6 J. R¨orden et al.27. Stagl, J.: Eine Geschichte der Neugier. Die Kunst des Reisens 1550–1800, pp. 77–122, 242–251. B¨ohlau, K¨oln, Weimar und Wien (2002)28. Treue, W.: Abenteuer und Anerkennung. Reisende und Gereiste in Sp¨atmittelalterund Fr¨uhneuzeit (1400–1700), p. 8. Ferdinand Sch¨oningh, Paderborn (2014)29. Van Groesen, M.: The representations of the overseas world in the De Bry collectionof voyages (1590–1634). Brill, Boston and Leiden (2008)30. Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment andtopic classiﬁcation. In: Proceedings of the 50th annual meeting of the associationfor computational linguistics: Short papers-volume 2. pp. 90–94. Association forComputational Linguistics (2012)31. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attentionnetworks for document classiﬁcation. In: Proceedings of the 2016 Conference ofthe North American Chapter of the Association for Computational Linguistics:Human Language Technologies. pp. 1480–1489 (2016)32. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for textclassiﬁcation. In: Advances in neural information processing systems. pp. 649–657(2015)33. von Zimmermann, C.: Texttypologische ¨Uberlegungen zum fr¨uhneuzeitlichenReisebericht: Ann¨aherung an eine Gattung. Archiv f¨ur das Studium der neuerenSprachen und Literaturen154