Identifying Historical Travelogues in Large Text Corpora Using Machine Learning
IIdentifying Historical Travelogues in Large TextCorpora Using Machine Learning
Jan R¨orden − − − , Doris Gruber − − − X ] , MartinKrickl − − − , and Bernhard Haslhofer − − − AIT Austrian Institute of Technology, Vienna, Austria Austrian Academy of Sciences, Vienna, Austria Austrian National Library, Vienna, Austria
Abstract.
Travelogues represent an important and intensively studiedsource for scholars in the humanities, as they provide insights into peo-ple, cultures, and places of the past. However, existing studies rarelyutilize more than a dozen primary sources, since the human capacities ofworking with a large number of historical sources are naturally limited.In this paper, we define the notion of travelogue and report upon aninterdisciplinary method that, using machine learning as well as domainknowledge, can effectively identify German travelogues in the digitizedinventory of the Austrian National Library with F1 scores between 0.94and 1.00. We applied our method on a corpus of 161,522 German vol-umes and identified 345 travelogues that could not be identified usingtraditional search methods, resulting in the most extensive collection ofearly modern German travelogues ever created. To our knowledge, thisis the first time such a method was implemented for the bibliographicindexing of a text corpus on this scale, improving and extending the tra-ditional methods in the humanities. Overall, we consider our techniqueto be an important first step in a broader effort of developing a novelmixed-method approach for the large-scale serial analysis of travelogues.
Keywords:
Travelogues · Machine Learning · Digital Humanities.
Travelogues offer a wide range of information on topics closely connected to cur-rent challenges, including mass tourism, transnational migration, interculturalityand globalization. By definition, documents considered to be travelogues containperceptions of
Otherness related to foreign regions, cultures, or religions. At thesame time, travelogues are strongly shaped by the socio-cultural background ofthe people involved in their production. Comparative analysis allows us, in turn,to scrutinize how (specific) cultures handled
Otherness , as well as to examine theevolution of stereotypes and prejudices. This high degree of topicality fosters thecontinuous growth of studies on travels and travelogues, as can be observed bythe sheer flood of publications appearing every year (c.f. [23]). While heuristicapproaches proved to be fruitful [2,29], many fundamental questions connected a r X i v : . [ c s . D L ] J a n J. R¨orden et al. to travelogues remain unanswered, among other reasons, because previous anal-ysis of travelogues rarely exceeded a dozen primary sources.In response, we seek to leverage the possibilities offered by large-scale dig-itization efforts, as well as novel automated text-mining and machine learningtechniques, for the first time, on travelogues. This allows us to significantly in-crease the quantity of text we can analyze. The overall goal of our work is todevelop a novel mixed qualitative and quantitative method for the serial analy-sis of large-scale text corpora and apply that method to a comprehensive corpusof German language travelogues from the period 1500–1876 (ca. 3,000–3,500books) drawn from the Austrian Books Online project (ca. 600,000 books) ofthe Austrian National Library (ONB).As a first step, and this is the focus of this paper, we seek to provide auto-mated support for scholars in identifying travelogues in large collections of his-torical documents, which have been scanned and undergone an optical characterrecognition (OCR) process by Google. A major challenge clearly lies in findingan effective method that can be scaled for large collections, is robust enough tosupport documents with varying OCR quality and can deal with the evolution ofthe German language over almost four centuries. Previous studies have alreadydemonstrated the potential of quantitative methods for investigating culturaltrends [15] or types of discourses in the past [22], and the effectiveness of au-tomated machine learning techniques for subject indexing [14]. However, to thebest of our knowledge, no method has previously been tailored to the specificcharacteristics and unique challenges of identifying travelogues.To this end, our contributions can be summarized as follows:1. We reviewed the characteristics and commonalities of travelogues and com-bine our findings into a generic definition of a travelogue .2. We provided a manually annotated dataset of documents that match ourworking definition of a travelogue in the range of the 16th to the 19th cen-tury.
3. We employed that dataset as a ground-truth for evaluating a variety of doc-ument classification methods and found that a multilayer perceptron (MLP)model trained with standard bag-of-words (BOW) and bag of n-grams (range1, 2) feature set can effectively identify travelogues with an F1 ratio of 1 (16thc.), 0.94 (17th c.), 0.94 (18th c.) and 0.97 (19th c.).
4. We found that approximately 30 manually annotated documents are neededfor training an effective classifier.Our results show that standard machine learning approaches can effectivelyidentify travelogues in large text corpora. When we applied our most effectivemodel on the ONB’s entire German language corpus, we unearthed 345 travel-ogues that could not be identified using a traditional keyword search. Thus, we We will share the corpus here: https://github.com/Travelogues/travelogues-corpus . The code (as Jupyter notebook) that we used for the classification is available here: https://github.com/Travelogues/identifying-travelogues .dentifying Historical Travelogues 3 were able to create the most extensive collection of early modern German trav-elogues to date. This will provide us a solid baseline for determining subsequentsteps to develop a serial text-analysis method, which will focus on the specificphenomena of intertextuality and analysis of semantic expressions referring to
Otherness .We will present our definition of travelogue and closely related work in thenext section. Afterward, in Section 3, we outline our methodology before present-ing our results in Section 4. Finally, we discuss the implications and limitationsin Section 5 and conclude our paper in Section 6.
For identifying travelogues we needed, first of all, a precise definition of the notionof a travelogue . In previous research, very broad and general definitions weresuggested that, unfortunately, did not resolve all of our questions connected tothe classification [20,33]. There was, for instance, no conclusive answer whetheror not missives, letters of consuls, or texts only partly including descriptions ofactual travel are to be considered travelogues. Consequently, we had to generateour own definition, which aims to apply to all historical eras, geographical regionsand media types. Our considerations, however, which are based on an analysisof printed (early) modern travelogues in German, have been formed accordinglyand can be characterized as follows:A travelogue is a specific type of media [7] that reports on a journeywhich, if detectable, actually took place. Consequently, a travelogue isformed by two relations: the first is content-based (description of a jour-ney) and the second biographical (factuality of the journey).Our definition builds upon and refines the careful reflections of Almut H¨ofert [9],who provides a narrower characterization: Fictional narratives are excluded, butthere is no binary distinction between fictionality and factuality, since a certainamount of fictionality is part of every travelogue [17,24], apparent factualitywas often generated artificially, and fictional narratives influenced reports attimes [27].A journey is a movement in space and time that begins at a starting pointand then moves through a variable set of further points outside of the well-known cultural environment of the traveler. In contrast to Wolfang Treue [28]we include (forced) emigrations and relics of people who died while traveling,but exclude movements on a permanent level (e.g., nomads, vagabonds).Travelogues can be handed down in various forms, whether through oralspeech, non-verbal communication, text, an image or video. Travelogues obtainedfrom the (early) modern period and available for research consist of text and/orimages. The available text is predominantly in prose and can be attributed toseveral text genres, such as reports, diaries, letters or missives. Notwithstanding
J. R¨orden et al. that there are many mixed forms and transition zones here, especially sincecertain guidelines for the creation of travelogues (the so-called
Ars Apodemica )only emerged and were, if at all, partially applied by the authors during thecourse of the study period [12,26].The only decisive element of a text to be classified as travelogue is the men-tion of the fact that it reflects the experiences of an actual journey that was un-dertaken, with all of the imaginable variations of spellings and semantic forms.A frequent, but not always included, feature is an itinerary listing different sta-tions along the journey and the connected experiences associated with the stops.Images in travelogues are usually mimetic, predominantly including portraits,landscapes, and depictions of plants, animals or architecture, but may also in-corporate abstract representations. The decisive element here is the inclusion ofany pictorial form that is a reflection about experiences that occurred duringa journey. Thus, a series of pictures, which originated from an actual journeyand contain no text, are also understood to be travelogues, but are not collectedwithin the current project that is focusing specifically on text.Most of the travelogues from the (early) modern period were written by thetraveling persons themselves, are therefore known as ego documents [21], and,predominantly, in a narrower sense considered to be self-testimonies (Selbstzeug-nisse) [11,13]. Consequently, the personal experiences and cultural backgroundof the authors, as well as other persons involved in the production of the finaldocument, strongly shape the content of the resulting texts.However, (early) modern travelogues should not be considered detached fromeach other, since they depend on each other and/or other (types of) mediaintertextually [19], interpictorially [8], intermaterially or intermodally [3]. For thedefinition itself it is considered irrelevant whether, in the case of a publication,a travelogue was published by the traveling person or by someone else (e.g.,posthumous publications, later editions, written/edited by a related person),whether they are independent publications, appear in the context of a travelcollection, as part of a larger publication (e.g., autobiography, historiography)or in the form of an excerpt.
Generally, linear classifiers have demonstrated solid performance for text clas-sification tasks. This includes support vector machines (SVM) and logistic re-gression, as shown in [10] and [6]. We build on these findings and evaluate bothmethods in our experiments.A recent study in the digital library field, by Mai et al. [14], compared theeffectiveness of classification models trained on titles only versus models trainedon full-texts and found that the former outperform the latter. They used mul-tilayer perceptron (MLP), convolutional neural network (CNN) and long short-term memory (LSTM) architectures, and found that MLP outperformed theother methods in most cases. Although their models were trained on large-scaledatasets from other domains (PubMed, EconBiz) and therefore not directly ap-plicable, we consider MLPs for building a travelogue identification model. dentifying Historical Travelogues 5
Dai et al. [5] use an unsupervised method based on word embeddings to clus-ter social media tweets as related or unrelated to a topic, in their case influenza.Although they use much shorter texts (Twitter posts, or tweets, were limited to140 characters until 2018), their task is similar to the one we present here in thatboth are binary classification tasks. The authors report an F1 score as high as0.847, using pre-trained word embeddings from the Google News dataset. Addi-tionally, the authors compared their approach to other methods such as keywordor related-word analysis but found their solution to perform better. In this pa-per, we will show that similar scores can be achieved without using pre-trainedword embeddings.In [31], Yang et al. use hierarchical attention networks for document classifi-cation, in their case sentiment estimation and topic classification (multi-class).Their model outperforms previous methods, depending on the dataset, reachingF1 scores between 0.494 and 0.758. Additionally, they are able to visualize theinformative components of a given document. This might be a suitable methodfor identifying possible subject indexing terms (classes) in an entire corpus anda subsequent document classification task. However, since the identification oftravelogues is a binary classification problem, we refrain from these methods atthe moment.Zhang et al. [32] use character-level convolutional networks for text classifi-cation, comparing them against methods such as BOW, n-grams and other neu-ral network architectures. They test on several large-scale datasets (e.g., news,reviews, question/answers, DBPedia), showing that their methodology outper-forms most of the other approaches, having up to 40% fewer errors. While theauthors do not report F1 scores, they illustrate that treating text as just a se-quence of characters, without syntactic or semantic information or even knowingthe words, can work well for classification tasks. While we do not apply theirfindings directly, we take inspiration from their work and use BOW and bag-of-n-grams features.
Our overall goal is to develop a novel mixed qualitative and quantitative methodfor the serial analysis of large-scale text corpora. Since serial analysis typicallyfocuses on a specific topic or type of document, in this case travelogues , wefirst need to define a systematic method that supports scholars with diversebackgrounds (historical science, library and information science, data science) initeratively training a machine learning model that ultimately supports them inlocating travelogues within a huge collection of digitized documents.Figure 1 summarizes the overall workflow and involved participants from ahigh-level perspective: in the first step, domain experts use the keyword searchfeature of the Austrian National Library’s catalog to search the overall corpus for documents meeting our definition of a travelogue . They manually inspect
J. R¨orden et al.
Fig. 1.
High-level overview of our interdisciplinary approach. Creation of the groundtruth is primarily the responsibility of domain experts (with data scientists contributingto identify non-travelogues). Model creation is completed by data scientists, with theresults of the model deployment on the whole corpus being evaluated by the domainexperts again. each result and annotate those matching our definition as being a travelogue . Inparallel, the data scientist automatically selects a randomized sample of docu-ments from the overall corpus, which are then manually inspected and verifiedby the domain experts as being non-travelogues . This process, which is describedin more detail in Section 3.2, yields a balanced ground truth corpus consistingof travelogues and non-travelogues documents, which can then be used for sub-sequent machine learning tasks.Before building machine learning models, documents in the ground truthcorpus need to be pre-processed, which includes cleansing, normalization andfeature engineering steps. Section 3.3 explains in more detail the steps we appliedto our documents. Next, we use the pre-processed documents for model building ,which includes training various machine learning models, such as SVMs andMLPs. This process is described in Section 3.4. Following this, we evaluate theeffectiveness of the trained models (see Section 3.5).The top-performing model was then deployed and used to classify the remain-ing documents in our corpus, in an attempt to identify additional, potentiallypreviously unknown travelogue documents. As a result, our iterative methodyields a growing travelogue corpus , which can be used for refining the effec-tiveness of machine learning tasks and for other quantitative and qualitativeanalytics tasks. In the following sections we describe each step in more detail. dentifying Historical Travelogues 7
In our work, we are focusing on prints published between 1500 and 1876, whichare part of the historical holdings of the ONB. Since 2011, more than 600,000books (volumes) from that period have been digitized and OCR-processed in apublic-private partnership with Google (Austrian Books Online, ABO ). There-fore, nearly all of the library’s historical books are currently accessible in a digitalform. Within this corpus, we are specifically searching for travelogues.As a first step, we identified German volumes in the overall ABO corpus, andthen split the corpus by century. Then we initiated a ground truth by queryingover titles and subject headings. We searched for different keywords in German,namely truncated spellings of ‘Reise’ (travel) and ‘Fahrt’ (journey) along withtheir known variants and with wildcard affixes and suffixes (in alphabetical order:*faart*, *fahrt*, *fart*, *rais*, *raiß*, *raisz*, *rays*, *rayß*, *raysz*, *reis*,*reiß*, *reisz*, *reys*, *reyß*, *reysz*, *rys*, *ryß*) as well as common subject-headings in the library’s catalog including ‘Forschungsreise’ (expedition), ‘Reise’(travel) and ‘Reisebericht’ (travelogue).As these queries still generated many false positives, we cleaned up thedataset manually. Results were double-checked intellectually by two annotatorsfluent in German and experienced with early modern German, a historian anda librarian, who read parts of the texts and utilized external bibliographies, bi-ographies and catalogs, to confirm whether a document meets our definitionof travelogue or belongs to another genre. Uncertainties were resolved unani-mously and there were no disagreements on the final annotations. The result ofthis step is a manually annotated and verified sample of travelogues, which tookapproximately three months of full time work for both annotators.Since training and validation of machine learning models also requires coun-terexamples, in this case, non-travelogues, we implemented an automated pro-cedure for randomly selecting an equally sized sample of documents from thesubset of German volumes. Via a manual investigation process conducted by thesame annotators, we ensured that those documents were not travelogues. Thisprovides a manually verified sample of non-travelogues.In total, our travelogues ground truth dataset contains a balanced sample of6,048 volumes, representing 3.67% of 167,570 German language volumes fromthe complete ONB corpus. Table 1 provides an overview of our ground truthand its distribution over centuries. One can easily observe that the number ofvolumes, as well as the size of each publication increases with time. To provideinsight into how likely it is to find a travelogue randomly, we also included thenumber of travelogues that were found while reviewing the randomized samplewe used as counterexamples. This approach was replicated for the 16 th , 17 th , 18 th . E.g.: , https://lb-eutin.kreis-oh.de/ , https://kvk.bibliothek.kit.edu/ , , , , , https://viaf.org/ , Wikipedia. J. R¨orden et al. and 19 th centuries. Volumes that were not evaluated remain in the candidates pool, upon which we applied our classifier after identifying the best-performingmodel. Table 1.
Dataset overview. Our corpus consists of the total number of digitizedGerman-language books available to us. The ground truth contains an equal amount oftravelogues and randomly selected counter examples; in brackets, we provide the num-ber of travelogues we found by chance. Books not evaluated remain in the candidates pool. A token contains at least two alphanumerical characters, punctuation etc. is notcounted.
Corpus
Century No. candidatevolumes No. groundtruth volumes Total tokens Average tokens16th 8,526 67/67 362,244,353 41,82917th 8,763 161/161 651,957,983 71,76218th 55,971 873/873 5,041,741,840 82,27419th 88,262 1,897/1,897 11,464,645,150 124,539 (cid:80)
The preprocessing phase involves several steps. First, the texts were tokenizedat the word level, using blanks and interpunctuation as separators. The Germanlanguage uses upper- and lowercase spelling, depending on the word type andtheir position in the sentence, but to compensate for OCR and orthographic er-rors we transformed all tokens to lowercase. Furthermore, we removed all tokensthat do not contain at least two alphanumeric characters, as this removes OCRartifacts, which are often special characters. For the same reason, each tokenneeds to appear at least twice in the whole corpus.
As shown in Table 1, the documents that we seek to classify are rather large, asthey contain on average 41,000-124,000 tokens. We decided to use a combinationof BOW and bag-of-n-grams, as shown by (Wang and Manning, [30]). With thisapproach, we can both handle intricate problems with our data, while having acomputationally effective method that still provides competitive results.Experiments were performed on the above-mentioned ground truth. We testeddifferent classification algorithms: – Multinominal Naive Bayes (MNB) – Support Vector Machine (SVM) dentifying Historical Travelogues 9 – Logistic regression (Log) – Multilayer perceptron (MLP)For the MNB, SVM and Log algorithms we used the sklearn [18] implemen-tation. We use the Tensorflow [1] and Keras [4] implementation for the MLP.The data for all algorithms was vectorized and hashed with the sklearn Hash-ingVectorizer.
As a baseline, we applied a random classification. In all the experiments, wetreat every book as a single document.First, we split the ground truth into both a training (75%) set and a validation(25%) set, for every time period. We evaluated all classifiers presented here firstthrough a five-fold cross evaluation along the training split. This essentiallymeans that the training set was split into five equally sized subsets, and for eachfold one subset serves as a test, and the other four become the training data.When the results across the cross-evaluation are comparable, good scores areless likely to occur by chance. Subsequently, we applied the classifiers on theheld-out validation data. The classification results are discussed in Section 4.The evaluation of our work follows a two-step approach. First, we gaugethe effectiveness of a given method by precision, recall and F1 metrics on ourtraining set.
Precision is the number of correct results, divided by the numberof all returned results.
Recall shows how many of the documents that shouldhave been found are actually found, dividing the number of correctly classifieddocuments by the number of documents that actually belong to that class. F1 isthe harmonic mean of precision and recall (with a range between 0 and 1, with1 representing the perfect result).For the second step of our evaluation, we apply the model that performs beston our training data to the remaining documents of our corpus. This resultsin a list of all those documents, with probability scores indicating how likelythey belong to our travelogues class. Starting with the highest probability, thosedocuments are then manually evaluated by domain experts (see Section 4), tojudge how well this model identifies travelogues in a set of unseen documents. Additionally, we wanted to understand how many ground truth documents areneeded to train an effective classifier. We approached this by testing the topclassification approach against different amounts of ground truth documents andevaluated the results. The same setup as described above is used, but we variedthe number of ground truth documents, as well as randomizing their selection.For each time frame, we evaluated 5, 10, 15, 20, 25, 30 and 50 exampleseach for the positive class (travelogue) and the negative class (anything else);for the 18 th and 19 th centuries, we extended to 100 examples each. The modelcreated with those documents was then tested on the remaining ground truth documents. For each sample size, we repeated this a total of five times with adifferent randomized sample each time. Table 2 shows the evaluation of our classification algorithms.
Table 2.
Classification results. We provide precision, recall and F1 scores for multi-nominal naive Bayes (MNB), support vector machine (SVM), logistic regression (Log)and a multi-layer perceptron neural network (MLP).MNB SVM Log MLPCentury P R F1 P R F1 P R F1 P R F116th 0.73
Our results show that it is possible to achieve good classification scores withour dataset, even without extensive feature engineering or pre-trained word em-beddings. We assume that this is based in the comparably large size of ourdata points, as can be seen in Table 1. Additionally, through the randomnessinvolved in the selection of non-travelogues, those volumes are expected to havea high variance in genres, which matches the whole corpus as well. Comparingnumerous examples of one genre against an equal number of volumes coveringmany more genres certainly benefits the classification, especially when takinginto account the length of the documents.Taking the results from this evaluation, we were confident in approachingour main task, which was the identification of travelogues from a much largerdataset: the other digitized books in our corpus not yet evaluated by us.Following the training of models suitable for classification, one specificallydesigned for each century, we applied it to our pool of candidates . We usedthis process to create a list of books that are potentially travelogues, rankedfrom highest to lowest, and evaluated the first 200 items. The results of thisare shown in Table 3, and have been subject to the same scrutiny as our initialground truth.We can show that our methodology proves a clear improvement over a lessguided evaluation: within our evaluated findings, true positives made up 12.5%(16 th c.), 30% (17 th c.), 41.5% (18 th c.) and 89.5% (19 th c.) respectively. Due to Many works focus on datasets that have more, but shorter documents, c.f. [31] forcomparisons of multiple classification methods and datasets.dentifying Historical Travelogues 11 time constraints, our evaluation was discontinued after the first 200 items, butthis already means that we discovered 345 books of the travelogue genre thatwere not found by traditional search queries on meta data, as we explained inSection 3.2. Discovery by chance only resulted in 3% (16 th c.) and 0.8% (18 th c.) positives, or none at all (17 th c., 19 th c.).It also has to be noted that the increase in the percentage of true positivesin a diachronic perspective is connected to the equally increasing number oftravelogues. There simply are many more travelogues that were printed in the19 th century than in the previous time periods. Table 3.
Applying the classifier on the candidates pool.
By chance shows how manytravelogues were found when randomly selecting examples for the ground truth.
Century No.candidates Confirmed(top 200) By chance th th th th During this evaluation, it became apparent that the majority of potential findingswith a high probability belong to the group we attribute as historiography. Dueto their nature, they have strong a overlap with travelogues but lack the requiredcriteria of describing a journey actually experienced by the author. This crucialinformation can in many cases only be gathered from external sources . Froma purely technical perspective, there is no difference with the content of thebooks between travelogues and historiographies. This means that while, for thepurposes of this project, they are very different, right now we cannot furtherdifferentiate between them. Additionally, in the 18 th century, many false positivesinclude publications on geography; a possible explanation here is that they oftendescribe locations, which naturally overlap with travelogues as well. The result of our efficiency evaluation is depicted in Figure 2. We provide theaverage F1 score for each time frame, as well as the variance between the samples.For every time frame, our general observation is that the performance ofthe MLP classification fluctuates heavily when only a small dataset is available.With at least 20 documents, but it is better with 30 examples, each for the C.f. Section 2.1.2 J. R¨orden et al. positive and negative class a stable performance of above a 0.8 F1 score can bereached. After that it slowly increases up to a 0.9 F1 score at 50 examples each,with only very minor changes for 100 examples each.This experiment shows that it is possible to create a working classificationmethodology, which reaches acceptable results with a modest time investment upfront, as shown in Table 2. Fig. 2.
Classifier efficiency evaluation for MLP. Every step has a balanced number oftravelogues and non-travelogues (5/5, 10/10, 15/15, 20/20, 25/25, 30/30, 50/50 and100/100). The experiment was repeated five times for each step.
Our results show that standard machine learning techniques combined with rel-atively easily computable features (BOW and bag-of-n-grams) can effectivelysupport scholars in identifying travelogues in a large-scale document corpus. Us-ing the same features for an MLP neural network generates even superior results.This is an important first step in the development of a broader mixed-methodapproach for the large-scale serial analysis of travelogues.Specifically, we discovered a total of 345 travelogues in the evaluated timeperiods using the top 200 findings with the highest confidence scores each (800 Depending on the sources, and if additional definitions etc. are needed, betweenseveral hours and up to a few weeks of full time work.dentifying Historical Travelogues 13 in total). We previously were not able to find any of these files through searchqueries based on meta data or manual search in our catalog, hence this directlytranslates into a re-discovery of sources for scholars in the humanities.Additionally, a large fraction of false positives is, at the level of words andtheir semantics, extremely similar to the true positives. However, going by thedefinition provided earlier we are required to use external information that couldnot be included as a feature, as it is dependant on domain knowledge. Thisseverely limits the efficacy of unsupervised machine learning and deep learningapproaches.A clear limitation of our effort lies in the time and effort required to cre-ate a high-quality ground truth. While this effort could possibly be reduced byapplying unsupervised clustering techniques beforehand, annotations providedby domain experts will always be key for effective learning techniques. Apply-ing active learning techniques (c.f. [25]) for iteratively developing ground truthscould be a possible strategy for reducing this manual annotation effort. Anotherlimitation of our approach lies in the focus on entire volumes, which currentlyneglects the fact that volumes may include travelogues and non-travelogues. Us-ing a wider spectrum of semantically richer features (c.f. [16]) such as namedentities could support classification at the paragraph or page level.Nonetheless, our experiments on the efficiency of the classification methodpresented here show that it is possible to achieve robust results above an F1 scoreof 0.8 with a relatively small ground truth size. Knowing this, future researchin different domains should require substantially less time investments to getstarted.A remaining challenge lies in the distinction between highly similar genres, inour case historiographies or geographic books and travelogues. We hope that thiscan be tackled by further refining the ground truth to fit the given genres, takinginto account a wide range of external sources, to include domain knowledge ina structured way.
In this paper, we have described a methodology to identify historical traveloguesin a large dataset. Our approach combines the knowledge of both domain experts(historical science, library and information science) and data scientists to createa ground truth and subsequently build an MLP model, successfully identifying345 previously unknown travelogues. Furthermore, we have shown that a groundtruth for this kind of data can be as small as 30 examples each for the positiveand negative class and still perform well.In the upcoming weeks and months, we will begin looking at the discoveredtravelogues in more detail. In a first step, we will identify intertextual relationsin our corpus to find out in what way the travelogues depended on each other,why and how (certain) stereotypes and prejudices were handed down, evolvedor disappeared over the centuries.
Ultimately, we want to know how foreign cultures, places and people wereperceived, and if the perceptions differed depending on the socio-cultural back-ground of the involved people. This will allow us to come to an understandinghow and why something was perceived as
Other , if and how this changed acrossthe centuries. For this, we have done the groundwork here, as it is crucial torely on as much data as possible; concrete next steps in this direction includecreating a formal description of intertextuality and
Otherness , and translatingit into a set of machine-readable text features.
Acknowledgments
The work in the Travelogues project ( )is funded through an international project grant by the Austrian Science Fund(FWF, Austria: I 3795) and the German Research Foundation (DFG, Germany:398697847).
References
1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A.,Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V.,Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng,X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015), , software available from tensorflow.org2. Agai, B., Conermann, S. (eds.): Wenn einer eine Reise tut, hat er was zu erz¨ahlen.Pr¨afiguration – Konfiguration – Refiguration in muslimischen Reiseberichten. ebv,Berlin (2013)3. Bellingradt, D., Salman, J.: Books and Book History in Motion: Materiality, Social-ity and Spatiality. In: Daniel Bellingradt, P.N., Salman, J. (eds.) Books in Motionin Early Modern Europe. Beyond Production, Circulation and Consumption, pp.1–11. Springer International Publishing, Cham (2017)4. Chollet, F., et al.: Keras. https://keras.io (2015)5. Dai, X., Bikdash, M., Meyer, B.: From social media to public health surveillance:Word embedding based clustering method for twitter classification. In: Southeast-Con 2017. pp. 1–7. IEEE (2017)6. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library forlarge linear classification. Journal of machine learning research (Aug), 1871–1874(2008)7. Genz, J., G´evaudan, P.: Medialit¨at, Materialit¨at, Kodierung: Grundz¨uge einer all-gemeinen Theorie der Medien, pp. 201–209. transcript, Bielefeld (2016)8. Greve, A.: Die Konstruktion Amerikas: Bilderpolitik in den ”Grands Voyages” ausder Werkstatt de Bry, Europ¨aische Kulturstudien, vol. 14. B¨ohlau, K¨oln, Weimarund Wien (2004)9. H¨ofert, A.: Den Feind beschreiben. >> T¨urkengefahr << und europ¨aisches Wissen¨uber das Osmanische Reich 1450–1600, pp. 120–122. Ferdinand Sch¨oningh, Pader-born (2014)dentifying Historical Travelogues 1510. Joachims, T.: Text categorization with support vector machines: Learning withmany relevant features. In: European conference on machine learning. pp. 137–142. Springer (1998)11. von Krusenstjern, B.: Was sind Selbstzeugnisse? Begriffskritische und quel-lenkundliche ¨Uberlegungen anhand von Beispielen aus dem 17. Jahrhundert. ForumHistorische Anthropologie , 462–471 (1994)12. K¨urbis, H.: Hispania descripta: Von der Reise zum Bericht. Deutschsprachige Reise-berichte des 16. und 17. Jahrhunderts ¨uber Spanien. Ein Beitrag zur Struktur undFunktion der fr¨uhneuzeitlichen Reiseliteratur, pp. 345–356. Peter Lang, Frankfurtam Main (2004)13. L¨udtke A. et al., editors: Selbstzeugnisse der Neuzeit (1993–)14. Mai, F., Galke, L., Scherp, A.: Using deep learning for title-based semantic subjectindexing to reach competitive performance to full-text. In: Proceedings of the 18thACM/IEEE on Joint Conference on Digital Libraries. pp. 169–178. ACM (2018)15. Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg,D., Clancy, D., Norvig, P., Orwant, J., et al.: Quantitative analysis of culture usingmillions of digitized books. science (6014), 176–182 (2011)16. Momeni, E., Tao, K., Haslhofer, B., Houben, G.J.: Identification of useful usercomments in social media: a case study on flickr commons. In: Proceedings of the13th ACM/IEEE-CS Joint Conference on Digital libraries. pp. 1–10. ACM (2013)17. N¨unning, A.: Zur mehrfachen Pr¨afiguration/Pr¨amediation der Wirklichkeits-darstellung im Reisebericht: Grundz¨uge einer narratologischen Theorie, Typologieund Poetik der Reiseliteratur. In: Gymnich, M., al. (eds.) Points of Arrival: Trav-els in Time, Space, and Self/Zielpunkte: Unterwegs in Zeit, Raum und Selbst, pp.11–32. Francke, T¨ubingen (2008)18. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machinelearning in Python. Journal of Machine Learning Research , 2825–2830 (2011)19. Pfister, M.: Intertextuelles Reisen, oder: Der Reisebericht als Intertext. In: Wetzel,H.H. (ed.) Reisen in den Mittelmeerraum, pp. 55–101. Passavia Universit¨atsverlag,Passau (1993)20. Piera, M.: Travel as episteme–an Introductory Journey. In: Piera, M. (ed.) Remap-ping Travel Narratives (1000–1700), pp. 1–22. Arc Humanities Press, Leeds (2018)21. Presser, J.: Memoires als geschiedbron. In: Presser, J. (ed.) Uit het werk van JacobPresser, pp. 277–282. Athenaeum-Polak & Van Gennep, Amsterdam (1969)22. Purschwitz, A.: Netzwerke des Wissens – Thematische und personelle Relatio-nen innerhalb der halleschen Zeitungen und Zeitschriften der Aufkl¨arungsepoche(1688–1818). Journal of Historical Network Research (1), 109–142 (2018)23. Salzani, C., T¨ot¨osy de Zepetnek, S.: Bibliography for work in travel studies.CLCWeb Library p. travelstudiesbibliography (2010), https://docs.lib.purdue.edu/clcweblibrary/travelstudiesbibliography
24. Sandrock, K.: Truth and Lying in Early Modern Travel Narratives: Coryat’s Crudi-ties, Lithgow’s Totall Discourse and Generic Change. European Journal of EnglishStudies (2), 189–203 (2015)25. Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Ma-chine Learning (1), 1–114 (2012)26. Stagl, J.: Apodemiken. Eine r¨asonnierte Bibliographie der reisetheoretischen Liter-atur des 16., 17. und 18. Jahrhunderts, Quellen und Abhandlungen zur Geschichteder Staatsbeschreibung und Statistik (QASS), vol. 2. Ferdinand Sch¨oningh, Pader-born et al. (1983)6 J. R¨orden et al.27. Stagl, J.: Eine Geschichte der Neugier. Die Kunst des Reisens 1550–1800, pp. 77–122, 242–251. B¨ohlau, K¨oln, Weimar und Wien (2002)28. Treue, W.: Abenteuer und Anerkennung. Reisende und Gereiste in Sp¨atmittelalterund Fr¨uhneuzeit (1400–1700), p. 8. Ferdinand Sch¨oningh, Paderborn (2014)29. Van Groesen, M.: The representations of the overseas world in the De Bry collectionof voyages (1590–1634). Brill, Boston and Leiden (2008)30. Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment andtopic classification. In: Proceedings of the 50th annual meeting of the associationfor computational linguistics: Short papers-volume 2. pp. 90–94. Association forComputational Linguistics (2012)31. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attentionnetworks for document classification. In: Proceedings of the 2016 Conference ofthe North American Chapter of the Association for Computational Linguistics:Human Language Technologies. pp. 1480–1489 (2016)32. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for textclassification. In: Advances in neural information processing systems. pp. 649–657(2015)33. von Zimmermann, C.: Texttypologische ¨Uberlegungen zum fr¨uhneuzeitlichenReisebericht: Ann¨aherung an eine Gattung. Archiv f¨ur das Studium der neuerenSprachen und Literaturen154