A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching
Mariona Coll Ardanuy, Kasra Hosseini, Katherine McDonough, Amrey Krause, Daniel van Strien, Federico Nanni
AA Deep Learning Approach to Geographical CandidateSelection through Toponym Matching
Recognizing toponyms and resolving them to their real-world referents is required to provide advanced semanticaccess to textual data. This process is often hindered by the high degree of variation in toponyms. Candidate se-lection is the task of identifying the potential entities that can be referred to by a previously recognized toponym.While it has traditionally received little attention, candidate selection has a significant impact on downstreamtasks (i.e. entity resolution), especially in noisy or non-standard text. In this paper, we introduce a deep learningmethod for candidate selection through toponym matching, using state-of-the-art neural network architectures.We perform an intrinsic toponym matching evaluation based on several datasets, which cover various challeng-ing scenarios (cross-lingual and regional variations, as well as OCR errors) and assess its performance in thecontext of geographical candidate selection in English and Spanish.
Conceptualization
Mariona Coll Ardanuy
Kasra Hosseini Federico Nanni Methodology
Kasra HosseiniMariona Coll ArdanuyFederico Nanni
Implementation
Kasra HosseiniFederico NanniMariona Coll Ardanuy
Reproducibility
Kasra HosseiniFederico NanniDaniel van Strien Data Curation
Mariona Coll ArdanuyKasra HosseiniFederico NanniAmrey Krause Annotation
Katherine McDonough
Mariona Coll ArdanuyDaniel van Strien
Analysis
Mariona Coll ArdanuyFederico NanniKasra HosseiniKatherine McDonough
Writing and Editing
Federico NanniMariona Coll ArdanuyKasra HosseiniKatherine McDonough The Alan Turing Institute, The British Library, Queen Mary University of London, Edinburgh Parallel Computing CentreContacts of the corresponding authors: { mcollardanuy,khosseini,fnanni } @turing.ac.uk Work for this paper was produced as part of “Living with Machines”. This project, funded by the UK Research and Innovation(UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRCgrant AH/S01179X/1), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter,and Queen Mary University of London. This work was also supported by The Alan Turing Institute (EPSRC grant EP/ N510129/1).Newspaper data was kindly shared by Findmypast. a r X i v : . [ c s . C L ] S e p INTRODUCTION
With increasingly larger amounts of unstructured text becomingdigitally available in many different fields, the need for robustgeographically-aware retrieval of information from large textualcollections is now more urgent than ever. Textual data is oftendeeply geographical and it has been shown that geographic queriesmake for a large part of all search queries [1, 12, 14, 25]. Toponymresolution is a class of entity linking that focuses specifically ongeographical entities. Given a toponym (i.e. a geographical name)that has been recognized in text, its aim is to resolve it to its spatialfootprint (often represented as a set of coordinates that define its lo-cation on the Earth’s surface). This step requires an external sourceof knowledge which usually comes in the shape of a gazetteer, thatis, a dictionary of geographical entities with their associated alter-native place names and geospatial information. On the other hand,candidate selection is the task of identifying the potential entitiesthat can be referred to by a named entity recognized in text. Asthe intermediary step between named entity recognition and thedownstream task of entity disambiguation, candidate selection isan integral part of entity linking. And yet, it has often been an over-looked component of the entity linking pipeline, even though it hasbeen shown to have a significant impact on the final performance[15, 26], especially in noisy or non-standard text.Toponyms are particularly prone to name variations and changes,which can arise from multiple causes, such as regional spellingdifferences, diachronic spelling variation, and change of the geopo-litical status [4]. In toponyms, variation is common not only at atoken-level (e.g. ‘Republic of Ireland’ matching ‘Ireland’), but also ata character-level (e.g. ‘Killarra’ for ‘Killala’), and at both token- andcharacter-level (e.g. ‘Canouan’ and ‘Cannouan Island’). In additionto these, noisy text often presents other types of character-levelvariations, such as spelling errors, typographical errors, and OCRerrors (e.g. ‘Worchestershire’ for ‘Worcestershire’, or ‘Cockcnnotith’for ‘Cockermouth’). The number of potential variations can be veryhigh, and yet candidate selection should ensure that the correctlocation is provided among the pool of retrieved entities.In this paper, we present a new and flexible deep learning ap-proach to geographical candidate selection through toponym match-ing, which is specifically tailored to dealing with these challengescharacteristic of noisy scenarios. Our method consists of two maincomponents: (1) toponym matching, formulated as a binary classi-fication of toponym query-candidate pairs, and (2) candidate selec-tion, formulated as a ranking task where the aim is to rank the goodcandidates first while minimizing the presence of noisy candidates.The main contributions of this paper are: • A new flexible, user-friendly, efficient software library, withextensive documentation for performing candidate selectionthrough fuzzy string matching. We discuss its relevanceand application in the context of geographical candidateselection, and evaluate its performance and efficiency on We do not consider toponym detection as part of the toponym resolution task in thispaper. There is a large body of research in the natural language processing communitythat deals with the specific problem of named entity recognition, of which toponymdetection is a part. datasets of various sizes. We call our method DeezyMatch(
DEEp fuzZY string MATCHing ). • New realistic datasets for the evaluation of the toponymmatching methods. These datasets cover a wide range ofchallenging scenarios (e.g. cross-lingual, diachronic, and re-gional variations, as well as OCR errors). • A comprehensive evaluation framework for the task of geo-graphical candidate selection in the downstream task of to-ponym resolution on noisy or non-standard datasets (i.e. twoexisting datasets in English and Spanish and one new manu-ally annotated dataset of English nineteenth-century OCR’dtext). We conduct an extensive quantitative evaluation cov-ering both the binary classification of toponyms in thesedatasets and the ranking of potential candidates in real to-ponym resolution scenarios with models created from thesedatasets.Our method has been designed to be as language-independentas possible. It only relies upon a character tokenizer when process-ing the string inputs and a reference gazetteer. We have tested itsdownstream application on datasets from different languages, timeperiods and origins, from seventeenth century Latin America tonineteenth century Britain and the United States. All codes, datasets,gazetteers and evaluation settings are openly available to supportresearch reproducibility and to foster the use of
DeezyMatch inother downstream tasks. To date, most entity linking and toponym resolution systems haveapproached candidate selection by performing exact or partialstring matching between the mention of the toponym in a textand a name variation of the entry in a knowledge base (KB) fora specific entity (e.g. ‘NYC’ for ‘New York’). Well-established en-tity linking pipelines such as the ones presented by Ferragina andScaiella [11], Mendes et al. [22], Raiman and Raiman [27], Sil et al.[31] and Moro et al. [24] depend on exact or super-string matchingto select the set of potential candidates a query can refer to. Thisapproach to select candidates relies on the assumption that themention is present as the name variation of a specific entity in theKB. There have been a number of studies on enriching the entries ina KB with alternate names (such as abbreviations, historical names,or names in other languages) [3, 7, 24]. Thanks to such studies,most of the effort of the research community invested in candi-date selection has been on developing algorithms for scoring andranking a set of retrieved candidates [13, 18, 20], while less efforthas been put into dealing with candidate selection in noisy text.Nevertheless, even a KB highly enriched with name variations willnot cover all possible name variations (especially of less popularentities), or spelling mistakes and OCR errors. DeezyMatch codes can be found here: https://github.com/Living-with-machines/DeezyMatch/. For a more detailed description of the DeezyMatch architecture andfunctionalities, see Hosseini et al. [16]. All experiments can be found here: https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching. Weprovide all resources to allow full reproducibility of the results. lternatives to perfect match linking include the adoption of edit-distance techniques, such as Levenshtein distance [21, 23], but thesemethods suffer from poor scalability. More recently, researchershave proposed deep learning solutions to address this problem. Leand Titov [19] use a noise detector in their entity linking system thatoperates at the token level (e.g. ‘Bill Clinton (President)’ matching‘Presidency of Bill Clinton’), which learns true matchings fromlists of positive and negative candidate pairs. Tam et al. [34] haverecently presented STANCE, a model for computing the similaritybetween two strings by encoding the characters of each of them,aligning the encodings using Sinkhorn Iteration, and scoring thealignments using a convolutional neural network.The most similar work to ours is by Santos et al. [29], who pro-posed a deep learning architecture using Gated Recurrent Units(GRUs) to classify pairs of toponyms as either potentially refer-ring to the same entity or not. The method is trained and intrin-sically evaluated on a large dataset collected from the GeoNamesgazetteer. This dataset is composed of 5 million positive and nega-tive toponym pairs. Our work builds on this by leveraging on cur-rent research in natural language processing (NLP), and expandsit in several different directions: by supporting various state-of-the-art neural network architectures, allowing the application ofan existing model to new data, offering the possibility of furtherfine-tuning it, and employing it for the task of candidate ranking.More critically, our approach allows the candidate ranking compo-nent to be seamlessly integrated into entity linking and toponymresolution pipelines. In the remainder of the paper, we describe ourmethod and assess its performance in the context of geographicalcandidate selection.
In this section, we provide a brief overview of DeezyMatch, a free,open-source library written in Python for fuzzy string matchingand candidate ranking, and show how it can be used for the taskof geographical candidate selection. DeezyMatch consists of twomain components: a pair classifier, which we use for the subtask of toponym matching and is described in section 3.1, and a candidateranker, which is used for the task of geographical candidate selection and is described in section 3.2.
The DeezyMatch pair classifier component is largely inspired onprevious work by Santos et al. [29] for toponym matching, whichis formulated as a binary classification task of toponym query-candidate pairs. The authors developed a siamese deep neural net-work for binary classification of toponym pairs implemented using
Keras [6]. DeezyMatch builds upon the neural network architec-ture of Santos et al. [29] and extends it to allow more control on itsarchitecture and each of its components and parameters, direct ap-plication on unseen data, and further fine-tuning of already trainedmodels. DeezyMatch is designed following a modular approach,and has been implemented in PyTorch and tested on both CPUand GPU. It is not fixed to a pre-defined architecture: the user canspecify the preferred architecture (GRU, LSTM, or RNN); more-over, the dimensionality of the hidden units in the recurrent neural networks, fully connected layers and embeddings can be changedin the input file. The user can choose to work with a forward orbi-directional RNN/GRU/LSTM architecture and can specify thenumber of layers in the networks. The preprocessing steps andother hyperparameters, such as learning rate, number of epochs,batch size, maximum sequence length and dropout, can all be aswell changed in the input file.During training, a dataset of string pairs is read, preprocessed,and strings are converted into dense vectors. Instead of having afixed two-fold cross-validation as in Santos et al. [29], we split thedataset into training, validation and test sets (the ratio of whichis specified by the user). The resulting model can be further fine-tuned by other datasets, an approach especially promising whereonly limited training examples are available. In contrast to Santoset al. [29], DeezyMatch provides functionality for model inferencewhere a trained model can be applied to other datasets, not usedin training and validation, and evaluated through various metrics(loss and precision/recall/F1 scores).Section 5.1.1 summarizes the model architectures and the choiceof hyperparameters used in this study.
The pair classifier component described in section 3.1 returns aclassifier trained to capture the transformations present in theinput dataset of toponym pairs. The candidate selection componentthen uses this model to retrieve potential candidates for a giventoponym query — or set of queries — from a KB. This is achievedthrough the following steps:(0)
Generating gazetteer vector representations: a trainedDeezyMatch model is used in this initial step to generatevector representations for all the alternate names (i.e. theset of all locations’ toponyms) in the KB or gazetteer towhich we want to link our toponym queries. This step isdone only once for each model and gazetteer. The vectors(e.g. forward/backward vectors in a bi-directional neuralnetwork) are then combined to form one file containing allthe gazetteer vectors.(1)
Generating query vector representations: the same mo-del used to generate gazetteer vector representations in theprevious step reads in a set of toponym queries (e.g. to-ponyms recognized in a text) and generates a vector rep-resentation for each query term. As above, the vectors arethen combined to form one file containing all the queryvectors.(2)
Ranking candidates:
The representations generated in theprevious steps encode toponym similarities based on thetransformations learned during the training. For example,the vector representations of ‘Manchtftcr’ and ‘Manchester’are more similar (i.e., the vectors are close to each other)when they have been generated from a DeezyMatch modelthat has been trained on a dataset which encloses thesetypes of transformations (in this case, OCR-induced). In this DeezyMatch supports character-, word-, and ngram- tokenization. Preprocessingsteps include lower-casing, stripping, dealing with missing characters in the vocabulary(particularly in the case of fine-tuning) and normalizing strings. Experiments on the impact of fine-tuning on both toponym matching and candidateselection are underway. tep, we compute the distance of each query vector withrespect to all gazetteer vectors and rank them according tothe distance. We use faiss , a library for efficient similaritysearch, to compute the L -norm distances [17].In practice, the gazetteer usually has many more entries thanthe number of queries (i.e. toponyms for which we want to findcandidates). As remarked above, an advantage of the proposedmethod is that vector representations for the gazetteer are com-puted only once (for a given trained model). For all subsequentqueries, only the query vectors are generated and compared to thegazetteer vectors. This significantly reduces the computation timecompared to more traditional string-matching methods (e.g. Lev-enshtein distance) in which one query is compared to n possiblevariations of all potential candidates in each run. DeezyMatch alsosupports on-the-fly ranking, that is, the toponym queries are con-verted into vector representations and compared to the gazetteervectors automatically. In this section we introduce the datasets and resources used in theexperiments of Section 5. We first describe the datasets that we usefor the downstream task of selecting candidates for toponyms intext, in Section 4.1. They inform the choice of the gazetteers thatwe use both for creating the datasets for toponym matching andevaluating the performance of candidate selection . The gazetteersused in our experiments are described in Section 4.2. In Section4.3, we describe the datasets used for training and evaluating thetoponym matching models.
To assess the performance of geographical candidate selection, weuse two datasets in English and one in Spanish. They are historicaldatasets (from the nineteenth and seventeenth centuries), and havebeen selected because they present interesting challenges, includingdiachronic or spelling variations and OCR errors. The three datasetsconsist of text documents in which toponyms have been recognizedand resolved to latitude and longitude coordinate points. We discussthe characteristics of these datasets in the following paragraphs,and summarize them in Table 1. (WOTR) [9].
This corpus is com-posed of historical texts in English (historical letters and reports)from the American Civil War (1861-1865), manually annotatedwith geographic references. The documents had been previouslyOCR’d and manually-corrected. Annotators were free to obtaincoordinates from different sources. However, they were specificallyshown how to retrieve them from Wikipedia pages, which thereforewas the most-used resource. Larger geographical entities in theWOTR dataset (e.g. countries or states) were annotated both withgeographical points and polygons. For consistency with the otherdatasets, we only considered points. We use their test set for ourexperiments, which has 1,479 annotated toponyms (of which 584are unique, after lower-casing). (BNA-FMP) . Thesecond dataset has been created as part of our project from his-torical newspaper articles in English obtained from the BritishNewspaper Archive, abbreviated BNA-FMP . This dataset consistsof 1,248 toponyms (of which 509 unique toponyms, after lower-casing) from 191 articles published between 1780 and 1870 in localnewspapers based in Manchester and Ashton-under-Lyne (broadlyrepresenting the industrial north of England), and Dorchester andPoole (representing the rural south). We selected articles that arebetween 150 and 550 words long and with an OCR confidence scoregreater than 0.7, as reported in the metadata. We did not correcterrors produced in the OCR or layout recognition steps. The an-notator was asked to recognize every location mentioned in thetext and map it to the URL of the Wikipedia article that refers toit. We then derived the latitude and longitude of the entry in ques-tion from WikiGazetteer, a Wikipedia-based gazetteer enhancedwith information from Geonames [2]. The toponyms recognizedin this dataset often contain OCR errors (e.g. ‘iHancfjrcter’ for‘Manchester’, or ‘WEYBIOIJTII’ for ‘WEYMOUTH’), spelling varia-tions (e.g. ‘Leipsic’ for ‘Leipzig’, or ‘Montpelier’ for ‘Montpellier’),historical anglicizations and other form of foreign toponym domes-tications (e.g. ‘Kingstown’ for ‘Dún Laoghaire’, ‘Queenstown’ for‘Cobh’, or ‘Carlowitz’ for ‘Sremski Karlovci’), name changes due toexternal factors (e.g. ‘Constantinople’ for ‘Istanbul’), or a combina-tion of these. Out of the 509 unique toponyms in the dataset, thereare 167 toponyms for which a true referent exists in the gazetteerbut cannot be directly retrieved because there is no exact-matchingtoponym in the gazetteer. In most cases, this is due to OCR errors(111 instances) or spelling variations (26 instances). (ArgManuscrita) . This dataset hasbeen created as part of the Digital Humanities project
La ArgentinaManuscrita , which used the semantic annotation tool Recogito to geolocate toponyms from a seventeenth-century chronicle andtravelogue in Spanish, describing the area around the RÃŋo dela Plata basin. This dataset is in Spanish and is composed of 799toponyms (of which 200 unique toponyms, after lower-casing), andhas been annotated with coordinates from different gazetteers, suchas Geonames, Pleiades, and HGIS de las Indias. Dataset Unique toponyms Language PeriodArgManuscrita 200 Spanish 1610sWOTR (test) 584 English 1860sBNA-FMP 509 English 1780-1870
Table 1: Candidate selection datasets Living with Machines: http://livingwithmachines.ac.uk/. A description of the project is found here: https://arounddh.org/en/la-argentina-manuscrita, the data is openly available here: https://recogito.pelagios.org/document/wzqxhk0h3vpikm. Recogito, an initiative of Pelagios Commons, http://recogito.pelagios.org/. http://pleiades.stoa.org/. .2 Gazetteers Gazetteers are geographical dictionaries. They can be either global(aiming at worldwide geographic coverage) or local (covering aspecific region or time period). The choice of the gazetteer(s) towhich toponyms are linked inevitably has an impact on candidateselection and therefore final resolution. For many digital humanitiesapplications, a gazetteer for toponym resolution should faithfullyreflect the geographical knowledge of the writer and intended audi-ence of the texts. Noisy gazetteers — those containing anachronisticrecords — not only complicate candidate selection, they also intro-duce geographical information that may have been unknown topeople in a particular historical context. For example, if the goal isto resolve the toponyms in an English-translated collection of textsfrom second century Greece, it would be preferable that the cityof Athens in Georgia is not even present in the gazetteer, so thatit would not be retrieved as a possible candidate. Our candidateselection method is flexible in the choice of the gazetteer, as longas its entries (i.e. places) include at least latitude and longitudecoordinate points and potential alternate names that may be usedto refer to them.Gazetteers serve two different goals in this paper: (1) to createsets of positive and negative pairs used in the toponym matchingstep to train the classifiers (toponym pair datasets are described inthe following section, 4.3), and (2) as the knowledge base againstwhich we perform geographical candidate selection. We have cre-ated a WikiGazetteer [2] (a Wikipedia-based gazetteer enrichedwith Geonames data) for the different languages of the datasets de-scribed in Section 4.1, i.e. English and Spanish, and a WikiGazetterfor Greek, to test DeezyMatch in an alphabet different than Latin.The three gazetteers have been created from their corresponding (atthe time) latest versions of Wikipedia. These three WikiGazetteershave been used to create the sets of positive and negative pairs thatwill be described in the following section. Besides, for the candidateselection experiments we have expanded the Spanish WikiGazetteerwith the
HGIS de las Indias gazetteer, which collects the historicalgeography of colonial Spanish America and corresponds with theperiod of the Argentina Manuscrita dataset (described in section4.1.3). The four resulting gazetteers are summarized in Table 2.Gazetteer Language Locations Unique altnamesWG:en_gz English 1,144,016 2,455,966WG:es_gz Spanish 338,239 550,697WG:es+HGISIndias Spanish 351,040 556,985WG:el_gz Greek 21,037 34,572
Table 2: Summary of gazetteers:
Language indicates the lan-guage of the Wikipedia version from which the gazetteerhas been built,
Locations is the number of entity entries inthe gazetteer, and
Unique altnames is the number of uniqueplace names present in the gazetteer. See the instructions to create a WikiGazetteer from a specific Wikipedia versionhere: https://github.com/Living-with-machines/lwm_GIR19_resolving_places/tree/master/gazetteer_construction. To learn more, see Stangl [33] and the webpage: https://hgis.club/historical-geography-of-bourbon-spanish-america.
We use several datasets to evaluate the intrinsic performance ofour toponym matching approach: an existing dataset, introduced inSantos et al. [30] (Section 4.3.1), and two new realistic datasets thatcover different types of variations (Sections 4.3.2 and 4.3.3). Theyare summarized in Table 5.
Santos ). GeoNames is a large gazetteerthat is publicly available. Each location is associated with multiplenames (i.e. corresponding to historical or regional denominations,or to names in different languages and alphabets, etc.). Santos et al.[30] generated a dataset of 5M toponym pairs from Geonamesalternate names, half of which are matching pairs. A matching pairof toponyms consists of two alternate names that correspond to thesame entity (e.g. ‘London’ and ‘Londres’), as long as both names arelonger than two characters and they are not identical after lower-casing. A non-matching pair consists of alternate names that donot correspond to the same entity (e.g. ‘Salsipuedes’ and ‘Isla SanPedro’). To make sure not all non-matching pairs are completelydissimilar, the authors discarded pairs with a Jaccard similarityequal to zero with a probability of 0.75. This resource is the largestemployed in our work and contains toponym pairs from differentlanguages and alphabets.
We created the Spanish, English, andGreek versions of WikiGazetteer to build three new datasets of to-ponym pairs, created in a similar way as in Santos et al. [30]. Theresulting datasets (
WG:es , WG:en , and
WG:el ) are significantlysmaller, less ambitious (they do not have toponym pairs from acrossalphabets) and are biased towards place names in Spanish, English,and Greek respectively, which fitted our downstream scenariosbetter. Half of the dataset is composed by trivial cases: negativepairs that are extremely dissimilar (e.g. ‘London’ and ‘Paris’) andpositive pairs that are either identical or nearly-identical (exceptfor differences in letter case). We found trivial negative cases byrandomly selecting 50 toponyms from the gazetteer, sorting themfrom more dissimilar to less to the source toponym, and selectingthe top most dissimilar. The other half of the dataset is comprisedof very challenging cases, to force the model to learn the morenuanced toponymic variations (e.g. ‘Edinburgh’ and ‘Edinborg’ aspositive, and ‘Sheverin’ and ‘Neverin’ as negative). To do so, we con-sidered as positive matches alternate names that can correspond tothe same entity and that have a normalized Levenshtein-Damerausimilarity of above 0.25. We then created non-matching pairs by col-lecting the most similar alternate names for a toponym and rankedthem using normalized Levenshtein-Damerau distance, removingalternate names of entities that are within a distance of 50 km ofeach other. We chose this threshold heuristically: it is conservativeenough to filter out unwelcome noise, while at the same time allowsfinding enough potential non-matching pairs. We made sure thatfor each toponym there were as many positive as negative pairs.This resulted in a balanced dataset for each WikiGazetteer (seeexamples of trivial and challenging positive and negative pairs for We added distance as a restriction to minimize the incidence of highly-related,though still distinct, entities in the gazetteer, such as ‘Barcelona’ and ‘Barcelonès’ (itsenclosing administrative territorial entity) or ‘Port of Barcelona’ (a nested entity). oponym ‘Aintourine’ extracted from the English WikiGazetteerdataset
WG:en_gz in table 3.).Toponym 1 Toponym 2 MatchingAintourine Aintourine TrueAintourine AINTOURINE TrueAintourine Haagsche Bosch FalseAintourine Sorkhankalateh FalseAintourine Am ToÃżrÃőne TrueAintourine AÃŕn ToÃżrÃőne TrueAintourine Tigantourine FalseAintourine Tiguentourine False
Table 3: Positive and negative toponym pairs extracted from
WG:en_gz . OCR ). Evershedand Fitch [10] released a corpus of OCR’d newspaper texts thatwere aligned with corrections performed by volunteers. Followingthe procedure described in van Strien et al. [35], we aligned thetexts at the token level and identified tokens recognized as beingpart of named entities in the human-corrected text. Since the goalof this dataset is to allow learning OCR transformations, we filteredout pairs of aligned tokens if: (1) the OCR’d and its correction areidentical, (2) the OCR’d token has less than two characters, (3) theOCR’d text is exactly a substring of the human-corrected text, orvice-versa, (4) the human correction does not contain a hyphen, (5) the correction is composed only of alphabetical characters, and(6) the edit operation that transforms one token into another (e.g. ‘c’into ‘e’ in the pair ‘Jagclman-Jagelman’) occurs more than once inthe dataset. We then created a dataset that has similar characteristicsto the Santos and WikiGazetteer-based datasets: for each human-corrected token, we consider all its observed OCR’d variations in thedataset as positive pairings. We then capture the most observed OCRtransformations in the dataset, and artificially build negative pairsby introducing unobserved random transformations for charactersin the human-corrected string. We build as many negative pairsas positive pairs exist for a corrected string. See some examples inTable 4.
Table 5 summarizes the key character-istics of the different datasets that we have used to create the modelsand assess the performance of our toponym matching component.As a post-processing step, for each of the five datasets we removedpairs if one of the elements was an empty string, and removedduplicates (including reverse duplicates such as ‘Florence, Firenze,True’ and ‘Firenze, Florence, True’). We removed, for each true pair,a corresponding false pair, and vice versa. Finally, for each dataset,we provide a balanced training/validation and a test set (90% and10% of the whole dataset respectively). The data comes from the National Library of Australia Trove digitized newspapercollection. We decided to filter out tokens containing hyphens because of hyphenated words atthe end of the line, sometimes resulting in partial tokens matching full tokens.
Correction Variation MatchingZurich Zmich TrueZurich 7urich TrueZurich Zuiich TrueZurich Zunch TrueZurich Zururn FalseZurich ZurÂż/ FalseZurich ZuhSch FalseZurich ZuÃćch False
Table 4: Positive and negative OCR pairs
Dataset Toponym Pairs Source AlphabetSantos [30] 4,337,446 Geonames MultipleWG:en 669,376 Wikigaz (EN) LatinWG:es 152,026 Wikigaz (ES) LatinWG:el 3,086 Wikigaz (EL) GreekOCR 93,111 OCR Latin
Table 5: Toponym matching datasets
Our approach is built around two main components, toponymmatching and candidate selection. We assess the performance ofeach of them in the next two sub-sections.
The goal of toponym matching is to assess whether two stringscan refer to the same location. In this section, we evaluate theperformance of our toponym matching component in differentsettings and scenarios and in comparison with well-establishedbaselines.
The DeezyMatch models used in thisstudy have similar neural network architectures and hyperparame-ters. In all models, the underlying dataset is preprocessed by nor-malizing the text to the ASCII encoding standard, by removingboth the leading and the trailing empty characters, and by addinga prefix and suffix (character ‘|’) to the string. We keep the lettercase in toponym pairs. The training/validation datasets are used fortraining, hyperparameter tuning and model selection. The test setis used for reporting the final results. A character-level embeddingis employed to convert the preprocessed text into vectors of size60. The two embedding vectors of a toponym pair are then fed (inbatches of size 64) to two parallel bi-directional GRUs with twolayers. Each GRU network has a hidden state of size 60 and a maxi-mum sequence length of 120. The learnable parameters (i.e. weightsand biases) of the two GRUs are shared, which helps the model tolearn transformations regardless of the order of toponyms in aninput pair. Each bi-directional GRU network outputs two vectorscorresponding to the last hidden states of the forward and back-ward passes. These two vectors are then concatenated to form onevector with the length of 2 × hidden-state-size (i.e., 120) per GRU.We call them h GRU and h GRU for the first and second networks, igure 1: Tracking metrics during a model training. DeezyMatch logs various evaluation metrics at each epoch: (a) train andvalidation losses; (b) macro F1 scores (the harmonic mean of the precision and recall); (c) Accuracy on both train and validationsets; (d) precision and recall. The selected model (epoch 5 in this example) is shown by vertical dashed lines. respectively. DeezyMatch supports different ways of combiningthese two vectors. In our experiments, we create one vector foreach toponym pair: 1 − | h GRU − h GRU | . The resulting vectoris then passed to a feedforward neural network with one hiddenlayer of size 120 with ReLU activation functions and one outputunit with sigmoid nonlinearity. We use the Binary Cross Entropycriterion and the Adam optimization method with a learning rateof 0.001 to adjust the learnable parameters in our model (591,122parameters in total). A dropout probability of 0.01 was used in alllayers (i.e., GRUs and fully-connected layers) for regularization. Toavoid overfitting, we also use early stopping and select the modelwhere the validation loss starts to increase. As a baseline, we use normalized Levenshtein-Damerau edit distance, a traditional string similarity measurebased on the number of operations needed to transform one stringto another, to classify pairs of toponyms as either matching ornot. We find the optimal threshold on the training/validation set.For comparison, we also report the performance of the toponymmatching implementation by Santos et al. [29]. As done in previous work [29], we used the pyxDamerauLevenshtein python im-plementation: https://pypi.org/project/pyxDamerauLevenshtein/. We had to slightly modify the original implementation to be compatible with theTensorflow backend. We have notified the authors.
Following Santos et al. [29], we treat toponym match-ing as a binary classification task. We report the F-Score (i.e. theharmonic mean between precision and recall). Table 6 reports the performance of DeezyMatchand comparing methods on several toponym matching datasets.Note that we tested the performance of
DeezyMatch and
LevDam on the test split (10% of the datasets), to ensure reproducibility.For Santos et al. [29], this was not possible without extensivelychanging their code, as their implementation does not allow testingan existing model on new data. Therefore, we evaluated it onlythrough two-fold cross validation on the training/validation dataset.Though not strictly comparable, we show in Table 6 the differencesin performance between the two implementations. The inference function of DeezyMatch additionally provides accuracy, precisionand recall. We have not included them in the paper for readability reasons, but allexperiments can be found in our Github repository: https://github.com/Living-with-machines/LwM_SIGSPATIAL2020_ToponymMatching. We provide all resources toallow full reproducibility of the results. Performance of Santos et al. [29] on their dataset is considerably lower than thatreported in their paper (0.89 F1 Score). This difference is due to the fact that weremoved all duplicates (including reverse duplicates, such as ‘Yangji-mal, yangjimal’and ‘yangjimal, Yangji-mal’) in the dataset ( ∼
7% of the original resource). antos WG:en WG:es WG:el OCRLevDam 0.70 0.74 0.75 0.83 0.76Santos et al. [29] 0.82 0.92 0.90 0.80 0.95DeezyMatch 0.89 0.94 0.92 0.84 0.95
Table 6: Evaluation of Toponym Matching methods in termsof F1 Score on the different datasets described in Section 4.3.
Candidate selection is the task of ensuring that the correct entity isfound among the retrieved candidates. In this section, we reportthe performance of our candidate selection component on differenttoponym resolution datasets.
Evaluation of geographical candidate selection intoponym resolution systems is not always straightforward. This isin part due to the lack of a true gold standard for places, as gazetteersindicate the position of a place on the Earth’s surface through itsapproximate coordinates, which may not coincide with the sameexact coordinates used by the dataset annotators. Because of this,it has been common in the literature to allow an error distance,be it in km (usually 161km, i.e. 100 miles) or degrees [5, 8, 28, 32].We decided to be more restrictive and considered a candidate ascorrect if it was within 10km of its location in the gazetteer. Ourcandidate selection module finds potential matching toponyms inthe gazetteer, not entities. During evaluation, if a toponym in thegazetteer can refer to more than one entity, we select the one closestto the gold standard coordinates; and if this falls within 10km fromthe gold standard location, we consider it a true positive.Candidate selection is the task of ensuring that the correct entityis found among the retrieved candidates. We report our resultsbased on two metrics: precision at 1 candidate (
P@1 ) and meanaverage precision at 5, 10, and 20 candidates (
MAP@5 , MAP@10 ,and
MAP@20 ), to evaluate the quality of the ranking. To better understand the performance of Deezy-Match for candidate selection, we provide the following baselines:(1)
Exact : a candidate is retrieved if it exactly matches the toponymin the text (case insensitive), which is the most common type ofcandidate selection in entity linking and toponym resolution meth-ods; and (2)
LevDam : candidates are ranked by string similarity,based on normalized Levenshtein-Damerau edit distance. Whilethis is a strong baseline, it is often impracticable in downstreamtoponym resolution because of time complexity (see Table 7). We observed that 161km was too large a distance for some of our datasets. A smallerwindow ensures higher reliability of our precision metrics. Given the lack of a true gold standard mentioned above, some true candidatesare incorrectly considered as false positives. Chile is an extreme example of this: theannotator assigned coordinates with -37.78 latitude and -71.36 longitude to this countryand the coordinates in the
WG:es+HGISIndias gazetteer are -33.45 latitude and -70.67longitude, almost 500km apart; and yet, they are both correctly a point in Chile. In ourevaluation results, we excluded cases where none of the methods retrieved any correctresults, because our intention is not to evaluate gazetteer-to-dataset compatibility, butthe quality of our method’s candidate selection compared to other methods. We used the pyxDamerauLevenshtein implementation: https://pypi.org/project/pyxDamerauLevenshtein/.
We report the performance of ourmethod and baselines on three datasets for candidate selection inTable 7 (datasets and gazetteers are described in detail in Sections4.1 and 4.2, respectively). We show how the exact baseline, whilebeing the most common approach in entity linking systems, isinsufficient for the task of toponym candidate selection from noisydatasets. We observe that the performance of the
LevDam and
DeezyMatch methods varies significantly depending on the dataset:while DeezyMatch clearly outperforms LevDam on the
Wotr datasetand on the
ArgManuscrita dataset (in the second case, in terms ofMAP@5, MAP@10, and MAP@20, while being comparable in P@1),our method shows lower performance on the BNA-FMP datasetin comparison with the LevDam baseline, particularly in terms [email protected] argue that the reason behind these differences in performanceis found either in the nature of the toponymic variations presentin the datasets or in the nature of the toponym matching datasets,from which we learned the transformations. To better understandthis, for each dataset we investigated the results by looking at theselected candidates correctly retrieved by one method and failedto be retrieved by another. DeezyMatch seems clearly better thanLevDam when the transformation affects significant part of thetoponym; this is particularly the case of long multi-token placenames. In the
ArgManuscrita dataset, for example, given the to-ponym ‘provincia del Paraguay’, DeezyMatch returns ‘Republicadel Paraguay’, ‘RepÞblica del Paraguay’, ‘Paraguai - Paraguay’,‘Republic of Paraguay’, and ‘Departamento Alto Paraguay’ as mostlikely candidates, whereas LevDam returns ‘Provincia de Paragua’,‘Provincia de Veragua’, ‘Provincia de Camaguey’, followed by alarge number of other provinces from around the world, sorted bystring surface similarity. Just to provide another example, Deezy-Match ranks ‘Departamento de Tarija’ as the most likely match fortoponym ‘corregimiento de Tarija’, while LevDam retrieves ‘cor-regimiento de Tunja’, followed by other ‘corregimientos’ (a type ofcountry subdivision), such as ‘corregimiento de Loja’. DeezyMatchis able to rank these candidates better because it has learned sim-ilar transformations from the corresponding toponym matchingdataset (in this case,
WG:es ), which, even though it does not havethe alternate names ‘corregimiento de Tarija’ and ‘provincia delParaguay’ in it, has other similarly-shaped multi-token toponymswhere it can learn these transformations from.On the contrary, LevDam is a very strong baseline when thestring surface difference between the two toponyms is very small(i.e. very few characters difference). This is particularly common inthe
BNA-FMP dataset, where most toponym variations are caused byOCR errors. While
DeezyMatch offers better results in comparisonwith exact matching, and while the quality of its ranking remainsconstantly high when more candidates are retrieved, it is clearlybehind
LevDam in retrieving the best first candidate (P@1). In orderto better understand this issue, we tried two
DeezyMatch models,one trained on
WG:en and one trained on
OCR . However, disap-pointingly, the second model produced worse results. This might bedue to the fact that, while some typical OCR transformations seemto be correctly enclosed in the OCR-based DeezyMatch model, theyare not always aligned between the toponym matching dataset andthe OCR errors in the gold standard candidate selection dataset.The
OCR matching dataset does not in fact have the same originazetteer P@1 MAP@5 MAP@10 MAP@20 TimeArgManuscrita:exact WG:es_gz+HGISIndias 0.69 - - - -ArgManuscrita:LevDam WG:es_gz+HGISIndias 0.78 0.77 0.72 0.70 29.77mArgManuscrita:DeezyMatch (WG:es) WG:es_gz+HGISIndias 0.78 0.78 0.76 0.74 0.73mWotr:exact WG:en_gz 0.86 - - - -Wotr:LevDam WG:en_gz 0.92 0.89 0.84 0.80 308mWotr:DeezyMatch (WG:en) WG:en_gz 0.93 0.92 0.90 0.87 4.75mBNA-FMP:exact WG:en_gz 0.77 - - - -BNA-FMP:LevDam WG:en_gz 0.92 0.88 0.82 0.76 120mBNA-FMP:DeezyMatch (WG:en) WG:en_gz 0.85 0.85 0.82 0.78 4.27mBNA-FMP:DeezyMatch (OCR) WG:en_gz 0.83 0.83 0.82 0.80 4.27m
Table 7: DeezyMatch candidate ranker performance. All DeezyMatch models have been trained using the model architecturesand the choice of hyperparameters described in Section 5.1.1. The datasets on which they have been trained are specifiedin parentheses in the first column;
Gazetteer specifies the gazetteer from where candidates are retrieved for each scenario.All methods are evaluated using the same metrics (columns
P@1 , MAP@5 , MAP@10 , and
MAP@20 ). Time indicates totalcomputation time on CPU, which mostly depends on the number of queries and the size of gazetteer. as the
BNA-FMP dataset, and while some OCR transformationsare probably largely generic (e.g. ‘e’ to ‘c’, or ‘B’ to ‘P’), differenttypographies or OCR softwares may lead to learning unwelcometransformations. On the other hand, by using a model trained onlyon the OCR dataset, we are disregarding the other types of transfor-mations that are enclosed in more generic toponym-based resources,such as
WG:en . We will continue our experiments on OCR-inducednoise in future publications, considering other hyperparametersand exploring transfer learning approaches, a functionality that
DeezyMatch already provides.Nevertheless, whereas
LevDam is generally a strong baseline,its high computational cost makes it impracticable to use in manyreal applications of candidate selection. In this regard,
DeezyMatch undeniably offers a strong alternative. While training the modeland generating the gazetteer candidate vectors is computationallyexpensive, these are steps that need to be done only once and canbe reused for all following candidate selection tasks that employthe same gazetteer. Time needed for generating query vectorsgiven a set of toponyms and finding candidates in a gazetteer isreported in the last column of Table 7 and is significantly lowerthan
LevDam performance in all cases.
In this paper, we discussed the importance of precisely identify-ing candidates in order to resolve toponyms to their real-worldreferents. In particular, we highlighted its necessity when work-ing with noisy and non-standard texts (e.g. documents digitizedwith OCR). To foster further research on this intermediary step, wehave introduced DeezyMatch, a flexible deep learning method forcandidate selection through toponym matching. It is based on thestate-of-the-art neural network architectures and has been tested DeezyMatch training time (on GPU) until validation loss starts to increase: Santos(4,337,446 pairs): 10h4, WG:en (669,376): 56m, WG:es (152,026): 21m, WG:el (3,086): 1m,OCR (93,111): 5m. Generating candidate vectors for the largest gazetteer (i.e. WG:en_gz,with 2,455,966 unique alternate names) takes 204m on CPU. Query vector representations can be generated, compared to candidate vectors, andranked on-the-fly. This can be directly integrated into an entity linking pipeline. in different evaluation settings, considering various challengingscenarios (cross-lingual, diachronic, and regional variations, as wellas OCR errors) and in comparison with a series of well establishedbaselines. DeezyMatch, the evaluation framework presented in thispaper, and all other resources employed are useful contributions toother researchers working at the intersection of geospatial infor-mation retrieval and digital humanities.
REFERENCES [1] Saad Aloteibi and Mark Sanderson. 2014. Analyzing geographic query reformu-lation: An exploratory study.
Journal of the Association for Information Scienceand Technology (2014).[2] Mariona Coll Ardanuy, Katherine McDonough, Amrey Krause, Daniel CS Wilson,Kasra Hosseini, and Daniel van Strien. 2019. Resolving places, past and present:toponym resolution in historical british newspapers using multiple resources. In
Proc. of GIR .[3] Razvan Bunescu and Marius Pasca. 2006. Using encyclopedic knowledge fornamed entity disambiguation. (2006).[4] James O. Butler, Christopher E. Donaldson, Joanna E. Taylor, and Ian N. Gre-gory. 2017. Alts, Abbreviations, and AKAs: Historical Onomastic Variationand Automated Named Entity Recognition.
Journal of Map & GeographyLibraries (2017). arXiv:https://doi.org/10.1080/15420353.2017.1307304 https://doi.org/10.1080/15420353.2017.1307304[5] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. 2010. You are where youtweet: a content-based approach to geo-locating twitter users. In
Proc. of CIKM .759–768.[6] François Chollet et al. 2015. Keras. https://keras.io.[7] Silviu Cucerzan. 2007. Large-scale named entity disambiguation based onWikipedia data. In
Proc. of EMNLP-CoNLL .[8] Grant DeLozier, Jason Baldridge, and Loretta London. 2015. Gazetteer-independent toponym resolution using geographic word profiles. In
Proc. ofAAAI .[9] Grant DeLozier, Ben Wing, Jason Baldridge, and Scott Nesbit. 2016. Creating anovel geolocation corpus from historical texts. In
Proc. of LAW-X .[10] John Evershed and Kent Fitch. 2014. Correcting noisy OCR: Context beatsconfusion. In
Proceedings of the First International Conference on Digital Access toTextual Cultural Heritage . ACM.[11] Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-the-fly annotation of short textfragments (by wikipedia entities). In
Proceedings of the 19th ACM internationalconference on Information and knowledge management . 1625–1628.[12] Qingqing Gan, Josh Attenberg, Alexander Markowetz, and Torsten Suel. 2008.Analysis of Geographic Queries in a Search Engine Log. In
Proc. of LOCWEB .New York, NY, USA.[13] Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep Joint Entity Disam-biguation with Local Neural Attention. In
Proc. of EMNLP .14] Ian Gregory, Christopher Donaldson, Patricia Murrieta-Flores, and Paul Rayson.2015. Geoparsing, GIS, and Textual Analysis: Current Developments in SpatialHumanities Research. (2015).[15] Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R Curran.2013. Evaluating entity linking with wikipedia.
Artificial intelligence (2013).[16] Kasra Hosseini, Federico Nanni, and Mariona Coll Ardanuy. 2020. DeezyMatch:A Flexible Deep Learning Approach to Fuzzy String Matching. In
EMNLP: SystemDemonstrations (accepted) .[17] Jeff Johnson, Matthijs Douze, and Herve Jegou. 2019. Billion-scale similaritysearch with GPUs.
IEEE Transactions on Big Data (2019).[18] Phong Le and Ivan Titov. 2019. Boosting Entity Linking Performance by Lever-aging Unlabeled Documents. In
Proc. of ACL .[19] Phong Le and Ivan Titov. 2019. Distant Learning for Entity Linking with Auto-matic Noise Detection. In
Proc. of ACL .[20] Pedro Henrique Martins, Zita Marinho, and André F. T. Martins. 2019. JointLearning of Named Entity Recognition and Entity Linking. In
Proc. of ACL .[21] Paul McNamee, James Mayfield, Dawn Lawrie, Douglas W Oard, and DavidDoermann. 2011. Cross-language entity linking. In
Proceedings of 5th InternationalJoint Conference on Natural Language Processing . 255–263.[22] Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011.DBpedia spotlight: shedding light on the web of documents. In
Proceedings of the7th international conference on semantic systems . 1–8.[23] Jose G Moreno, Romaric Besançon, Romain Beaumont, Eva D’hondt, Anne-LaureLigozat, Sophie Rosset, Xavier Tannier, and Brigitte Grau. 2017. Combining wordand entity embeddings for entity linking. In
European Semantic Web Conference .Springer, 337–352.[24] Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linkingmeets word sense disambiguation: a unified approach.
TACL (2014).[25] Ross S. Purves, Paul Clough, Christopher B. Jones, Mark H. Hall, and VanessaMurdock. [n.d.]. Geographic Information Retrieval: Progress and Challenges inSpatial Search of Text. 12, 2 ([n. d.]), 164–318. https://doi.org/10.1561/1500000034[26] Gianluca Quercini, Hanan Samet, Jagan Sankaranarayanan, and Michael D Lieber-man. 2010. Determining the spatial reader scopes of news sources using locallexicons. In
Proc. of SIGSPATIAL .[27] Jonathan Raphael Raiman and Olivier Michel Raiman. 2018. Deeptype: multi-lingual entity linking by neural type system evolution. In
Thirty-Second AAAIConference on Artificial Intelligence .[28] Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and JasonBaldridge. 2012. Supervised text-based geolocation using language models on anadaptive grid. In
Proc. of EMNLP .[29] Rui Santos, Patricia Murrieta-Flores, Pável Calado, and Bruno Martins. 2018.Toponym matching through deep neural networks.
International Journal ofGeographical Information Science (2018).[30] Rui Santos, Patricia Murrieta-Flores, and Bruno Martins. 2018. Learning tocombine multiple string similarity metrics for effective toponym matching.
In-ternational journal of digital earth (2018).[31] Avirup Sil, Gourab Kundu, Radu Florian, and Wael Hamza. 2018. Neural cross-lingual entity linking. In
Thirty-Second AAAI Conference on Artificial Intelligence .[32] Michael Speriosu and Jason Baldridge. 2013. Text-driven toponym resolutionusing indirect supervision. In
Proc. of ACL . 1466–1476.[33] Werner Stangl. 2018. ‘The Empire Strikes Back’?: HGIS de las Indias and thePostcolonial Death Star.
IJHAC (2018).[34] Derek Tam, Nicholas Monath, Ari Kobren, Aaron Traylor, Rajarshi Das, andAndrew McCallum. 2019. Optimal transport-based alignment of learned characterrepresentations for string similarity. arXiv preprint arXiv:1907.10165 (2019).[35] Daniel van Strien, Kaspar Beelen, Mariona Coll Ardanuy, Kasra Hosseini, BarbaraMcGillivray, and Giovanni Colavizza. 2020. Assessing the Impact of OCR Qualityon Downstream NLP Tasks. In