Joint Spatio-Textual Reasoning for Answering Tourism Questions
JJoint Spatio-Textual Reasoning for Answering Tourism Questions
Danish Contractor , ∗ Shashank Goel † Mausam Parag Singla IBM Research AI, New Delhi Indian Institute of Technology, New Delhi [email protected], [email protected] , { mausam,parags } @cse.iitd.ac.in Abstract
Our goal is to answer real-world tourism ques-tions that seek Points-of-Interest (POI) recom-mendations. Such questions express variouskinds of spatial and non-spatial constraints, ne-cessitating a combination of textual and spatialreasoning. In response, we develop the firstjoint spatio-textual reasoning model, whichcombines geo-spatial knowledge with informa-tion in textual corpora to answer questions.We first develop a modular spatial-reasoningnetwork that uses geo-coordinates of locationnames mentioned in a question, and of can-didate answer POIs, to reason over only spa-tial constraints. We then combine our spatial-reasoner with a textual reasoner in a jointmodel and present experiments on a real worldPOI recommendation task. We report substan-tial improvements over existing models with-out joint spatio-textual reasoning.
Users of travel forums often post questions seek-ing personalized recommendations for their travelneeds. Consider the example in Figure 1, whichshows a real-world Points-of-Interest (POI) seek-ing question. Answering such a recommendationquestion is a challenging problem as, it not onlyrequires reasoning over a text corpus describingpotential restaurants (eg. reviews), but it also re-quires resolving spatial constraints (“near HotelFlorida”) over the physical location of a restaurant.In addition, the question is also under-specifiedand ambiguous (eg, “dont have to venture too far”)making the spatial-inference task harder.Recently, there has been work on QA modelsthat fuse knowledge from multiple sources; for ex- ∗ This work was carried out as part of PhD research at IITDelhi.The author is also a regular employee at IBM Research. † Work carried out when the author was a student at IITDelhi. https://bit.ly/2zIxQpj Figure 1:
A sample POI recommendation question. Theanswers correspond to POI IDs of the form < city id > < POItype > < number > . The Tourism QA dataset has three classesof POIs - restaurants (R), attractions (A) and hotels (H). ample, by combining data from knowledge baseswith textual passages (Xia et al., 2019; Bi et al.,2019), or incorporating multi-modal data sources(Guo et al., 2018; Vo et al., 2019). But, we do notknow of systems that fuse geo-spatial knowledgewith text. In addition, there exist several geo-spatialIR systems (eg, (Santos and Cabral, 2009; Scheideret al., 2020)), however, to the best of our knowl-edge, none of them perform joint-reasoning overgeo-spatial and textual knowledge sources.In response, we present our joint spatio-textualQA model for returning answers to questions thatrequire textual as well as spatial reasoning. We firstdevelop a modular spatial-reasoning network thatuses geo-coordinates of location names mentionedin a question, and, of candidate answer entities,to reason over only spatial constraints. It learnsto associate contextual distance-weights with eachlocation-mention in the question – these weightsare combined with their respective spatial-distancesfrom a candidate answer, to generate a ‘spatial rel-evance’ score for that answer.We then combine the spatial-reasoner with a tex-tual QA system to develop a joint spatio-textualQA model. We demonstrate the model using a re- a r X i v : . [ c s . A I] O c t ently introduced QA task, which contains tourismquestions seeking POI (entity) answers (Contrac-tor et al., 2019). It also contains a collection ofentity reviews as knowledge source for answeringthese questions. We provide the geo-spatial knowl-edge for the task by mapping location-mentionsin questions to their geographical coordinates us-ing publicly available APIs. Similarly, candidateanswer POIs are also mapped to their geograph-ical coordinates, included as part of the dataset(Contractor et al., 2019). To the best of our knowl-edge, we are the first to develop a joint QA modelthat combines reasoning over external geo-spatialknowledge along with textual reasoning. Contributions:
Our paper makes the followingcontributions:1. We develop a spatial-reasoner that uses geo-coordinates of locations and POIs to reason overspatial constraints specified in a question.2. We demonstrate, using a simple toy-dataset,that our spatial-reasoner is not only able to reasonover “near”, “’far” constraints but is also able todetermine location references that are not usefulfor reasoning (Eg: a location reference mentioningwhere a user last went on vacation).3. We develop a spatio-textual QA model, whichfuses spatial knowledge (geo-coordinates) with tex-tual knowledge (POI reviews) using sub-networksdesigned for spatial and textual reasoning.4. We demonstrate that our joint spatio-textualmodel performs significantly better than modelsemploying only spatial- or textual-reasoning. Italso obtains state-of-the-art results on a real-worldtourism questions dataset, with substantial improve-ment in answering location questions.
Our work is related to four broad areas of questionanswering and information retrieval:
Geographical Information Systems:
There issignificant prior work on Geographical Informa-tion systems where standard IR models are aug-mented with spatial knowledge (Ferr´es Dom`enech,2017; Purves et al., 2018). Models have been de-veloped to address challenges in adhoc-retrievaltasks with locative references (Gey et al., 2006;Mandl et al., 2008; Santos and Cabral, 2009). How-ever, such models deal primarily with inferenceproblems in toponyms (eg, “Beijing is located inChina”), location disambiguation and use of topo-graphical classes (eg, “Union lake is a water-body”) etc. Methods for IR involving locative referencesuse three strategies (i) a pipeline of filtering basedon spatial information followed by text-based IR(ii) a pipeline of filtering based on text-based IRfollowed by ranking based on geo-spatial rankingor coverage, and (iii) a weighted or linear combi-nation of two independent rankings (Leidner et al.,2020). Our work builds on the third strategy byjointly training a model with both geo-spatial andtextual components. To the best of our knowledge,joint reasoning over text and geo-spatial data hasnot been investigated in geographical IR literature.
Geo-Spatial Querying:
There has been consid-erable work in research areas of geo-parsing (to-ponym discovery and disambiguation) (Kew et al.,2019), geo-spatial query processing over structuredor RDF knowledge bases (KB) (Vorona et al., 2019;Scheider et al., 2020), geocoding and geo-taggingdocuments (De Rassenfosse et al., 2019; Lim et al.,2019; Huang and Carley, 2019) etc. However, suchquerying methods require KB & task-specific an-notations for training and are thus specialized inapplication and scope (Scheider et al., 2020).
Numerical Reasoning for Question Answering:
Spatial reasoning in our task is effectively a formof numerical reasoning over distances betweenlocation-mentions in a question and a candidateentity (POI). Recently introduced tasks such asDROP (Dua et al., 2019) and QuaRTz (Tafjordet al., 2019) require reasoning that includes addi-tion, subtraction, counting, etc. for answering read-ing comprehension style questions. Other taskssuch as MathQA (Amini et al., 2019) and Math-SAT (Hopkins et al., 2019) present high school andSAT-level algebraic word problems.Models developed for numerical reasoning taskssuch as NAQANet (Dua et al., 2019) and NumNet(Ran et al., 2019) reason over the explicit mentionsof numerical quantities within a question or pas-sage. In contrast, the questions in our task do notexplicitly mention geographical coordinates, andalso do not contain all the information required fornumerical reasoning (since the distances need to becomputed with respect to a candidate answer un-der consideration). Further, in contrast to algebraicword problems and numerical reasoning questions,answers in the POI-recommendation task are alsoheavily influenced by text-based reasoning on sub-jective POI-entity reviews.
Points-of-Interest (POI) Recommendation:
Ex-isting models for POI recommendation typicallyely on the presence of structured data, includinggeo-spatial coordinates. Queries may be structuredor semi-structured and can consist of both spatialas well as textual arguments. Textual argumentsare usually associated with the structured attributesor may serve as filters. Approaches include effi-cient indexing for ‘spatial’ and ‘preference’ fea-tures along with specialized data-structures as IR-Trees, (Cong et al., 2009; Zhang et al., 2016; Tsat-sanifos and Vlachou, 2015; Li et al., 2016), meth-ods based on Matrix Factorization (Yiu et al., 2007)for user-specific recommendations, click-throughlogs used for recommendations from search en-gines (Zhao et al., 2019) etc.Our work builds on the recently-released POIentity-recommendation QA task (Contractor et al.,2019, 2020). Two approaches have been developedfor this task: semantic parsing of unstructured userquestions to query a semi-structured knowledgestore (Contractor et al., 2020), and an end-to-endtrainable neural model operating over a corpus ofunstructured reviews to represent POIs (Contractoret al., 2019). Neither of these approaches explic-itly reason on spatial constraints, even though thequestions contain them.
The Spatio-Textual Reasoning Network (Figure 2)consists of components: (i) Geo-Spatial Reasoner,(ii) Textual Reasoner, (iii) Joint Scoring Layer. Our geo-spatial reasoner consists of the follow-ing components: (1)
Distance-aware QuestionEncoder - to encode questions along with geo-spatial distances between location mentions (inthe question) and a candidate entity, (2)
DistanceReasoning layer - to enable reasoning over geo-spatial distances with respect to the spatial con-strains mentioned in the question, (3)
Spatial Rel-evance Scorer - to score and rank candidates forspatial-relevance.
Distance-aware Question Encoder:
We gener-ate question representations by using embeddingrepresentations of their constituent tokens alongwith embedding representations of their location-mentions. A question token can be represented bytraditional word-vector embeddings, or contextualembeddings such as BERT (Devlin et al., 2019).Each token representation is further appended witha one-hot encoding representing Begin ( B ), In- termediate ( I ) or Other ( O ) labels, indicating thepresence of location tokens. The B-I labels helpthe model recognize a single continuous location-mention. In addition, we concatenate the distance of the candidate entity c , from a location-mentionto each token-representation Thus, the questionrepresentations are distance-aware and candidate-dependent.Formally, let the token embedding repre-sentations in a question be given by v i ( v . . . v i . . . v m − ) ,where m is the length of a ques-tion. Let the distance between the k th location-mention lm k and c be denoted by d k . Further, let φ ( lm k ) be a function that returns the set of positionindices occupied by location mention lm k , i.e. it re-turns the set of position indices of question tokensthat have been assigned the B or I label from the B-I encoding for location mention lm k , ( φ ( lm k ) ⊂ { , . . . , m − } ). We create an m -dimensionaldistance vector d (cid:48) where each element d (cid:48) i of thevector is given by: d (cid:48) i = (cid:40) d k if i ∈ φ ( lm k ) , otherwise (1)Let the one-hot vector (two dimensional) of the B-I labels for the i th position be g i . The inputquestion embedding t i , ( t . . . t i . . . t m − ) is thengiven by: t i = concat [ v i , d (cid:48) i , g i ] (2)We encode the question using a bi-directional GRU(Cho et al., 2014) which results in output states q i . Distance-Reasoning Layer (DRL):
We first useda series of down-projecting feed-forward layers ap-plied to the output state of the GRU, to generatethe final score for each candidate, but we foundthis was not effective (Section 4.1.2). We there-fore include a component designed for distance-reasoning referred to as the ‘Distance ReasoningLayer’ which uses the representations generated bythe distance-aware question encoder.A model could score candidate-entities for rele-vance if, for each location mentioned in the ques-tion, it is able to (i) learn whether a location-mention needs to be considered for answering, and(ii) learn how a location-mention needs to be usedfor answering. Our design of the DRL is motivatedby this insight – it learns a function which, for each location-mention lm k , in the question, outputs a Manhattan Distance igure 2:
Spatio-Textual reasoning network consisting of (i) Geo-Spatial Reasoner (ii) Textual-Reasoning subnetwork (iii) JointScoring Layer distance-weight w k . Here, w k captures the contri-bution of the spatial-distance between lm k and thecandidate entity c , under the constraints mentionedin the question. For instance, a question may in-clude location-mentions that could be involved insimple ‘near’ or ‘far’ constraints or other complexconstraints such as “within driving distance” or“within walking distance” etc. The DRL layer usesthe distance-aware question encoding to understandthe nature of the constraint being expressed, as wellas, figure out how to compute distance-reasoningweights to express those constraints.Let the output states of the question encoder begiven by q .. q i .. q m − , where m is the length ofthe question. To compute distance-weights, weuse a series of position-wise feed-forward blocks(Vaswani et al., 2017) that consist of a linear layerwith ReLU activation applied at each output posi-tion of the Question Encoder: q li = Block l ( q l − i ) = max (0 , A l q l − i + b l ) (3)where q li is the output of the Block layer at layer l , A l is a weight matrix and b l the bias term.The initial block input uses the output state ofthe GRU ( q i ) concatenated with the final hiddenstate ( q L ) . Thus, the output q i from the applicationof the first block layer, corresponding position i inthe input is given by: q i = Block ( concat [ q i , q m − ]) (4) The blocks apply the same linear transformationsat each position but we vary the parameters acrosslayers (see appendix). The final layer gives us asingle dimension output for each position resultingin an m -dimensional vector r ( r ...r i ..r m − ).Let B be an m -dimensional one hot-vector basedon the position indices that have been assigned onlythe B label from the B-I encoding used in the inputlayer. The distance-weight vector w for a questionis given by: w = tanh ( r (cid:12) B ) (5)We use the distance-weights for scoring, as de-scribed below. Spatial Relevance Scorer:
The final score S L ofa candidate c is given by: S L = w ˙ d (cid:48) (6)Note that since we concatenate the distance val-ues along with token embeddings while encodinglocations as part of the Question Encoder (Equa-tion 2), it helps learn distance weights w whichare dependent on the distance value as well as thesemantic information present in the question. Thus,the spatial relevance score is not just a simple linearcombination of distances and makes the model rep-resentationally more powerful (see experiments in An element of B is whenever it corresponds to a po-sition index indicating the start of a location mention in aquestion. ection 4.2). We refer to the Geo-Spatial Reasoneras SPN ET for brevity in the rest of the paper. We use the C R QA (Contractor et al., 2019) modelas our textual-reasoning sub-network. It con-sists of a Siamese-Encoder (Lai et al., 2018) thatuses question representations to attend over entity-review sentences and generate question-awareentity-embeddings. These entity embeddings arecombined with question representations to generatean overall relevance score. For scalability, insteadof using full review documents, the model usesa set of representative sentences from reviews af-ter clustering them in USE-embedding space (Ceret al., 2018). We follow Contractor et al. and usek-means to cluster sentences in USE embeddingspace. We set k =10, and select 10 sentences percluster, thus creating a ≤ Let the score generated by the textual-reasonerbe S T and let the score generated by the spatial-reasoner be S L . Let the rescaling weights for S T and S L be w T and w L respectively. Then, the over-all score S is given by: α.σ ( w T S T ) . tanh( w L S L ) + β.σ ( w T S T ) , where σ is the Sigmoid function and α , β are com-bination weights. The weights are computed byreturning a two dimensional output (correspondingto each weight), after a series of feed-forward oper-ations on the self-attended representation (Chenget al., 2016), of the question using the outputs of aQuestion Encoder with the same architecture as inSPN ET (see appendix for hyperparameters). Notethat the first term of scoring equation uses S L asa selector – for questions where there are no lo-cations mentioned, the spatial score of a questionwith no location-mentions will be (due to theequation for w ). This lets the model rely only ontextual scores for these cases. Training:
We train the joint model using max-margin loss, teaching the network to score correct-answer entities higher than negative samples.
Figure 3:
Sample questions from the Toy Dataset. Thedataset has questions from three categories: (1) close to set X,(2) far from set X (3) Combination.
We first present a detailed study of the spatial-reasoner using a simple artificially generated toy-dataset. This allows us to probe and study differentaspects of spatial-reasoning in the absence of tex-tual reasoning. We then present our experimentswith the joint spatio-textual model using a real-world POI-recommendation QA dataset (Sec 4.2)
We conduct this study on a simple toy-dataset gen-erated using linguistically diverse templates speci-fying spatial constraints and location names chosenat random from a list of , entities across cities. We create templates that canbe broadly divided into three types of proximityqueries based, on whether the correct answer entityis expected to be: (1) close to one or more loca-tions (mentioned in the question), (2) far from oneor more locations, (3) close to some and far fromothers (combination). We create different templatesfor each category with linguistics variations. Fig-ure 3 shows a sample question from each category.See appendix for more details, including the list oftemplates.
Use of distractor-locations:
In order to make thetask more reflective of real-world challenges wealso randomly insert a distractor sentence that con-tains a location reference which does not need to bereasoned over (e.g the location “Pinati” in Question in Figure 3). Gold-entity generation:
The gold answer entityis uniquely determined for each question based onits template. For example, consider a template T,“
I am staying at $A! Please suggest a hotel closeto $B but far from $K. ” The score of a candidateentity X is given by dist T ( X ) = − ( dist ( X, B ) − dist ( X, K )) (distances from B needs to be reduced,while distance from K needs to be higher). A is a lose to set X Far from set X Combination AggregateModels Acc@3 MRR Dist g Acc@3 MRR
Dist g Acc@3 MRR
Dist g Acc@3 MRR
Dist g SPN ET w/o DRL 62.60 0.608 2.88 89.00 0.858 15.24 23.40 0.229 9.72 58.33 0.565 9.28SPN ET ET w/o DRL 63.60 0.616 3.68 90.80 0.881 15.32 26.80 0.242 12.96 60.40 0.579 10.65BERT SPN ET Table 1: Results of SPN ET on the artificial spatial-questions dataset (t-test p-value < − for Acc@3) distractor. The candidate with the max ( dist T ( X ) )in the universe is chosen as the gold-answer entityfor that question. We use the geo-coordinates oflocations to compute the distance. Dataset Statistics:
The train, dev and test sets con-sist of , and questions respectivelygenerated using different templates, split equallyacross all template categories. Each question con-sists of location-names from only one city and thusthe candidate search space for that question is re-stricted to that city. The average search space foreach question is , varying between - across cities. The dataset includes questions con-taining distractor-locations ( . % of dataset) dis-tributed evenly across all template classes. We study SPN ET using the artificial dataset to an-swer the following questions: (1) What is the modelperformance across template classes? (2) How doesthe network compare with baseline models that donot use the DRL? (3) How well does the model dealwith distractor-locations, i.e locations not relevantfor the scoring task? For all experiments in thissection we use perfectly tagged location-mentions. Metrics:
We study the performance of models us-ing
Acc @ N (N=3,5,30) which requires that anyone of the top-N answers be correct, Mean Recip-rocal Rank ( M RR ) and the average distance of thetop- ranked answers from the gold-entity Dist g . Dist g is helpful in quantifying the spatial goodnessof the returned answers (lower is better).We use the following models in our experiments:(i) SPN ET (ii) SPN ET without DRL (iii) BERT-SPN ET (iv) BERT-SPN ET without DRL. Modelswithout DRL use the final hidden states of the Ques-tion Encoder and a series of down-projecting feed-forward layers to generate the final score. Performance across template classes:
As can beseen in Table 1, all models perform the worst on thetemplate class that contains a combination of both‘close-to’ and ‘far’ constraints. Models based onSPN ET perform exceeding well on the ‘Far’ tem- We report results with N=3 in the main paper. Please seeappendix for full results.
Without Distractors With DistractorsModels Acc@3 MRR Acc@3 MRR
SPN ET ET Table 2: Performance of spatial-reasoning networks de-grades in the presence of location-distractor sentences. plates because the difference between the dist T ( X ) scores of the best and the second best candidate isalmost always large enough for every model toeasily separate them. Importance of Distance-Reasoning Layer:
Ascan be seen in Table 1 the performance of eachconfiguration (with and without BERT) suffers aserious degradation in the absence of the DRL. Re-call, that all models have access to spatial knowl-edge in their input layer via the question encoding.This indicates that the DRL is an important compo-nent required for reasoning on spatial constraints.To further assess whether our model is able to dodistance reasoning, we computed the correlationbetween ranking-by-distances (appropriate rank-ing order for each template-class) and SPN ET ’sranking on the toy-dataset. We found the rank cor-relation to be a high . ( p < . ) suggestingthat the model is able to use physical distance tocompute the best answer. Effect of distractor-locations:
We report resultson two splits of the test set: Questions with andwithout distractor-locations. We report the aggre-gate performance over all template classes due tospace constraints. As can be seen in Table 2, mod-els suffer a degradation of performance in the pres-ence of distractor-locations. We hypothesize thatthis is because the reasoning task becomes harder;models now need to also account for location-mentions that do not need to be reasoned over.
Probing Study:
We conduct a probing study (Fig-ure 4) on SPN ET to get some insights into thereasoning process employed by the trained net-work. We use a question that has both ‘near’ and‘far’ constraints (case 1) and then interchange theconstraints (case 2). In both the cases we studythe corresponding distance-weights assigned to thelocation-mentions with respect to two candidates igure 4: Probing study of the Distance ReasoningLayer (DRL) using the question: “I came from Trop-icoco today. Any nice ideas for a coffee shop [ farfrom/close to ] ‘Be Live Havana’ but [ closeto/far from ] ‘Melia Cohiba’?” “Santa Isabel” and “Parque Central”. Consider thefirst case; as can be seen, each candidate entity as-signs a higher weight (column-wise comparison)as compared to the other candidate, on the distanceproperty it is most likely to benefit from, with re-spect to the spatial-constraint. For example, whenthe spatial-constraint requires an answer to be close to “Melia Cohiba”, the candidate “Parque Central”assigns a higher weight to this location as comparedto candidate “Santa Isabel”, since “Parque Central”has a smaller distance value to this location. Onthe other hand, with respect to the “ far ” constraint,candidate “Santa Isabel” has a larger distance valuefrom “Be live Havana” as compared to candidate“Parque Central”, thus assigning a higher distanceweight for this location-mention.When we interchange the constraints (Case )we see the same pattern and the comparative weighttrends (at each location-mention) invert due to in-version of spatial-constraints. This suggests, thatDRL is learning to transform the inputs and gener-ate weights based on the spatial constraint at hand. Effect of Candidate Space Size:
We analyzed theerrors made by the SPN ET model and we find thatnearly 40% of the errors were made in questionsthat have large ( > ) candidate spaces. Approx-imately of the test-set contains questions withlarge candidate spaces. Effect of the No. of Location-mentions:
Thecomplexity of the spatial-reasoning task increasesas the number of location-mentions (includingdistractor-locations) in the question increase. Wefind that SPN ET makes no errors when spatial-reasoning involves only location-mention but, nearly 57% of the errors are made in questionswith location-mentions (See appendix). For the joint model, we investigate the followingresearch questions: (i) Does joint spatio-textualranking result in improved performance over amodel with only spatial-reasoning or only textual-reasoning? (ii) How do pipelined baseline modelsthat use spatial re-ranking perform on the task?(iii) Does distance-aware question encoding help inspatio-textual reasoning? (iv) Is the spatio-textualreasoning model more robust to distractor-locationsas compared to baselines? (v) What kind of errorsdoes the model make?
Dataset:
We use the recently released data set onTourism Questions (Contractor et al., 2019) thatconsists of over 47,000 real-world POI question-answer pairs along with a universe of nearly200,000 candidate POIs; questions are long andcomplex, as presented in Figure 1, while the rec-ommendations (answers) are represented by an IDcorresponding to each POI. Each POI comes with acollection of reviews and meta-data that includes itsgeo-coordinates. The training set contains nearly , QA-pairs and about , QA-pairs eachin the validation and test sets. The average candi-date space for each question is , . Task Challenges:
The task presents novel chal-lenges of reasoning and scale; the nature of entityreviews (eg. inference on subjective language, sar-casm etc) makes methods such as BM25 (Robert-son and Zaragoza, 2009), that are often used toprune the search space quickly in large scale QAtasks (Chen et al., 2017; Dunn et al., 2017), inef-fective. Thus, even simple BERT-based architec-tures or popular models such as BiDAF (Seo et al.,2016) do not scale for the answering task in thisdataset (Contractor et al., 2019).Thus, we use thenon-BERT based SPN ET subnetwork in the rest ofthe QA experiments . Evaluation Challenges:
It is infeasible to con-struct a dataset of POI recommendation QA pairs,which has an exhaustively labeled answer-set foreach question, since the candidate space is verylarge. Hence, this dataset suffers from the prob-lem of false negatives, and
Acc @ N metrics under-report system performance. Still, they are shownto be correlated with human relevance judgments https://github.com/dair-iitd/TourismQA C R QA is also not based on BERT due to this reason. ocation Non-locationDataset Questions QA pairs Questions QA pairs
Train 9,617 21,396 10,342 22,150Dev 1,065 2,209 1,054 1,987Test 1,086 2,198 1,087 2,144
Table 3: Dataset statistics: Questions with and with-out location-mentions across train, dev & test sets from(Contractor et al., 2019). (Contractor et al., 2019). We therefore, use thesemetrics for all experiments, but additionally presenta small human-study on the end-task, verifying therobustness of our results.
Location Tagging in Questions:
In order to getmentions of locations in questions, we manuallylabel a set of questions from the training set forlocation mentions. We then use a BERT-based se-quence tagger trained on this set to label locations.The tagger has a macro- F of 88.03. This taggertags all location mentions in a question without con-sidering their utility for spatial-reasoning. It is pos-sible that a question may contain only distractor-locations, i.e., locations-mentions that do not needto be reasoned over the answering task.Once the location-mentions are tagged, we re-move the punctuations and stopwords from thetagged-location span. We then query the BingMaps Location API using the location-mentionalong with the city (known from question meta-data) to get the geo-tags. To reduce noise in geo-tagging, we ignore the location-mention if the re-maining text has a length of less than charactersor is identified as a popular acronym, continent,country, city or state (lists from Wikipedia). Wefurther reduce noise by ignoring a location men-tion: (1) if no results were found from BING, or(2) If the geo-tag is beyond 40km from the citycenter. We found the location-mention geo-taggingprecision on a small set of location-mentions tobe 96%.We label all questions in the full dataset us-ing this tagger, resulting in approximately 49.54%of the QA pairs containing at least one location-mention (see Table 3). In all our experiments, weuse the Manhattan distance as our distance value,because it is generally closer to real-world driv-ing/walking distance within a city, as opposed tostraight-line distance. github.com/codedecde/BiLSTM-CCM/tree/allennlp https://bit.ly/36Vazwo Location Questions
Models Acc@3 Acc@5 Acc@30 MRR
Dist g SD 2.49 3.41 14.29 0.029 3.07SPN ET → SD 13.73 19.26 50.65 0.125
CRQA → SPN ET Table 4: Comparison of the joint Spatio-Textual modelwith baselines on questions that have location mentions(t-test p-value < . ) Apart from the textual-reasoning model C R QA wealso use the following baselines in our experiments:
Sort-by-distance (SD) : Given a set of tagged-locations in a question and their geo-coordinates,rank candidate entities by the minimum distancefrom the set of tagged locations.
SPN ET : Use only the spatial-reasoning networkfor ranking candidate entities using their geo-coordinates. No textual-reasoning performed. C R QA → SD:
Rank candidates using C R QA andthen re-rank the top-30 answers using SD. C R QA → SPN ET : Rank candidates usingC R QA and then re-rank the top-30 answers usingSPN ET . Training:
We pretrain SPN ET on this dataset byallowing entities within a radius of m from theactual gold-entity to be considered as gold (only forpretraining). To train the joint network we initial-ize model parameters learnt from component-wisepretraining of both SPN ET as well as C R QA.
We present our experiments on two slices of thetest-set – questions with tagged location-mentions(called
Location-Questions ) and those without anylocation mentions (
Non-Location Questions ). Ascan be seen in Table 4 sorting-by-distance (SD) per-forms very poorly indicating that simple methodsfor ranking based on entity-distance do not workfor such questions. Further, the poor performanceof SPN ET also indicates that the task cannot besolved just by reasoning on location data.In addition, pipelined re-ranking using SD orSPN ET over the textual reasoning model decreasesthe average distance ( Dist g ) from the gold-entitybut does not result in improved performance interms of answering (Acc@N) indicating the needfor spatio-textual reasoning. Finally, from Tables 4& 5 we note that the spatio-textual model performsbetter than its textual counterpart on the Location- ocation Questions Non-location Questions Full SetModels Acc@3 Acc@5 Acc@30 MRR
Dist g Acc@3 Acc@5 Acc@30 MRR Acc@3 Acc@5 Acc@30 MRR
CRQA 14.83 21.27 50.65 0.143 3.41 18.95 26.22 54.37 0.177 16.89 23.75 52.51 0.159Spatio-TextualCRQA
Spatio-textualC R QA(w/o distance-aware QE) 16.85 23.39 53.04 0.159 2.84 20.06 26.86 56.49 0.185 18.45 25.13 54.76 0.172
Table 5: Comparison of Spatio-Textual C R QA (with and without (w/o) distance-aware question encoding) andC R QA (t-test p-value < . for Acc @3 ) Questions subset, while continuing to perform wellon questions without location mentions.
Effect of distance-aware question encoding:
Inorder to demonstrate the importance of distance-aware question encoding, we present an experi-ment where we remove the distance values fromthe input encoding. Thus, Equation 2 changes to t i = concat [ v i , g i ] . As Table 5 shows, the per-formance of the Spatio-Textual C R QA model inthe absence of distance-aware encoding drops (lastrow), but it still performs better than the text-onlyC R QA model (first row). This indicates that thedistance-aware question encoding helps learn betterdistance weights for spatio-textual reasoning.
Effect of distractor-locations:
As mentioned ear-lier, we use a location-tagger that is oblivious to thereasoning task, to tag locations in the dataset.Wemanually create a small set of questions, ran-domly selected from the test-set, but ensuring thathalf of it contains at least one non-distractor loca-tion mentioned in the question while the other halfcontains questions with only distractor-locations.
Questions requiring Spatial-reasoning
Models Acc@3 Acc@5 Acc@30 MRR
Dist g SD 5.00 7.00 22.00 0.053 2.10SPN ET → SD 15.00 22.00 51.00 0.142
CRQA → SPN ET ET → SD 13.00 17.00 51.005 0.108 3.26CRQA → SPN ET Spatio-textualCRQA
Table 6: Experiments on two subsets from the test-set:(i) Questions requiring Spatial-reasoning (ii) Questionswith distractor-locations only.
As can be seen from Table 6, all models includ-ing the spatio-textual model deteriorate in perfor-mance if a question only contains distractors; thespatio-textual model however, suffers a less signifi-cant drop in performance.
Error Type Percentage
Textual Reasoning Error 37.9%Far from the required location 22.3%Influenced by Distractor 12.6 %Not in requested Neighbourhood 10.7 %Location Tagger Error 5.8 %RepeatedLocation Names 4.9 %Error in Geo-Spatial Data 2.9 %Invalid Question 2.9 %
Table 7: Spatio-Textual C R QA: Classification of Er-rors
Location Questions
Models Acc@3 Acc@5 Acc@30 MRR
Dist g C SR QA 19.89 26.43 SR QA 21.45 28.21
Table 8: Comparison with current state-of-the-artC SR QA on (i) Location Questions (ii) All data
Qualitative Study:
We randomly selected
QA pairs with location-mentions from the test-set,to conduct a qualitative error analysis of Spatio-textual C R QA (Table 7). We find that nearly 37%of the errors can be traced to the textual-reasoner,22% of the errors were due to a ‘near’ constraint notbeing satisfied, while about 13% of the errors weredue to the model reasoning on distractor-locations.Lastly 8% of the errors were due to errors made bythe location-tagger and incorrect geo-spatial data.
Effect of Candidate Search Space:
Past work(Contractor et al., 2019) has improved overall taskperformance by employing a neural IR method toreduce the search space (Mitra and Craswell, 2019),and then using the C R QA textual-reasoner to re-rank only the top 30 selected candidates (pipelinereferred to as C SR QA). In line with their work, wecreate a spatio-textual counterpart to C SR QA, byusing spatio-textual reasoning in re-rank step. Wefind that this final model results in a pt (Acc@3)improvement overall (see Table 16), and a . ptimprovement on location questions (Acc@3), es-tablishing a new state of the art on the task. We notethat, because the IR selector is incapable of spatial- utomated evaluation Human evaluationLocation Non-location Location Non-location C SR QA 28.00 36.00 64.00 70.00Spatio-textualCSRQA 32.00 32.00 84.00 72.00
Table 9: Acc@3 results on a blind-human study using randomly selected questions from the test-set reasoning, it possibly reduces the gains made by thespatio-textual re-ranking. An interesting directionof future work could be to augment general purposeneural IR methods with such spatial-reasoning.
Effect of False Negatives:
To supplement the au-tomatic evaluation, we additionally conducted ablind human-study using the top-ranked C SR QAand spatio-textual C SR QA models on another sub-set of questions from the test-set. Two humanevaluators ( κ =0.81) were presented the top-3 an-swers from both models in random order and wereasked to mark each answer for relevance. As Table9 shows, the manual annotation resulted in Acc@3for C SR QA and spatio-textual C SR QA at a muchhigher, % and % respectively. On the subsetof location questions, the accuracy numbers are and . This underscores the value of jointspatio-textual reasoning for the task, and signifiesa substantial improvement in the overall QA per-formance. Our paper presents the first joint spatio-textual QAmodel that combines spatial and textual reason-ing. Experiments on an artificially constructed(spatial-only) toy QA dataset show that our spa-tial reasoner effectively trains to satisfy spatial con-straints. We also presented detailed experiments onthe recently released POI recommendation task fortourism questions. Comparing against textual onlyand spatial only QA models, the joint model ob-tains significant improvements. Our final model es-tablishes a new state of the art on the task. In futurework, we would like to also support reasoning onquestions that require directional or topographicalinference (eg.“north of X”, “on the river beach”).
We would like to thank Krunal Shah, GauravPandey, Biswesh Mohapatra, Azalenah Shah fortheir helpful suggestions to improve early versionsof this paper. We would also like to acknowledgethe IBM Research India PhD program that enablesthe first author to pursue the PhD at IIT Delhi. This work is supported by an IBM AI Horizons Net-work grant, IBM SUR awards, Visvesvaraya fac-ulty awards by Govt. of India to both Mausam andParag as well as grants by Google, Bloomberg and1MG to Mausam.
References
Aida Amini, Saadia Gabriel, Shanchuan Lin, RikKoncel-Kedziorski, Yejin Choi, and Hannaneh Ha-jishirzi. 2019. MathQA: Towards interpretablemath word problem solving with operation-basedformalisms. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 2357–2367, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Bin Bi, Chen Wu, Ming Yan, Wei Wang, JiangnanXia, and Chenliang Li. 2019. Incorporating ex-ternal knowledge into machine reading for gener-ative question answering. In
Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019 , pages 2521–2530.Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St. John, Noah Con-stant, Mario Guajardo-Cespedes, Steve Yuan, ChrisTar, Yun-Hsuan Sung, Brian Strope, and RayKurzweil. 2018. Universal sentence encoder.
CoRR ,abs/1803.11175.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In
Association for Computa-tional Linguistics (ACL) .Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.Long short-term memory-networks for machinereading. In
Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing ,pages 551–561, Austin, Texas. Association for Com-putational Linguistics.Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In
Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1724–1734, Doha, Qatar. Association for ComputationalLinguistics.Gao Cong, Christian S. Jensen, and Dingming Wu.2009. Efficient retrieval of the top-k most rel-evant spatial web objects.
Proc. VLDB Endow. ,2(1):337–348.anish Contractor, Barun Patra, Mausam, and ParagSingla. 2020. Constrained bert bilstm crf for un-derstanding multi-sentence entity-seeking questions.
Natural Language Engineering , page 1–23.Danish Contractor, Krunal Shah, Aditi Partap,Mausam, and Parag Singla. 2019. Large scalequestion answering using tourism data.
CoRR ,abs/1909.03527.Ga´etan De Rassenfosse, Jan Kozak, and Florian Seliger.2019. Geocoding of worldwide patent data.
Scien-tific data , 6(1):1–15.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In
Proc. ofNAACL .Matthew Dunn, Levent Sagun, Mike Higgins, V. UgurG¨uney, Volkan Cirik, and Kyunghyun Cho. 2017.Searchqa: A new q&a dataset augmented with con-text from a search engine.
CoRR , abs/1704.05179.Daniel Ferr´es Dom`enech. 2017. Knowledge-basedand data-driven approaches for geographical infor-mation access.Fredric Gey, Ray Larson, Mark Sanderson, KerstinBischoff, Thomas Mandl, Christa Womser-Hacker,Diana Santos, Paulo Rocha, Giorgio M Di Nun-zio, and Nicola Ferro. 2006. Geoclef 2006: theclef 2006 cross-language geographic information re-trieval track overview. In
Workshop of the Cross-Language Evaluation Forum for European Lan-guages , pages 852–876. Springer.Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-modal preference modeling for product search. In
Proceedings of the 26th ACM international confer-ence on Multimedia , pages 1865–1873.Mark Hopkins, Ronan Le Bras, Cristian Petrescu-Prahova, Gabriel Stanovsky, Hannaneh Hajishirzi,and Rik Koncel-Kedziorski. 2019. SemEval-2019task 10: Math question answering. In
Proceed-ings of the 13th International Workshop on Seman-tic Evaluation , pages 893–899, Minneapolis, Min-nesota, USA. Association for Computational Lin-guistics.Binxuan Huang and Kathleen Carley. 2019. A hierar-chical location prediction neural network for twitter user geolocation. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 4732–4742, Hong Kong, China. As-sociation for Computational Linguistics.Tannon Kew, Anastassia Shaitarova, Isabel Meraner,Janis Goldzycher, Simon Clematide, and MartinVolk. 2019. Geotagging a diachronic corpus ofalpine texts: Comparing distinct approaches to to-ponym recognition. In
Proceedings of the Work-shop on Language Technology for Digital HistoricalArchives , pages 11–18.Tuan Manh Lai, Trung Bui, and Sheng Li. 2018. A re-view on deep learning techniques applied to answerselection. In
Proceedings of the 27th InternationalConference on Computational Linguistics, COLING2018, Santa Fe, New Mexico, USA, August 20-26,2018 , pages 2132–2144.Jochen L. Leidner, Bruno Martins, Katherine Mc-Donough, and Ross S. Purves. 2020. Text meetsspace: Geographic content extraction, resolutionand information retrieval. In
Advances in Informa-tion Retrieval , pages 669–673, Cham. Springer In-ternational Publishing.Miao Li, Lisi Chen, Gao Cong, Yu Gu, and Ge Yu.2016. Efficient processing of location-aware grouppreference queries. In
Proceedings of the 25th ACMInternational on Conference on Information andKnowledge Management , CIKM ’16, page 559–568,New York, NY, USA. Association for ComputingMachinery.Kwan Hui Lim, Shanika Karunasekera, Aaron Har-wood, and Yasmeen George. 2019. Geotaggingtweets to landmarks using convolutional neural net-works with text and posting time. In
Proceedingsof the 24th International Conference on IntelligentUser Interfaces: Companion , IUI ’19, page 61–62,New York, NY, USA. Association for ComputingMachinery.Thomas Mandl, Paula Carvalho, Giorgio Maria Di Nun-zio, Fredric Gey, Ray R Larson, Diana Santos,and Christa Womser-Hacker. 2008. Geoclef 2008:the clef 2008 cross-language geographic informa-tion retrieval track overview. In
Workshop of theCross-Language Evaluation Forum for EuropeanLanguages , pages 808–821. Springer.Bhaskar Mitra and Nick Craswell. 2019. An updatedduet model for passage re-ranking. arXiv preprintarXiv:1903.07666 .R. S. Purves, P. Clough, C. B. Jones, M. H. Hall,and V. Murdock. 2018.
Geographic Information Re-trieval: Progress and Challenges in Spatial Searchof Text .Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and ZhiyuanLiu. 2019. NumNet: Machine reading comprehen-sion with numerical reasoning. In
Proceedings ofhe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2474–2484, Hong Kong,China. Association for Computational Linguistics.Stephen Robertson and Hugo Zaragoza. 2009. Theprobabilistic relevance framework: Bm25 and be-yond.
Found. Trends Inf. Retr. , 3(4):333–389.Diana Santos and Lu´ıs Miguel Cabral. 2009. Giki-clef: Expectations and lessons learned. In
Workshopof the Cross-Language Evaluation Forum for Euro-pean Languages , pages 212–222. Springer.Simon Scheider, Enkhbold Nyamsuren, Han Kruiger,and Haiqi Xu. 2020. Geo-analytical question-answering with gis.
International Journal of DigitalEarth , 0(0):1–14.Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-tention flow for machine comprehension.
CoRR ,abs/1611.01603.Oyvind Tafjord, Matt Gardner, Kevin Lin, and PeterClark. 2019. Quartz: An open-domain dataset ofqualitative relationship questions. In
Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019 , pages 5940–5945.George Tsatsanifos and Akrivi Vlachou. 2015. On pro-cessing top-k spatio-textual preference queries. In
Proceedings of the 18th International Conference onExtending Database Technology, EDBT 2015, Brus-sels, Belgium, March 23-27, 2015 , pages 433–444.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, undefine-dukasz Kaiser, and Illia Polosukhin. 2017. Attentionis all you need. In
Proceedings of the 31st Interna-tional Conference on Neural Information ProcessingSystems , NIPS’17, page 6000–6010, Red Hook, NY,USA. Curran Associates Inc.Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-JiaLi, Li Fei-Fei, and James Hays. 2019. Compos-ing text and image for image retrieval-an empiricalodyssey. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages6439–6448.Dimitri Vorona, Andreas Kipf, Thomas Neumann, andAlfons Kemper. 2019. Deepspace: Approximategeospatial query processing with deep learning. In
Proceedings of the 27th ACM SIGSPATIAL Interna-tional Conference on Advances in Geographic Infor-mation Systems , pages 500–503.Jiangnan Xia, Chen Wu, and Ming Yan. 2019. Incorpo-rating relation knowledge into commonsense read-ing comprehension with multi-task learning. In
Pro-ceedings of the 28th ACM International Conference on Information and Knowledge Management , CIKM’19, page 2393–2396, New York, NY, USA. Associ-ation for Computing Machinery.M. L. Yiu, X. Dai, N. Mamoulis, and M. Vaitis. 2007.Top-k spatial preference queries. In ,pages 1076–1085.C. Zhang, Y. Zhang, W. Zhang, and X. Lin. 2016. In-verted linear quadtree: Efficient top k spatial key-word search.
IEEE Transactions on Knowledge andData Engineering , 28(7):1706–1721.Ji Zhao, Meiyu Yu, Huan Chen, Boning Li, LingyuZhang, Qi Song, Li Ma, Hua Chai, and Jieping Ye.2019. POI semantic model with a deep convolu-tional structure.
CoRR , abs/1903.07279.
A Appendix
This appendix is organized as follows.• Section A.1 provides more details about theToy Dataset and supplementary experimen-tal information that includes additional tablesreferred to in the main paper on Spatial Rea-soning.• Section A.2 includes more results of the Lo-cation Tagger used in the end-task.• Section A.3 contains supplementary experi-ments on Spatio-Textual Reasoning.• Section A.4 gives details about the modelhyper-parameters.
A.1 Toy Dataset
We create a simple toy-dataset that is generatedusing linguistically diverse templates specifyingspatial constraints and locations chosen at ran-dom from across 200,000 entities. These entitieswere sourced from the recently released Points-of-Interest (POI) recommendation task (Contractoret al., 2019). Each POI entity is labeled with itsgeo-coordinates apart from other meta-data such asits address, timings, etc. Further, each entity in acity has a specific type viz. Restaurant(R), Attrac-tion(A) or Hotel(H). Table 10 shows the list of tem-plates used for generating the dataset. These tem-plates have been to make the toy-dataset reflectiveof real-world challenges. For instance, templates distrac-tor locations . To generate questions, $LOCATIONand $ENTITY values are updated by randomly se-lecting values from the POI-set for each entity asdescribed in the next section. .1.1 Dataset Generation
To generate a question, a city c , type t and a tem-plate T are chosen at random. The ”ENTITY” to-ken in each template is replaced by a randomlychosen metonym of the type t . Table 11 shows thelist of metonyms for each type. Each instance ofthe ”LOCATION” token is replaced by a randomlychosen entity from the city c and type t . The candi-date set consists of the entities from the city c andtype t . The entities used as location mentions aresampled without replacement and removed fromthe candidate set.The gold answer entity is uniquely determinedfor each question based on its template. For ex-ample, consider a template T, “ I am staying at $A!Please suggest a hotel close to $B but far from$C. ” The score of a candidate entity X is given by dist T ( X ) = − ( dist ( X, B ) − dist ( X, C )) (dis-tances from B needs to be reduced, while distancefrom C needs to be higher). A is a distractor. Thecandidate with the max ( dist T ( X ) ) in the universeis chosen as the gold entity for that question.Each question further consists of 500 negativesamples (35% hard, 65% soft). The negative sam-ples are generated as a part of the gold generationprocess. A hard negative sample has a dist T ( X ) value closer to the gold as compared to a soft nega-tive sample. We release the samples used for train-ing along with the dataset for reproducibility. A.1.2 Template classes
We create templates (Table 10) that can be broadlydivided into three different categories based onwhether the correct answer entity is expected tobe: (1) close to one or more locations [1-16] (2)far from one or more locations [17-32] (3) closeto some and far from others (combination) [33-48]. To make the task more reflective of real-worldchallenges we also randomly insert a distractor location that does not need to be reasoned. Thesecond-half for each category (i.e. [9-16], [25-32], and [41-48]) consists of templates that have adistractor locative reference. Further, for the close(or far) category, the templates could contain onelocation ([1-4] + [9-12]) or two locations ([5-8] +[13-16]) that need to be reasoned for close (or far).
A.1.3 Results
We use the following models in our experiments:(i)SPNet (ii) SPNet without (w/o) DRL (iii) BERT-SPNet (iv) BERT-SPNet without (w/o) DRL. Mod-els without DRL use the final hidden states of the Question Encoder and a series of down-projectingfeed-forward layers to generate the final score.We study our models’ performance using
Acc @ N (N=3,5,30) which requires that any oneof the top-N answers be correct, Mean Recipro-cal Rank ( M RR ), and the average distance of thetop- ranked answers from the gold-entity Dist g .Table 12 summarizes the results test set. A.1.4 Error Analysis
Tables 13 and 14 show the effect of candidatesearch space and the number of location mentionsin the question on the performance of the SPNetModel.
Correctly Answered Incorrectly AnsweredSearch Space size Questions Percentage Questions Percentage
Table 13: Performance of SPNet decreases with in-crease in universe size.
Correctly Answered Incorrectly Answered
Table 14: Performance of SPNet decreases with in-crease in the number of location mentions in the ques-tion.
A.2 Location Tagger
In order to get mentions of locations in questions,we manually label a set of questions from thetraining set for location mentions. We then usea BERT-BiLSTM CRF (Contractor et al., 2020)based tagger trained on this set to label locations.Table 15 describes the performance of the taggeron an unseen set of questions. Precision Recall F1Micro Average
Macro Average
Table 15: Performance of the BERT-BiLSTM CRF fortagging locations on a small set of 75 unseen questions.
A.3 Spatio-textual Reasoning Network
The Spatio-Textual Reasoning Network consists ofthree components (i) Spatial Reasoner (ii) TextualReasoner (iii) Joint Scoring Layer.
Training:
We train the joint model using max-margin loss teaching the network to score the d Description
ENTITY near the
LOCATION ?2 Does anyone have ideas on
ENTITY close to
LOCATION ? Thank you!3 Hello! Could anyone please suggest
ENTITY in the neighborhood of
LOCATION ?4 Good Morning! Can someone please propose
ENTITY not very far from
LOCATION ?5 Suggestions for
ENTITY close to both
LOCATION and
LOCATION ?6 Some good ideas of
ENTITY between
LOCATION and
LOCATION ? Thanks much!7 Please advise
ENTITY close to
LOCATION and not very far off the
LOCATION .8 Any ideas for
ENTITY near
LOCATION and also close to
LOCATION would be welcomed?9 I once lived around
LOCATION . Does anyone have ideas of
ENTITY close to the
LOCATION ? Thanks!10 Any nice suggestions of
ENTITY near the
LOCATION ? I will be going to
LOCATION the next day.11 I just came from
LOCATION . Someone, please recommend
ENTITY in the neighborhood of
LOCATION .12 Could anyone propose
ENTITY not far from the
LOCATION ? I need to leave for
LOCATION urgently.13 We came from
LOCATION this morning. Suggestions for
ENTITY close to both
LOCATION and
LOCATION ?14 Any ideas of
ENTITY between
LOCATION and
LOCATION ? I would be going to
LOCATION . Thanks.15 We might be staying around
LOCATION . Please advise
ENTITY close to
LOCATION and not far from
LOCATION .16 Could anyone suggest ideas for
ENTITY close to
LOCATION and around
LOCATION ? We could be going to
LOCATION soon.17 Any suggestions for
ENTITY quite far from the
LOCATION ? Thank you very much!18 Somebody please suggest
ENTITY cut off from
LOCATION . Have a good day!19 Does anyone have suggestions for
ENTITY away from
LOCATION ? Thanks a lot!20 Good Afternoon! Any proposals for
ENTITY not very close to the
LOCATION ?21 Suggestions on
ENTITY far from both
LOCATION and
LOCATION ? Thank!22 Hi! Any idea of
ENTITY far away from
LOCATION and
LOCATION ?23 Could anyone please propose
ENTITY not close to
LOCATION and also far from
LOCATION ?24 Does anyone have any suggestions for
ENTITY far from
LOCATION and not around
LOCATION ?25 Hey! I will be staying at
LOCATION . Please suggest
ENTITY cut off from
LOCATION .26 Any pleasant ideas of
ENTITY far off the
LOCATION ? I might then be visiting
LOCATION .27 I came from
LOCATION this afternoon. Any proposal for
ENTITY not close to the
LOCATION ?28 Does anyone have a suggestion for
ENTITY distant from
LOCATION ? By the way, I came from
LOCATION yesterday.29 We will be staying near the
LOCATION . Suggestions for
ENTITY far from both
LOCATION and
LOCATION will be welcomed.30 Any idea of
ENTITY far away from
LOCATION and
LOCATION ? I would then be visiting
LOCATION .31 Hi, I will be staying near the
LOCATION . Could anyone propose
ENTITY not very close to
LOCATION and far from
LOCATION ?32 Does anyone have suggestions for
ENTITY far from
LOCATION and also far from
LOCATION ? I will then be visiting
LOCATION too.33 Any good ideas of
ENTITY far from
LOCATION but close to
LOCATION would be appreciated? Best Regards.34 Anyone having ideas of
ENTITY close to
LOCATION but far from
LOCATION ?35 Someone please advise
ENTITY far from
LOCATION but not very far from
LOCATION .36 Suggest
ENTITY close to
LOCATION but not in the neighborhood of
LOCATION . Thank you so much!37 Does anyone have good ideas of
ENTITY far from
LOCATION but near
LOCATION ? Regards.38 Please suggest ideas of
ENTITY in the neighborhood of
LOCATION but far from
LOCATION .39 Could anyone advise
ENTITY far from
LOCATION but not too far from
LOCATION ?40 Any nice ideas of
ENTITY close to
LOCATION but not in the neighborhood of
LOCATION . Thanks!41 Tomorrow, I would be coming to stay at
LOCATION . Anyone having ideas of
ENTITY close to
LOCATION but far from
LOCATION ?42 Please propose
ENTITY far from
LOCATION but not far from
LOCATION . I will then be exploring
LOCATION .43 I came from
LOCATION this evening. Any nice ideas for
ENTITY far from
LOCATION but close to
LOCATION would be appreciated?44 Suggest
ENTITY close to
LOCATION but not near
LOCATION . Tomorrow, I will be leaving for
LOCATION .45 Yesterday, I came to stay at
LOCATION . Any ideas of
ENTITY close to
LOCATION but far from
LOCATION ?46 Suggestions of
ENTITY far from
LOCATION but not very far from
LOCATION . I will then be moving to
LOCATION .47 I came from
LOCATION today. Any good ideas for
ENTITY far from
LOCATION but near to
LOCATION would be welcomed?48 Advise
ENTITY close to
LOCATION but not close to
LOCATION . I might be leaving for
LOCATION soon.
Table 10: Templates used for generating the Toy-dataset
Entity type Metonyms
R (Restaurant) a restaurant, an eatery, an eating joint, a cafeteria, an outlet, a coffee shop, a fast food place, a lunch counter,a lunch room, a snack bar, a chop house, a steak house, a pizzeria, a coffee shop, a tea house, a bar roomH (Hotel) a hotel, an inn, a motel, a guest house, a hostel, a boarding house, a lodge, an auberge, a caravansary,a public house, a tavern, an accomodation, a resort, a youth hostel, a bunk house, a dormitory, a flop houseA (Attraction) an attraction, a tourist spot, a tourist attraction, a popular wonder, a sightseeing place, a tourist location,a place of tourist interest, a crowd pleaser, a scenic spot, a popular landmark, a monument
Table 11: List of metonyms for each entity type in the Toy-dataset odels Acc@3 Acc@5 Acc@30 MRR Dg
Close to Set XSPNet w/o DRL 62.60 66.00 79.00 0.608 2.88SPNet 90.20
Far from Set XSPNet w/o DRL 89.00 90.80 96.40 0.858 15.24SPNet
CombinationSPNet w/o DRL 23.40 28.00 50.60 0.229 9.72SPNet 52.80 60.20 82.00 0.486 3.90BERT SPNet w/o DRL 26.80 32.60 59.00 0.242 12.96BERT SPNet
AggregateSPNet w/o DRL 58.33 61.60 75.33 0.565 9.28SPNet 80.33 83.80 92.93 0.778 6.21BERT SPNet w/o DRL 60.40 64.07 79.13 0.579 10.65BERT SPNet
Table 12: Results of the Spatial-reasoning network on the toy-data test set correct-answer higher than a negatively sampledcandidate entity. Model parameters are describedin the next section.
A.3.1 Results
Similar to Contractor et al. (2019) we also ex-periment on this dataset by employing a neuralmethod to reduce the search space (Mitra andCraswell, 2019) before using the C R QA textual-reasoner to re-rank only the top-30 selected can-didates (pipeline referred to as C SR QA). UnlikeC R QA, which uses two levels of attention betweenquestion and review sentences to score candidateentities C S QA does not reason deeply over the text.It compares elements of a question with differentparts of a review document to aggregate relevancefor scoring. Local and distributed representationsare used to capture lexical and semantic features.We report some experiments using this modelreferred to as C S QA and compare it with C SR QAand spatio-textual C SR QA. As can be seen re-ranking with SD or SPNet does not help the system.An interesting direction of future work could thus,be to augment general-purpose neural-IR methodssuch as Duet used by C S QA with spatial-reasoning.Another interesting approach could be to extendideas from existing Graph-neural network basedapproaches, such as NumNet (Ran et al., 2019).Each entity could be viewed as a node in a graphfor reasoning but we note that methods will need tobe made more scalable for them to be useful. Theentity space (and thus nodes in the graph) wouldrun into thousands of nodes per question makingcurrent message-passing based inference methods
Location Questions
Models Acc@3 Acc@5 Acc@30 MRR Dg C S QA 15.84 20.26 S QA → SD 11.34 17.26 C S QA → LocNet 8.38 13.72 SR QA 19.89 26.43 SR QA 21.45 28.21
Table 16: Comparison of re-ranking models operatingon a reduced search space returned by C S QA on Lo-cation Questions (ii) Comparison with current state-of-the-art C SR QA on the full task. prohibitively expensive.
A.4 Model settingsA.4.1 Experiments on Toy Dataset
The hyperparameters for the best performing con-figurations of all models were identified throughmanual testing on the validation set (Table 17). Themodels were trained on a 2x NVIDIA K40 (12GB,2880 CUDA cores) GPU on a shared cluster.The BERT models were trained with a learningrate of 0.0002 whereas the non-BERT models witha learning rate of 0.001.
A.4.2 Spatio-textual Reasoning Network
The hyperparameters for the best performing con-figuration were identified through manual testingon the validation set (Table ?? ). The Spatio TextualReasoner was trained on 4 K-80 GPUs on a sharedcluster. yperparameter Value Negative samples 40Batch size 20Optimizer AdamLoss MarginRankingLossMargin 0.5Max no. of epochs 15GRU Input dimension 131GRU Output dimension 32DRL Block Layer 1 64 (Input) 64 (Output)DRL Block Layer 2 64 (Input) 64 (Output)DRL Block Layer 3 64 (Input) 64 (Output)DRL Block Layer 4 64 (Input) 1 (Output)
Table 17: Hyperparameter settings for experiments onthe toy-dataset
Hyperparameter Value