[PDF] Joint Spatio-Textual Reasoning for Answering Tourism Questions

Abstract

Our goal is to answer real-world tourism questions that seek Points-of-Interest (POI) recommendations. Such questions express various kinds of spatial and non-spatial constraints, necessitating a combination of textual and spatial reasoning. In response, we develop the first joint spatio-textual reasoning model, which combines geo-spatial knowledge with information in textual corpora to answer questions. We first develop a modular spatial-reasoning network that uses geo-coordinates of location names mentioned in a question, and of candidate answer POIs, to reason over only spatial constraints. We then combine our spatial-reasoner with a textual reasoner in a joint model and present experiments on a real world POI recommendation task. We report substantial improvements over existing models with-out joint spatio-textual reasoning.

Full PDF

JJoint Spatio-Textual Reasoning for Answering Tourism Questions

Danish Contractor , ∗ Shashank Goel † Mausam Parag Singla IBM Research AI, New Delhi Indian Institute of Technology, New Delhi [email protected], [email protected] , { mausam,parags } @cse.iitd.ac.in Abstract

Our goal is to answer real-world tourism ques-tions that seek Points-of-Interest (POI) recom-mendations. Such questions express variouskinds of spatial and non-spatial constraints, ne-cessitating a combination of textual and spatialreasoning. In response, we develop the ﬁrstjoint spatio-textual reasoning model, whichcombines geo-spatial knowledge with informa-tion in textual corpora to answer questions.We ﬁrst develop a modular spatial-reasoningnetwork that uses geo-coordinates of locationnames mentioned in a question, and of can-didate answer POIs, to reason over only spa-tial constraints. We then combine our spatial-reasoner with a textual reasoner in a jointmodel and present experiments on a real worldPOI recommendation task. We report substan-tial improvements over existing models with-out joint spatio-textual reasoning.

Users of travel forums often post questions seek-ing personalized recommendations for their travelneeds. Consider the example in Figure 1, whichshows a real-world Points-of-Interest (POI) seek-ing question. Answering such a recommendationquestion is a challenging problem as, it not onlyrequires reasoning over a text corpus describingpotential restaurants (eg. reviews), but it also re-quires resolving spatial constraints (“near HotelFlorida”) over the physical location of a restaurant.In addition, the question is also under-speciﬁedand ambiguous (eg, “dont have to venture too far”)making the spatial-inference task harder.Recently, there has been work on QA modelsthat fuse knowledge from multiple sources; for ex- ∗ This work was carried out as part of PhD research at IITDelhi.The author is also a regular employee at IBM Research. † Work carried out when the author was a student at IITDelhi. https://bit.ly/2zIxQpj Figure 1:

A sample POI recommendation question. Theanswers correspond to POI IDs of the form < city id > < POItype > < number > . The Tourism QA dataset has three classesof POIs - restaurants (R), attractions (A) and hotels (H). ample, by combining data from knowledge baseswith textual passages (Xia et al., 2019; Bi et al.,2019), or incorporating multi-modal data sources(Guo et al., 2018; Vo et al., 2019). But, we do notknow of systems that fuse geo-spatial knowledgewith text. In addition, there exist several geo-spatialIR systems (eg, (Santos and Cabral, 2009; Scheideret al., 2020)), however, to the best of our knowl-edge, none of them perform joint-reasoning overgeo-spatial and textual knowledge sources.In response, we present our joint spatio-textualQA model for returning answers to questions thatrequire textual as well as spatial reasoning. We ﬁrstdevelop a modular spatial-reasoning network thatuses geo-coordinates of location names mentionedin a question, and, of candidate answer entities,to reason over only spatial constraints. It learnsto associate contextual distance-weights with eachlocation-mention in the question – these weightsare combined with their respective spatial-distancesfrom a candidate answer, to generate a ‘spatial rel-evance’ score for that answer.We then combine the spatial-reasoner with a tex-tual QA system to develop a joint spatio-textualQA model. We demonstrate the model using a re- a r X i v : . [ c s . A I] O c t ently introduced QA task, which contains tourismquestions seeking POI (entity) answers (Contrac-tor et al., 2019). It also contains a collection ofentity reviews as knowledge source for answeringthese questions. We provide the geo-spatial knowl-edge for the task by mapping location-mentionsin questions to their geographical coordinates us-ing publicly available APIs. Similarly, candidateanswer POIs are also mapped to their geograph-ical coordinates, included as part of the dataset(Contractor et al., 2019). To the best of our knowl-edge, we are the ﬁrst to develop a joint QA modelthat combines reasoning over external geo-spatialknowledge along with textual reasoning. Contributions:

Our paper makes the followingcontributions:1. We develop a spatial-reasoner that uses geo-coordinates of locations and POIs to reason overspatial constraints speciﬁed in a question.2. We demonstrate, using a simple toy-dataset,that our spatial-reasoner is not only able to reasonover “near”, “’far” constraints but is also able todetermine location references that are not usefulfor reasoning (Eg: a location reference mentioningwhere a user last went on vacation).3. We develop a spatio-textual QA model, whichfuses spatial knowledge (geo-coordinates) with tex-tual knowledge (POI reviews) using sub-networksdesigned for spatial and textual reasoning.4. We demonstrate that our joint spatio-textualmodel performs signiﬁcantly better than modelsemploying only spatial- or textual-reasoning. Italso obtains state-of-the-art results on a real-worldtourism questions dataset, with substantial improve-ment in answering location questions.

Our work is related to four broad areas of questionanswering and information retrieval:

Geographical Information Systems:

There issigniﬁcant prior work on Geographical Informa-tion systems where standard IR models are aug-mented with spatial knowledge (Ferr´es Dom`enech,2017; Purves et al., 2018). Models have been de-veloped to address challenges in adhoc-retrievaltasks with locative references (Gey et al., 2006;Mandl et al., 2008; Santos and Cabral, 2009). How-ever, such models deal primarily with inferenceproblems in toponyms (eg, “Beijing is located inChina”), location disambiguation and use of topo-graphical classes (eg, “Union lake is a water-body”) etc. Methods for IR involving locative referencesuse three strategies (i) a pipeline of ﬁltering basedon spatial information followed by text-based IR(ii) a pipeline of ﬁltering based on text-based IRfollowed by ranking based on geo-spatial rankingor coverage, and (iii) a weighted or linear combi-nation of two independent rankings (Leidner et al.,2020). Our work builds on the third strategy byjointly training a model with both geo-spatial andtextual components. To the best of our knowledge,joint reasoning over text and geo-spatial data hasnot been investigated in geographical IR literature.

Geo-Spatial Querying:

There has been consid-erable work in research areas of geo-parsing (to-ponym discovery and disambiguation) (Kew et al.,2019), geo-spatial query processing over structuredor RDF knowledge bases (KB) (Vorona et al., 2019;Scheider et al., 2020), geocoding and geo-taggingdocuments (De Rassenfosse et al., 2019; Lim et al.,2019; Huang and Carley, 2019) etc. However, suchquerying methods require KB & task-speciﬁc an-notations for training and are thus specialized inapplication and scope (Scheider et al., 2020).

Numerical Reasoning for Question Answering:

Spatial reasoning in our task is effectively a formof numerical reasoning over distances betweenlocation-mentions in a question and a candidateentity (POI). Recently introduced tasks such asDROP (Dua et al., 2019) and QuaRTz (Tafjordet al., 2019) require reasoning that includes addi-tion, subtraction, counting, etc. for answering read-ing comprehension style questions. Other taskssuch as MathQA (Amini et al., 2019) and Math-SAT (Hopkins et al., 2019) present high school andSAT-level algebraic word problems.Models developed for numerical reasoning taskssuch as NAQANet (Dua et al., 2019) and NumNet(Ran et al., 2019) reason over the explicit mentionsof numerical quantities within a question or pas-sage. In contrast, the questions in our task do notexplicitly mention geographical coordinates, andalso do not contain all the information required fornumerical reasoning (since the distances need to becomputed with respect to a candidate answer un-der consideration). Further, in contrast to algebraicword problems and numerical reasoning questions,answers in the POI-recommendation task are alsoheavily inﬂuenced by text-based reasoning on sub-jective POI-entity reviews.

Points-of-Interest (POI) Recommendation:

Ex-isting models for POI recommendation typicallyely on the presence of structured data, includinggeo-spatial coordinates. Queries may be structuredor semi-structured and can consist of both spatialas well as textual arguments. Textual argumentsare usually associated with the structured attributesor may serve as ﬁlters. Approaches include efﬁ-cient indexing for ‘spatial’ and ‘preference’ fea-tures along with specialized data-structures as IR-Trees, (Cong et al., 2009; Zhang et al., 2016; Tsat-sanifos and Vlachou, 2015; Li et al., 2016), meth-ods based on Matrix Factorization (Yiu et al., 2007)for user-speciﬁc recommendations, click-throughlogs used for recommendations from search en-gines (Zhao et al., 2019) etc.Our work builds on the recently-released POIentity-recommendation QA task (Contractor et al.,2019, 2020). Two approaches have been developedfor this task: semantic parsing of unstructured userquestions to query a semi-structured knowledgestore (Contractor et al., 2020), and an end-to-endtrainable neural model operating over a corpus ofunstructured reviews to represent POIs (Contractoret al., 2019). Neither of these approaches explic-itly reason on spatial constraints, even though thequestions contain them.

The Spatio-Textual Reasoning Network (Figure 2)consists of components: (i) Geo-Spatial Reasoner,(ii) Textual Reasoner, (iii) Joint Scoring Layer. Our geo-spatial reasoner consists of the follow-ing components: (1)

Distance-aware QuestionEncoder - to encode questions along with geo-spatial distances between location mentions (inthe question) and a candidate entity, (2)

DistanceReasoning layer - to enable reasoning over geo-spatial distances with respect to the spatial con-strains mentioned in the question, (3)

Spatial Rel-evance Scorer - to score and rank candidates forspatial-relevance.

Distance-aware Question Encoder:

We gener-ate question representations by using embeddingrepresentations of their constituent tokens alongwith embedding representations of their location-mentions. A question token can be represented bytraditional word-vector embeddings, or contextualembeddings such as BERT (Devlin et al., 2019).Each token representation is further appended witha one-hot encoding representing Begin ( B ), In- termediate ( I ) or Other ( O ) labels, indicating thepresence of location tokens. The B-I labels helpthe model recognize a single continuous location-mention. In addition, we concatenate the distance of the candidate entity c , from a location-mentionto each token-representation Thus, the questionrepresentations are distance-aware and candidate-dependent.Formally, let the token embedding repre-sentations in a question be given by v i ( v . . . v i . . . v m − ) ,where m is the length of a ques-tion. Let the distance between the k th location-mention lm k and c be denoted by d k . Further, let φ ( lm k ) be a function that returns the set of positionindices occupied by location mention lm k , i.e. it re-turns the set of position indices of question tokensthat have been assigned the B or I label from the B-I encoding for location mention lm k , ( φ ( lm k ) ⊂ { , . . . , m − } ). We create an m -dimensionaldistance vector d (cid:48) where each element d (cid:48) i of thevector is given by: d (cid:48) i = (cid:40) d k if i ∈ φ ( lm k ) , otherwise (1)Let the one-hot vector (two dimensional) of the B-I labels for the i th position be g i . The inputquestion embedding t i , ( t . . . t i . . . t m − ) is thengiven by: t i = concat [ v i , d (cid:48) i , g i ] (2)We encode the question using a bi-directional GRU(Cho et al., 2014) which results in output states q i . Distance-Reasoning Layer (DRL):

We ﬁrst useda series of down-projecting feed-forward layers ap-plied to the output state of the GRU, to generatethe ﬁnal score for each candidate, but we foundthis was not effective (Section 4.1.2). We there-fore include a component designed for distance-reasoning referred to as the ‘Distance ReasoningLayer’ which uses the representations generated bythe distance-aware question encoder.A model could score candidate-entities for rele-vance if, for each location mentioned in the ques-tion, it is able to (i) learn whether a location-mention needs to be considered for answering, and(ii) learn how a location-mention needs to be usedfor answering. Our design of the DRL is motivatedby this insight – it learns a function which, for each location-mention lm k , in the question, outputs a Manhattan Distance igure 2:

Spatio-Textual reasoning network consisting of (i) Geo-Spatial Reasoner (ii) Textual-Reasoning subnetwork (iii) JointScoring Layer distance-weight w k . Here, w k captures the contri-bution of the spatial-distance between lm k and thecandidate entity c , under the constraints mentionedin the question. For instance, a question may in-clude location-mentions that could be involved insimple ‘near’ or ‘far’ constraints or other complexconstraints such as “within driving distance” or“within walking distance” etc. The DRL layer usesthe distance-aware question encoding to understandthe nature of the constraint being expressed, as wellas, ﬁgure out how to compute distance-reasoningweights to express those constraints.Let the output states of the question encoder begiven by q .. q i .. q m − , where m is the length ofthe question. To compute distance-weights, weuse a series of position-wise feed-forward blocks(Vaswani et al., 2017) that consist of a linear layerwith ReLU activation applied at each output posi-tion of the Question Encoder: q li = Block l ( q l − i ) = max (0 , A l q l − i + b l ) (3)where q li is the output of the Block layer at layer l , A l is a weight matrix and b l the bias term.The initial block input uses the output state ofthe GRU ( q i ) concatenated with the ﬁnal hiddenstate ( q L ) . Thus, the output q i from the applicationof the ﬁrst block layer, corresponding position i inthe input is given by: q i = Block ( concat [ q i , q m − ]) (4) The blocks apply the same linear transformationsat each position but we vary the parameters acrosslayers (see appendix). The ﬁnal layer gives us asingle dimension output for each position resultingin an m -dimensional vector r ( r ...r i ..r m − ).Let B be an m -dimensional one hot-vector basedon the position indices that have been assigned onlythe B label from the B-I encoding used in the inputlayer. The distance-weight vector w for a questionis given by: w = tanh ( r (cid:12) B ) (5)We use the distance-weights for scoring, as de-scribed below. Spatial Relevance Scorer:

The ﬁnal score S L ofa candidate c is given by: S L = w ˙ d (cid:48) (6)Note that since we concatenate the distance val-ues along with token embeddings while encodinglocations as part of the Question Encoder (Equa-tion 2), it helps learn distance weights w whichare dependent on the distance value as well as thesemantic information present in the question. Thus,the spatial relevance score is not just a simple linearcombination of distances and makes the model rep-resentationally more powerful (see experiments in An element of B is whenever it corresponds to a po-sition index indicating the start of a location mention in aquestion. ection 4.2). We refer to the Geo-Spatial Reasoneras SPN ET for brevity in the rest of the paper. We use the C R QA (Contractor et al., 2019) modelas our textual-reasoning sub-network. It con-sists of a Siamese-Encoder (Lai et al., 2018) thatuses question representations to attend over entity-review sentences and generate question-awareentity-embeddings. These entity embeddings arecombined with question representations to generatean overall relevance score. For scalability, insteadof using full review documents, the model usesa set of representative sentences from reviews af-ter clustering them in USE-embedding space (Ceret al., 2018). We follow Contractor et al. and usek-means to cluster sentences in USE embeddingspace. We set k =10, and select 10 sentences percluster, thus creating a ≤ Let the score generated by the textual-reasonerbe S T and let the score generated by the spatial-reasoner be S L . Let the rescaling weights for S T and S L be w T and w L respectively. Then, the over-all score S is given by: α.σ ( w T S T ) . tanh( w L S L ) + β.σ ( w T S T ) , where σ is the Sigmoid function and α , β are com-bination weights. The weights are computed byreturning a two dimensional output (correspondingto each weight), after a series of feed-forward oper-ations on the self-attended representation (Chenget al., 2016), of the question using the outputs of aQuestion Encoder with the same architecture as inSPN ET (see appendix for hyperparameters). Notethat the ﬁrst term of scoring equation uses S L asa selector – for questions where there are no lo-cations mentioned, the spatial score of a questionwith no location-mentions will be (due to theequation for w ). This lets the model rely only ontextual scores for these cases. Training:

We train the joint model using max-margin loss, teaching the network to score correct-answer entities higher than negative samples.

Figure 3:

Sample questions from the Toy Dataset. Thedataset has questions from three categories: (1) close to set X,(2) far from set X (3) Combination.

We ﬁrst present a detailed study of the spatial-reasoner using a simple artiﬁcially generated toy-dataset. This allows us to probe and study differentaspects of spatial-reasoning in the absence of tex-tual reasoning. We then present our experimentswith the joint spatio-textual model using a real-world POI-recommendation QA dataset (Sec 4.2)

We conduct this study on a simple toy-dataset gen-erated using linguistically diverse templates speci-fying spatial constraints and location names chosenat random from a list of , entities across cities. We create templates that canbe broadly divided into three types of proximityqueries based, on whether the correct answer entityis expected to be: (1) close to one or more loca-tions (mentioned in the question), (2) far from oneor more locations, (3) close to some and far fromothers (combination). We create different templatesfor each category with linguistics variations. Fig-ure 3 shows a sample question from each category.See appendix for more details, including the list oftemplates.

Use of distractor-locations:

In order to make thetask more reﬂective of real-world challenges wealso randomly insert a distractor sentence that con-tains a location reference which does not need to bereasoned over (e.g the location “Pinati” in Question in Figure 3). Gold-entity generation:

The gold answer entityis uniquely determined for each question based onits template. For example, consider a template T,“

I am staying at $A! Please suggest a hotel closeto $B but far from $K. ” The score of a candidateentity X is given by dist T ( X ) = − ( dist ( X, B ) − dist ( X, K )) (distances from B needs to be reduced,while distance from K needs to be higher). A is a lose to set X Far from set X Combination AggregateModels Acc@3 MRR Dist g Acc@3 MRR

Dist g Acc@3 MRR

Dist g SPN ET w/o DRL 62.60 0.608 2.88 89.00 0.858 15.24 23.40 0.229 9.72 58.33 0.565 9.28SPN ET ET w/o DRL 63.60 0.616 3.68 90.80 0.881 15.32 26.80 0.242 12.96 60.40 0.579 10.65BERT SPN ET Table 1: Results of SPN ET on the artiﬁcial spatial-questions dataset (t-test p-value < − for Acc@3) distractor. The candidate with the max ( dist T ( X ) )in the universe is chosen as the gold-answer entityfor that question. We use the geo-coordinates oflocations to compute the distance. Dataset Statistics:

The train, dev and test sets con-sist of , and questions respectivelygenerated using different templates, split equallyacross all template categories. Each question con-sists of location-names from only one city and thusthe candidate search space for that question is re-stricted to that city. The average search space foreach question is , varying between - across cities. The dataset includes questions con-taining distractor-locations ( . % of dataset) dis-tributed evenly across all template classes. We study SPN ET using the artiﬁcial dataset to an-swer the following questions: (1) What is the modelperformance across template classes? (2) How doesthe network compare with baseline models that donot use the DRL? (3) How well does the model dealwith distractor-locations, i.e locations not relevantfor the scoring task? For all experiments in thissection we use perfectly tagged location-mentions. Metrics:

We study the performance of models us-ing

Acc @ N (N=3,5,30) which requires that anyone of the top-N answers be correct, Mean Recip-rocal Rank ( M RR ) and the average distance of thetop- ranked answers from the gold-entity Dist g . Dist g is helpful in quantifying the spatial goodnessof the returned answers (lower is better).We use the following models in our experiments:(i) SPN ET (ii) SPN ET without DRL (iii) BERT-SPN ET (iv) BERT-SPN ET without DRL. Modelswithout DRL use the ﬁnal hidden states of the Ques-tion Encoder and a series of down-projecting feed-forward layers to generate the ﬁnal score. Performance across template classes:

As can beseen in Table 1, all models perform the worst on thetemplate class that contains a combination of both‘close-to’ and ‘far’ constraints. Models based onSPN ET perform exceeding well on the ‘Far’ tem- We report results with N=3 in the main paper. Please seeappendix for full results.

Without Distractors With DistractorsModels Acc@3 MRR Acc@3 MRR

SPN ET ET Table 2: Performance of spatial-reasoning networks de-grades in the presence of location-distractor sentences. plates because the difference between the dist T ( X ) scores of the best and the second best candidate isalmost always large enough for every model toeasily separate them. Importance of Distance-Reasoning Layer:

Ascan be seen in Table 1 the performance of eachconﬁguration (with and without BERT) suffers aserious degradation in the absence of the DRL. Re-call, that all models have access to spatial knowl-edge in their input layer via the question encoding.This indicates that the DRL is an important compo-nent required for reasoning on spatial constraints.To further assess whether our model is able to dodistance reasoning, we computed the correlationbetween ranking-by-distances (appropriate rank-ing order for each template-class) and SPN ET ’sranking on the toy-dataset. We found the rank cor-relation to be a high . ( p < . ) suggestingthat the model is able to use physical distance tocompute the best answer. Effect of distractor-locations:

We report resultson two splits of the test set: Questions with andwithout distractor-locations. We report the aggre-gate performance over all template classes due tospace constraints. As can be seen in Table 2, mod-els suffer a degradation of performance in the pres-ence of distractor-locations. We hypothesize thatthis is because the reasoning task becomes harder;models now need to also account for location-mentions that do not need to be reasoned over.

Probing Study:

We conduct a probing study (Fig-ure 4) on SPN ET to get some insights into thereasoning process employed by the trained net-work. We use a question that has both ‘near’ and‘far’ constraints (case 1) and then interchange theconstraints (case 2). In both the cases we studythe corresponding distance-weights assigned to thelocation-mentions with respect to two candidates igure 4: Probing study of the Distance ReasoningLayer (DRL) using the question: “I came from Trop-icoco today. Any nice ideas for a coffee shop [ farfrom/close to ] ‘Be Live Havana’ but [ closeto/far from ] ‘Melia Cohiba’?” “Santa Isabel” and “Parque Central”. Consider theﬁrst case; as can be seen, each candidate entity as-signs a higher weight (column-wise comparison)as compared to the other candidate, on the distanceproperty it is most likely to beneﬁt from, with re-spect to the spatial-constraint. For example, whenthe spatial-constraint requires an answer to be close to “Melia Cohiba”, the candidate “Parque Central”assigns a higher weight to this location as comparedto candidate “Santa Isabel”, since “Parque Central”has a smaller distance value to this location. Onthe other hand, with respect to the “ far ” constraint,candidate “Santa Isabel” has a larger distance valuefrom “Be live Havana” as compared to candidate“Parque Central”, thus assigning a higher distanceweight for this location-mention.When we interchange the constraints (Case )we see the same pattern and the comparative weighttrends (at each location-mention) invert due to in-version of spatial-constraints. This suggests, thatDRL is learning to transform the inputs and gener-ate weights based on the spatial constraint at hand. Effect of Candidate Space Size:

We analyzed theerrors made by the SPN ET model and we ﬁnd thatnearly 40% of the errors were made in questionsthat have large ( > ) candidate spaces. Approx-imately of the test-set contains questions withlarge candidate spaces. Effect of the No. of Location-mentions:

Thecomplexity of the spatial-reasoning task increasesas the number of location-mentions (includingdistractor-locations) in the question increase. Weﬁnd that SPN ET makes no errors when spatial-reasoning involves only location-mention but, nearly 57% of the errors are made in questionswith location-mentions (See appendix). For the joint model, we investigate the followingresearch questions: (i) Does joint spatio-textualranking result in improved performance over amodel with only spatial-reasoning or only textual-reasoning? (ii) How do pipelined baseline modelsthat use spatial re-ranking perform on the task?(iii) Does distance-aware question encoding help inspatio-textual reasoning? (iv) Is the spatio-textualreasoning model more robust to distractor-locationsas compared to baselines? (v) What kind of errorsdoes the model make?

Dataset:

We use the recently released data set onTourism Questions (Contractor et al., 2019) thatconsists of over 47,000 real-world POI question-answer pairs along with a universe of nearly200,000 candidate POIs; questions are long andcomplex, as presented in Figure 1, while the rec-ommendations (answers) are represented by an IDcorresponding to each POI. Each POI comes with acollection of reviews and meta-data that includes itsgeo-coordinates. The training set contains nearly , QA-pairs and about , QA-pairs eachin the validation and test sets. The average candi-date space for each question is , . Task Challenges:

The task presents novel chal-lenges of reasoning and scale; the nature of entityreviews (eg. inference on subjective language, sar-casm etc) makes methods such as BM25 (Robert-son and Zaragoza, 2009), that are often used toprune the search space quickly in large scale QAtasks (Chen et al., 2017; Dunn et al., 2017), inef-fective. Thus, even simple BERT-based architec-tures or popular models such as BiDAF (Seo et al.,2016) do not scale for the answering task in thisdataset (Contractor et al., 2019).Thus, we use thenon-BERT based SPN ET subnetwork in the rest ofthe QA experiments . Evaluation Challenges:

It is infeasible to con-struct a dataset of POI recommendation QA pairs,which has an exhaustively labeled answer-set foreach question, since the candidate space is verylarge. Hence, this dataset suffers from the prob-lem of false negatives, and

Acc @ N metrics under-report system performance. Still, they are shownto be correlated with human relevance judgments https://github.com/dair-iitd/TourismQA C R QA is also not based on BERT due to this reason. ocation Non-locationDataset Questions QA pairs Questions QA pairs

Train 9,617 21,396 10,342 22,150Dev 1,065 2,209 1,054 1,987Test 1,086 2,198 1,087 2,144

Table 3: Dataset statistics: Questions with and with-out location-mentions across train, dev & test sets from(Contractor et al., 2019). (Contractor et al., 2019). We therefore, use thesemetrics for all experiments, but additionally presenta small human-study on the end-task, verifying therobustness of our results.

Location Tagging in Questions:

In order to getmentions of locations in questions, we manuallylabel a set of questions from the training set forlocation mentions. We then use a BERT-based se-quence tagger trained on this set to label locations.The tagger has a macro- F of 88.03. This taggertags all location mentions in a question without con-sidering their utility for spatial-reasoning. It is pos-sible that a question may contain only distractor-locations, i.e., locations-mentions that do not needto be reasoned over the answering task.Once the location-mentions are tagged, we re-move the punctuations and stopwords from thetagged-location span. We then query the BingMaps Location API using the location-mentionalong with the city (known from question meta-data) to get the geo-tags. To reduce noise in geo-tagging, we ignore the location-mention if the re-maining text has a length of less than charactersor is identiﬁed as a popular acronym, continent,country, city or state (lists from Wikipedia). Wefurther reduce noise by ignoring a location men-tion: (1) if no results were found from BING, or(2) If the geo-tag is beyond 40km from the citycenter. We found the location-mention geo-taggingprecision on a small set of location-mentions tobe 96%.We label all questions in the full dataset us-ing this tagger, resulting in approximately 49.54%of the QA pairs containing at least one location-mention (see Table 3). In all our experiments, weuse the Manhattan distance as our distance value,because it is generally closer to real-world driv-ing/walking distance within a city, as opposed tostraight-line distance. github.com/codedecde/BiLSTM-CCM/tree/allennlp https://bit.ly/36Vazwo Location Questions

Models Acc@3 Acc@5 Acc@30 MRR

Dist g SD 2.49 3.41 14.29 0.029 3.07SPN ET → SD 13.73 19.26 50.65 0.125

CRQA → SPN ET Table 4: Comparison of the joint Spatio-Textual modelwith baselines on questions that have location mentions(t-test p-value < . ) Apart from the textual-reasoning model C R QA wealso use the following baselines in our experiments:

Sort-by-distance (SD) : Given a set of tagged-locations in a question and their geo-coordinates,rank candidate entities by the minimum distancefrom the set of tagged locations.

SPN ET : Use only the spatial-reasoning networkfor ranking candidate entities using their geo-coordinates. No textual-reasoning performed. C R QA → SD:

Rank candidates using C R QA andthen re-rank the top-30 answers using SD. C R QA → SPN ET : Rank candidates usingC R QA and then re-rank the top-30 answers usingSPN ET . Training:

We pretrain SPN ET on this dataset byallowing entities within a radius of m from theactual gold-entity to be considered as gold (only forpretraining). To train the joint network we initial-ize model parameters learnt from component-wisepretraining of both SPN ET as well as C R QA.

We present our experiments on two slices of thetest-set – questions with tagged location-mentions(called

Location-Questions ) and those without anylocation mentions (

Non-Location Questions ). Ascan be seen in Table 4 sorting-by-distance (SD) per-forms very poorly indicating that simple methodsfor ranking based on entity-distance do not workfor such questions. Further, the poor performanceof SPN ET also indicates that the task cannot besolved just by reasoning on location data.In addition, pipelined re-ranking using SD orSPN ET over the textual reasoning model decreasesthe average distance ( Dist g ) from the gold-entitybut does not result in improved performance interms of answering (Acc@N) indicating the needfor spatio-textual reasoning. Finally, from Tables 4& 5 we note that the spatio-textual model performsbetter than its textual counterpart on the Location- ocation Questions Non-location Questions Full SetModels Acc@3 Acc@5 Acc@30 MRR

Dist g Acc@3 Acc@5 Acc@30 MRR Acc@3 Acc@5 Acc@30 MRR

CRQA 14.83 21.27 50.65 0.143 3.41 18.95 26.22 54.37 0.177 16.89 23.75 52.51 0.159Spatio-TextualCRQA

Spatio-textualC R QA(w/o distance-aware QE) 16.85 23.39 53.04 0.159 2.84 20.06 26.86 56.49 0.185 18.45 25.13 54.76 0.172

Table 5: Comparison of Spatio-Textual C R QA (with and without (w/o) distance-aware question encoding) andC R QA (t-test p-value < . for Acc @3 ) Questions subset, while continuing to perform wellon questions without location mentions.

Effect of distance-aware question encoding:

Inorder to demonstrate the importance of distance-aware question encoding, we present an experi-ment where we remove the distance values fromthe input encoding. Thus, Equation 2 changes to t i = concat [ v i , g i ] . As Table 5 shows, the per-formance of the Spatio-Textual C R QA model inthe absence of distance-aware encoding drops (lastrow), but it still performs better than the text-onlyC R QA model (ﬁrst row). This indicates that thedistance-aware question encoding helps learn betterdistance weights for spatio-textual reasoning.

Effect of distractor-locations:

As mentioned ear-lier, we use a location-tagger that is oblivious to thereasoning task, to tag locations in the dataset.Wemanually create a small set of questions, ran-domly selected from the test-set, but ensuring thathalf of it contains at least one non-distractor loca-tion mentioned in the question while the other halfcontains questions with only distractor-locations.

Questions requiring Spatial-reasoning

Models Acc@3 Acc@5 Acc@30 MRR

Dist g SD 5.00 7.00 22.00 0.053 2.10SPN ET → SD 15.00 22.00 51.00 0.142

CRQA → SPN ET ET → SD 13.00 17.00 51.005 0.108 3.26CRQA → SPN ET Spatio-textualCRQA

Table 6: Experiments on two subsets from the test-set:(i) Questions requiring Spatial-reasoning (ii) Questionswith distractor-locations only.

As can be seen from Table 6, all models includ-ing the spatio-textual model deteriorate in perfor-mance if a question only contains distractors; thespatio-textual model however, suffers a less signiﬁ-cant drop in performance.

Error Type Percentage

Textual Reasoning Error 37.9%Far from the required location 22.3%Inﬂuenced by Distractor 12.6 %Not in requested Neighbourhood 10.7 %Location Tagger Error 5.8 %RepeatedLocation Names 4.9 %Error in Geo-Spatial Data 2.9 %Invalid Question 2.9 %

Table 7: Spatio-Textual C R QA: Classiﬁcation of Er-rors

Location Questions

Models Acc@3 Acc@5 Acc@30 MRR

Dist g C SR QA 19.89 26.43 SR QA 21.45 28.21

Table 8: Comparison with current state-of-the-artC SR QA on (i) Location Questions (ii) All data

Qualitative Study:

We randomly selected

QA pairs with location-mentions from the test-set,to conduct a qualitative error analysis of Spatio-textual C R QA (Table 7). We ﬁnd that nearly 37%of the errors can be traced to the textual-reasoner,22% of the errors were due to a ‘near’ constraint notbeing satisﬁed, while about 13% of the errors weredue to the model reasoning on distractor-locations.Lastly 8% of the errors were due to errors made bythe location-tagger and incorrect geo-spatial data.

Effect of Candidate Search Space:

Past work(Contractor et al., 2019) has improved overall taskperformance by employing a neural IR method toreduce the search space (Mitra and Craswell, 2019),and then using the C R QA textual-reasoner to re-rank only the top 30 selected candidates (pipelinereferred to as C SR QA). In line with their work, wecreate a spatio-textual counterpart to C SR QA, byusing spatio-textual reasoning in re-rank step. Weﬁnd that this ﬁnal model results in a pt (Acc@3)improvement overall (see Table 16), and a . ptimprovement on location questions (Acc@3), es-tablishing a new state of the art on the task. We notethat, because the IR selector is incapable of spatial- utomated evaluation Human evaluationLocation Non-location Location Non-location C SR QA 28.00 36.00 64.00 70.00Spatio-textualCSRQA 32.00 32.00 84.00 72.00

Table 9: Acc@3 results on a blind-human study using randomly selected questions from the test-set reasoning, it possibly reduces the gains made by thespatio-textual re-ranking. An interesting directionof future work could be to augment general purposeneural IR methods with such spatial-reasoning.

Effect of False Negatives:

To supplement the au-tomatic evaluation, we additionally conducted ablind human-study using the top-ranked C SR QAand spatio-textual C SR QA models on another sub-set of questions from the test-set. Two humanevaluators ( κ =0.81) were presented the top-3 an-swers from both models in random order and wereasked to mark each answer for relevance. As Table9 shows, the manual annotation resulted in Acc@3for C SR QA and spatio-textual C SR QA at a muchhigher, % and % respectively. On the subsetof location questions, the accuracy numbers are and . This underscores the value of jointspatio-textual reasoning for the task, and signiﬁesa substantial improvement in the overall QA per-formance. Our paper presents the ﬁrst joint spatio-textual QAmodel that combines spatial and textual reason-ing. Experiments on an artiﬁcially constructed(spatial-only) toy QA dataset show that our spa-tial reasoner effectively trains to satisfy spatial con-straints. We also presented detailed experiments onthe recently released POI recommendation task fortourism questions. Comparing against textual onlyand spatial only QA models, the joint model ob-tains signiﬁcant improvements. Our ﬁnal model es-tablishes a new state of the art on the task. In futurework, we would like to also support reasoning onquestions that require directional or topographicalinference (eg.“north of X”, “on the river beach”).

We would like to thank Krunal Shah, GauravPandey, Biswesh Mohapatra, Azalenah Shah fortheir helpful suggestions to improve early versionsof this paper. We would also like to acknowledgethe IBM Research India PhD program that enablesthe ﬁrst author to pursue the PhD at IIT Delhi. This work is supported by an IBM AI Horizons Net-work grant, IBM SUR awards, Visvesvaraya fac-ulty awards by Govt. of India to both Mausam andParag as well as grants by Google, Bloomberg and1MG to Mausam.

References

Aida Amini, Saadia Gabriel, Shanchuan Lin, RikKoncel-Kedziorski, Yejin Choi, and Hannaneh Ha-jishirzi. 2019. MathQA: Towards interpretablemath word problem solving with operation-basedformalisms. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 2357–2367, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Bin Bi, Chen Wu, Ming Yan, Wei Wang, JiangnanXia, and Chenliang Li. 2019. Incorporating ex-ternal knowledge into machine reading for gener-ative question answering. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019 , pages 2521–2530.Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St. John, Noah Con-stant, Mario Guajardo-Cespedes, Steve Yuan, ChrisTar, Yun-Hsuan Sung, Brian Strope, and RayKurzweil. 2018. Universal sentence encoder.

CoRR ,abs/1803.11175.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In

Association for Computa-tional Linguistics (ACL) .Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.Long short-term memory-networks for machinereading. In

Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing ,pages 551–561, Austin, Texas. Association for Com-putational Linguistics.Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In

Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1724–1734, Doha, Qatar. Association for ComputationalLinguistics.Gao Cong, Christian S. Jensen, and Dingming Wu.2009. Efﬁcient retrieval of the top-k most rel-evant spatial web objects.

Proc. VLDB Endow. ,2(1):337–348.anish Contractor, Barun Patra, Mausam, and ParagSingla. 2020. Constrained bert bilstm crf for un-derstanding multi-sentence entity-seeking questions.

Natural Language Engineering , page 1–23.Danish Contractor, Krunal Shah, Aditi Partap,Mausam, and Parag Singla. 2019. Large scalequestion answering using tourism data.

CoRR ,abs/1909.03527.Ga´etan De Rassenfosse, Jan Kozak, and Florian Seliger.2019. Geocoding of worldwide patent data.

Scien-tiﬁc data , 6(1):1–15.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In

Proc. ofNAACL .Matthew Dunn, Levent Sagun, Mike Higgins, V. UgurG¨uney, Volkan Cirik, and Kyunghyun Cho. 2017.Searchqa: A new q&a dataset augmented with con-text from a search engine.

CoRR , abs/1704.05179.Daniel Ferr´es Dom`enech. 2017. Knowledge-basedand data-driven approaches for geographical infor-mation access.Fredric Gey, Ray Larson, Mark Sanderson, KerstinBischoff, Thomas Mandl, Christa Womser-Hacker,Diana Santos, Paulo Rocha, Giorgio M Di Nun-zio, and Nicola Ferro. 2006. Geoclef 2006: theclef 2006 cross-language geographic information re-trieval track overview. In

Workshop of the Cross-Language Evaluation Forum for European Lan-guages , pages 852–876. Springer.Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-modal preference modeling for product search. In

Proceedings of the 26th ACM international confer-ence on Multimedia , pages 1865–1873.Mark Hopkins, Ronan Le Bras, Cristian Petrescu-Prahova, Gabriel Stanovsky, Hannaneh Hajishirzi,and Rik Koncel-Kedziorski. 2019. SemEval-2019task 10: Math question answering. In

Proceed-ings of the 13th International Workshop on Seman-tic Evaluation , pages 893–899, Minneapolis, Min-nesota, USA. Association for Computational Lin-guistics.Binxuan Huang and Kathleen Carley. 2019. A hierar-chical location prediction neural network for twitter user geolocation. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 4732–4742, Hong Kong, China. As-sociation for Computational Linguistics.Tannon Kew, Anastassia Shaitarova, Isabel Meraner,Janis Goldzycher, Simon Clematide, and MartinVolk. 2019. Geotagging a diachronic corpus ofalpine texts: Comparing distinct approaches to to-ponym recognition. In

Proceedings of the Work-shop on Language Technology for Digital HistoricalArchives , pages 11–18.Tuan Manh Lai, Trung Bui, and Sheng Li. 2018. A re-view on deep learning techniques applied to answerselection. In

Proceedings of the 27th InternationalConference on Computational Linguistics, COLING2018, Santa Fe, New Mexico, USA, August 20-26,2018 , pages 2132–2144.Jochen L. Leidner, Bruno Martins, Katherine Mc-Donough, and Ross S. Purves. 2020. Text meetsspace: Geographic content extraction, resolutionand information retrieval. In

Advances in Informa-tion Retrieval , pages 669–673, Cham. Springer In-ternational Publishing.Miao Li, Lisi Chen, Gao Cong, Yu Gu, and Ge Yu.2016. Efﬁcient processing of location-aware grouppreference queries. In

Proceedings of the 25th ACMInternational on Conference on Information andKnowledge Management , CIKM ’16, page 559–568,New York, NY, USA. Association for ComputingMachinery.Kwan Hui Lim, Shanika Karunasekera, Aaron Har-wood, and Yasmeen George. 2019. Geotaggingtweets to landmarks using convolutional neural net-works with text and posting time. In

Proceedingsof the 24th International Conference on IntelligentUser Interfaces: Companion , IUI ’19, page 61–62,New York, NY, USA. Association for ComputingMachinery.Thomas Mandl, Paula Carvalho, Giorgio Maria Di Nun-zio, Fredric Gey, Ray R Larson, Diana Santos,and Christa Womser-Hacker. 2008. Geoclef 2008:the clef 2008 cross-language geographic informa-tion retrieval track overview. In

Workshop of theCross-Language Evaluation Forum for EuropeanLanguages , pages 808–821. Springer.Bhaskar Mitra and Nick Craswell. 2019. An updatedduet model for passage re-ranking. arXiv preprintarXiv:1903.07666 .R. S. Purves, P. Clough, C. B. Jones, M. H. Hall,and V. Murdock. 2018.

Geographic Information Re-trieval: Progress and Challenges in Spatial Searchof Text .Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and ZhiyuanLiu. 2019. NumNet: Machine reading comprehen-sion with numerical reasoning. In

Proceedings ofhe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2474–2484, Hong Kong,China. Association for Computational Linguistics.Stephen Robertson and Hugo Zaragoza. 2009. Theprobabilistic relevance framework: Bm25 and be-yond.

Found. Trends Inf. Retr. , 3(4):333–389.Diana Santos and Lu´ıs Miguel Cabral. 2009. Giki-clef: Expectations and lessons learned. In

Workshopof the Cross-Language Evaluation Forum for Euro-pean Languages , pages 212–222. Springer.Simon Scheider, Enkhbold Nyamsuren, Han Kruiger,and Haiqi Xu. 2020. Geo-analytical question-answering with gis.

International Journal of DigitalEarth , 0(0):1–14.Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-tention ﬂow for machine comprehension.

CoRR ,abs/1611.01603.Oyvind Tafjord, Matt Gardner, Kevin Lin, and PeterClark. 2019. Quartz: An open-domain dataset ofqualitative relationship questions. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019 , pages 5940–5945.George Tsatsanifos and Akrivi Vlachou. 2015. On pro-cessing top-k spatio-textual preference queries. In

Proceedings of the 18th International Conference onExtending Database Technology, EDBT 2015, Brus-sels, Belgium, March 23-27, 2015 , pages 433–444.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, undeﬁne-dukasz Kaiser, and Illia Polosukhin. 2017. Attentionis all you need. In

Proceedings of the 31st Interna-tional Conference on Neural Information ProcessingSystems , NIPS’17, page 6000–6010, Red Hook, NY,USA. Curran Associates Inc.Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-JiaLi, Li Fei-Fei, and James Hays. 2019. Compos-ing text and image for image retrieval-an empiricalodyssey. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages6439–6448.Dimitri Vorona, Andreas Kipf, Thomas Neumann, andAlfons Kemper. 2019. Deepspace: Approximategeospatial query processing with deep learning. In

Proceedings of the 27th ACM SIGSPATIAL Interna-tional Conference on Advances in Geographic Infor-mation Systems , pages 500–503.Jiangnan Xia, Chen Wu, and Ming Yan. 2019. Incorpo-rating relation knowledge into commonsense read-ing comprehension with multi-task learning. In

Pro-ceedings of the 28th ACM International Conference on Information and Knowledge Management , CIKM’19, page 2393–2396, New York, NY, USA. Associ-ation for Computing Machinery.M. L. Yiu, X. Dai, N. Mamoulis, and M. Vaitis. 2007.Top-k spatial preference queries. In ,pages 1076–1085.C. Zhang, Y. Zhang, W. Zhang, and X. Lin. 2016. In-verted linear quadtree: Efﬁcient top k spatial key-word search.

IEEE Transactions on Knowledge andData Engineering , 28(7):1706–1721.Ji Zhao, Meiyu Yu, Huan Chen, Boning Li, LingyuZhang, Qi Song, Li Ma, Hua Chai, and Jieping Ye.2019. POI semantic model with a deep convolu-tional structure.

CoRR , abs/1903.07279.

A Appendix

This appendix is organized as follows.• Section A.1 provides more details about theToy Dataset and supplementary experimen-tal information that includes additional tablesreferred to in the main paper on Spatial Rea-soning.• Section A.2 includes more results of the Lo-cation Tagger used in the end-task.• Section A.3 contains supplementary experi-ments on Spatio-Textual Reasoning.• Section A.4 gives details about the modelhyper-parameters.

A.1 Toy Dataset

We create a simple toy-dataset that is generatedusing linguistically diverse templates specifyingspatial constraints and locations chosen at ran-dom from across 200,000 entities. These entitieswere sourced from the recently released Points-of-Interest (POI) recommendation task (Contractoret al., 2019). Each POI entity is labeled with itsgeo-coordinates apart from other meta-data such asits address, timings, etc. Further, each entity in acity has a speciﬁc type viz. Restaurant(R), Attrac-tion(A) or Hotel(H). Table 10 shows the list of tem-plates used for generating the dataset. These tem-plates have been to make the toy-dataset reﬂectiveof real-world challenges. For instance, templates distrac-tor locations . To generate questions, $LOCATIONand $ENTITY values are updated by randomly se-lecting values from the POI-set for each entity asdescribed in the next section. .1.1 Dataset Generation

To generate a question, a city c , type t and a tem-plate T are chosen at random. The ”ENTITY” to-ken in each template is replaced by a randomlychosen metonym of the type t . Table 11 shows thelist of metonyms for each type. Each instance ofthe ”LOCATION” token is replaced by a randomlychosen entity from the city c and type t . The candi-date set consists of the entities from the city c andtype t . The entities used as location mentions aresampled without replacement and removed fromthe candidate set.The gold answer entity is uniquely determinedfor each question based on its template. For ex-ample, consider a template T, “ I am staying at $A!Please suggest a hotel close to $B but far from$C. ” The score of a candidate entity X is given by dist T ( X ) = − ( dist ( X, B ) − dist ( X, C )) (dis-tances from B needs to be reduced, while distancefrom C needs to be higher). A is a distractor. Thecandidate with the max ( dist T ( X ) ) in the universeis chosen as the gold entity for that question.Each question further consists of 500 negativesamples (35% hard, 65% soft). The negative sam-ples are generated as a part of the gold generationprocess. A hard negative sample has a dist T ( X ) value closer to the gold as compared to a soft nega-tive sample. We release the samples used for train-ing along with the dataset for reproducibility. A.1.2 Template classes

We create templates (Table 10) that can be broadlydivided into three different categories based onwhether the correct answer entity is expected tobe: (1) close to one or more locations [1-16] (2)far from one or more locations [17-32] (3) closeto some and far from others (combination) [33-48]. To make the task more reﬂective of real-worldchallenges we also randomly insert a distractor location that does not need to be reasoned. Thesecond-half for each category (i.e. [9-16], [25-32], and [41-48]) consists of templates that have adistractor locative reference. Further, for the close(or far) category, the templates could contain onelocation ([1-4] + [9-12]) or two locations ([5-8] +[13-16]) that need to be reasoned for close (or far).

A.1.3 Results

We use the following models in our experiments:(i)SPNet (ii) SPNet without (w/o) DRL (iii) BERT-SPNet (iv) BERT-SPNet without (w/o) DRL. Mod-els without DRL use the ﬁnal hidden states of the Question Encoder and a series of down-projectingfeed-forward layers to generate the ﬁnal score.We study our models’ performance using

Acc @ N (N=3,5,30) which requires that any oneof the top-N answers be correct, Mean Recipro-cal Rank ( M RR ), and the average distance of thetop- ranked answers from the gold-entity Dist g .Table 12 summarizes the results test set. A.1.4 Error Analysis

Tables 13 and 14 show the effect of candidatesearch space and the number of location mentionsin the question on the performance of the SPNetModel.

Correctly Answered Incorrectly AnsweredSearch Space size Questions Percentage Questions Percentage

Table 13: Performance of SPNet decreases with in-crease in universe size.

Correctly Answered Incorrectly Answered

Table 14: Performance of SPNet decreases with in-crease in the number of location mentions in the ques-tion.

A.2 Location Tagger

In order to get mentions of locations in questions,we manually label a set of questions from thetraining set for location mentions. We then usea BERT-BiLSTM CRF (Contractor et al., 2020)based tagger trained on this set to label locations.Table 15 describes the performance of the taggeron an unseen set of questions. Precision Recall F1Micro Average

Macro Average

Table 15: Performance of the BERT-BiLSTM CRF fortagging locations on a small set of 75 unseen questions.

A.3 Spatio-textual Reasoning Network

The Spatio-Textual Reasoning Network consists ofthree components (i) Spatial Reasoner (ii) TextualReasoner (iii) Joint Scoring Layer.

Training:

We train the joint model using max-margin loss teaching the network to score the d Description

ENTITY near the

LOCATION ?2 Does anyone have ideas on

ENTITY close to

LOCATION ? Thank you!3 Hello! Could anyone please suggest

ENTITY in the neighborhood of

LOCATION ?4 Good Morning! Can someone please propose

ENTITY not very far from

LOCATION ?5 Suggestions for

ENTITY close to both

LOCATION and

LOCATION ?6 Some good ideas of

ENTITY between

LOCATION and

LOCATION ? Thanks much!7 Please advise

ENTITY close to

LOCATION and not very far off the

LOCATION .8 Any ideas for

ENTITY near

LOCATION and also close to

LOCATION would be welcomed?9 I once lived around

LOCATION . Does anyone have ideas of

ENTITY close to the

LOCATION ? Thanks!10 Any nice suggestions of

ENTITY near the

LOCATION ? I will be going to

LOCATION the next day.11 I just came from

LOCATION . Someone, please recommend

ENTITY in the neighborhood of

LOCATION .12 Could anyone propose

ENTITY not far from the

LOCATION ? I need to leave for

LOCATION urgently.13 We came from

LOCATION this morning. Suggestions for

ENTITY close to both

LOCATION and

LOCATION ?14 Any ideas of

ENTITY between

LOCATION and

LOCATION ? I would be going to

LOCATION . Thanks.15 We might be staying around

LOCATION . Please advise

ENTITY close to

LOCATION and not far from

LOCATION .16 Could anyone suggest ideas for

ENTITY close to

LOCATION and around

LOCATION ? We could be going to

LOCATION soon.17 Any suggestions for

ENTITY quite far from the

LOCATION ? Thank you very much!18 Somebody please suggest

ENTITY cut off from

LOCATION . Have a good day!19 Does anyone have suggestions for

ENTITY away from

LOCATION ? Thanks a lot!20 Good Afternoon! Any proposals for

ENTITY not very close to the

LOCATION ?21 Suggestions on

ENTITY far from both

LOCATION and

LOCATION ? Thank!22 Hi! Any idea of

ENTITY far away from

LOCATION and

LOCATION ?23 Could anyone please propose

ENTITY not close to

LOCATION and also far from

LOCATION ?24 Does anyone have any suggestions for

ENTITY far from

LOCATION and not around

LOCATION ?25 Hey! I will be staying at

LOCATION . Please suggest

ENTITY cut off from

LOCATION .26 Any pleasant ideas of

ENTITY far off the

LOCATION ? I might then be visiting

LOCATION .27 I came from

LOCATION this afternoon. Any proposal for

ENTITY not close to the

LOCATION ?28 Does anyone have a suggestion for

ENTITY distant from

LOCATION ? By the way, I came from

LOCATION yesterday.29 We will be staying near the

LOCATION . Suggestions for

ENTITY far from both

LOCATION and

LOCATION will be welcomed.30 Any idea of

ENTITY far away from

LOCATION and

LOCATION ? I would then be visiting

LOCATION .31 Hi, I will be staying near the

LOCATION . Could anyone propose

ENTITY not very close to

LOCATION and far from

LOCATION ?32 Does anyone have suggestions for

ENTITY far from

LOCATION and also far from

LOCATION ? I will then be visiting

LOCATION too.33 Any good ideas of

ENTITY far from

LOCATION but close to

LOCATION would be appreciated? Best Regards.34 Anyone having ideas of

ENTITY close to

LOCATION but far from

LOCATION ?35 Someone please advise

ENTITY far from

LOCATION but not very far from

LOCATION .36 Suggest

ENTITY close to

LOCATION but not in the neighborhood of

LOCATION . Thank you so much!37 Does anyone have good ideas of

ENTITY far from

LOCATION but near

LOCATION ? Regards.38 Please suggest ideas of

ENTITY in the neighborhood of

LOCATION but far from

LOCATION .39 Could anyone advise

ENTITY far from

LOCATION but not too far from

LOCATION ?40 Any nice ideas of

ENTITY close to

LOCATION but not in the neighborhood of

LOCATION . Thanks!41 Tomorrow, I would be coming to stay at

LOCATION . Anyone having ideas of

ENTITY close to

LOCATION but far from

LOCATION ?42 Please propose

ENTITY far from

LOCATION but not far from

LOCATION . I will then be exploring

LOCATION .43 I came from

LOCATION this evening. Any nice ideas for

ENTITY far from

LOCATION but close to

LOCATION would be appreciated?44 Suggest

ENTITY close to

LOCATION but not near

LOCATION . Tomorrow, I will be leaving for

LOCATION .45 Yesterday, I came to stay at

LOCATION . Any ideas of

ENTITY close to

LOCATION but far from

LOCATION ?46 Suggestions of

ENTITY far from

LOCATION but not very far from

LOCATION . I will then be moving to

LOCATION .47 I came from

LOCATION today. Any good ideas for

ENTITY far from

LOCATION but near to

LOCATION would be welcomed?48 Advise

ENTITY close to

LOCATION but not close to

LOCATION . I might be leaving for

LOCATION soon.

Table 10: Templates used for generating the Toy-dataset

Entity type Metonyms

R (Restaurant) a restaurant, an eatery, an eating joint, a cafeteria, an outlet, a coffee shop, a fast food place, a lunch counter,a lunch room, a snack bar, a chop house, a steak house, a pizzeria, a coffee shop, a tea house, a bar roomH (Hotel) a hotel, an inn, a motel, a guest house, a hostel, a boarding house, a lodge, an auberge, a caravansary,a public house, a tavern, an accomodation, a resort, a youth hostel, a bunk house, a dormitory, a ﬂop houseA (Attraction) an attraction, a tourist spot, a tourist attraction, a popular wonder, a sightseeing place, a tourist location,a place of tourist interest, a crowd pleaser, a scenic spot, a popular landmark, a monument

Table 11: List of metonyms for each entity type in the Toy-dataset odels Acc@3 Acc@5 Acc@30 MRR Dg

Close to Set XSPNet w/o DRL 62.60 66.00 79.00 0.608 2.88SPNet 90.20

Far from Set XSPNet w/o DRL 89.00 90.80 96.40 0.858 15.24SPNet

CombinationSPNet w/o DRL 23.40 28.00 50.60 0.229 9.72SPNet 52.80 60.20 82.00 0.486 3.90BERT SPNet w/o DRL 26.80 32.60 59.00 0.242 12.96BERT SPNet

AggregateSPNet w/o DRL 58.33 61.60 75.33 0.565 9.28SPNet 80.33 83.80 92.93 0.778 6.21BERT SPNet w/o DRL 60.40 64.07 79.13 0.579 10.65BERT SPNet

Table 12: Results of the Spatial-reasoning network on the toy-data test set correct-answer higher than a negatively sampledcandidate entity. Model parameters are describedin the next section.

A.3.1 Results

Similar to Contractor et al. (2019) we also ex-periment on this dataset by employing a neuralmethod to reduce the search space (Mitra andCraswell, 2019) before using the C R QA textual-reasoner to re-rank only the top-30 selected can-didates (pipeline referred to as C SR QA). UnlikeC R QA, which uses two levels of attention betweenquestion and review sentences to score candidateentities C S QA does not reason deeply over the text.It compares elements of a question with differentparts of a review document to aggregate relevancefor scoring. Local and distributed representationsare used to capture lexical and semantic features.We report some experiments using this modelreferred to as C S QA and compare it with C SR QAand spatio-textual C SR QA. As can be seen re-ranking with SD or SPNet does not help the system.An interesting direction of future work could thus,be to augment general-purpose neural-IR methodssuch as Duet used by C S QA with spatial-reasoning.Another interesting approach could be to extendideas from existing Graph-neural network basedapproaches, such as NumNet (Ran et al., 2019).Each entity could be viewed as a node in a graphfor reasoning but we note that methods will need tobe made more scalable for them to be useful. Theentity space (and thus nodes in the graph) wouldrun into thousands of nodes per question makingcurrent message-passing based inference methods

Location Questions

Models Acc@3 Acc@5 Acc@30 MRR Dg C S QA 15.84 20.26 S QA → SD 11.34 17.26 C S QA → LocNet 8.38 13.72 SR QA 19.89 26.43 SR QA 21.45 28.21

Table 16: Comparison of re-ranking models operatingon a reduced search space returned by C S QA on Lo-cation Questions (ii) Comparison with current state-of-the-art C SR QA on the full task. prohibitively expensive.

A.4 Model settingsA.4.1 Experiments on Toy Dataset

The hyperparameters for the best performing con-ﬁgurations of all models were identiﬁed throughmanual testing on the validation set (Table 17). Themodels were trained on a 2x NVIDIA K40 (12GB,2880 CUDA cores) GPU on a shared cluster.The BERT models were trained with a learningrate of 0.0002 whereas the non-BERT models witha learning rate of 0.001.

A.4.2 Spatio-textual Reasoning Network

The hyperparameters for the best performing con-ﬁguration were identiﬁed through manual testingon the validation set (Table ?? ). The Spatio TextualReasoner was trained on 4 K-80 GPUs on a sharedcluster. yperparameter Value Negative samples 40Batch size 20Optimizer AdamLoss MarginRankingLossMargin 0.5Max no. of epochs 15GRU Input dimension 131GRU Output dimension 32DRL Block Layer 1 64 (Input) 64 (Output)DRL Block Layer 2 64 (Input) 64 (Output)DRL Block Layer 3 64 (Input) 64 (Output)DRL Block Layer 4 64 (Input) 1 (Output)

Table 17: Hyperparameter settings for experiments onthe toy-dataset

Hyperparameter Value