A Hierarchical Location Normalization System for Text
AA Hierarchical Location Normalization System for Text
Dongyun Liang, Guohua Wang, Jing Nie, Binxu Zhai and Xiusen Gu
Tencent, China { dylanliang, guohuawang, dennisgu } @tencent.com Abstract
It’s natural these days for people to know thelocal events from massive documents. Manytexts contain location information, such as cityname or road name, which is always incom-plete or latent. It’s significant to extract theadministrative area of the text and organizethe hierarchy of area, called location normal-ization. Existing detecting location systemseither exclude hierarchical normalization orpresent only a few specific regions. We pro-pose a system named ROIBase that normal-izes the text by the Chinese hierarchical ad-ministrative divisions. ROIBase adopts a co-occurrence constraint as the basic frameworkto score the hit of the administrative area,achieves the inference by special embeddings,and expands the recall by the ROI (region of in-terest). It has high efficiency and interpretabil-ity because it mainly establishes on the definiteknowledge and has less complex logic thanthe supervised models. We demonstrate thatROIBase achieves better performance againstfeasible solutions and is useful as a strong sup-port system for location normalization. In every day and every place, various events are be-ing reported in the form of texts, and many of thesedon’t present hierarchical and standard locations.In the context-aware text, location is a fundamentalcomponent that supports a wide range of applica-tions. We need to focus on the normalizing locationto process massive texts effectively in specific sce-narios. As the text stream in social media are morequickly in accident or disaster response (Munro,2011), location normalization is crucial for situa-tional awareness in these fields, in which the omit-ted writing style often avoids redundant content.For example, “ 十 陵 立 交 路 段 交 通 拥 堵 (Traffic github.com/waterblas/ROIBase-lite congestion at Shiling Interchange)” refers to a defi-nite location, but there’s no indication of where theShiling Interchange is to make an exact response,unless we know it belongs to Longquanyi district,Chengdu city, Sichuan province.Countries are divided up into different units tomanage their land and the affairs of their peopleeasier. Administrative division (AD) is a portionof a country or other region delineated for the pur-pose of administration. Due to China’s large pop-ulation and area, the AD of China have consistedof several levels since ancient times. For clarityand convenience, we cover three levels in our sys-tem, and treat the largest administrative divisionof a country as 1st-level, next subdivisions as 2nd-level and 3rd-level, which matches the provincial(province, autonomous region, municipality, andspecial administrative region), prefecture-level cityand county in China, shown as Table 1. Chinaadministers more than 3,200 divisions in these flat-tened levels. In such a large and complex hierarchy,much work stops at extracting the relevant loca-tions, such as named entity tagging (Srihari, 2000).There are many similar named entity recognition(NER) toolkits (Che et al., 2010; Finkel et al., 2005)for location extraction. As the ambiguity is veryhigh for location name, Li et al. (2002) and Al-Olimat et al. (2017) develop to the disambiguationof location extraction. We take a step closer toextract normalization information, and determinewhich the three hierarchical administrative area thedocument mainly describes.The challenges are a bit different in our loca-tion normalization, which are mainly in ambiguityand explicit absence. For example, there is a du-plicate Chaoyang district as 3rd-level in Beijingand Changchun city, and “Chaoyan” also meansthe rising sun in Chinese, which may cause ambi-guity. If “Beijing” and “Chaoyang” are mentionedin the same context, it is confident that “Chaoyang” a r X i v : . [ c s . C L ] J a n Structural hierarchy of the administrative divisions and basic level autonomies of ChinaProvincial level (1st) 省级行政区
Prefectural level (2nd) 地级行政区
County level (3rd) 县级行政区Autonomous region 自治区 Sub-provincial-level autonomous prefecture 副省级自治州 District 市辖区 County-level city 县级市 County 县 Autonomous county 自治县 Banner 旗 Autonomous banner 自治旗Prefectural-level city 地级市Autonomous prefecture 自治州 Prefecture 地区 Leagues 盟Province 省 Sub-provincial-level city 副省级城市 District 市辖区 Special district 特区 County-level city 县级市 County 县 Autonomous county 自治县Prefectural-level city 地级市Autonomous prefecture 自治州 Prefecture 地区Sub-prefectural-level city 副地级市Forestry district 林区Municipality 直辖市 Sub-provincial-level new area 副省级市辖新区District 市辖区County 县Special administrative region 特别行政区 (Part of the One country, two systems)
Region 地区 (informal)
District 区Civic and Municipal Affairs Bureau 民政总署 Municipality 市 (informal)
Freguesia 堂区 (informal)
Table 1: Structural hierarchy of the administrative divisions. should refer to the district of Beijing city. Simi-larly, Yarowsky (1995) proposes a corpus-basedunsupervised approach that avoids the need forcostly truthed training data. However, it’s com-mon that some contexts lack enough co-occurrenceof AD to disambiguate or the explicit informationcompletely misses. We refer to it as the explicitabsence problem, and neither NER nor disambigua-tion makes it work unless more hidden informationis explored. There are many specific AD-relatedpoints identifying which division is, including: • Location alias, e.g. “ 鹏 城 (Pengcheng)” is thealias name of Shenzhen city; • Old calling or customary title, e.g. “ 老 闸 北 (Old Zhabei)” is a municipal district that onceexisted in Shanghai city; • The phrase about the spatial region event, e.g.“ 中 国国 际 徽 商 大 会 (China Huishang Con-ference)” has been held in Hefei city; • Some POIs (point of interest), e.g. The well-known “ 颐 和 园 (Summer Palace)” is situatedin the northwestern suburbs of Beijing.We summary them as a concept named ROI,which is both similar and different from POI.POI dataset collects specific location points thatsomeone may find useful or interesting. It mapsthe detailed address that covers the administra-tive division. However, many POIs only build anuni-directional association with AD. For example,Bank of China as a common POI is opened acrossthe China. We can find many Bank of China at aspecific AD, but if only “Bank of China” exists in acontext, we can’t directly confirm its location with-out more area information. Since POI is uncertain naturally, we propose the concept of ROI, whichhas a bi-directional association with AD. Given anROI mapping the fixed hierarchical administrativearea, ROI has high confidence to represent the area,as well as the area contains it definitely. In theabsence of explicit patterns, the co-occurring ROIin the context can be good evidence to predict themost likely administrative area. The main contri-butions of the system are as follows, which can beapplied to other languages:1. We provide a structured AD database, anduse the co-occurrence constraints to make adecision;2. The ROIBase is equipped with geographicembeddings trained by special location se-quences to make an inference;3. We use a large news corpus to build a knowl-edge base that is made up of ROIs, whichhelps normalization. We design a web-based online demo to show thelocation normalization. As shown in Figure 1, thereare three cases split by blue lines, and each casemainly contains two components: query and result. Query
Input the document into the textbox witha green border to query for ROIBase. The queryaccepts the Chinese format sentences, such as thetext from news or social media.
Result
On the right of the textbox, it will showthe structured result from ROIBase after submit-ting the query. The result consists of there parts:
Confidence , Inference and
ROI . http://research.dylra.com/v/roibase/ igure 1: User interface of ROIBase. Confidence represents the result that can be ex-tracted and identified from explicit information.For example, we have confidence to fill “ 新 疆 (Xin-jiang)” when “ 尉 犁 县 (Yuli County)” and “ 巴 音 郭 楞 蒙 古 自 治 州 (Bayingol Mongolian AutonomousPrefecture)” are coming together in context. Inference is complement for the
Confidence byembeddings, where the nearest uncertain admin-istrative level will be inferred from the implicitinformation of the input. For example, none of theexplicit administrative area appears in middle caseof the Figure 1, so the
Inference will start with 1st-level (the largest division), and it infers “ 广 东 省 (Guangdong Province)”. If the Confidence comesup with 1st-level, the
Inference will start with 2nd-level. If the
Confidence is filled with three levels,
Inference does nothing and keeps it as before.
ROI is derived from the ROI knowledge base.We will match the input with the ROI knowledgebase, and return the ROI associated with the admin-istrative area when the match is successful. Thetypes of ROI are many and varied, and what theyhave in common is that it build the bidirectionalrelation with a hierarchical AD. As shown in Fig-ure 1, “ 梧桐 山 (Wutong Mountain)”, the highestpeak in Shenzhen city, map to three levels: [Yantiandistrict, Shenzhen city, Guangdong province].When the user queries, the input will be seg-mented into tokens by a Chinese tokenizer. Twoprocesses are running in parallel: one is calculatingthe Confidence and then
Inference , the other is re-trieving the ROI knowledge base. The final resultwill be restructured back to the front in green color.
We support an administrative division database,including the names and partial aliases of the ad- ministrative areas in China, which are organizedin the form of hierarchy. Each record is associatedwith its parent and children, for example, “ 襄 阳 市 (Xiangyang city)” is at 2nd-level, its alias is Xiang-fan, its parent is Hubei province, and some childrenof its divisions are Gucheng County, XiangzhouDistrict, etc. we develop a co-occurrence constraintbased on this database to Confidence result, shownin Algorithm 1.
ALGORITHM 1: processing
Confidence
Input : S, sentences from text
Output
D, hierarchical administrative division T ← ∅ , Q ← {} foreach word phrase w ∈ S doif w hit AD database then expand w to three levels [ l , l , l ] bystandard AD, and add them into T ; endforeach hierarchical candidate t ∈ T do Count the hit number of level of t in S : Q [ t ] = CountLevel ( S, t ) end Filter out t ∈ Q when Q [ t ] < max Q foreach filtered t ∈ Q doforeach sentence s ∈ S do Q [ t ]+ = Count ( s,t )(1+ CountOtherAD ( s )) endendreturn D = arg min t [:max Q ] ( Q [ t ]) Firstly, we expand the possible AD hierarchy ascandidates based on the input segments, and filterthe longest to next calculation. If a sentence is fullof various AD information, it is probably just thelisting of addresses that makes no sense, such as: 青 少 年 橄 榄 球 天 行 联 赛 总 决 赛 在 上 海 森 兰 体 育 公 园 举 行 。 由 来 自 北 京 、 上 海 、 深 圳 、 重 庆 、 贵 阳 等 地 的 青 少 年 选 手 组 成 的 ... here the underlined words are related to the ad-ministrative area. The more various area-relatedwords are, and the less certainty a sentence has.We consider the frequency of the hits as well as thepenalty of other surrounding area-related words,and construct a function to accumulate the weightof each sentence for AD. Finally, we get the Confi-dence result based the explicit statistics.
We propose to train geographic embeddings byword sequences related to AD. As the location in-formation in a document is usually only a smallpart, the standard name of AD are sparse and dis-perse, and the words related to geographic locations(now called geographic words) in a long tail arerarely seen. We don’t directly get the embeddingfrom the raw word sequences, and instead, we as-sume that the raw sequences are made up of therecords of AD database, geographic words, andothers. To keep the former twos, we pass througha large news corpus, more than 14.3 million doc-uments, take every phrase of news sentences thathits the AD records as a starting point, use a NERtoolkit to recognise the location entities among thesurrounding two sentences, and keep order to ex-tract the candidate sequences that consist of thestandard AD records and location entities. In thepattern of the NER model, it’s not extremely ac-curate, and various types of the phrases related tolocation are generically recognized. We collect thecandidate sequences greater than a threshold lengthto train geographic embeddings.Given a set S of candidate sequences ex-tracted from documents, each sequence s =( w , ..., w m ) ∈ S is made up of the AD recordsand location entities, where the relative order ofelements in s stays the same as raw text. The aimis to learn a d -dimensional real-valued embedding v w i of each v w i , so that the administrative area andgeographic words are in the same embedding space,and the adjacent administrative areas lie nearby inthe embedding space. We learn the embedding us-ing the skip-gram model (Mikolov et al., 2013) bymaximizing the objective function L over the set S , which is defined as follows: L = (cid:88) s ∈ S (cid:88) w i ∈ s ( (cid:88) − n ≥ j ≤ n,i log P ( w i + j | w i )) P ( w i + j | w i ) = exp( v Tw i v (cid:48) w i + j ) (cid:80) | V | w =1 exp( v Tw i v (cid:48) w ) where v and v (cid:48) are the input and output vector, n is the size of the sequence window, and V is thevocabulary that consists of the administrative areasand geographic words. Figure 2: The clustering distribution of geographic em-beddings about administrative areas.
To evaluate whether the region characteristicsare captured by geographic embeddings, we de-sign a visualization to show. Firstly, we performk-means clustering on the learned embeddings ofrecords in AD database, cluster 4,000+ standardAD to 100 clusters, and then plot the scatters onthe map of China with the division borders, wherethe different colors represent different clusters andthe coordinates are the rough locations of the selfstandard AD. As shown in Figure 2, the scattersin same clusters are mainly located in same ad-ministrative area, and it means that the geographicsimilarity is well encoded.
Based on the
Confidence result, we utilize the ge-ographic embeddings that we train in the abovesection to infer the next administrative area. Wefirst get the intersection of the input text and thegeographic words V , and average the embeddingsof the intersection at each dimension as the rep-resentation of the input v input . Then we embedthe latest level’s divisions of Confidence to get thecandidate embeddings. For example, the
Confi-dence ends with 2nd level, denoted as [ l , l ] , so theembeddings of its latest level’s divisions [ l , ...l k ] can be denoted as v l i , i = [1 , ..., k ] , where k isthe number of l subdivisions. It can be observedthat cosine similarities between the right candidateand geographic embedding are often higher com-pared to other candidates embeddings. We makehe Inference by arg max l i Cosine ( v l i , v input ) asthe complement of Confidence . Since embeddings are implicit, we build an ROIknowledge base to improve interpretability and re-duce the bias of
Inference . Unlike traditional tax-onomies that require a lot of manual labor, we pro-pose a novel method to extract ROI from large cor-pus, which uses the statistics to model inconsistent,ambiguous and uncertain information it contains.Given the geographic sequences s in section 3.2,where ¯ w ∈ s is the geographic word, we assumethat the most frequent administrative area in thewindow of the geographic word probably corre-sponds to its division. In fact, some administrativearea records appear more frequently in general,such as Beijing, Shanghai and other big cites. Weconsider the number of the pair ( ¯ w, w i ) appears inthe S , where w i represents the administrative areaname. and offset by the total count of w i in thewhole corpus. Therefore, a similar tf–idf weightingscheme is applied to balance the exact division: score ( ¯ w, w i ) = Count ( ¯ w, w i ) × IDF ( w i ) where the Count denotes the counting operation ofthe co-occurrence of ¯ w and w i in each geographicsequence, and IDF denotes the inverse documentfrequency of w i in all sequences for S .We score each pair ¯ w and w i , and filter the validpairs by a high threshold. Then the sorted map-ping { ¯ w | ( w , g ) , ..., ( w t , g t ) } is obtained for each ¯ w , where g i denotes the score weight, the higher g i ranks more ahead. It is noteworthy that the ge-ographic word is not equal to ROI. We use theinformation entropy to filter the valid candidates: E ( ¯ w ) = − (cid:88) i P i log P i , P i = g i (cid:80) tj g j If ¯ w can’t represents the administrative area, theweights of the candidates mappings will be dis-persed. The higher E ( ¯ w ) is, the less certain thethe mapping contains. We cut off the high E ( ¯ w ) tokeep the candidates of ROIs.For a specific candidate ROI, it is common thatthe upper level of mapping will has the higher fre-quency than the low level in news corpus. Forexample, the co-occurrence of Summer Palace and
Beijing is larger than the co-occurrence of
Sum-mer Palace and
Haidian , and Haidian district is asubdivision of Beijing city. We base subdivision relation to correct the weight of w i when the w j isthe parent division of w i , where j < i . g i = g i /P ( w i | w j , ¬ w i , s ) P ( w i | w j , ¬ w i , s ) = (cid:80) s ∈ S H ( w i ∩ f ( s )) (cid:80) s ∈ S H ( w j ∩ ¬ w i ∩ s ) where ¬ w i means the operation without w i , P ( w i | w j , ¬ w i , s ) denotes the probability that only w j appears in s but actually it belong to w i , f ( s ) denotes the sequences that are in the same doc-ument excluding s , and H is the Heaviside stepfunction.We sort the mapping again by the re-weightscheme, and get the top few pairs, which are onsame orders of magnitude, to compose ROI pairs ( ¯ w, < l , l , l > ) , where l , l , l represent thethree levels of AD and it will be set to null if one ismissing. Finally, the pairs are inserted into Elastic-search engine to build the knowledge base. There are no publicly available datasets on text lo-cation normalization, so as no comparable methods.As many similar schemes about detecting locationstart from NER, we build NER+pattern as baseline,which uses NER to recognise and retrieves the ADdatabase. We conduct the experiments on newsand Weibo (social media in China) corpus. Thenews contains title and content, the title is usuallyshort and cohesive, and the content always has hun-dreds of words with more location information, ofwhich the changes lie in redundancy and efficiency.The Weibo corpus is short-text, and the locationinformation is always implicit.We manually sample the finance and social news,and obtain 760 news that can be assigned to a defi-nite place to build the news dataset. Equally, 1228short-texts are finally picked from Weibo corpus.Location information is extracted by ROIBase andNER (Che et al., 2010)+pattern respectively onthese datasets. As the Table 2 lists examples of theresults, only NER+pattern matching can’t utilizethe hidden information to completely normalizethe locations, ROIBase contains 1.51 million geo-graphic embeddings and 0.42 million ROIs, so itknows the more linking of AD by the underlinedphrases.A variant of F1 score is used to measure theperformance, which takes the incomplete output OIBase NER+pattern section of text news -, 呼 和 浩 特 市 , 内 蒙 古 自 治 区 内 蒙 古 内 蒙 古 大 兴 安 岭 原 始 林 区 雷 击 火 蔓 延 ... (Lightning fire spreads in thevirgin forest area of the Greater Xing’an Mountains, Inner Mongolia...)-, 深 圳 市 , 广 东 省 - 日 前 , 华 为 基地 启 用 了 无 人 机 送 餐 业 务 ... (A few days ago, Huawei baselaunched drone food delivery business...) 双 流 区 , 成 都 市 , 四 川 省 海 口 四 川 航 空 成 都 至 海 口 航 班 ... 安 全 落 地 成 都 双 流 国 际 机 场 ...(Sichuan Airlines flight 3u8751 from Chengdu to Haikou returned and landedsafely at Chengdu Shuangliu International Airport...) Weibo -, 丽 江 市 , 云 南 省 - 拍 不 出 泸沽 湖 万 分 之一 的 美 这 个 时 节 少 了 喧 嚣 多 了 闲 适 (Can’t shootone-tenth of the beauty of Lugu Lake...)-, 武 汉 市 , 湖 北 省 湖 北 湖 北 经 济 学 院 学 生 爆 料 质 疑 校 园 联 通 宽 带 垄 断 性 经 营 (Students fromHubei University of Economics questioned campus unicom’s ...) Table 2: The examples of the location extraction by ROIBase and NER
F1-score
METHOD news WeiboROIBase 0.812 0.780NER+pattern 0.525 0.582
Table 3: F1 score on two datasets as 0.5 hit when counting. As shown in Table 3,ROIBase achieves better performance against NERwith AD patterns by large margins. Some of Weibotexts carry the label of location, and it contributesto the recognition of AD patterns, which closes thegap with us. The long texts provide more abundantinformation, and ROIBase can eliminate confusionto improve the performance.total 1st 2nd 3rd speed36.8% 23% 48.7% 28.3% 751KB/s
Table 4: ROIBase statistics on 100,000 news
Statistics over 100 thousand news from financialand social domains by ROIBase access to detailedresults. As shown in Table 4, we can normalizelocations from 36.8 percent in general. Amongthem, there is 23 percent normalization only at the1st level, 48.7 percent at 2nd level, and 28.3 percentwith complete divisions. We show the speed on amachine with Xeon 2.0GHz CPU and 4G Memory,and the speed of ROIBase is up to 751KB/s whenthe NER method (Che et al., 2010) costs 14.4KB/s.ROIBase lets the user process vast amounts of longtext in location normalization.
Zubiaga et al. (2017) makes use of eight tweet-inherent features for classification at the countrylevel. Qian et al. (2017) formalizes the inferring lo-cation of social media into a semi-supervised factorgraph model, and perform on the level of countriesand provinces. A hierarchical location predictionneural network (Huang and Carley, 2019) is pre-sented for user geolocation on Twitter. However,many of these focus on a single level, only coverfewer countries or states, or utilize extra featuresout of text. There is room for improvement in theperformance. Since Mikolov et al. (2013) proposesthe word vector technique, there are many applica-tions. Grbovic and Cheng (2018) introduces listingand user embeddings trained on bookings to cap-ture user’s real-time and long-term interest. Wuet al. (2012) demonstrates that a taxonomy knowl-edge base can be constructed from the entire webin special patterns. Inspired by the these cases, wemake the first solution to normalize the location oftext by hierarchical administrative areas.
Through the investigation, we found that there isvery few work on location normalization of text,and the popular alike solutions, such as NER, arenot directly transferable to it. The ROIBase sys-tem provides an efficient and interpretable solutionto location normalization through a web interface,which enables to process these modules with a cas-caded mechanism. We propose it as a baselinethat can be applied in different languages easily,and look forward to more work on improving thelocation normalization. eferences
Hussein S Al-Olimat, Krishnaprasad Thirunarayan, Va-lerie Shalin, and Amit Sheth. 2017. Location nameextraction from targeted text streams using gazetteer-based statistical language models. arXiv preprintarXiv:1708.03105 .Wanxiang Che, Zhenghua Li, and Ting Liu. 2010. Ltp:A chinese language technology platform. In
Pro-ceedings of the 23rd International Conference onComputational Linguistics: Demonstrations , pages13–16. Association for Computational Linguistics.Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In
Proceedings of the 43rd annual meet-ing on association for computational linguistics ,pages 363–370. Association for Computational Lin-guistics.Mihajlo Grbovic and Haibin Cheng. 2018. Real-timepersonalization using embeddings for search rank-ing at airbnb. In
Proceedings of the 24th ACMSIGKDD International Conference on KnowledgeDiscovery & Data Mining , pages 311–320. ACM.Binxuan Huang and Kathleen Carley. 2019. A hierar-chical location prediction neural network for twitteruser geolocation. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 4731–4741. Association for Com-putational Linguistics.Huifeng Li, Rohini K Srihari, Cheng Niu, and Wei Li.2002. Location normalization for information ex-traction. In
Proceedings of the 19th internationalconference on Computational linguistics-Volume 1 ,pages 1–7. Association for Computational Linguis-tics.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In
Advances in neural information processingsystems , pages 3111–3119.Robert Munro. 2011. Subword and spatiotemporalmodels for identifying actionable information inhaitian kreyol. In
Proceedings of the fifteenth con-ference on computational natural language learning ,pages 68–77. Association for Computational Lin-guistics.Yujie Qian, Jie Tang, Zhilin Yang, Binxuan Huang, WeiWei, and Kathleen M Carley. 2017. A probabilisticframework for location inference from social media. arXiv preprint arXiv:1702.07281 .Rohini Srihari. 2000. A hybrid approach for named en-tity and sub-type tagging. In
Sixth Applied NaturalLanguage Processing Conference , pages 247–254. Wentao Wu, Hongsong Li, Haixun Wang, and Kenny QZhu. 2012. Probase: A probabilistic taxonomy fortext understanding. In
Proceedings of the 2012 ACMSIGMOD International Conference on Managementof Data , pages 481–492. ACM.David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In , pages 189–196.Arkaitz Zubiaga, Alex Voss, Rob Procter, Maria Li-akata, Bo Wang, and Adam Tsakalidis. 2017. To-wards real-time, country-level location classificationof worldwide tweets.