A Concept Knowledge-Driven Keywords Retrieval Framework for Sponsored Search
Yijiang Lian, Yubo Liu, Zhicong Ye, Liang Yuan, Yanfeng Zhu, Min Zhao, Jianyi Cheng, Xinwei Feng
AA Concept Knowledge-Driven Keywords Retrieval Frameworkfor Sponsored Search
Yijiang Lian [email protected]
Yubo Liu [email protected]
Zhicong Ye [email protected]
Liang Yuan [email protected]
Yanfeng Zhu [email protected]
Min Zhao [email protected]
Jianyi Cheng [email protected]
Xinwei Feng [email protected]
ABSTRACT
In sponsored search, retrieving synonymous keywords for exactmatch type is important for accurately targeted advertising. Data-driven deep learning-based method has been proposed to tacklethis problem. An apparent disadvantage of this method is its poorgeneralization performance on entity-level long-tail instances, ev-en though they might share similar concept-level patterns withfrequent instances. With the help of a large knowledge base, wefind that most commercial synonymous query- keyword pairs canbe abstracted into meaningful conceptual patterns through concepttagging. Based on this fact, we propose a novel knowledge-drivenconceptual retrieval framework to mitigate this problem, whichconsists of three parts: data conceptualization, matching via concep-tual patterns and concept-augmented discrimination. Both offlineand online experiments show that our method is very effective.This framework has been successfully applied to Baidu’s sponsoredsearch system, which yields a significant improvement in revenue.
KEYWORDS
Sponsored Search; Keyword Matching; Keyword Retrieval; Para-phrase Pattern; Knowledge Graph
Sponsored search advertising refers to the placement of ads onsearch result pages above or next to organic search results. As thesesponsored ads directly target user’s query intention, they usuallyhave a much higher conversion ratio. In the past few years, searchadvertising has become one of the most popular forms of digitaladvertising worldwide.One of the most critical modules of the sponsored search systemis the keyword matching module, which is to match users’ queriesto advertisers’ bidding keywords . (
Keyword is utilized to representqueries purchased by the advertisers in particular). Mainstreamsearch engine companies provide a structured bidding language,with which the advertisers can specify how would their purchasedkeywords be matched to the online queries. Generally, there arethree match types provided : exact, phrase, broad . Under the exactmatch type , ads would be eligible to appear when a user searches https://support.google.com/google-ads/answer/7478529?hl=en for the specific keyword or its synonymous variants. For phrasematch type , the matched queries should contain the keyword orthe synonymous variants of the keyword . Broad match type furtherrelaxes the matching restrictions to the semantic relevance level.Due to the targeting of precise traffic, most customers prefer the exact match type , which accounts for a large portion of the keyword revenue in most search engine companies. In this article, we paymore attention to the problem of synonymous matching under the exact match type .Since the synonymous query- keyword relationship is quite scar-ce, while the volumes of the industrial queries and keywords areextremely huge, the traditional boolean retrieval framework, whichworks well in broad match scenarios, is quite inefficient in this exactmatch scenario [13]. Recently, data-driven deep learning methodshave been applied in this scenario [13], where a translation model istrained on a high-quality paraphrasing dataset and is used to makea generalization to link more synonymous queries and keywords.One problem of the data-driven deep learning-based approach isits poor generalization performance on the long-tail instances [9].Considering that nearly 70% of the queries are entity related [7], weespecially focus on those entity-level long-tail cases. For example, if double-fold eyelid operation and
Los Angeles are commonly observedin the training data, the translation model can generate lots of highquality paraphrases for queries like the price of double-fold eyelidoperation in Los Angeles ; however, it would fail on queries like theprice of liposuction in Denver if liposuction and Denver are rare in thetraining data, even though these two queries share a same abstractconceptual pattern the price of [aesthetic sur- gery] in [location] . Andthe same is true for the discriminant model. With a limited numberof synonymous training data in hand and a huge number of queriesto be addressed, long-tail cases should be regularly encountered forthe industrial models. How to make our retrieval system robust onthese long-tail cases becomes an important and urgent problem tobe addressed.Conceptual abstraction and reasoning are common in the humanlearning process. Given a synonymous training instance ( how muchdoes double-fold eyelid operation cost in Los Angeles = the price ofdouble-fold eyelid operation in Los Angeles ), if we know that double-fold eyelid operation belongs to the concept of aesthetic surgery,and
Los Angeles belongs to the concept of [location] , we can easily a r X i v : . [ c s . I R ] F e b bstract the original paraphrase instance into a conceptual patternform ( how much does [aesthetic surgery] cost in [location] = the priceof [aesthetic surgery] in [location] ). When a long-tail query like theprice of liposuction in Denver comes, if we get the knowledge that liposuction belongs to [aesthetic surgery] and it has an alias "lipo",we can effortlessly inference out a new paraphrase instance: ( howmuch does lipo cost in Denver = the price of liposuction in Denver ).In this way, the paraphrasing capability has been transferred ontothe long-tail instances.Inspired by this idea, we propose a concept knowledge-driven keyword retrieval framework, which comprises three phases: dataconceptualization, matching via conceptual patterns, and concept-augmented discrimination. In the first phase of data conceptualiza-tion, the original paraphrasing data is transformed into the con-ceptual pattern form by concept tagging, where each token in thesentence would be labeled with a concept label. Core commercialconcepts in the pattern are selected for further generalization usageand others are ignored. Secondly, a deep neural translation modelis directly trained on this conceptualized data to capture the syn-onymous pattern expression’s variations. This model is used to linkquery patterns with keyword patterns. Finally, a concept-augmenteddiscrimination model is utilized to filter out nonsynonymous cases.Our offline experiments revealed that this conceptual retrievalframework can significantly boost the retrieval performance forlong-tail cases. Besides, this framework has been successfully ap-plied in Baidu’s sponsored search retrieval system, which yields aremarkable revenue growth. Patterns are widely used in the task-oriented dialogue systems [3][11][1]. Given an utterance, the natural language understandingmodule of the dialogue system would parse it into semantic entitytypes (known as slots) and determines the user intents. Usually,slots and intents are pre-defined in a system. Patterns have alsobeen studied in the paraphrase generation and identification tasks[14] [20] [19] [24]. [19] describes a syntax-based algorithm thatautomatically builds paraphrase patterns from semantically equiva-lent translation sets. [29] proposed a pivot approach for extractingparaphrase patterns from bilingual parallel corpora.Knowledge Graph is commonly used in information retrievesystems [27][16][6] and recommendation systems [26][22][25][2].Microsoft built its Concept Graph named Probase [28] to optimizethe web search and document understanding services. [18] usedthe Resource Description Framework triples data to compute thethematic similarity for information retrieval. Knowledge graph ofWikidata [8] was applied to Amazon’s virtual assistant Alexa tooffer better answers to factual questions. A large-scale cognitiveconcept net was constructed by [17] to better define user needs andoffer more intelligent shopping experience in their e-commerceplatform. [15] utilized query log and search click graph to discoveruser-centered concepts at more precise granularity that can repre-sent users interests.
Our method consists of three parts: data conceptualization, match-ing via conceptual patterns, and concept-augmented discrimination.
In this step, the original query/keyword will be transformed into aconceptual pattern form based on concept tagging, which meansassigning each token in the sentence with a domain concept label.Different from the traditional named entity recognition (NER) task,which focuses on recognizing the entities in a small number of cate-gories such as organizations and locations, our concept tagging taskcovers entities and function words from all domains. The taxonomyis constructed based on Baidu’s Knowledge Graph, which covers5 billion entities and 550 billion facts. To expand the conceptualpattern’s application scope, we prefer coarse concepts over refinedconcepts. For example, the entity double eyelid surgery belongs toa refined concept of [eye plastic] , which has a coarse hypernymconcept named [aesthetic surgery] . Considering that most of the eye plastic related paraphrase patterns can also be applied in aes-thetic surgery scenarios, the coarse concept of [aesthetic surgery] isselected.
Figure 1: The process of commercially customized concepttagging.
As is shown in Figure 1, the concept tagging procedure consistsof three steps: • Firstly, each token in the sentence is assigned with a conceptlabel based on concept tagging. • Secondly, core concepts are selected for further generaliza-tion and the remaining concepts are neglected. The coreconcept set is a manually selected subset of the ontology,which focuses on commercial entity related categories suchas education, traveling, cosmetology, food, and so on, whilefunction words like adverb and preposition are overlooked. • Finally, the original sentence is extracted into a pattern formmixed with concept slots and normal texts, and slots’ corre-sponding entities are represented as slot values.Therefore, the query in the figure how much does liposuctioncost in Denver will be conceptualized as how much does [aestheticsurgery] cost in [location] with slot values [aesthetic surgery: lipo-suction] and [location: Denver] . Concept Knowledge-Driven Keywords Retrieval Framework for Sponsored Search
The price of lipo in New York
How much does liposuction cost in New York ? Data ConceptualizationConcept Slots filling
Query Concept Form: Query:
How much does [Aesthetic Surgery] cost in [Location]?
Conceptual Pattern Translation
The price of [Aesthetic Surgery] in [Location] Slot Value[Aesthetic Surgery: liposuction][Location: New York] The price of liposuction in New York =
The price of lipo in New York (alias replacement) What is the price of liposuction in New YorkKeyword RepositoryIn Concept form
What is the price of [Aesthetic Surgery] in [Location] Target
Keyword Original Form:Keyword RepositoryIn Original form
Join the Original Keyword Repository
The price of liposuction in New York
What is the price of liposuction in New York
Synonymous Keyword Expansion
Figure 2: The specific query- keyword matching procedures.
The basic idea of our framework is to transform queries and key-words into conceptual patterns and conduct the synonymous query- keyword matching through conceptual pattern matching.
The matching betweenquery pattern and keyword pattern is based on a pattern-to-patterntranslation model. Given a high quality paraphrase data set, dataconceptualization is performed on both sides to obtain parallelpattern training data. For example, the original instance ( how muchdoes double eyelid surgery cost in Los Angeles = the price of double-fold eyelid operation in Los Angeles ) is changed into ( the price of[Aesthetic Surgery] in [Location] = how much does [Aesthetic Surgery]cost in [Location] ). For high precision concerns, we further cleanthe conceptualized parallel pattern data with the requirement thatthe left pattern should be strictly aligned with the right pattern,not only the number of concept slots but also the correspondingentities should be the same.And then a transformer-based translation model is trained on theconceptualized parallel data. In detail, concept slots are consideredas normal tokens and added to the model’s vocabulary dictionary.The structure of the model follows the common sequence to se-quence framework [23], where the encoder encodes the sourcesentence into a list of hidden states and the decoder generates wordone by one until a special end symbol is generated. When an ad-hoc querycomes, the following four steps are carried out to generate its syn-onymous candidates.The first step is query conceptualization. It follows the sameprocedure as data conceptualization mentioned above.The second step is pattern matching. The pattern translationmodel is utilized to link the query’s pattern with the keyword pat-tern repository, which is a conceptualized version of the original keyword repository. A prefix-tree-based targeted decoding trick isutilized [12] to ensure that all the decoded hypotheses are valid keyword patterns.The third step is keyword pattern instantiation. The conceptslots in the previously retrieved keyword patterns are replaced withthe query’s corresponding entities. To increase recall, the entities’aliases from the knowledge database are also utilized. Since theinstantiated pattern might not be real keywords , we further jointhem with the original keyword repository to remove illegal results.The final step is synonymous keyword expansion. Consideringthat there is also a number of synonymous relationships whichcould not be strictly expressed as aligned pattern forms, a syn-onymous expanding process is performed in the end. In detail,synonymous keywords are clustered in advance by utilizing key-word to keyword synonymous retrieval method [13]. If a previouslyretrieved keyword belongs to a keyword cluster, the whole keyword cluster will be merged into the final candidate queue.he whole matching process is illustrated in Figure 2. When anad-hoc query How much does liposuction cost in New York arrives,it is firstly conceptualized as
How much does [Aesthetic Surgery]cost in [Location] with slot values [Aesthetic Surgery: liposuction]and [Location: Denver] , entities’ aliases such as liposuction(lipo)are fetched from the knowledge database in the meanwhile. Thena pattern translation model is utilized to link the query’s patternwith the conceptual keyword repository, the following two syn-onymous keyword patterns are produced:
The price of [AestheticSurgery] in [Location] and
What is the price of [Aesthetic Surgery]in [Location] . In the next step, concept slots in these keyword pat-terns are replaced with the corresponding query entities and theiraliases, which generates three sentences:
The price of liposuctionin New York , The price of lipo in New York and
What is the price ofliposuction in New York . Then, we join these sentences with the real keyword repository to remove invalid keywords, and
What is theprice of liposuction in New York is removed in the example. Finally,synonymous keywords expansion is performed.
To guarantee the final query- keyword pairs’ synonymous quality,an end2end concept augmented discrimination is performed basedon a domain fine-tuned BERT [5] model.Since the discriminant model also faces the long-tail instancegeneralization problem, directly using it might misjudge lots of con-ceptually retrieved cases. Therefore, we augment the fine-tuningdata by replacing the entities with same-concept entities. In detail,for the original synonymous query- keyword instances, we replacethe aligned concept slot of query and keyword with the same rareentity to get augmented positive cases and we replace them withdifferent rare entities to get augmented negative cases, where lit-erally confusable entities are particularly used. For the originalnegative query- keyword instances, with a probability of 50%, wereplace the aligned concept slots with the same rare entity, and witha probability of 50%, we replace them with different rare entities.Offline experiments show that this method greatly improves themodel’s robustness for long-tail instances.
Data conceptualizationis performed on one day’s query-keyword exact matching weblog,and 22 million paraphrasing pattern pairs (cid:101) 𝐷 𝑔𝑒𝑛 are sampled from it.This pattern pair dataset is split into two parts, (cid:101) 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 for trainingand (cid:101) 𝐷 𝑔𝑒𝑛𝑑𝑒𝑣 for development. And the corresponding original datasetsare denoted as 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 and 𝐷 𝑔𝑒𝑛𝑑𝑒𝑣 . The baseline and the conceptualtranslation models 𝑀 𝑔𝑒𝑛 and (cid:101) 𝑀 𝑔𝑒𝑛 are separately trained on 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 and (cid:101) 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 . To evaluate the translation model’s generalization abil-ity on the entity-level long-tail cases, we construct a query test set 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 by joining all the queries in 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 with one week’s queryweblog based on their contained entities. For simplicity, we onlyconsider queries having one entity. Queries in 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 are groupedinto 4 buckets according to their entity frequencies in 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 . De-note the frequency as 𝑓 , four frequency intervals are speciallyconsidered: 1 < 𝑓 ≤ , < 𝑓 ≤ , < 𝑓 ≤ , 𝑓 > 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 . The translation model’s word em-beddings are randomly initialized. The vocabulary contains 200,000frequent words and 284 concept slots. The word embedding di-mension and the number of hidden units are set to 512. The maxsequence length is limited to 12 at the word level. Both the encoderand decoder are implemented with transformers having 4 layersand 8 heads. The model’s cross-entropy loss is minimized by Adam[10] with initial learning rate of 5 × − and the batch size is 64. Setting beam size to 50, 𝑀 𝑔𝑒𝑛 and (cid:101) 𝑀 𝑔𝑒𝑛 are utilizedto decode towards queries in 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 . For (cid:101) 𝑀 𝑔𝑒𝑛 , the concept slots inthe decoded patterns are replaced with the original slot values inthe query. For each bucket in 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 , we sample 500 generated casesfor human synonymous binary evaluation. As is shown in the Table1, (cid:101) 𝑀 𝑔𝑒𝑛 performs much better than the 𝑀 𝑔𝑒𝑛 , and it outperforms 𝑀 𝑔𝑒𝑛 by 26 absolute points on the long-tail query dataset 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 .We can also see that the raw translation model’s performance issignificantly influenced by the entity frequency in the trainingdataset. In contrast, (cid:101) 𝑀 𝑔𝑒𝑛 ’s performance is insensitive to the entityfrequency. Frequency 𝑀 𝑔𝑒𝑛 (cid:101) 𝑀 𝑔𝑒𝑛 Table 1: Accuracy performances of the two translation mod-els on 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 . The frequency denotes the query’s entity occur-rence frequencies in 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 . 𝐷 𝑑𝑖𝑠 are sampled from the sponsored matching weblog forhuman synonymous evaluation, which covers all three matchingtypes: exact match , phrase match , and broad match with a propor-tion ratio of 2:1:1. 𝐷 𝑑𝑖𝑠 is further split into three parts: 𝐷 𝑑𝑖𝑠𝑡𝑟𝑎𝑖𝑛 fortraining, 𝐷 𝑑𝑖𝑠𝑑𝑒𝑣 for developing and 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 for testing. Then conceptaugmentation procedure mentioned in subsection 3.3 is performedon 𝐷 𝑑𝑖𝑠𝑡𝑟𝑎𝑖𝑛 . 55,200 augmented cases are sampled, which accounts for12% of 𝐷 𝑑𝑖𝑠𝑡𝑟𝑎𝑖𝑛 , and merged with 𝐷 𝑑𝑖𝑠𝑡𝑟𝑎𝑖𝑛 to get ˆ 𝐷 𝑑𝑖𝑠𝑡𝑟𝑎𝑖𝑛 . 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 can beconsidered as a test set from a global perspective, however, since wecare more about the discriminant model’s generalization ability onthe conceptually retrieved long-tail cases, another test data ¯ 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 is constructed by sampling 1000 cases from (cid:101) 𝑀 𝑔𝑒𝑛 ’s generated caseson 𝐷 𝑔𝑒𝑛𝑡𝑒𝑠𝑡 for human evaluation. The paraphrase discriminant modelis a binary classifier implemented with BERT. It takes a query-keyword pair separated by a special token as input and predicts1 if the pair is synonymous and 0 if not. The model contains 24
Concept Knowledge-Driven Keywords Retrieval Framework for Sponsored Search layers, 16 self-attention heads and the hidden dimension size is1024. The parameters are initialized with ERNIE [21], a well-knownChinese pre-trained transformer published by Baidu. The baselinemodel 𝑀 𝑑𝑖𝑠 and the concept-augmented model ˆ 𝑀 𝑑𝑖𝑠 are separatelyfine-tuned on 𝐷 𝑑𝑖𝑠𝑡𝑟𝑎𝑖𝑛 and ˆ 𝐷 𝑑𝑖𝑠𝑡𝑟𝑎𝑖𝑛 . The fine-tuned loss is minimizedby Adam with initial learning rate of 5 × − and the batch size is128. We focus on two indicators: AUC (area under thecurve) and the recall under 95%/70% precision. Table 2 shows thatthe concept-augmented model outperforms the baseline model with16.24 points for the recall under 70% precision on ¯ 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 . Since thedata distribution between ¯ 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 and 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 differs a lot, where long-tail cases occupy a small proportion in the general test dataset 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 ,the indicators on the 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 have just been slighted improved. Model AUC-G Recall-G AUC-L Recall-L 𝑀 𝑑𝑖𝑠 𝑀 𝑑𝑖𝑠 Table 2: Results of different discriminant models on 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 and ¯ 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 . AUC-G and Recall-G denote the AUC value andrecall ratio under the precision of 95% on the global testset 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 . AUC-L and Recall-L denote the AUC value and re-call ratio under the precision of 70% on the long-tail test set ¯ 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 . Table 3 shows the performances of the discriminant model ˆ 𝑀 𝑑𝑖𝑠 with different proportion of augmented data. We can see that themodel’s recall ratio on ¯ 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 grows as the proportion increases.However, too much augmented data would destroy the originaldata’s distribution, which results in a bad performance on the globaltest set 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 . As we can see in the table, the performance on 𝐷 𝑑𝑖𝑠𝑡𝑒𝑠𝑡 peaks around the proportion at 12%. Proportion Recall-G Recall-L
8% 76.14% 72.18%10% 76.58% 76.95%12% ˆ 𝑀 𝑑𝑖𝑠 with different proportion of the augmentation data. Table 4 shows some typical long-tail queries andtheir corresponding conceptual patterns. Table 5 further shows thetop 3 results decoded by 𝑀 𝑔𝑒𝑛 and ˜ 𝑀 𝑔𝑒𝑛 . We can see that the tra-ditional translation model performs badly on the long-tail queries,and it can not preserve the source entities in the decoding phase.For example, in the case of Can pregnant women eat cantaloupe , 𝑀 𝑔𝑒𝑛 ’s decoding results have changed the entity cantaloupe into or-anges or other nonsynonymous items. However, the concept-baseddecoding precisely retains the entity. Table 6 shows the discriminant model’s inference scores on sometypical long-tail cases. We can see that the model’s predicted scoreshave been well adjusted, where the new model predicts much higherscores for positive cases and much lower scores for negative onesafter fine-tuning on the conceptually augmented data. The synonymous pattern’s coverage playsa crucial role in the practical application of our method. Our hu-man evaluation statistics shows that nearly 70% of the existingcommercial paraphrases can be transformed into conceptual forms.As a matter of fact, people’s commercial intentions are relativelyconcentrated in the sponsored search scenario, and most of theseparaphrases can be expressed as steady pattern forms. For example,goods’ price is one of the most common commercial intentions,which can be easily abstracted as how much is [Goods] . Table 4shows some typical commercial patterns related to goods’ quality,food recipe, and medicine functions.
As is illustrated in Figure 3, we deployed a new module namedCKBR(Concept Knowledge Based Retrieval) on Baidu’s sponsoredsearch engine for retrieving exact-match-type keywords.For latency concerns, the pattern matching phase is implementedunder an online-offline mixed architecture. For frequent query pat-terns, their matched keyword patterns are computed ahead of timewith a complex transformer model, and these results are saved ina lookup table for fast online retrieval. And this offline processis repeated periodically to follow the changes in the query pop-ulation and keyword supply over time. For infrequent queries, aphrase-based statistical machine translation(PBSMT) framework isutilized to generate keyword patterns from scratch. Although neu-ral machine translation(NMT) has outperformed PBSMT in lots oftranslation tasks, we find that PBSMT is still a cost-effective choicefor the industry. In contrast with NMT, it is much faster and noGPU is required. As a matter of fact, our PBSMT is able to decodeout 1200 synonymous variations in less than 60ms. Nevertheless,a simple two-layer GRU [4] gated NMT can just decode out 30synonymous variations in the same time window(The performancetest is conducted on a machine equipped with a 12-core Intel(R)Xecon(R) E5-2620 v3 clocked at 2.40GHz, a RAM of 128G, and 16Tesla K40m. PBSMT runs with 10 CPU threads and the stack size is100. NMT runs in a single GPU with a beam size of 30). Anotheradvantage of SMT is that all of the alignments are stored in aneditable and readable phrase table, which makes it explainable andcontrollable.The online discriminant model is implemented with a trans-former having 2 layers and 4 heads, with a hidden size of 128. Thismodel is distilled from the previous concept-augmented discrim-inant model ˆ 𝑀 𝑑𝑖𝑠 . Considering the chapter space, we would notelaborate on the details.A 10 days’ real online A/B test is deployed on two platformsof Baidu’s sponsored search engines, which corresponds to queryflows from desktops and mobile devices. We focus on the followingmetrics. • SHOW denotes the total ads’ number shown to users uery Pattern Entity Freq 海 马 丘 比 特 质 量 怎 么 样 [ 品 牌 ] 质 量 怎 么 样 欧 米 伽 哪 里 有 [ 食 物 营 养 素 ] 哪 里 有 炮 附 片 的 功 效 [ 植 物 类 中 药 材 ] 的 功 效 哈 蜜 瓜 孕 妇 能 吃 么 [ 饮 食 ] 孕 妇 能 吃 么 伦 敦 糕 的 做 法 与 配 方 [ 饮 食 ] 的 做 法 与 配 方 Table 4: Some typical long-tail queries and their conceptual patterns, where entities are underlined and concepts are enclosedin square brackets. Besides, Entity Freq denotes the entity’s frequency in 𝐷 𝑔𝑒𝑛𝑡𝑟𝑎𝑖𝑛 .Query 𝑀 𝑔𝑒𝑛 (cid:101) 𝑀 𝑔𝑒𝑛 海 马 丘 比 特 质 量 怎 么 样 海 马 好 吗 海 马 丘 比 特 这车 怎 么 样 (How about the quality of the Haima Qiubite) (How is Haima) (How about the car named Haima Qiubite) 海 马 下 垂 怎 么 样 海 马 丘 比 特 质 量 好 不 好 (How about hippocampus sagging) (Is the quality of Haima Qiubite good) 海 马 法 语 怎 么 样 海 马 丘 比 特 质 量 如 何 (How about Haima French) (How is the quality of Haima Qiubite) 哈 蜜 瓜 孕 妇 能 吃 么 孕 妇 能 吃 橙 子 吗 怀 孕 了 能 吃哈 蜜 瓜 吗 (Can pregnant women eat cantaloupe) (Can pregnant women eat oranges) (Can I eat cantaloupe while pregnant) 孕 妇 可 以 吃 益 母 草 吗 孕 妇 可 以 吃哈 蜜 瓜 吗 (Can pregnant women eat leonotis) (Can pregnant women eat cantaloupe) 孕 妇 能 吃 黄 芪 吗 怀 孕 期 间 可 以 吃哈 蜜 瓜 吗 (Can pregnant women eat astragalus) (Can I eat cantaloupe during pregnancy) Table 5: The translation results of typical cases.Query Bidword 𝑀 𝑑𝑖𝑠 ˆ 𝑀 𝑑𝑖𝑠 Label 血 HCG 检 查 多 少 钱 血 疮 检 查 费 用 花 牛 苹 果 什么 科 嘎 拉 苹 果 属 于什么 科 道 奇 汽 车 是 哪 国 生 产 道 奇 汽 车 产 地在 哪 里 Table 6: The prediction scores of different discriminant models on some typical long-tail query- keyword cases. The label 1denotes that the query-keyword pair is synonymous, while 0 stands for the opposite.Platform SHOW CPM CTR SHOW-EXACT
Mobile 0.54% 1.51% 0.92% 9.14%Desktop 1.58% 2.52% 1.43% 8.42%
Table 7: The online A/B test results of the CKBR module,which indicate the relative improvements over the currentsystem. • CTR =
CLICKSEARCH denotes the average click-through rate. CLICKdenotes the number of all ads’ clicks and SEARCH denotesthe number of all searches. • CPM =
REVENUESEARCH × • SHOW-EXACT denotes the number of shown ads under the exact match type .As is shown in the Table 7, the CKBR module leads to an evidentimprovement of SHOW-EXACT by 8.42% on desktop flows and9.14% on mobile flows. Besides, it also yields a significant growthin CPM with 2.52% on desktop flows and 1.51% on mobile flows. Inthe meanwhile, a quality evaluation has been conducted, where 600query- keyword cases under the exact match type are sampled from
Concept Knowledge-Driven Keywords Retrieval Framework for Sponsored Search
QueryConcept TaggingQuery-Pattern KeywordsKeyword CandidatesKeyword ExpansionNMT-Gen LookUp TableSMT Slot Filling DiscriminantionPattern2Pattern Matching updateKeyword Patterns
Figure 3: The online implementation framework of the concept knowledge-driven retrieval module. the system’s weblog and are sent for synonymous binary judgment.And the evaluation result shows that the synonym accuracy hasincreased by 0.7%.
In this paper, we have developed a novel knowledge-driven frame-work for addressing the synonymous keywords retrieval problemin sponsored search. Under this framework, the synonymous trans-formation can be understood as a combination of pattern transfor-mation and alias replacement. Based on a large Chinese knowledgegraph, we are able to conceptualize most of the commercial query- keyword synonymous pairs into abstract patterns. A deep trans-lation model is trained on this conceptualized data to capture thesynonymous pattern variations. Our offline experiments show thatthe new framework performs much better on the long-tail cases.The whole framework has been implemented in Baidu’s keywordsretrieval system based on a NMT/SMT mixed architecture, and asignificant improvement in revenue has been yielded without de-grading the quality of users’ experience. This method’s applicationscope is not limited in the synonymous retrieval under exact match ,exploration in phrase match and broad match is in progress. Wehope our method would shed some light on the further design ofthe industrial sponsored search system.
REFERENCES [1] Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-endgoal-oriented dialog. arXiv preprint arXiv:1605.07683 (2016).[2] Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019.Unifying knowledge graph learning and recommendation: Towards a betterunderstanding of user preferences. In
The world wide web conference . 151–161.[3] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A survey ondialogue systems: Recent advances and new frontiers.
Acm Sigkdd Explorations Newsletter
19, 2 (2017), 25–35.[4] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).[6] Jeffrey Scott Eder. 2012. Knowledge graph based search system. US Patent App.13/404,109.[7] Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Named entity recognition inquery. In
Proceedings of the 32nd international ACM SIGIR conference on Researchand development in information retrieval . 267–274.[8] Peter Haase, Andriy Nikolov, Johannes Trame, Artem Kozlov, and Daniel MHerzig. 2017. Alexa, Ask Wikidata! Voice Interaction with Knowledge Graphsusing Amazon Alexa.. In
International Semantic Web Conference (Posters, Demos& Industry Tracks) .[9] Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan,and Dawn Song. 2020. Pretrained transformers improve out-of-distributionrobustness. arXiv preprint arXiv:2004.06100 (2020).[10] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic opti-mization.
International Conference on Learning Representations (ICLR) (2015).[11] Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz.2017. End-to-end task-completion neural dialogue systems. arXiv preprintarXiv:1703.01008 (2017).[12] Yijiang Lian, Zhijie Chen, Jinlong Hu, Kefeng Zhang, Chunwei Yan, MuchenxuanTong, Wenying Han, Hanju Guan, Ying Li, Ying Cao, et al. 2019. An end-to-endGenerative Retrieval Method for Sponsored Search Engine–Decoding Efficientlyinto a Closed Target Domain. arXiv preprint arXiv:1902.00592 (2019).[13] Yijiang Lian, Zhenjun You, Fan Wu, Wenqiang Liu, and Jing Jia. 2020. RetrieveSynonymous keywords for Frequent Queries in Sponsored Search in a DataAugmentation Way. arXiv preprint arXiv:2008.01969 (2020).[14] Dekang Lin and Patrick Pantel. 2001. Discovery of inference rules for question-answering.
Natural Language Engineering
7, 4 (2001), 343–360.[15] Bang Liu, Weidong Guo, Di Niu, Chaoyue Wang, Shunnan Xu, Jinghong Lin,Kunfeng Lai, and Yu Xu. 2019. A User-Centered Concept Mining System forQuery and Document Understanding at Tencent. In
Proceedings of the 25th ACMSIGKDD International Conference on Knowledge Discovery & Data Mining . 1831–1841.[16] Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2018. Entity-duetneural ranking: Understanding the role of knowledge graph semantics in neuralinformation retrieval. arXiv preprint arXiv:1805.07591 (2018).17] Xusheng Luo, Luxin Liu, Yonghua Yang, Le Bo, Yuanpeng Cao, Jinghang Wu,Qiang Li, Keping Yang, and Kenny Q Zhu. 2020. AliCoCo: Alibaba E-commerceCognitive Concept Net. In
Proceedings of the 2020 ACM SIGMOD InternationalConference on Management of Data . 313–327.[18] Jibran Mustafa, Sharifullah Khan, and Khalid Latif. 2008. Ontology based semanticinformation retrieval. In ,Vol. 3. IEEE, 22–14.[19] Bo Pang, Kevin Knight, and Daniel Marcu. 2003. Syntax-based alignment ofmultiple translations: Extracting paraphrases and generating new sentences. In
Proceedings of the 2003 Human Language Technology Conference of the NorthAmerican Chapter of the Association for Computational Linguistics . 181–188.[20] Deepak Ravichandran and Eduard Hovy. 2002. Learning surface text patterns fora question answering system. In
Proceedings of the 40th Annual meeting of theassociation for Computational Linguistics . 41–47.[21] Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang,Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: EnhancedRepresentation through Knowledge Integration.
CoRR abs/1904.09223 (2019).arXiv:1904.09223 http://arxiv.org/abs/1904.09223[22] Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, Long-Kai Huang, and Chi Xu.2018. Recurrent knowledge graph embedding for effective recommendation. In
Proceedings of the 12th ACM Conference on Recommender Systems . 297–305.[23] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to SequenceLearning with Neural Networks. In
Advances in Neural Information Processing Systems 27 , Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.Weinberger (Eds.). Curran Associates, Inc., 3104–3112.[24] Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaventura Coppola. 2004. Scalingweb-based acquisition of entailment relations. In
Proceedings of the 2004 conferenceon empirical methods in natural language processing . 41–48.[25] Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and MinyiGuo. 2019. Multi-task feature learning for knowledge graph enhanced recom-mendation. In
The World Wide Web Conference . 2000–2010.[26] Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. Kgat:Knowledge graph attention network for recommendation. In
Proceedings of the25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining . 950–958.[27] Colby Wise, Vassilis N Ioannidis, Miguel Romero Calvo, Xiang Song, GeorgePrice, Ninad Kulkarni, Ryan Brand, Parminder Bhatia, and George Karypis. 2020.COVID-19 knowledge graph: accelerating information retrieval and discoveryfor scientific literature. arXiv preprint arXiv:2007.12731 (2020).[28] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. 2012. Probase: Aprobabilistic taxonomy for text understanding. In
Proceedings of the 2012 ACMSIGMOD International Conference on Management of Data . 481–492.[29] Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li. 2008. Pivot approach forextracting paraphrase patterns from bilingual corpora. In