[PDF] DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks

Abstract

Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging, and translation. Recent works have mostly focused on the translation step overlooking the earlier steps by using ad-hoc solutions. In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.). We define the keyword mapping problem as a sequence tagging problem, and propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. Our proposed approach, called \textit{DBTagger} (DataBase Tagger), is an end-to-end and schema independent solution, which makes it practical for various relational databases. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, 92.4\% on the average. Our results also indicate that DBTagger is faster than its counterparts up to 10000 times and scalable for bigger databases.

Full PDF

DDBTagger: Multi-Task Learning for Keyword Mapping inNLIDBs Using Bi-Directional Recurrent Neural Networks

To appear in VLDB 2021

Arif Usta

Bilkent UniversityAnkara, [email protected]

Akifhan Karakayali

Bilkent UniversityAnkara, [email protected]

Özgür Ulusoy

Bilkent UniversityAnkara, [email protected]

ABSTRACT

DBTagger (DataBase Tagger), is an end-to-end andschema independent solution, which makes it practical for variousrelational databases. We evaluate our approach on eight differentdatasets, and report new state-of-the-art accuracy results, 92 .

4% onthe average. Our results also indicate that DBTagger is faster thanits counterparts up to 10000 times and scalable for bigger databases.

PVLDB Reference Format:

Arif Usta, Akifhan Karakayali, and Özgür Ulusoy. DBTagger: Multi-TaskLearning for Keyword Mapping in NLIDBs Using Bi-Directional RecurrentNeural Networks

To appear in VLDB 2021. PVLDB, 14(5): XXX-XXX, 2021.doi:10.14778/3446095.3446103

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available athttps://github.com/arifusta/DBTagger.

Amount of processed data has been growing rapidly pertaining totechnology, leading database systems to have a great deal of impor-tance in today’s world. Amongst the systems, relational databasesare still one of the most popular infrastructures to effectively storedata in a structured fashion. To extract data out of a relationaldatabase, structured query language (SQL) is used as a standard tool.

This work is licensed under the Creative Commons BY-NC-ND 4.0 InternationalLicense. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy ofthis license. For any use beyond those covered by this license, obtain permission byemailing [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 14, No. 5 ISSN 2150-8097.doi:10.14778/3446095.3446103

Although SQL is a powerfully expressive language, even technicallyskilled users have difficulties using SQL. Along with the syntax ofSQL, one has to know the schema underlying the database uponwhich the query is issued, which further causes hurdles to use SQL.Consequently, casual users find it even more difficult to expresstheir information need, which makes SQL less desirable. To removethis barrier, an ideal solution is to provide a search engine likeinterface, such as Google or Bing in databases. The goal of

NaturalLanguage Interfaces to Databases (NLIDB) is to break through thesebarriers to make it possible for casual users to employ their naturallanguage to extract information.To this end, many works have been published recently attackingthe research problem of translation of natural language queries intoSQL; such as conventional pipeline based approaches [4, 24, 32, 35]or end-to-end solutions using encoder-decoder based deep learningapproaches [20, 33, 34, 41, 44]. Neural network based solutions seempromising in terms of robustness, covering semantic variations ofqueries. However, they struggle for queries requiring translationof complex SQL queries, such as aggregation and nested queries,especially if they include multiple tables. They also have a hugedrawback in that they need many SQL-NL pairs for training toperform well, which makes conventional pipeline based solutionsstill an attractive alternative. [30].In the translation pipeline, one of the most important sub-problemsis keyword mapping , as noted in [25] as an open challenge to beaddressed in NLIDBs.

Keyword mapping task requires to associateeach token or a series of consecutive tokens (e.g., keywords) inthe natural language query to a corresponding database schemaelement such as table, attribute or value. It is the very first step ofresolving ambiguity for translation. Xu et. al [34] also note thatduring the translation of the query, where clause is the most diffi-cult part to generate which further signifies the task of keywordmapping .Consider the below natural language query examples run on thesample IMDB movie database shown in Figure 1 to better under-stand the challenges in keyword mapping problem.

Example NL Query 1. "What is the writer of The Truman Show?"

Challenge 1.

The very first challenge in keyword mapping is todifferentiate and categorize tokens in the query either as databaserelevant or not. For instance, some of the words in Example 1(e.g., "is", "the", "of") are just stop words that are needed not tobe considered as potential mapping target. An ad-hoc solution isto filter certain words using a pre-defined vocabulary, howeversuch a solution removes "The" in Example 1 preceding the actualdatabase value that needs to be mapped, which will cause the wrongtranslation. a r X i v : . [ c s . D B ] J a n hallenge 2. Another important challenge is to detect multi-wordentities (mostly database values), "The Truman Show" in Example1. The most common approach is to build look-up tables or indexeson n-grams of database values and calculate semantic and/or lexicalsimilarity over the candidates. Yet, this is a costly process for on-the-fly calculations regarding possible n-grams of the qiven NLquery.

Example NL Query 2. "Find all movies written by Matt Demon."

Example NL Query 3. "How many movies are there that are di-rected by Steven Spielberg and featuring Matt Demon?"

Challenge 3.

Consider the queries given in Examples 2 and 3. Inthe queries, tokens ("written" and "featuring") referring to data-base tables are syntactic and semantic variations of the actual table("written_by" and "cast" respectively) that they mapped to in thedatabase (Figure 1). To handle such a challenge, lexical and semanticsimilarities of tokens over database elements (table and attributes)can be calculated using a third party database such as WordNet [29].However, in addition to being a costly process to calculate suchsimilarities online, such a solution cannot cover all possible varia-tions of every map target in the database schema. Also, similaritycalculation approach requires a manually crafted threshold, 𝜏 , todetermine how much similarity is sufficient to map to a particularschema element, which makes it undesirable. Challenge 4.

One of the usages of keyword mapping step is toresolve ambiguities before getting into translation step. In the aboveexamples, "Matt Demon" refers to a database value residing in mul-tiple tables (e.g., actor, writer). Actual mapping of the keywordis determined by the mappings of neighbouring words surround-ing, which implies that query-wise labelling considering coherence rather than independent labelling can be beneficiary.

Challenge 5.

In addition to an effective solution, an ideal keywordmapping approach must be efficient to be deployed on interfaceswhere users run queries online. Mapper should output the result inreasonable time.Most of the pipeline-based state-of-the-art works do not pro-vide a novel solution to the problem of keyword mapping , ratherthey utilize unsupervised approaches such as simple look-up tableslooking for exact matches or pre-defined synonyms [4, 31]; or theymake use of an existing lexical database [24] such as WordNet [29];or they exploit domain information to extract an ontology to beused for the task [32]; or they employ distributed representationsof words [35] such as word2vec [28] to calculate semantic similar-ity of tokens over database elements. Although these approachesare effective to some extent, they fail to solve various challengesmentioned by the task of keyword mapping single-handedly.In order to address all of the challenges mentioned above wepropose DBTagger, a novel deep sequence tagger architecture usedfor keyword mapping in NLIDBs. Our approach is applicable to dif-ferent database domains requiring only handful of training queryannotations and practical to be deployed in online scenarios find-ing tags in just milliseconds. In particular, we make the followingcontributions by proposing DBTagger: • We tackle the keyword mapping problem as a sequencetagging problem and borrow state-of-the-art deep learningapproaches tailored for well-known NLP tasks.

Figure 1: ER diagram of a subset of IMDB movie database • We extend the neural structure for sequence tagging, byutilizing multi-task learning and cross-skip connections toexploit the observation we made in natural language querylogs of databases, that is, schema tags of keywords are highlycorrelated with POS tags. • We manually annotate query logs from three publicly avail-able relational databases, and five different schemas belong-ing to Spider [42] dataset. • We evaluate DBTagger, with above-mentioned query logsin two different setups. First, we compare DBTagger withunsupervised baselines preferred in state-of-the-art NLIDBs.In the latter, we evaluate DBTagger architecture by compar-ing with different supervised neural architectures. We reportnew state-of-the-art accuracy results for keyword mappingin all datasets. • We provide comprehensive run time and memory usageanalysis over the existing keyword mapping approaches.Our results show that, DBTagger is the most efficient andscalable approach for both metrics.The remainder of this paper is organized as follows. In the nextsection, we explain the problem formulation and methodology wefollow. We present the neural network structure we designed tosolve the keyword mapping problem, and discuss how annotationof queries is handled. In Section 3, we provide experimental resultscomparing DBTagger with unsupervised baselines and present theperformance of different neural models to justify DBTagger archi-tecture. We also provide an efficiency analysis on all baselines. InSection 4, we summarize related work.We conclude with a discus-sion of our results and outline possible future work plans in Section5.

POS tagging and NER refer to sequence tagging problem in NLPfor a particular sentence to identify parts-of-speech such as noun, igure 2: DBTagger Network verb, adjective and to locate any entity names such as person, or-ganization, respectively. Recurrent Neural Networks (RNN) are atthe core of architectures to handle such problems, since they are afamily of networks that perform well on sequential data input suchas a sentence.In RNN networks, the basic goal is to carry past information(previous words) to future time steps (future words) to determinevalues of inner states and consequently the final output, whichmakes them preferable architecture for sequential data. Given 𝑥 𝑡 as input at time step 𝑡 , calculation of hidden state ℎ 𝑡 at time step 𝑡 is as follows: ℎ 𝑡 = 𝑓 ( 𝑈 𝑥 𝑡 + 𝑊 ℎ 𝑡 − ) (1)In practice, however, RNN networks suffer from vanishing gradi-ent problem , therefore the limitation was overcome by modifyingthe gated units of RNNs; such as LSTM [16] and GRU[8]. Com-pared to vanilla RNN, LSTM has forget gates and GRU comprisesof reset and update gates additionally. We experimented with bothstructures and we chose GRU due to its better performance in ourexperiments. In GRU, Update Gates decide what information tothrow away and what new information to add, whereas Reset Gateis utilized to decide how much past information to forget.In sequence tagging problem, in addition to past informationwe also have future information as well at a given specific time, 𝑡 . For a particular word 𝑤 𝑖 , we know the preceding words (pastinformation) and succeeding words (future information), which can be further exploited in the particular network architecture called, bi-directional RNN introduced in [11]. Bi-directional RNN has two setsof networks with different parameters called forward and backward.The concatenation of the two networks is then fed into the lastlayer, where the output is determined. This process is demonstratedin the upper part of the Figure 2, named bi-directional GRU.Sequence tagging is a supervised classification problem wherethe model tries to predict the most probable label from the outputspace. For that purpose, although conventional softmax classifica-tion can be used, conditional random field (CRF) [22] is preferred.Unlike independent classification by softmax, CRF tries to predictlabels sentence-wise by taking labels of the neighboring wordsinto consideration as well. This feature of CRF is what makes itan attractive choice especially in a problem like keyword mapping .CRFs for each class of tags are appended to uni-directional GRU,depicted in lower part of the Figure 2. Formally, for a given NL query, input 𝑋 becomes a series of vectors [ 𝑥 , 𝑥 , ...𝑥 𝑛 ] where 𝑥 𝑖 represents the 𝑖 𝑡ℎ word in the query. Sim-ilarly, output vector 𝑌 becomes [ 𝑦 , 𝑦 , ...𝑦 𝑛 ] where 𝑦 𝑖 representsthe label (actual tag) of the 𝑦 𝑡ℎ word in the query. Input must be innumerical format, which implies that a numerical representation ofwords is needed. For that purpose, the word embedding approach isstate-of-the-art in various sequence tagging tasks in NLP [9] beforeeeding into the network. So, embedding matrix is extracted forthe given query, 𝑊 ∈ 𝑅 𝑛𝑥𝑑 , where 𝑛 is the number of words inthe query and 𝑑 is the dimension of the embedding vector for eachword. For the pre-calculated embeddings, we used fastText[6] dueto it being one of the representation techniques considering sub-word (character n-grams) as well to deal with the out of vocabularytoken problem better.We consider 𝐺 to be 2-dimensional scores of output by the uni-directional GRU with size 𝑛 × 𝑘 where 𝑘 represents the total numberof tags. 𝐺 𝑖,𝑗 refers to score of the 𝑗 𝑡ℎ tag for the 𝑖 𝑡ℎ word. For asequence 𝑌 and given input 𝑋 , we define tag scores as; 𝑠 ( 𝑋, 𝑌 ) = 𝑛 ∑︂ 𝑖 = 𝐴 𝑦 𝑖 ,𝑦 𝑖 + + 𝑛 ∑︂ 𝑖 = 𝐺 𝑖,𝑦 𝑖 (2)where 𝐴 is a transition matrix in which 𝐴 𝑖,𝑗 represents the scoreof a transition from the 𝑖 𝑡ℎ tag to the 𝑗 𝑡ℎ tag. After finding scores,we define probability of the sequence 𝑌 : 𝑝 ( 𝑌 | 𝑋 ) = 𝑒 𝑠 ( 𝑋,𝑌 ) ∑︁ 𝑌 ¯ ∈ 𝑌 𝑥 𝑒 𝑠 ( 𝑋,𝑌 ¯ ) (3)where 𝑌 ¯ refers to any possible tag sequence. During training wemaximize the log-probability of the correct tag sequence and forthe inference we simply select the tag sequence with the maximumscore.In our architecture, we utilize Multi-task learning by introducingtwo other related tasks; POS and type levels (shown in Figure 2).The reason we apply multi-task learning is to try to exploit theobservation that actual database tags of the tokens in the query arerelated to POS tags. Besides, multi-task learning helps to increasemodel accuracy and efficiency by making more generalized modelswith the help of shared representations between tasks [7]. POS andType tasks are trained with schema task to improve accuracy ofschema (final) tags. For each task, we define the same loss function,described above. During backpropagation, we simply combine thelosses as follows; 𝐿 𝑡𝑜𝑡𝑎𝑙 = ∑︂ 𝑖 = 𝑤 𝑖 × 𝐿 𝑖 subject to ∑︂ 𝑖 = 𝑤 𝑖 = 𝑤 𝑖 represents the weight of 𝑖 𝑡ℎ task and 𝐿 𝑖 represents theloss calculated for the 𝑖 𝑡ℎ task similarly.Another technique we integrate into the neural architecture is skip-connection . Skip connection is used to introduce extra node con-nections between different layers by skipping one or more layers inthe architecture.With skip connections, the model provides an alter-native for gradient to back propagation, which eventually helps inconvergence. The technique has become compulsory component inmany neural architectures deployed in computer vision community,such as the famous architectures ResNet [13] and DenseNet [17].In the architecture of DBTagger, for each task except the first one(POS), we additionally feed the output of uni-directional GRU layerof previous task into CRF layer of the next task ( 𝑖 + 𝑡ℎ task). With these connections, we further carry the information of previoustasks to later tasks and eventually to the final task, schema tagging. In our problem formulation, every token (words in the natural lan-guage query) associates three different tags; namely part-of-speech(POS) tag, type tag and schema tag. In the following subsections,we explain how we extract or annotate each of them in detail.

To obtain the POS tags of our natural languagequeries we used the toolkit of Stanford Natural Language ProcessingGroup named Stanford CoreNLP[27]. We use them as they areoutput from the toolkit, without doing any further processing sincethe reported accuracy for POS Tagger (97%) is sufficient enough.

In each natural language query, there are key-words (words or consecutive words) which can be mapped to data-base schema elements such as table, attribute or value. We dividethis mapping into two levels; type tagging and schema tagging.Type tags represent the type of the mapped schema element to beused in the SQL query. In total we have seven different type tags; • TABLE : NLQs contain nouns which may inhibit direct referencesto the tables in the schema, and we tag such nouns with

TABLE tag. In the example NL query given in Table 1, noun movie has atype tag as TABLE, which also supports the intuition that schemalabels and pos tags are related. • TABLEREF : Although the primary sources for table referencesare nouns, some verbs contain references to the tables most ofwhich are relation tables. TABLEREF tag is used to identify suchverbs. Revisiting the example given Table 1, the verb acted refersto the table cast , and therefore it is tagged with TABLEREF todifferentiate better the roles of POS tags in the query. • ATTR : In SQL queries, attributes are mostly used in SELECT,WHERE and GROUP BY clauses. Natural language queries maycontain nouns that can be mapped to those attributes. We useATTR tag for tagging such nouns in the natural language queries. • ATTRREF : Like TABLEREF tag, ATTRREF tag is used to tag theverbs in the natural language query that can be mapped to theattributes in the SQL query. • VALUE : In NLQs, there are many entity like keywords that needto be mapped to their corresponding database values. Thesewords are mostly tagged as

Proper noun-NNP such as the keyword

John Nash in the example query. In addition to these tags, it isalso likely for a word to have a noun-NN

POS tag with a

Value tag corresponding to schema level. In order to handle these caseshaving different POS tags, we have

Value type tags (e.g.,

Mind keyword in the example query is part of a keyword that needsto be mapped as value to movie.title ). Keywords with Value tagscan later be used in the translation to determine "where" clausesin SQL. • COND : After determining which keywords in the query are tobe mapped as values, it is also important to identify the wordsthat imply which type of conditions to be met for the SQL query.For that purpose, we have the

COND type tag. • O (OTHER) : This type of tag represents words in the query thatare not needed to be mapped to any schema instrument related able 1: An example NL query with its tags corresponding to each word in three different levelsNL query who acted John Nash in the movie A Beautiful Mind

POS tags

WP VBD NNP NNP IN DT NN DT JJ NN

Type tags

O TABLEREF VALUE VALUE COND O TABLE VALUE VALUE VALUE

Schema tags

O cast cast.role cast.role cond O movie movie.title movie.title movie.titleto the translation step. Most stop words in the query (e.g., the)fall into this category.

Schema tags of keywords represent the data-base mapping that the keyword is referring to; name of a table,or attribute. Tagging a keyword with a type tag is important yetincomplete. To find the exact mapping the keyword refers to, wedefined a second level tagging where the output is the name of thetables or attributes. For each entity table (e.g. movie table in Figure1) and for each non-PK or non-FK attribute (attributes which havesemantics) we define a schema tag (e.g movie, people, movie.title ,etc., referring to Figure 1). We complete possible schema tags by car-rying

OTHER and

COND from type tags. We use the same schematag for attributes and values (e.g movie.title ), but differentiate themat the inference step by combining tags from both type tags andschema tags. If a word is mapped into

Value type tag as a result ofthe model, its schema tag refers to the attribute in which the valueresides.In order to annotate queries, we annotate each word in the queryfor three different levels mentioned above. While POS tags areextracted automatically, we manually annotate the other two levels.Annotations were done by three graduate and three undergraduatecomputer science students who are familiar with database subject.Although annotation time varies depending on the person, on theaverage it took a week to annotate tokens by a single person fortwo levels (type and schema) for a query log with 150 NL questions,which we believe is practical to apply in many domains.

In our experiments we used yelp, imdb [35] , and mas [24] datasetswhich are heavily used in many NLIDB related works by the data-base community [2, 24, 32, 35]. In addition to these datasets, wealso used different schemas from the

Spider dataset [42]; whichare academic, college, hr, imdb , and yelp . Spider is comprised ofapproximately 200 schemas from different domains; however, thereare only handful (around 10) of schemas with more than 100 NLquestions. Number of questions is important for our deep learningbased solution, since it requires certain number of training datato effectively train. Each schema we picked from Spider datasetis among the schemas with most number of NL questions, havingover 100 queries to work with. Due to the lack of sufficient data-base values (many schemas do not have database rows or have fewnumber of rows), we used the Spider dataset only on supervisedsetup.The statistics about each dataset for which annotation is doneis shown in Table 2. In Table 2 (referring to Figure 1), entity tablesrefer to main tables (i.e. Movie), relation tables refer to hub tablesthat store connections between entity tables (i.e. cast, written_by), nonPK-FK attributes refer to attributes in any table that is neitherPK nor FK (i.e., gender in People table), and finally total tags refer tounique number of taggings extracted from that particular schemadepending on the above mentioned values. Final schema tags of aparticular database are determined by composing table names andname of the nonPK-FK attributes in addition to COND and OTHER.In the last two rows of the Table 2, we show annotated number ofNL questions, referred to as queries, and the number of total wordsinside these queries, referred to as tokens.

We first split the datasets into train-validation sets with 5 − . − . − . . We implementedthe unsupervised approaches utilized in the state-of-the art NLIDBworks for the keyword mapping task as baselines to compare withDBTagger.- tf-idf:

Similar to ATHENA [32], for each unique value presentin the database, we first create an exact matching index, andthen perform tf-idf for tokens in the NLQ. In case of matches tomultiple columns, the column with the biggest tf value is chosenas matching. In order to handle multi word keywords, we usen-grams of tokens up to 𝑛 =

3. For relation matching, we usedlexical similarity based on the Edit Distance algorithm.-

NALIR:

NALIR [24] uses WordNet for relation matching. Fornon-relation matching,it utilizes regex or full text search queriesover each database column whose type is text. In case of matchesto multiple columns, the column which returns more rows as a re-sult is chosen as matching. For fast retrieval, we limit the number able 2: Statistics of the databases usedDatabase Spider

Properties (

Table 3: Accuracy scores of unsupervised baselines for rela-tion and non-relation matchingDatabase

Baseline imdb mas yelptf-idf 0.594-0.051 0.734-0.084 0.659-0.557NALIR 0.574-0.103 0.742-0.476 0.661-0.188word2vec 0.625-0.093 0.275-0.379 0.677-0.269TaBERT NA-0.251 NA-0.094 NA-0.114DBTagger 0.908-0.861 0.964-0.950 0.947-0.923of rows returned from the query to 2000, as in the implementationof NALIR.- word2vec:

For each unique value present in the database, cosinesimilarity over tokens in the NLQ is applied to find mappingsusing pre-defined wor2vec embeddings. The matching with thehighest similarity over a certain threshold is chosen.-

TaBERT:

TaBert [38] is a transformer based encoder which gen-erates dynamic word representations (unlike word2vec) usingdatabase content. The approach also generates column encodingfor a given table, which makes it an applicable keyword mapperfor non-relation matching.For a particular token, matching withmaximum similarity over a certain threshold is chosen.We categorize the keyword mapping task as relation matching and non-relation matching . The former mapping refers to matchingfor table or column names and the latter refers to matching fordatabase values. For fair comparison, we do not apply any preor post processing over the NL queries or use external source ofknowledge, such as a keyword parser or metadata extractor. Resultsare shown in Table 3. Each pair of scores represents token wiseaccuracy for relation and non-relation matching.DBTagger outperforms unsupervised baselines in each datasetsignificantly, by up to 31% and 65% compared to best counterpartfor relation and non-relation matching, respectively. For relationmatching, results of all approaches are similar to each other exceptthe word2vec method for the mas dataset. The main reason for suchpoor performance is that the mas dataset has column names suchas venueName for which word2vec cannot produce word represen-tations, which radically reduces chances of semantic matching.tf-idf gives promising results on the yelp dataset, whereas itfails on the imdb and mas datasets for non-relation matching. Thisbehavior is due to presence of ambiguous values (the same databasevalue in multiple columns) and not being able to find a match

Table 4: Translation AccuracyDatabase

NLIDB System imdb mas yelpNALIR 0.383 0.330 0.472TEMPLAR (on NALIR) 0.500 0.402 0.528DBTagger Pipeline 0.564 0.551 0.461for values having more than three words. For the imdb dataset,none of the baselines performs well for non-relation matching. The imdb dataset has entity like values that are comprised of multiplewords such as movie names, which makes it impossible for semanticmatching approaches to generate meaningful representations toperform similarity. NALIR’s approach of querying over database hasdifficulties for the imdb and yelp datasets since the approach doesnot solve ambiguities without user interaction. TaBERT performspoorly for all datasets. TaBERT has its own tokenizer, which tries todeal with tokens that are out of vocabulary (OOV) by breaking thetoken into sub-words that have representations. OOV keywordsappearing in the natural language query are therefore divided bythe tokenizer into pieces, which eventually leads to unrelated wordrepresentations and therefore non-predictive similarity calculation.

In order to show the effectiveness oftags output by DBTagger, we implemented a simple translationpipeline, similar to methodology in [2]. The pipeline generates joinpaths for SQL translation using shortest length path over schemagraph to cover all the mappings output by DBTagger. We countinaccurate, if the algorithm can not output a joining path. We com-pare our pipeline with a state-of-the-art system, NALIR[24], andTEMPLAR[2], which is an enhancer over an existing NLIDB system.The results are presented in Table 4. The pipeline over DBTaggertags outperforms both systems in imdb and mas datesets, up to66% and 37% compared to NALIR and TEMPLAR respectively. Forqueries which do not include nested or group by constraints such assimple select-join queries, our pipeline produces 67%, 77% and 53%translation accuracy for imdb, mas and yelp datasets respectively.Considering the simplicity of the translation algorithm, resultsdemonstrate the efficacy of predicted outputs of DBTagger.

In this experimental setup, we perform keyword mapping in a su-pervised fashion with different neural network architectures along able 5: Performance of Neural Models with Different Architectures in accuracy-F1 metricsDatabase Spider

Model yelp imdb mas academic hr college imdb yelpCRF 0.934-0.890 0.907-0.850 0.955-0.932 - - - - -0.941 0.965-0.954 0.861-0.735 - - - (a) Run Time (b) Memory Usage Figure 3: Run Time and Memory Usage of state-of-the-art keyword mapping approaches with a non-Deep Learning (DL) baseline to evaluate architecturaldecisions.-

CRF:

As a non-DL baseline, we use vanilla CRF. Semantic wordrepresentations of the NLQ are fed as input to the model.-

ST_Uni:

We create a two layers stack of uni-directional GRUs,followed by CRF as the classification layer. This model is trainedon only a single task, schema tags.-

ST_Bi:

Different than the previous architecture, we use bi-directionalGRUs instead of uni-directional GRUs. Classification is done onthe CRF layer.-

MT_Seq

In this model, training is performed on all three tasks.However, each task is trained separately. The predicted tag ofthe previous task is fed into the next task. To do that, 1-hotvector representations of predicted tags are concatenated withsemantic word representations. We stack a bi-directional GRUwith a uni-directional GRU to encode the sentence and feed theoutput vector to the CRF layer.-

DBTagger

This model represents the DBTagger architecturewhere all tasks are used during training concurrently. DBTaggeralso has cross-skip connections between tasks as depicted inFigure 2.For all the models, the same hyper parameters are used for faircomparison during training, as explained in Section3.2. The resultsare shown in Table 5. Each pair of scores represents the accuracyand F1 measures, respectively. DBTagger performs better than the other supervised architectures for six different datasets in accuracyand in terms of F1. Especially for the yelp and college datasets theperformance improvement is remarkable, which is up to around4 .

5% and 5%, respectively. Vanilla CRF performs well among all(best in two datasets), which signifies its role in the architecture forthe sequence tagging problem. ST_Bi performs better than ST_Uniin all datasets, which shows the positive impact of bi-directionalGRUs. Compared to single task models, multi task models performbetter for all datasets. Except the mas dataset for the F1 metric,DBTagger produces better tags compared to the other multi taskmodel, MT_Seq, in which tasks are trained separately.

Efficiency is one of the most important properties of a good keywordmapper to be deployable in online interfaces. Therefore, run timeperformance of keyword mapping approaches mentioned in Section3.3 is also evaluated.-

NALIR : We analyze both querying over database column ap-proaches used in NALIR[24], named as q_regex and q_ftext , whichuse like and match against operators respectively.- tf-idf : Similar to ATHENA [32], we created an exact matchingindex, using inverted index named as inv_index , beforehand toavoid querying over database, .- word2vec : Many works such as Sqlizer [35] make use of pre-trained word embeddings to find mappings, which requires keep-ing the model in the memory to perform similarities. tabert_on : TaBert requires database content (content snapshot)to generate encodings for both NL tokens and columns. We callthis setup tabert online, where the model generates the contentsnapshot to perform mapping when the query comes.- tabert_off : We also use TaBert in offline setup. For each table,database content is generated beforehand to perform encodings.In this setup, we keep the content in the memory to serve thequery faster.We measured the time elapsed for a single query to extract tagsand the memory consumption needed to perform mapping for eachapproach. We also run each experiment with different number ofrow values to capture the impact of the database size. Figure 3presents run time and memory usage analysis of keyword mappers.DBTagger ouputs the tags faster than any other baseline and itis scalable to much bigger databases. However, q_regex, q_ftext,tabert_on and word2vec do not seem applicable for bigger tableshaving more than 10000 rows. The tf-idf technique has nice balancebetween run-time and memory usage, but it is limited in terms ofeffectiveness (Table 3). tabert-off performs the tagging in a reason-able time, yet it requires huge memory consumption especially forbigger tables.

Although the very first effort [14] of providing natural languageinterface in databases dates back to multiple decades ago, the pop-ularity of the problem has increased due to some recent pipelinebased systems proposed by the database community, such as SODA[4], NALIR [24], ATHENA[32] and SQLizer[35].Recently, end-to-end approaches utilizing encoder-decoder basedarchitectures [3, 5, 12, 18, 33, 34, 37, 40, 41, 44] in deep learninghave become more popular to deal with the translation problem.Seq2SQL[44] uses a Bi-LSTM to encode a sequence that containscolumns of the related table, SQL keywords and question. Thestudy [44] also provided a dataset called

WikiSql to the researchcommunity working on NLIDB problem for evaluation. SQLNet[34]defines a seq-to-set approach to eliminate reinforcement learningprocess of Seq2SQL. In another study which used WikiSql dataset,Yavuz et al.[37] employs a process called candidate generationto create keyword mappings to be used in where clasue in SQLtranslation specifically. All of the proposed deep learning basedmethods use pre-trained word embedding vectors for input to themodel. Therefore, keyword mapping is implicitly handled by themodel. However, TypeSql [40] tries to enrich input data augmentingentity tags by performing similarity check over the database orknowledge base. Similary, [18] tries to find possible constant valuesin the query by performing similarity matching.Due to the limited nature of WikiSql dataset, having a singletable for each database, another important dataset called Spider[42] is provided to the community. Consequently, many studiesproposed recently [5, 12, 33, 38, 41] have evaluated their solutionson the Spider dataset. Different from the others, TaBERT [38] as atransformer based encoder, makes use of database content to gen-erate dynamic representations along with contextual encodings torepresent database columns. For a comprehensive survey coveringexisting solutions in NLIDB, the reader can refer to [1, 21]. Similar to our work, Baik et. al. [2] propose TEMPLAR, to beaugmented on top of existing NLIDB systems to improve keywordmapping and therefore translation using query logs. Though, TEM-PLAR is not a standalone mapper, since it requires from a NLIDBsystem multiple preliminaries to function properly, including parsedkeywords and associated metadata with each keyword, which arethe main challenges yielded by the keyword mapping problem.Therefore, the mapper cannot be plugged into NLIDB pipelines thatdoes not perform detailed keyword recognition and parsing.Different from the previous works, DBTagger is an end-to-endkeyword mapper, which does not require any processing or externalsource of knowledge. Also, to the best of our knowledge, our workis the first study utilizing deep neural networks in a supervisedlearning setup for keyword mapping.

In NLP community, neural network architectures have been utilizedin many research problems. As a pioneer in the field, Collobertet. al. [9] proposed Convolutional Neural Networks (CNN) basedarchitecture with CRF layer on top to deal with the sequence taggingproblem. Yao et. al. [36] applied LSTM in sequence tagging withouthaving CRF as the classification layer. Bi-directional RNN structurewas employed first in a speech recognition problem in [11].Later, instead of simple RNN networks, bi-directional LSTM wasadopted and employed by Huang et. al. [19] in NER problem. Fol-lowing that study, Lample et. al. [23] proposed a similar architecturewith the inclusion of word and character embeddings. They usedpre-trained word embeddings along with character level embed-dings to extract input matrix to feed into the network. Their studystand as the state-of-the-art in sequence tagging problems in NLP.Similar to [23], Ma and Hovy [26] proposed a neural architecturewhere character embeddings is done through CNN instead of LSTM.For a comprehensive survey discussing the deep learning solutionsfor research problems in NLP community, [39] is a great read.

In this paper, we present DBTagger, a keyword mapper to be used intranslation pipelines in NLIDB systems. DBTagger is a standalonesystem which does not require any processing or external knowl-edge such as parser or metadata preliminaries. Inspired by sequencetagging architectures used for well known problems such as POS inthe NLP community, DBTagger utilizes a deep neural architecturebased on bi-directional GRUs. We try to exploit the observationthat POS tags of keywords are related to schema tags by apply-ing multi-task learning in our architecture. DBTagger provides thebest accuracy results on three publicly available databases and fiveschemas in Spider dataset, producing keyword tags with 92 .

4% ac-curacy on the average over all the datasets within 3 milliseconds,which is 10000 times faster than unsupervised approaches. Ourresults also show that DBTagger is scalable to large databases con-taining millions of rows. We believe that DBTagger can be appliedin existing NLIDB systems as the first step to improve translation,especially in pipeline-based systems. For the deep learning basedapproaches, DBTagger can be utilized to be augmented on neuralnetwork to enrich input query before feeding into network.

EFERENCES [1] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparativesurvey of recent natural language interfaces for databases.

The VLDB Journal .[3] Fuat Basik, Benjamin Hättasch, Amir Ilkhechi, Arif Usta, Shekar Ramaswamy,Prasetya Utama, Nathaniel Weir, Carsten Binnig, and Ugur Cetintemel. 2018. DB-Pal: A Learned NL-Interface for Databases. In

Proceedings of the 2018 InternationalConference on Management of Data (SIGMOD ’18) . 1765–1768.[4] Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and KurtStockinger. 2012. SODA: Generating SQL for Business Users.

Proc. VLDB Endow.

5, 10 (2012).[5] Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing SchemaStructure with Graph Neural Networks for Text-to-SQL Parsing. In

Proceedingsof the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19) . 4560–4565.[6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En-riching Word Vectors with Subword Information.

Transactions of the Associationfor Computational Linguistics

Machine learning

28, 1 (1997), 41–75.[8] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015.Gated Feedback Recurrent Neural Networks. In

Proceedings of the 32nd Inter-national Conference on International Conference on Machine Learning (ICML’15) .2067–2075.[9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch.

Journal of machine learning research

12, ARTICLE (2011), 2493–2537.[10] Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In

Interna-tional Conference on Learning Representations Workshop .[11] A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deeprecurrent neural networks. In . 6645–6649.[12] Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, andDongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Databasewith Intermediate Representation. In

Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics (ACL ’19) . 4524–4535.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR ’16) . 770–778.[14] Gary G. Hendrix, Earl D. Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum.1978. Developing a Natural Language Interface to Complex Data.

ACM Trans.Database Syst.

3, 2 (1978), 105–147.[15] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2012. Improving neural networks by preventing co-adaptation offeature detectors.

ArXiv abs/1207.0580 (2012).[16] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

NeuralComputation

9, 8 (1997), 1735–1780.[17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.2017. Densely connected convolutional networks. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR ’17) . 4700–4708.[18] Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He.2018. Natural Language to Structured Query Generation via Meta-Learning. In

Proceedings of the 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 2 (ShortPapers) (NAACL ’18) . 732–738.[19] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models forSequence Tagging.

ArXiv abs/1508.01991 (2015).[20] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and LukeZettlemoyer. 2017. Learning a Neural Semantic Parser from User Feedback.In

Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics (ACL ’17) . 963–973.[21] Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Naturallanguage to SQL: Where are we today?

Proceedings of the VLDB Endowment

Proceedings of the Eighteenth International Conference on Machine Learn-ing (ICML ’01) . 282–289.[23] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL ’16) . 260–270.[24] Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural LanguageInterface for Relational Databases.

Proc. VLDB Endow.

8, 1 (2014), 73–84.[25] Yunyao Li and Davood Rafiei. 2017. Natural Language Data Management andInterfaces: Recent Development and Open Challenges. In

Proceedings of the 2017ACM International Conference on Management of Data (SIGMOD ’17) . 1765–1770.[26] Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In

Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (ACL ’16) . 1064–1074.[27] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J.Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural LanguageProcessing Toolkit. In

Association for Computational Linguistics (ACL) SystemDemonstrations . 55–60.[28] Tomas Mikolov, G.s Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient Estima-tion of Word Representations in Vector Space. In

Proceedings of the InternationalConference on Learning Representations (ICLR’13) . 1–12.[29] George A. Miller. 1995. WordNet: A Lexical Database for English.

Commun. ACM

38, 11 (1995), 39–41.[30] Fatma Özcan, Abdul Quamar, Jaydeep Sen, Chuan Lei, and Vasilis Efthymiou.2020. State of the Art and Open Challenges in Natural Language Interfacesto Data. In

Proceedings of the 2020 ACM SIGMOD International Conference onManagement of Data (SIGMOD ’20) . 2629–2636.[31] Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 2003. Towards a Theory ofNatural Language Interfaces to Databases. In

Proceedings of the 8th InternationalConference on Intelligent User Interfaces (IUI ’03) . 149–157.[32] Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar FarooqMinhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores.

Proc.VLDB Endow.

9, 12 (2016).[33] Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi,Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin Hättasch, SteffenEger, Ugur Cetintemel, and Carsten Binnig. 2020. DBPal: A Fully PluggableNL2SQL Training Pipeline. In

Proceedings of the 2020 ACM SIGMOD InternationalConference on Management of Data (SIGMOD ’20) . 2347–2361.[34] Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structuredqueries from natural language without reinforcement learning. arXiv preprintarXiv:1711.04436 (2017).[35] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer:Query Synthesis from Natural Language.

Proceedings of the ACM on ProgrammingLanguages

1, OOPSLA (2017), 1–26.[36] Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and YangyangShi. 2014. Spoken language understanding using long short-term memory neuralnetworks. In . 189–194.[37] Semih Yavuz, Izzeddin Gur, Yu Su, and Xifeng Yan. 2018. What It Takes to Achieve100% Condition Accuracy on WikiSQL. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing (EMNLP ’18) . 1702–1711.[38] Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020.TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In

Proceedings of the 58th Annual Meeting of the Association for Computational Lin-guistics (ACL ’20) . 8413–8426.[39] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018.Recent trends in deep learning based natural language processing.

IEEE Compu-tational Intelligence Magazine

13, 3 (2018), 55–75.[40] Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018. TypeSQL:Knowledge-Based Type-Aware Neural Text-to-SQL Generation. In

Proceedingsof the 2018 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Volume 2 (Short Papers)(NAACL ’18) . 588–594.[41] Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li,and Dragomir Radev. 2018. SyntaxSQLNet: Syntax Tree Networks for Complexand Cross-Domain Text-to-SQL Task. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing (EMNLP ’18) . 1653–1663.[42] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li,James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and DragomirRadev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex andCross-Domain Semantic Parsing and Text-to-SQL Task. In

Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing (EMNLP ’18) .3911–3921.[43] Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method.

ArXiv abs/1212.5701 (2012).[44] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generatingstructured queries from natural language using reinforcement learning. arXivpreprint arXiv:1709.00103arXivpreprint arXiv:1709.00103