[PDF] An Investigation Between Schema Linking and Text-to-SQL Performance

Abstract

Text-to-SQL is a crucial task toward developing methods for understanding natural language by computers. Recent neural approaches deliver excellent performance; however, models that are difficult to interpret inhibit future developments. Hence, this study aims to provide a better approach toward the interpretation of neural models. We hypothesize that the internal behavior of models at hand becomes much easier to analyze if we identify the detailed performance of schema linking simultaneously as the additional information of the text-to-SQL performance. We provide the ground-truth annotation of schema linking information onto the Spider dataset. We demonstrate the usefulness of the annotated data and how to analyze the current state-of-the-art neural models.

Full PDF

AAn Investigation Between Schema Linking and Text-to-SQL Performance

Yasufumi Taniguchi Hiroki Nakayama Kubo Takahiro Jun Suzuki

TIS, Inc. Tohoku University RIKEN { taniguchi.yasufumi,nakayama.hiroki,kubo.takahiro } @[email protected] Abstract

Text-to-SQL is a crucial task toward devel-oping methods for understanding natural lan-guage by computers. Recent neural ap-proaches deliver excellent performance; how-ever, models that are difﬁcult to interpret in-hibit future developments. Hence, this studyaims to provide a better approach toward theinterpretation of neural models. We hypoth-esize that the internal behavior of models athand becomes much easier to analyze if weidentify the detailed performance of schemalinking simultaneously as the additional infor-mation of the text-to-SQL performance. Weprovide the ground-truth annotation of schemalinking information onto the Spider dataset.We demonstrate the usefulness of the anno-tated data and how to analyze the current state-of-the-art neural models. Text-to-SQL is a task to convert the question innatural language to SQL (the logical form). The at-tempts to solve text-to-SQL are crucial to establishmethodologies for understanding natural languageby computers. Currently, neural models are widelyused for tackling text-to-SQL (Choi et al., 2020;Zhang et al., 2019; Bogin et al., 2019b; Guo et al.,2019; Wang et al., 2020). However, state-of-the-art neural models on the Spider dataset (Yu et al.,2018b), a current mainstream text-to-SQL bench-mark dataset, yield 60–65 exact matching accuracy.This indicates that current technologies require im-mense room for improvement to achieve commer-cialization and utilization as real-world systems.A severe drawback of the neural approach is thedifﬁculty of analyzing how models capture the clueto solve a task. Hence, researchers often struggle Our scheme linking annotation on the Spider data is pub-licly available at: https://github.com/yasufumy/spider-schema-linking-dataset . which direction to focus on to obtain further im-provement. This paper focuses on this problem andconsiders a methodology that can reduce enormouseffort to analyze the model behaviors and ﬁnd thenext direction. For this goal, we focus on schemalinking . Schema linking is a special case of entitylinking and a method to link the phrases in a givenquestion with the column names or the table namesin the database schema. Guo et al. (2019) and Wanget al. (2020) show that schema linking is an essen-tial module to solve text-to-SQL task effectively.We hypothesize that if the detailed performance ofthe schema linking is known simultaneously as ad-ditional information for text-to-SQL performance,then the analysis of the internal behavior of themodels at hand becomes easier.To investigate the above-mentioned hypothesisand offer a better analysis of text-to-SQL models,we annotate ground-truth schema linking informa-tion onto the Spider dataset (Yu et al., 2018b). Theexperiments reveal the usefulness of scheme link-ing information in the annotated dataset to under-stand the model behaviors. We also demonstratehow the current state-of-the-art neural models canbe analyzed by comparing the schema linking per-formance with the text-to-SQL performance. Text-to-SQL dataset

There exist many bench-mark datasets, such as WikiSQL (Zhong et al.,2017), Adivising (Finegan-Dollak et al., 2018), andSpider (Yu et al., 2018b). WikiSQL is the largestbenchmark dataset in text-to-SQL domain. How-ever, Finegan-Dollak et al. (2018) pointed out thatWikiSQL includes almost same SQL in the train-ing and test set, because the dataset aims to gener-ate the correct SQL for unknown questions. Theyproposed Advising (Finegan-Dollak et al., 2018),which does not include the same SQL in the train- a r X i v : . [ c s . C L ] F e b ng and test sets, but it still consists only of SQLwith limited clauses from one domain. Yu et al.(2018b) proposed the Spider dataset that includescomplicated SQL with many clauses and 138 dif-ferent domains. Currently, Spider is considered themost challenging dataset in the text-to-SQL ﬁeld. Schema linking

In text-to-SQL, schema linkingis a task to link a phrase in the given question andthe table name or the column name. The methodsused for schema linking are often categorized asexplicit or implicit approaches. The explicit ap-proach is treated as the ﬁrst step of the text-to-SQLpipeline, and thus we obtain the linking informa-tion (Yu et al., 2018a; Guo et al., 2019; Wang et al.,2020). In contrast, the implicit approach is a mod-ule included in text-to-SQL models, and thus link-ing is a black box during the process. To obtainlinking information, we mostly focus on the atten-tion module (Bahdanau et al., 2015) from questiontokens to the database schema mostly equippedby the models in the implicit approach (Krishna-murthy et al., 2017; Bogin et al., 2019a,b; Zhanget al., 2019; Dong et al., 2019). In this paper, wefocus on the explicit approach for a clear discus-sion.

Initial dataset

The Spider dataset (Yu et al.,2018b) is a large-scale human annotated and cross-domain text-to-SQL dataset. The dataset consistsof an 8,625 training set, a 1,034 development set,and a 2,147 test set. Moreover, it contains 200databases, and no database overlaps in the training,development, and test sets. We annotate ground-truth schema linking information onto the Spiderdataset. Note that we annotate it only on the devel-opment set, not on training and test sets. This is be-cause this study aims to provide a detailed analysistool of text-to-SQL models, mainly for investigat-ing the behavior of models and seeking directionfor subsequent developments, not to train modelsfor further improving the performance. Moreover,the test set is not publicly available for the Spiderdataset; the test set is only used in the leaderboardsystem for preventing the test set tuning often arosein the evaluation phase.

Annotation detail and statistics

The annotationis performed by two software engineers who arefamiliar with SQL. They use Doccano as the anno- https://github.com/doccano/doccano Figure 1: Annotated example l = 1 ) 2,359 8 0 2.28 1.229Table ( l = 1 ) 1,031 5 0 1.00 0.764Column ( l = 1 ) 1,328 0 0 1.28 0.948Total ( l ≥ ) 718 4 0 0.69 0.851Table ( l ≥ ) 192 3 0 0.19 0.424Column ( l ≥ ) 526 0 0 0.51 0.751 Table 1: Statistics of the annotated data for each sen-tence. tation tool. Figure 1 shows an annotation example .Table 1 shows the statistics of the annotated data. Quality check

For the annotation quality check,we validate the annotation agreement between twoannotators by independently annotating the same100 examples. The annotation agreement of Co-hen’s kappa is 0.764 ( CI = 0 . − . , p < . ) . According to Landis and Koch (1977),the kappa value in the range . − . is catego-rized in substantial agreement . Moreover, the F score of annotation of two annotators is 87.5. Wecalculate the F score as suggested in several pre-vious studies (Brandsen et al., 2020; Grouin et al.,2011; Alex et al., 2010). According to these results,we believe that our annotated scheme linking dataare highly reliable as the ground truth. Data split

We split the annotated data into twodistinct sets and used one for the development setand another for the test set. Hereafter, we refer tothese new sets as the development set and the testset , respectively; it is crucial to note that this paperdoes not deal with the true test data in the Spider See several other examples in Appendix E The un-annotated tokens unfairly increase the kappa scoreon sequence segmentation tasks (Brandsen et al., 2020). Wefollow the instruction written in Brandsen et al. (2020) tocalculate the kappa score only on tokens, either one annotated. ame alias explanation w/o uni-gram-(a) a The uni-grams are ignoredw/o uni-gram-(b) b The uni-grams are ignored.The partial matches are ignoredw/o column-match c The column names are ignoredw/o table-match d The table names are ignoredonly uni-gram-(a) e Only the uni-gramsare considered.only uni-gram-(b) f Only the uni-gramsare considered.However, the partialmatches are ignored.random g Randomly linkingw/o all h No schema linking

Table 2: Schema linking methods. We use the alias inlater experiments.

Spider Schema LinkingModel EM F Pre. Rec.

Table 3: Schema linking and text-to-SQL results. EM:exact match, Pre.: precision, Rec.:recall, dataset. Consequently, the development and testsets both contain 517 examples for each . Evaluation metric

Schema linking is a task sim-ilar to the named entity recognition and relationextraction (Marsh and Perzanowski, 1998). There-fore, we calculate the precision, recall, and F -score (Tjong Kim Sang, 2002; Tjong Kim Sangand De Meulder, 2003) for evaluating the schemalinking performance. This section demonstrates the utilization of the pro-posed annotated dataset to understand the modelbehavior and determine the next directions for fur-ther improvement. ed the Spider dataset to eval-uate the Text-to-SQL performance. We used theexact matching ( EM ) accuracy for the evaluationof text-to-SQL performance of the Spider datasetas introduced in Yu et al. (2018b) . Moreover, weevaluated the schema linking performance by F -score, as explained in the previous section. We conﬁrmed that there is no database overlapping be-tween the new development and test sets.This is the identicalconﬁguration for the original Spider dataset. We used the ofﬁcial evaluation script provided by Yu et al.(2018b) Spider Schema LinkingModel EM F Pre. Rec.

Table 4: Schema linking and text-to-SQL results.

Spider Schema LinkingModel EM F Pre. Rec.

Table 5: Evaluation on annotated dataset (anno)and mixing the annotations and estimated predictions(mix).

Baseline models

We selected IRNet (Guo et al.,2019) and RAT-SQL (Wang et al., 2020) for thebaseline models of the experiments to reveal the ef-fectiveness of the proposed dataset, where we referto them as

IRNet and

RAT-SQL , respectively .It should be noted that both of their models em-ployed an explicit approach, whose ﬁrst steps arescheme linking; thus, their settings match to evalu-ate the usefulness of the proposed annotated data.However, we also emphasize here that their schemalinking methods differ from each other althoughtheir methods consist of combinations of similarmultiple rules, where IRNet maps the phrase tothe single table or column, and

RAT-SQL mapsthe phrase to the multiple tables or columns. Fur-ther,

IRNet and

RAT-SQL mark the the top-linescores in the leader board of the Spider dataset ;speciﬁcally, RAT-SQL is the current state-of-the-art model. These facts suggest to use them as base-line models in our experiments. We selected theidentical hyper-parameter values for both

IRNet and

RAT-SQL with their original papers, i.e., Guoet al. (2019) and Wang et al. (2020). See C for the detailed descriptions of

IRNet and

RAT-SQL . https://yale-lily.github.io/spider nvestigations The schema linking methodsused in

IRNet and

RAT-SQL follow the rule-based approach that allows easier interpretationof model behavior. To investigate the usefulness ofthe schema linking information, we conducted theﬁne-grained schema linking ablation experiments.Through these experiments, we explore the generalbehaviors of text-to-SQL models when the perfor-mance of scheme linking changes. To accomplishthis, we prepare eight methods shown in Table 2. Behaviors of

IRNet and

RAT-SQL

Table 3shows the results of the schema linking and text-to-SQL performance of

IRNet and

RAT-SQL . TheSpider EM of

RAT-SQL is signiﬁcantly better thanthat of

IRNet , whereas the scheme linking F of RAT-SQL is much worse than that of

IRNet . Thismismatch occurred by the difference of the schemelinking strategy as

RAT-SQL prioritizes recall overprecision, as presented in Table 3.

Correlation

Table 4 shows a type of ablationstudy to gradually decrease the F scores by elim-inating the schema linking rules. Further, Table 5shows the simulated evaluation results when weobtained the perfect prediction (F = 100 ), orbetter predictions than that of the original IRNet and

RAT-SQL . We observed a strong correla-tion between scheme linking F and Spider EM on IRNet . In fact, the correlation coefﬁcient betweenthem is 0.937 with p = 2 . × − . This fact indi-cates that the scheme linking considerably affectsthe ﬁnal Spider EM score. Thus, we can roughly es-timate the EM scores from scheme linking F with-out performing the entire training and evaluationprocedures of IRNet . Unlike

IRNet , the SpiderEM of

RAT-SQL seems not to be strongly corre-lated to the scheme linking F . The correlationcoefﬁcient between them is 0.737 with p = 0 . .However, if we checked the Spider EM and RAT-SQL prioritizes recall than pre-cision, it becomes 0.81 with p = 0 . . Therefore, RAT-SQL still has a strong correlation betweenscheme linking results. Additionally, the SpiderEM for

IRNet anno is higher than the origi-nal

IRNet (62.5 vs. 58.8). Similarly,

RAT-SQLanno is higher than for the original

RAT-SQL See Appendix B for the rules used in their methods. We obtain ”anno” from the human annotation, and ”mix”by randomly choosing the example from the human annotationor the original schema linking result Question How many countries exist?Gold

SELECT count(*)FROM COUNTRIES;

IRNet-f

SELECT count(*)FROM COUNTRIES;

IRNet-h

SELECT count(T1.Country)FROM car_makers AS T1;

Figure 2: Example of IRNet output. (69.6 vs. 69.2). These results also support thereliability of the proposed annotation as the per-formance gain should be derived from the correct(better) scheme linking.

Error analysis

Figure 2 shows actual examplesof

IRNet outputs . IRNet-f successfully gener-ates the correct SQL query, while IRNet-h does not.In the absence of scheme linking annotation, it isrelatively difﬁcult to determine the cause of thisfailure. However, using the scheme linking annota-tion, we can easily ﬁnd the reason for the failure ofIRNet-h; it failed to link countries in the questionas to the table name . This is a simple example ofleveraging the proposed annotation for analyzingthe model behaviors. We believe there are manyways to utilize the proposed annotation to furtheranalyze the model behaviors. The schema linking is an essential module for per-forming the text-to-SQL task effectively. We an-notated the schema linking information onto theSpider dataset. Then, we investigated the useful-ness of the proposed annotation to understand themodel behaviors of text-to-SQL models and seekthe next directions for further development.As a demonstration, we selected IRNet and RAT-SQL, which are the state-of-the-art methods on theSpider data, and evaluated both scheme linkingand Spider EM scores. The results showed strongcorrelations between the schema linking F andSpider EM scores for IRNet and the number oftrue positive and Spider EM scores for RAT-SQL.These correlations may offer a rough estimation ofthe ﬁnal Spider exact match scores without trainingthe models. We hope the proposed scheme linkingannotation helps future studies in the text-to-SQLtask. The actual examples obtained from

RAT-SQL are pre-sented in Appendex D because of the space limitation. eferences

Bea Alex, Claire Grover, Rongzhou Shen, and MijailKabadjov. 2010. Agile corpus annotation in prac-tice: An overview of manual and automatic annota-tion of CVs. In

Proceedings of the Fourth LinguisticAnnotation Workshop , pages 29–37, Uppsala, Swe-den. Association for Computational Linguistics.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In .Ben Bogin, Jonathan Berant, and Matt Gardner. 2019a.Representing schema structure with graph neuralnetworks for text-to-SQL parsing. In

Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 4560–4565, Florence,Italy. Association for Computational Linguistics.Ben Bogin, Matt Gardner, and Jonathan Berant. 2019b.Global reasoning over database structures for text-to-SQL parsing. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 3659–3664, Hong Kong, China. As-sociation for Computational Linguistics.Alex Brandsen, Suzan Verberne, Milco Wansleeben,and Karsten Lambers. 2020. Creating a dataset fornamed entity recognition in the archaeology domain.In

Proceedings of the 12th Language Resourcesand Evaluation Conference , pages 4573–4577, Mar-seille, France. European Language Resources Asso-ciation.DongHyun Choi, Myeong Cheol Shin, EungGyun Kim,and Dong Ryeol Shin. 2020. RYANSQL: recur-sively applying sketch-based slot ﬁllings for com-plex text-to-sql in cross-domain databases.

CoRR ,abs/2004.03125.Zhen Dong, Shizhao Sun, Hongzhi Liu, Jian-GuangLou, and Dongmei Zhang. 2019. Data-anonymousencoding for text-to-SQL generation. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 5405–5414, HongKong, China. Association for Computational Lin-guistics.Catherine Finegan-Dollak, Jonathan K. Kummerfeld,Li Zhang, Karthik Ramanathan, Sesh Sadasivam,Rui Zhang, and Dragomir Radev. 2018. Improvingtext-to-SQL evaluation methodology. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 351–360, Melbourne, Australia. Asso-ciation for Computational Linguistics. Cyril Grouin, Sophie Rosset, Pierre Zweigenbaum,Kar¨en Fort, Olivier Galibert, and Ludovic Quin-tard. 2011. Proposal for an extension of traditionalnamed entities: From guidelines to evaluation, anoverview. In

Proceedings of the 5th Linguistic An-notation Workshop , pages 92–100, Portland, Oregon,USA. Association for Computational Linguistics.Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,Jian-Guang Lou, Ting Liu, and Dongmei Zhang.2019. Towards complex text-to-SQL in cross-domain database with intermediate representation.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages4524–4535, Florence, Italy. Association for Compu-tational Linguistics.Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In

Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing , pages 1516–1526,Copenhagen, Denmark. Association for Computa-tional Linguistics.J. Richard Landis and Gary G. Koch. 1977. The mea-surement of observer agreement for categorical data.

Biometrics , 33(1).Elaine Marsh and Dennis Perzanowski. 1998. MUC-7evaluation of IE technology: Overview of results. In

Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Vir-ginia, April 29 - May 1, 1998 .Erik F. Tjong Kim Sang. 2002. Introduction to theCoNLL-2002 shared task: Language-independentnamed entity recognition. In

COLING-02: The6th Conference on Natural Language Learning 2002(CoNLL-2002) .Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. In

Proceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003 , pages142–147.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.

CoRR , abs/1706.03762.Bailin Wang, Richard Shin, Xiaodong Liu, OleksandrPolozov, and Matthew Richardson. 2020. RAT-SQL:Relation-aware schema encoding and linking fortext-to-SQL parsers. In

Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics , pages 7567–7578, Online. Associationfor Computational Linguistics.Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, andDragomir Radev. 2018a. TypeSQL: Knowledge-based type-aware neural text-to-SQL generation. In

Proceedings of the 2018 Conference of the Northmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 2 (Short Papers) , pages 588–594, NewOrleans, Louisiana. Association for ComputationalLinguistics.Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang,and Dragomir Radev. 2018b. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages3911–3921, Brussels, Belgium. Association forComputational Linguistics.Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim,Eric Xue, Xi Victoria Lin, Tianze Shi, Caim-ing Xiong, Richard Socher, and Dragomir Radev.2019. Editing-based SQL query generation forcross-domain context-dependent questions. In

Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages 5338–5349,Hong Kong, China. Association for ComputationalLinguistics.Victor Zhong, Caiming Xiong, and Richard Socher.2017. Seq2sql: Generating structured queriesfrom natural language using reinforcement learning.

CoRR , abs/1709.00103.

Spider data

Question What are the names andthe descriptions for all the sections?SQL

SELECT section_name,section_descriptionFROM Sections;

Figure 3: Example pair of the question and SQL queryof Spider (Yu et al., 2018b).

Figure 3 shows an example in the dataset. Asingle data sample is constructed by the naturallanguage question and the SQL query.

B Details of scheme linking methods

The scheme linking methods in IRNet and RAT-SQL both classify the word n-grams in questionsto three classes, namely, table , column , or NONE .Then, they enumerate the word n-grams of length1-6 in the question and classify longer n-gramsﬁrst. During the scan of the n-grams, it classiﬁes column or table when the n-gram matches exactlyor partially. If the n-gram matches both column and table , column is prioritized. If the n-gram matchesnothing, that n-gram is classiﬁed to NONE

C Details of baseline models

IRNet is the model that successfully utilizesschema linking. IRNet has the three stages to gen-erate the SQL query. The ﬁrst stage is the schemalinking explained above. The second stage is themain part of this model. It consists of generationof SemQL, which is the immediate representationbetween the question and SQL query. SemQL hasa much simpler grammar than SQL. The last stageconverts SemQL to SQL.RAT-SQL is the ﬁrst-place model on the Spi-der leader board. RAT-SQL also uses the schemalinking technique proposed in IRNet. In RAT-SQL,Wang et al. (2020) proposed the relation-aware self-attention , which effectively encodes the directedgraph of the database schema. Their approach usesself attention mechanism (Vaswani et al., 2017) tocombine the phrases in the database schema andthe phrases in the question.

D Output examples

We show the

RAT-SQL outputs in Figure 4. FromFigure 4, both models fail to generate the SQLquery. However, RAT-SQL successfully predicates

Question Find the ﬁrst name, country code and birth dateof the winner who has the highest rank pointsin all matches.Gold

SELECT T1.first_name,T1.country_code,T1.birth_dateFROM players AS T1JOIN matches AS T2ON T1.player_id =T2.winner_idORDER BYT2.winner_rank_points DESCLIMIT 1

RAT-SQL

SELECT players.first_name,players.country_code,players.birth_dateFROM playersJOIN matchesON players.player_id=matches.loser_idORDER BY matches.winner_htASCLIMIT 1

RAT-SQL-f

SELECT players.first_name,rankings.ranking_date,matches.tourney_dateFROM playersJOIN matchesON players.player_id=matches.loser_idJOIN rankingsON players.player_id=rankings.player_idORDER BY rankings.ranking ASCLIMIT 1

Figure 4: Example of RAT-SQL output. the SELECT clauses, while RAT-SQL-f does not.This is because the schema linking of RAT-SQLcan capture the bi-gram matches.

E Annotated dataset examples

We show our annotated dataset examples randomlypicked from Figure 5. "question": "Count the number of templates.","labels": [[20, 29, "Templates"]]}{ "question": "Which airline has abbreviation ’UAL’?","labels": [[6, 13, "airlines.Airline"], [18, 30, "airlines.Abbreviation"]]}{ "question": "Show the names of high schoolers who have likes, and numbersof likes for each.","labels": [[9, 14, "Highschooler.name"], [18, 32, "Highschooler"],[42, 47, "Likes"], [64, 69, "Likes"]]}{ "question": "How many orchestras does each record company manage?","labels": [[9, 19, "orchestra"], [30, 44, "orchestra.Record_Company"]]}{ "question": "What is the first name of every student who has a dog butdoes not have a cat?","labels": [[12, 22, "Student.Fname"], [32, 39, "Student"]]}{ "question": "Show different citizenship of singers and the number ofsingers of each citizenship.","labels": [[15, 26, "singer.Citizenship"], [30, 37, "singer"],[56, 63, "singer"], [72, 83, "singer.Citizenship"]]}{ "question": "What are 3 most highly rated episodes in the TV seriestable and what were those ratings?","labels": [[23, 28, "TV_series.Rating"], [29, 37, "TV_series.Episode"],[45, 54, "TV_series"], [81, 88, "TV_series.Rating"]]}{ "question": "Find the semester when both Master students and Bachelorstudents got enrolled in.","labels": [[9, 17, "Student_Enrolment.semester_id"],[35, 43, "Degree_Programs.degree_summary_name"],[57, 65, "Degree_Programs.degree_summary_name"],[70, 81, "Student_Enrolment"]]}{ "question": "What are the contestant numbers and names of thecontestants who had at least two votes?","labels": [[13, 23, "CONTESTANTS"],[24, 31, "CONTESTANTS.contestant_number"],[36, 41, "CONTESTANTS.contestant_name"],[49, 60, "CONTESTANTS"], [82, 87, "VOTES"]]}{ "question": "Show names, results and bulgarian commanders of thebattles with no ships lost in the ’English Channel’.","labels": [[5, 10, "battle.name"], [12, 19, "battle.result"],[24, 44, "battle.bulgarian_commander"], [52, 59, "battle"],[68, 73, "ship"], [74, 78, "ship.lost_in_battle"],[79, 81, "ship.location"]]}

Figure 5: Example of our annotated data. The labelslabels