Semantic Parsing to Manipulate Relational Database For a Management System
SSemantic Parsing to Manipulate RelationalDatabase For a Management System
Muhammad Hamzah Mushtaq
FAST National University of Computers and Emerging Sciences
Abstract
Chatbots and AI assistants have claimed their importance in today’slife. The main reason behind adopting this technology is to connectwith the user, understand their requirements, and fulfill them. Thishas been achieved but at the cost of heavy training data and com-plex learning models. This work is carried out proposes a simplealgorithm, a model which can be implemented in different fieldseach with its own work scope. The proposed model converts hu-man language text to computer-understandable SQL queries. Themodel requires data only related to the specific field, saving dataspace. This model performs linear computation hence solving thecomputational complexity. This work also defines the stages wherea new methodology is implemented and what previous methodwas adopted to fulfill the requirement at that stage. Two datasetsavailable online will be used in this work, the ATIS dataset, andWikiSQL. This work compares the computation time among the 2datasets and also compares the accuracy of both. This paper worksover basic Natural language processing tasks like semantic parsing,NER, parts of speech and tends to achieve results through thesesimple methods.
Keywords
Semantic Parsing,NER,SQL,Text-to-SQL
1. Problem statement
To generate a generic model for low end individual systems whichwhen provided with any limited database would convert Humanlanguage to respected database query efficiently.
2. Introduction
Many companies have their own customer support department.Sometimes, users queries do get overbooked and there are notenough staff to handle those queries in time. Instead of employingmore workforce, having a smart voice query and result systemwould save more time and money. More importantly, in a low endsystem which doesn’t have a huge processing power and memoryand which needs to perform operations just related to its field, it ishard for it to compute such huge algorithms that today’s algorithmsrequire. Training models, requiring loads of training and testing
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
CONF ’yy , Month d–d, 20yy, City, ST, Country.Copyright © 20yy ACM 978-1-nnnn-nnnn-n/yy/mm. . . $15.00.http://dx.doi.org/10.1145/nnnnnnn.nnnnnnn data first and dependency over multiple computers/nodes for heavyprocessing. It is true that these high tech solution make our lifeeasier but just for solving and achieving bigger goals. In orderto work in a limited scope environment, one doesn’t need suchamount of data and dependencies. This paper aims to provide asolution for such limited scope operations. For this work purpose,we refer to a hotel room management system which only needs tocommunicate data and information regarding hotel and its vacantrooms and availability. In the race of developing high end genericsolutions, we have lost vision and though of achievements thatcould come by individually capturing each field. Same is the casefor semantic parsing in Speech recognition systems. Here, we aretalking about a smart system which understands human languagequeries through their voice and interprets their question and givesrequired answer in human language. For example, as DDL queriesare simple, the support assistant just needs to view the data fromdatabase and reveal it to user. If a smart user assistant handlesthese queries, the actual support assistants could get more time inhandling users with more difficult queries and issues. This impliesthe importance of voice assistants which entertain users throughdirectly communicating with database. Henceforth, there must bea channel in which users natural language is translated to querylanguage for computers to understand and interpret. Traditionalapproach revolves around training over huge amount of data whichcould fail when out of training set data is occurred. Therefore, themotivation behind the research is to develop a solution which isindependent of training set and computes efficient result over someset of rules.
Understanding human language by converting it into machine un-derstandable form and then retrieving information back is a tedioustask. In this path the main big concept is of Semantic parsing re-sponsible for converting human language to query structure. Itrequires multiple functions like tokenization, semantic analysis,parsing and passing through language models. If we talk particu-larly for a specific area of focus, a specific field of work where weneed to implement our semantic parsing of human language, thenmuch less data is needed for training purposes. Moreover, infor-mation extraction becomes more easy when there is just a singledatabase involved. Instead of training models on various keywordsand sentences, just a set of main keywords and there replacementQuery syntax are required. This leads to less memory and timeconsumption. Also, situation specific data can be processed easily.If we generalize the overall approaches used globally for theNLP tasks,[8] we say there are three categories namely: • Symbolic Approach or Rule-Based Approach a r X i v : . [ c s . C L ] F e b igure 1. Sample for natural language and related SQL extracted from [1]. In multi turns, these previous models save the query structureand re-query the database. Proposed model adopts different technique for multi turn questions by querying from previous results directly. • Empirical Approach or Corpus-Based Approach • Connectionist Approach or Using Neural NetworksThis research on makes use of both the rule based and corpusbased machine learning approach. This is made possible by uti-lizing the text-to-sql dataset. This dataset itself contains multiplefields databases, mostly with limited database table usage. Thisis highly likely to compensate in this research since the aim isto manipulate relational database for single scope databases. Sec-ondly, the template that would assist in refining the query consistof a classification model that classifies the table names and columnnames to the most accurate within the dataset. Researc SPYDERcarried out clearly outlines the specifications of multiple text-to-sql datasets. This research doesn’t consider dataset Spyder since itdoes contain multiple database with multiple tables but with differ-ent scopes which is not the aim this research.The original working flow for the research was that a black box textto sql generator would be used so as to provide a dummy or body ofrespected sql. This would save time. But translating through modelwould saved time and complexity. This research focuses generatingand utilizing a generic Query Language template which assists indesigning a query, and also how to use already processed data formulti turn dialogues. For template generation, NER will be thor-oughly used. We will learn the implications of NER over querygeneration. Multiple datasets are available as given in cath:cgo02.We will be using the structured database of many of these datasetsinstead of their Natural language to query language translation.We will compare our generated queries with that already given indataset and distinguish the correctness of our methodology. An ex-ample for comparison is shown below. We compare the total num-ber of statements that are translated to query and how many ofthem were correctly translated. Since the true translations are givenwithin the respective datasets, we can match our results and get thecomparison.
3. Related work
A lot of work has been done over voice assistants and chat bots.Mainly working over Open domain question answering (QA).[1]For this purpose, a huge amount of data corpus is required and fromvarious sources. Traditional approach requires training state of theart models which utilize this data corpus and learn user behaviorover voice commands.The research [1] also shows the use of domain ontology tripleswhich carry this format.
4. Implementation
Our first approach over this research is to convert human voice totext. The text is passed through semantic parsing to develop theinitial syntax for our query. We proposed a rule based approachin which the initially developed syntax is refined through the help igure 2.
Control flow logic of the being implemented modelof predefined set to Query Language template or rules. In thisway, the computer gets to learn the correct form.Then the queryis run over the database to fetch results. These queries can be ofany type; Data Definition Language (DDL) or Data ManipulationLanguage(DML). This process is to be adopted for single-turn dia-logue, where there is just one single operation. The second changethat is to implemented is using already gathered results for multiturn dialogues.Instead of redesigning the query and gather the similar resultsover it, we propose a set of grammar and rule which define whento run query over already gathered results and when to query thedatabase. Also, this research will put in use NER. This will behelpful when matching parsed query with domain ontology tuplesor predefined templates.
This is the first and initial part of our model in which voice input isconverted to text. The main work of this model resides in efficientlyprocessing and computing users text command in sql query form,hence, we will be using third party API for just converting humanvoice to text. For best results we would use google’s speech to textapi. We convert human voice and then perform further process onit.Also, our model as it is trained over recognizing English verbs orpronouns which would declare what type of operation to perform,the model is prone to some errors because of its dependency ongenerated language.
The generated English sentence is broken down here into parts ofspeech and object/ subject distinction. Hence each word carriesits attributes further for query generation. Major use here will beof Named Entity Recognition NER which will tag the words asnames, location, time etc. The reason for adopting this processis easily match the data with our already defined template. Thatquery template would convey us the refined SQL based on somepredefined set of conditions. For example, when certain words fallinto certain category, a different SQL syntax will be use. In parallel, a model is trained which generates a sql body for thenext method to work on. The model is trained over input sentencesand predict what type of sql query would be used, for example aselect statement or a insert or delete statement. This model woulddecide if to form DDL or DML.
The already defined SQL template also tells us whether to exe-cute our statement on previously generated result or re-query overdatabase. This would help us shortening the query and saving a lotof time which would be wasted in conditional query over database.Secondly, this is the part which is different from most of the mod-ern techniques. Since our goal is to develop a model for low endsystems running individually for a specific field and purpose, ourgoal to convert language to SQL query becomes easy. When thefield and scope is narrowed down, so is the data needed to identityhuman language meaning.The field specific template would include precise NER sub-jects and there respective SQL query mapping, along with the thestructure of conveying results in text form. First column woulddescribe the NER subjects like object, name, place etc, second col-umn would define the assigned query along with the positions ofthe entities. The third column denotes the structure of the resultsthat would be displayed. For example if a SELECT statement with
Figure 3.
Refining query includes the field specific template andparsed text as input igure 4. [7]Datasets comparisonsSatement SQL mapping Result statementHow manyrooms areavailable? SELECTCOUNT(ID)FROM
Table 1.
Table defining the intermediate structure after the modeltranslates text into a query.count is generated then the resultant query would start with "Thereare
5. Evaluation and Results
This study tends to identify the behavior of voice assistant overRule based approach during semantic parsing, rather than pre-learned models which require pre-processed data in a huge amount.Also, the use and dependency of NER over query generation will bestudied. With this research model, a new Human Language transla-tion system is generated which would ideally just get the connectedDatabase information from the user and create queries likewise.Theinitial idea has been highlighted in this paper and the drawbacksand setbacks of previous approaches over low core database sys-tems have been highlighted.Code available at : https://github.com/hamzah1947/textToSQL.git
Since data and naming convention for different databases and fieldsdiffer, this research works on achieving a single trained modelwhich when connected to any database generates efficient queriesfor that database respectively. text-to-SQL dataset contains multi-ple databases respected to different fields. It is most suitable in in-tegrating and testing the current model. The only dependencies ofproposed model are for the initial text to sql model which is trainedover pre-defined semantics translations. Secondly, the model thatclassifies best table name and column train on the text-to-sql datasetas whole. Once the table is determined, it is easy for model to drilldown to that database table and classify only among its columns.
In the proposed methodology we also touch multi turn QA inour system. This is done through the help of our QL templatewhich directs from where to query, either from database or previousresults. Process are to be implemented over both data-sets and theresults will be displayed in this manner.The model consists of 2 SVM models. First one trained to predict Dataset Statements executed No. of correct results1 50 402 40 383 40 40
Table 2.
Table comparing correctly generated queries among dif-ferent data-setsSQL syntax from human language and the second model classifiestable name and column names.Dataset Average Computation time Multi-turn1 4 Yes2 10 Yes
Table 3.
Table comparing correctly generated queries among dif-ferent data-sets
6. Conclusion
This study proposes a model which best benefits low score databasesystems which are unable to handle large training data and compu-tation requirements. Once the complete implementation of model,it would be able to integrate with any data and have a competitiveaccuracy of text to sql and database manipulation with accurateresults.
References [1] Colas, Anthony and Bui, Trung and Dernoncourt, Franck andSinha, Moumita and Kim, Doo Soon 2020. Efficient Deploymentof Conversational Natural Language Interfaces over Databases In
Proceedings of the First Workshop on Natural Language Interfaces ,27–36.[2] Catherine Finegan-Dollak, Jonathan K Kummerfeld, Li Zhang, KarthikRamanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev 2018.Improving text-to-sql evaluation methodology.
In Proceedings of the56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , 351–360.[3] J Ambient Intell Human Comput (2019) Jeong, S., Seo, Y. 2019.Improving response capability of chatbot using twitter. In
J AmbientIntell Human Comput .[4] Yih, Wen-tau and He, Xiaodong and Meek, Christopher 2014. SemanticParsing for Single-Relation Question Answering In
Proceedings of the52nd Annual Meeting of the Association for Computational Linguistics ,643–648.[5] Yujie Qian, Enrico Santus , Zhijing Jin, Jiang Guo, Barzilay Regina2019. GraphIE: A Graph-Based Framework for Information Extraction.n
Proceedings of the 2019 Conference of the North American Chapterof the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 [6] Gur, Izzeddin and Yavuz, Semih and Su, Yu and Yan, Xifengr 2018.DialSQL: Dialogue Based Structured Query Generation In