[PDF] Few-Shot Semantic Parsing for New Predicates

Abstract

In this work, we investigate the problems of semantic parsing in a few-shot learning setting. In this setting, we are provided with utterance-logical form pairs per new predicate. The state-of-the-art neural semantic parsers achieve less than 25% accuracy on benchmark datasets when k= 1. To tackle this problem, we proposed to i) apply a designated meta-learning method to train the model; ii) regularize attention scores with alignment statistics; iii) apply a smoothing technique in pre-training. As a result, our method consistently outperforms all the baselines in both one and two-shot settings.

Full PDF

FFew-Shot Semantic Parsing for New Predicates

Zhuang Li, Lizhen Qu ∗ , Shuo Huang, Gholamreza Haffari Faculty of Information TechnologyMonash University [email protected]@student.monash.edu

Abstract

In this work, we investigate the problems ofsemantic parsing in a few-shot learning set-ting. In this setting, we are provided with k utterance-logical form pairs per new predicate.The state-of-the-art neural semantic parsersachieve less than 25% accuracy on benchmarkdatasets when k = 1 . To tackle this problem,we proposed to i) apply a designated meta-learning method to train the model; ii) reg-ularize attention scores with alignment statis-tics; iii) apply a smoothing technique in pre-training. As a result, our method consistentlyoutperforms all the baselines in both one andtwo-shot settings. Semantic parsing is the task of mapping naturallanguage (NL) utterances to structured meaningrepresentations, such as logical forms (LF). Onekey obstacle preventing the wide application of se-mantic parsing is the lack of task-speciﬁc trainingdata. New tasks often require new predicates ofLFs. Suppose a personal assistant (e.g. Alexa) iscapable of booking ﬂights. Due to new business re-quirement it needs to book ground transport as well.A user could ask the assistant ”

How much does itcost to go from Atlanta downtown to airport? ”. Thecorresponding LF is as follows: (lambda $0 e (exists $1 (and ( ground transport $1 )(to city $1 atlanta:ci )(from airport $1 atlanta:ci)( =(ground fare $1 ) $0 )))) where both ground transport and ground fare arenew predicates while the other predicates are usedin ﬂight booking, such as to city , from airport . Asmanual construction of large parallel training datais expensive and time-consuming, we consider the few-shot formulation of the problem, which re-quires only a handful of utterance-LF training pairs ∗ corresponding author for each new predicate. The cost of preparing few-shot training examples is low, thus the correspond-ing techniques permit signiﬁcantly faster prototyp-ing and development than supervised approachesfor business expansions.Semantic parsing in the few-shot setting is chal-lenging. In our experiments, the accuracy of thestate-of-the-art (SOTA) semantic parsers drops toless than , when there is only one exampleper new predicate in training data. Moreover, theSOTA parsers achieve less than 32% of accuracyon ﬁve widely used corpora, when the LFs in thetest sets do not share LF templates in the trainingsets (Finegan-Dollak et al., 2018). An LF templateis derived by normalizing the entities and attributevalues of an LF into typed variable names (Finegan-Dollak et al., 2018). The few-shot setting imposestwo major challenges for SOTA neural semanticparsers. First, it lacks sufﬁcient data to learn ef-fective representations for new predicates in a su-pervised manner. Second, new predicates bring innew LF templates, which are mixtures of knownand new predicates. In contrast, the tasks (e.g. im-age classiﬁcation) studied by the prior work onfew-shot learning (Snell et al., 2017; Finn et al.,2017) considers an instance exclusively belongingto either a known class or a new class. Thus, it isnon-trivial to apply conventional few-shot learningalgorithms to generate LFs with mixed types ofpredicates.To address above challenges, we present ProtoParser , a transition-based neural seman-tic parser, which applies a sequence of parse ac-tions to transduce an utterance into an LF templateand ﬁlls the corresponding slots. The parser is pre-trained on a training set with known predicates, fol-lowed by ﬁne-tuning on a support set that containsfew-shot examples of new predicates. It extendsthe attention-based sequence-to-sequence architec-ture (Sutskever et al., 2014) with the following a r X i v : . [ c s . C L ] J a n ovel techniques to alleviate the speciﬁc problemsin the few-shot setting:• Predicate-droput . Predicate-droput is a meta-learning technique to improve representationlearning for both known and new predicates.We empirically found that known predicatesare better represented with supervisely learnedembeddings, while new predicates are betterinitialized by a metric-based few-shot learn-ing algorithm (Snell et al., 2017). In order tolet the two types of embeddings work togetherin a single model, we devised a training proce-dure called predicate-dropout to simulate thetesting scenario in pre-training.•

Attention regularization . In this work, newpredicates appear approximately once or twiceduring training. Thus, it is insufﬁcient to learnreliable attention scores in the Seq2Seq archi-tecture for those predicates. In the spirit ofsupervised attention (Liu et al., 2016), we pro-pose to regularize them with alignment scoresestimated by using co-occurrence statisticsand string similarity between words and pred-icates. The prior work on supervised attentionis not applicable, because it requires eitherlarge parallel data (Liu et al., 2016), signif-icant manual effort (Bao et al., 2018; Rabi-novich et al., 2017), or it is designed only forapplications other than semantic parsing (Liuet al., 2017; Kamigaito et al., 2017).•

Pre-training smoothing . The vocabulary ofpredicates in ﬁne-tuning is higher than that inpre-training, which leads to a distribution dis-crepancy between the two training stages. In-spired by Laplace smoothing (Manning et al.,2008), we achieve signiﬁcant performancegain by applying a smoothing technique dur-ing pre-training to alleviate the discrepancy.Our extensive experiments on three benchmark cor-pora show that

ProtoParser outperforms thecompetitive baselines with a signiﬁcant margin.The ablation study demonstrates the effectivenessof each individual proposed technique. The resultsare statistically signiﬁcant with p ≤ Semantic parsing

There is ample of work onmachine learning models for semantic parsing. The recent surveys (Kamath and Das, 2018; Zhuet al., 2019) cover a wide range of work in thisarea. The semantic formalism of meaning rep-resentations range from lambda calculas (Mon-tague, 1973), SQL, to abstract meaning representa-tion (Banarescu et al., 2013). At the core of most re-cent models (Chen et al., 2018; Cheng et al., 2019;Lin et al., 2019; Zhang et al., 2019b; Yin and Neu-big, 2018) is S EQ EQ with attention (Bahdanauet al., 2014) by formulating the task as a machinetranslation problem. C OARSE INE (Dong andLapata, 2018) reports the highest accuracy on G EO -Q UERY (Zelle and Mooney, 1996) and A

TIS (Price,1990) in a supervised setting. IRN ET (Guo et al.,2019) and RATSQL (Wang et al., 2019) are twobest performing models on the Text-to-SQL bench-mark, S PIDER (Yu et al., 2018). They are also de-signed to be able to generalize to unseen databaseschemas. However, supervised models performwell only when there is sufﬁcient training data.

Data Sparsity

Most semantic parsing datasetsare small in size. To address this issue, one lineof research is to augment existing datasets withautomatically generated data (Su and Yan, 2017;Jia and Liang, 2016; Cai and Yates, 2013). Anotherline of research is to exploit available resources,such as knowledge bases (Krishnamurthy et al.,2017; Herzig and Berant, 2018; Chang et al., 2019;Lee, 2019; Zhang et al., 2019a; Guo et al., 2019;Wang et al., 2019), semantic features in differentdomains (Dadashkarimi et al., 2018; Li et al., 2020),or unlabeled data (Yin et al., 2018; Koˇcisk`y et al.,2016; Sun et al., 2019). Those works are orthog-onal to our setting because our approach aims toefﬁciently exploit a handful of labeled data of newpredicates, which are not limited to the ones inknowledge bases. Our setting also does not requireinvolvement of humans in the loop such as activelearning (Duong et al., 2018; Ni et al., 2019) andcrowd-sourcing (Wang et al., 2015; Herzig and Be-rant, 2019). We assume availability of resourcesdifferent than the prior work and focus on the prob-lems caused by new predicates. We develop anapproach to generalize to unseen LF templates con-sisting of both known and new predicates.

Few-Shot Learning

Few-shot learning is a typeof machine learning problems that provides a hand-ful of labeled training examples for a speciﬁc task.The survey (Zhu et al., 2019) gives a comprehen-sive overview of the data, models, and algorithms

Actions t GEN [(ground transport v a )] t GEN [(to city v a v e )] t GEN [(from airport v a v e )] t GEN [(= (ground fare v a ) v a )] t REDUCE [and :- NT NT NT NT] t REDUCE [exists :- v a NT] t REDUCE [lambda :- v a e NT] Table 1: An example action sequence. proposed for this type of problems. It categorizesthe models into multitask learning (Hu et al., 2018),embedding learning (Snell et al., 2017; Vinyalset al., 2016), learning with external memory (Leeand Choi, 2018; Sukhbaatar et al., 2015), and gener-ative modeling (Reed et al., 2017) in terms of whatprior knowledge is used. (Lee et al., 2019) tack-les the problem of poor generalization across SQLtemplates for SQL query generation in the one-shotlearning setting. In their setting, they assume all theSQL templates on test set are shared with the tem-plates on support set. In contrast, we assume onlythe sharing of new predicates between a support setand a test set. In our one-shot setting, only around10% of LF templates on test set are shared with theones in the support set of G EO Q UERY dataset.

ProtoParser follows the SOTA neural seman-tic parsers (Dong and Lapata, 2018; Guo et al.,2019) to map an utterance into an LF in two steps: template generation and slot ﬁlling . It implementsa designated transition system to generate tem-plates, followed by ﬁlling the slot variables withvalues extracted from utterances. To address thechallenges in the few-shot setting, we proposedthree training methods, detailed in Sec. 4.Many LFs differ only in mentioned atoms, suchas entities and attribute values. An LF template iscreated by replacing the atoms in LFs with typedslot variables. As an example, the LF template ofour example in Sec. 1 is created by substituting i)a typed atom variable v e for the entity “atlanta:ci”;ii) a shared variable name v a for all variables “ $0 “and “ $1 “. (lambda v a e (exists v a (and ( ground transport v a )(to city v a v e )(from airport v a v e ) ( =(ground fare v a ) v a )))) Code and datasets can be found in this repos-itory: https://github.com/zhuang-li/few-shot-semantic-parsing

Formally, let x = { x , ..., x n } denote an NL utter-ance, and its LF is represented as a semantic tree y = ( V , E ) , where V = { v , ..., v m } denotes thenode set with v i ∈ V , and E ⊆ V × V is its edgeset. The node set V = V p ∪ V v is further dividedinto a template predicate set V p and a slot value set V v . A template predicate node represents a pred-icate symbol or a term, while a slot value noderepresents an atom mentioned in utterances. Thus,a semantic tree y is composed of an abstract tree τ y representing a template and a set of slot valuenodes V v, y attaching to the abstract tree.In the few-shot setting, we are provided with atrain set D train , a support set D s , and a test set D test . Each example in either of those sets is anutterance-LF pair ( x i , y i ) . The new predicates ap-pear only in D s and D test but not in D train . For K -shot learning, there are K ( x i , y i ) per each newpredicate p in D s . Each new predicate appears alsoin the test set. The goal is to maximize the accu-racy of estimating LFs given utterances in D test byusing a parser trained on D train ∪ D s . We apply the transition system (Cheng et al., 2019)to perform a sequence of transition actions to gen-erate the template of a semantic tree. The transitionsystem maintains partially-constructed outputs us-ing a stack . The parser starts with an empty stack.At each step, it performs one of the following tran-sition actions to update the parsing state and gener-ate a tree node. The process repeats until the stackcontains a complete tree.•

GEN [ y ] creates a new leaf node y and pushesit on top of the stack.• REDUCE [ r ] . The reduce action identiﬁesan implication rule head : − body . The rulebody is ﬁrst popped from the stack. A newsubtree is formed by attaching the rule headas a new parent node to the rule body . Thenthe whole subtree is pushed back to the stack.Table 1 shows such an action sequence for generat-ing the above LF template. Each action produces known or new predicates. ProtoParser generates an LF in two steps: i)template generation, ii) slot ﬁlling. The base archi-tecture largely resembles (Cheng et al., 2019). emplate Generation

Given an utterance, thetask is to generate a sequence of actions a = a , ..., a k to build an abstract tree τ y .We found out LFs often contain idioms, whichare frequent subtrees shared across LF templates.Thus we apply a template normalization procedurein a similar manner as (Iyer et al., 2019) to pre-process all LF templates. It collapses idioms intosingle units such that all LF templates are convertedinto a compact form.The neural transition system consists of an en-coder and a decoder for estimating action probabil-ities. P ( a | x ) = | a | (cid:89) t =1 P ( a t | a

We apply a bidirectional Long Short-term Memory (LSTM) network (Gers et al., 1999)to map a sequence of n words into a sequence ofcontextual word representations { e } ni =1 . Template Decoder

The decoder applies a stack-LSTM (Dyer et al., 2015) to generate action se-quences. A stack-LSTM is an unidirectional LSTMaugmented with a pointer. The pointer points to aparticular hidden state of the LSTM, which repre-sents a particular state of the stack. It moves to adifferent hidden state to indicate a different state ofthe stack.At time t , the stack-LSTM produces a hiddenstate h dt by h dt = LSTM ( µ t , h dt − ) , where µ t is aconcatenation of the embedding of the action c a t − estimated at time t − and the representation h y t − of the partial tree generated by history actions attime t − .As a common practice, h dt is concatenated withan attended representation h at over encoder hiddenstates to yield h t , with h t = W (cid:20) h dt h at (cid:21) , where W isa weight matrix and h at is created by soft attention, h at = n (cid:88) i =1 P ( e i | h dt ) e i (2)We apply dot product to compute the normalizedattention scores P ( e i | h dt ) (Luong et al., 2015).The supervised attention (Rabinovich et al., 2017;Yin and Neubig, 2018) is also applied to facilitatethe learning of attention weights. Given h t , theprobability of an action is estimated by: P ( a t | h t ) = exp( c (cid:124) a t h t ) (cid:80) a (cid:48) ∈A t exp( c (cid:124) a (cid:48) h t ) (3)where c a denotes the embedding of action a , and A t denotes the set of applicable actions at time t . The initialization of those embeddings will beexplained in the following section. Slot Filling

A tree node in a semantic tree maycontain more than one slot variables due to tem-plate normalization. Since there are two types ofslot variables, given a tree node with slot variables,we employ a LSTM-based decoder with the samearchitecture as the

Template decoder to ﬁll eachtype of slot variables, respectively. The output ofsuch a decoder is a value sequence of the samelength as the number of slot variables of that typein the given tree node.

The few-shot setting differs from the supervisedsetting by having a support set in testing in addi-tion to train/test sets. The support set contains k utterance-LF pairs per new predicate, while thetraining set contains only known predicates. Toevaluate model performance on new predicates, thetest set contains LFs with both known and newpredicates. Given the support set, we can tell if apredicate is known or new by checking if it onlyexists in the train set.We take two steps to train our model: i) pre-training on the training set, ii) ﬁne-tuning on thesupport set. Its predictive performance is measuredon the test set. We take the two-steps approachbecause i) our experiments show that this approachperforms better than training on the union of thetrain set and the support set; ii) for any new supportsets, it is computationally more time efﬁcient thantraining from scratch on the union of the train setand the support set.There is a distribution discrepancy between thetrain set and the support set due to new predicates,the meta-learning algorithms (Snell et al., 2017;Finn et al., 2017) suggest to simulate the testingscenario in pre-training by splitting each batch intoa meta-support set and a meta-test set. The mod-els utilize the information (e.g. prototype vectors)acquired from the meta-support set to minimizeerrors on the meta-test set. In this way, the meta-support and meta-test sets simulate the support andtest sets sharing new predicates.However, we cannot directly apply such a train-ing procedure due to the following two reasons.First, each LF in the support and test sets is amixture of both known predicates and new predi-cates. To simulate the support and test sets, themeta-support and meta-test sets should includeoth types of predicates as well. We cannot as-sume that there are only one type of predicates.Second, our preliminary experiments show that ifthere is sufﬁcient training data, it is better off train-ing action embeddings of known predicates c (Eq.(3)) in a supervised way, while action embeddingsinitialized by a metric-based meta-learning algo-rithm (Snell et al., 2017) perform better for rarelyoccurred new predicates. Therefore, we cope withthe differences between known and new predicatesby using a customized initialization method in ﬁne-tuning and a designated pre-training procedure tomimic ﬁne-tuning on the train set. In the follow-ing, we introduce ﬁne-tuning ﬁrst because it helpsunderstand our pre-training procedure. During ﬁne-tuning, the model parameters and theaction embeddings in Eq. (3) for known predicatesare obtained from the pre-trained model. The em-bedding of actions that produce new predicates c a t are initialized using prototype vectors as in proto-typical networks (Snell et al., 2017). The proto-type representations act as a type of regularization,which shares the similar idea as the deep learningtechniques using pre-trained models.A prototype vector of an action a t is constructedby using the hidden states of the template decodercollected at the time of predicting a t on a supportset. Following (Snell et al., 2017), a prototypevector is built by taking the mean of such a set ofhidden states h t . c a t = 1 | M | (cid:88) h t ∈ M h t (4)where M denotes the set of all hidden states at thetime of applying the action a t . After initialization,the whole model parameters and the action em-beddings are further improved by ﬁne-tuning themodel on the support set with a supervised trainingobjective L f . L f = L s + λ Ω (5)where L s is the cross-entropy loss and Ω is anattention regularization term explained below. Thedegree of regularization is adjusted by λ ∈ R + . Attention Regularization

We address thepoorly learned attention scores P ( e i | h dt ) ofinfrequent actions by introducing a novel attentionregularization. We observe that the probabil-ity P ( a j | x i ) = count ( a j ,x i ) count ( x i ) and the character similarity between the predicates generatedby action a j and the token x i are often strongindicators of their alignment. The indicatorscan be further strengthened by manually anno-tating the predicates with their correspondingnatural language tokens. In our work, we adopt − dist ( a j , x i ) as the character similarity,where dist ( a j , x j ) is normalized Levenshteindistance (Levenshtein, 1966). Both measuresare in the range [0 , , thus we apply g ( a j , x i ) = σ ( · ) P ( a j | x i ) + (1 − σ ( · ) char sim ( a j , x i ) tocompute alignment scores, where the sigmoid func-tion σ ( w (cid:124) p h dt ) combines two constant measuresinto a single score. The corresponding normalizedattention scores is given by P (cid:48) ( x i | a k ) = g ( a k , x i ) (cid:80) nj =1 g ( a k , x j ) (6)The attention scores P ( x i | a k ) should be similarto P (cid:48) ( x i | a k ) . Thus, we deﬁne the regularizationterm as Ω = (cid:80) ij | P ( x i | a j ) − P (cid:48) ( x i | a j ) | duringtraining. The pre-training objective are two-folds: i) learnaction embeddings for known predicates in a super-vised way, ii) ensure our model can quickly adaptto the actions of new predicates, whose embed-dings are initialized by prototype vectors beforeﬁne-tuning.

Predicate-dropout

Starting with randomly ini-tialized model parameters, we alternately use onebatch for the meta-loss L m and one batch for opti-mizing the supervised loss L s .In a batch for L m , we split the data into a meta-support set and a meta-test set. In order to simulateexistence of new predicates, we randomly select asubset of predicates as ”new”, thus their action em-beddings c are replaced by prototype vectors con-structed by applying Eq. (4) over the meta-supportset. The actions of remaining predicates keep theirembeddings learned from previous batches. Theresulted action embedding matrix C is the combi-nation of both. C = (1 − m (cid:124) ) C s + m (cid:124) C m (7)where C s is the embedding matrix learned in a su-pervised way, and C m is constructed by using pro-totype vectors on the meta-support set. The maskvector m is generated by setting the indices of ac-tions of the ”new” predicates to ones and the other lgorithm 1: Predicate-Dropout

Input :

Training set D , supervisely trained actionembedding C s , number of meta-supportexamples k , number of meta-test examples n per one support example,predicate-dropout ratio r Output :

The loss L m .Extract a template set T from the training set D Sample a subset T i of size k from T S := ∅ meta-support set Q := ∅ meta-test set for t in T i do Sample a meta-support example s (cid:48) with template t from D without replacementSample a meta-test set Q (cid:48) of size n with template t from D S = S ∪ s (cid:48) Q = Q ∪ Q (cid:48) end Build a prototype matrix C m on S Extract a predicate set P from S Sample a subset P s of size r × |P| from P as newpredicatesBuild a mask m using P s With C s , C m and m , apply Eq. (7) to compute C Compute L m , the cross-entropy on Q with C to zeros. We refer to this operation as predicate-dropout . The training algorithm for the meta-lossis summarised in Algorithm 1.In a batch for L s , we update the model parame-ters and all action embeddings with a cross-entropyloss L s , together with the attention regularization.Thus, the overall training objective becomes L p = L m + L s + λ Ω (8) Pre-training smoothing

Due to the new predi-cates, the number of candidate actions during theprediction of ﬁne-tuning and testing is larger thanthe one during pre-training. That leads to distribu-tion discrepancy between pre-training and testing.To minimize the differences, we assume a priorknowledge on the number of actions for new pred-icates by adding a constant k to the denominatorof Eq. (3) when estimating the action probability P ( a t | h t ) during pre-training. P ( a t | h t ) = exp( c (cid:124) a t h t ) (cid:80) a (cid:48) ∈A t exp( c (cid:124) a (cid:48) h t ) + k (9)We do not consider this smoothing technique dur-ing ﬁne-tuning and testing. Despite its simplicity,the experimental results show a signiﬁcant perfor-mance gain on benchmark datasets. Datasets.

We use three semantic parsing datasets:J

OBS , G EO Q UERY , and A

TIS . J

OBS contains 640 question-LF pairs in Prolog about job list-ings. G EO Q UERY (Zelle and Mooney, 1996) andA

TIS (Price, 1990) include 880 and 5,410 utterance-LF pairs in lambda calculas about US geographyand ﬂight booking, respectively. The number ofpredicates in J

OBS , G EO Q UERY , A

TIS is 15, 24,and 88, respectively. All atoms in the datasets areanonymized as in (Dong and Lapata, 2016).For each dataset, we randomly selected m pred-icates as the new predicates, which is 3 for J OBS ,and 5 for G EO Q UERY and A

TIS . Then we spliteach dataset into a train set and an evaluation set.And we removed the instances, the template ofwhich is unique in each dataset. The number ofsuch instances is around 100, 150 and 600 in J

OBS ,G EO Q UERY , and A

TIS . The ratios between theevaluation set and the train set are 1:4, 2:5, and1:7 in J

OBS , G EO Q UERY , and ATIS, respectively.Each LF in an evaluation set contains at least anew predicate, while an LF in a train set containsonly known predicates. To evaluate k -shot learn-ing, we build a support set by randomly sampling k pairs per new predicate without replacement froman evaluation set, and keep the remaining pairs asthe test set. To avoid evaluation bias caused byrandomness, we repeat the above process six timesto build six different splits of support and test setfrom each evaluation set. One for hyperparametertuning and the rest for evaluation. We consider atmost 2-shot learning due to the limited number ofinstances per new predicate in each evaluation set. Training Details.

We pre-train our parser on thetraining sets for {

80, 100 } epochs with the Adamoptimizer (Kingma and Ba, 2014). The batch sizeis ﬁxed to 64. The initial learning rate is 0.0025,and the weights are decayed after 20 epochs withdecay rate 0.985. The predicate dropout rate is 0.5.The smoothing term is set to {

3, 6 } . The numberof meta-support examples is 30 and the number ofmeta-test examples per support example is 15. Thecoefﬁcient of attention regularization is set to 0.01on J OBS and 1 on the other datasets. We employthe 200-dimensional GLOVE embedding (Penning-ton et al., 2014) to initialize the word embeddingsfor utterances. The hidden state size of all LSTMmodels (Hochreiter and Schmidhuber, 1997) is 256.During ﬁne-tuning, the batch size is 2, the learn-ing rates and the epochs are selected from { } and {

20, 30, 40, 60, 120 } , respectively. OBS G EO Q UERY A TIS J OBS G EO Q UERY A TIS p-valuesS EQ EQ ( pt ) 11.27 20.00 17.23 14.58 33.01 18.76 3.32e-04S EQ EQ ( cb ) 11.70 7.64 2.25 21.49 14.36 7.91 6.65e-06S EQ EQ ( os ) 14.18 11.38 4.45 30.46 33.59 10.17 5.30e-05C OARSE INE ( pt ) 10.91 24.07 17.44 13.83 35.63 21.08 1.48e-04C OARSE INE ( cb ) 9.28 14.50 0.42 19.61 28.93 9.25 2.35e-06C OARSE INE ( os ) 6.73 10.35 5.26 16.08 28.55 17.73 1.13e-05IRN ET ( pt ) 16.00 20.00 17.12 19.06 35.05 20.11 2.86e-05IRN ET ( cb ) 19.67 21.90 5.60 28.22 44.08 15.73 2.76e-03IRN ET ( os ) 14.91 18.78 4.95 30.84 40.97 18.05 2.47e-04DA 18.91 9.67 4.29 21.31 20.88 17.18 1.13e-06PT-MAML 11.64 9.76 6.83 17.76 22.52 12.28 1.73e-06Ours Table 2: Evaluation of learning results on three datasets. (Left) The one-shot results. (Right) The two-shot results.

Baselines.

We compared our methods withﬁve competitive baselines, S EQ EQ with atten-tion (Luong et al., 2015), C OARSE INE (Dongand Lapata, 2018), IRN ET (Guo et al., 2019), PT-MAML (Huang et al., 2018) and DA (Li et al.,2020). C OARSE INE is the best performing super-vised model on the standard split of G EO Q UERY and A

TIS datasets. PT-MAML is a few-shot learn-ing semantic parser that adopts Model-AgnosticMeta-Learning (Finn et al., 2017). We adapt PT-MAML in our scenario by considering a group ofinstances that share the same template as a pseudo-task. DA is the most recently proposed neuralsemantic parser applying domain adaptation tech-niques. IRN ET is the strongest semantic parser thatcan generalize to unseen database schemas. In ourcase, we consider a list of predicates in support setsas the columns of a new database schema and incor-porate the schema encoding module of IRN ET intothe encoder of our base parser. We choose IRN ET over RATSQL (Wang et al., 2019) because IRN ET achieves superior performance on our datasets.We consider three different supervised learningsettings. First, we pre-train a model on a train set,followed by ﬁne-tuning it on the correspondingsupport set, coined pt . Second, a model is trainedon the combination of a train set and a supportset, coined cb . Third, the support set in cb is over-sampled by 10 times and 5 times for one-shot andtwo-shot respectively, coined os . Evaluation Details.

The same as priorwork (Dong and Lapata, 2018; Li et al., 2020),we report accuracy of exactly matched LFs as themain evaluation metric.To investigate if the results are statistically sig-niﬁcant, we conducted the Wilcoxon signed-ranktest, which assesses whether our model consistentlyperforms better than another baseline across all evaluation sets. It is considered superior than t-test in our case, because it supports comparisonacross different support sets and does not assumenormality in data (Demˇsar, 2006). We include thecorresponding p -values in our result tables. Table 2 shows the average accuracies and signif-icance test results of all parsers compared on allthree datasets. Overall,

ProtoParser outper-forms all baselines with at least 2% on averagein terms of accuracy in both one-shot and two-shot settings. The results are statistically signif-icant w.r.t. the strongest baselines, IRN ET ( cb ) andC OARSE INE ( pt ). The corresponding p-valuesare 0.00276 and 0.000148, respectively. Givenone-shot example on J OBS , our parser achieves7% higher accuracy than the best baseline, andthe gap is 4% on G EO Q UERY with two-shots ex-amples. In addition, none of the SOTA baselineparsers can consistently outperform other SOTAparsers when there are few parallel data for newpredicates. In one-shot setting, the best supervisedbaseline IRN ET ( cb ) can achieve the best resultson G EO Q UERY and J

OBS among all baselines, andon two-shot setting, it performs best only on G EO -Q UERY . It is also difﬁcult to achieve good perfor-mance by adapting the existing meta-learning ortransfer learning algorithms to our problem, as evi-dent by the moderate performance of PT-MAMLand DA on all datasets.The problems of few-shot learning demonstratethe challenges imposed by infrequent predicates.There are signiﬁcant proportions of infrequent pred-icates on the existing datasets. For example, onG EO Q UERY , there are 10 predicates contributingto only 4% of the total frequency of all 24 predi-cates, while the top two frequent predicates amount

OBS G EO Q UERY A TIS J OBS G EO Q UERY A TIS p-valuesOurs 27.09 - sup 23.63 18.86 12.91 26.91 39.51 14.89 1.44e-05- proto 22.91 18.77 13.24 29.16 38.93 16.81 1.77e-05- reg

Table 3: Ablation study results. (Left) The one-shot learning results. (Right) The two-shot learning results. to 42%. As a result, the SOTA parsers achievemerely less than 25% and 44% of accuracy withone-shot and two-shots examples, respectively. Incontrast, those parsers achieve more than 84% ac-curacy on the standard splits of the same datasetsin the supervised setting.Infrequent predicates in semantic parsing canalso be viewed as a class imbalance problem, whensupport sets and train sets are combined in a cer-tain manner. In this work, the ratio between thesupport set and the train set in J

OBS , G EO Q UERY ,and ATIS is 1:130, 1:100, and 1:1000, respectively.Different models prefer different ways of using thetrain sets and support sets. The best option forC

OARSE INE and S EQ EQ is to pre-train on atrain set followed by ﬁne-tuning on the correspond-ing support set, while IRN ET favors oversamplingin two-shot setting. Ablation Study

We examine the effect of differ-ent components of our parser by removing each ofthem individually and reporting the correspondingaverage accuracy. As shown in Table 3, remov-ing any of the components almost always leads tostatistically signiﬁcant drop of performance. Thecorresponding p-values are all less than 0.00327.To investigate predicate-dropout, we exclude ei-ther supervised-loss during pre-training (-sup) orinitialization of new predicate embeddings by pro-totype vectors before ﬁne-tuning (-proto). It is clearfrom Table 3 that ablating either supervisely trainedaction embeddings or prototype vectors hurts per-formance severely.We further study the efﬁcacy of attention regular-ization by removing it completely (-reg), removingonly the string similarity feature (-strsim), or con-ditional probability feature (-cond). Removing theregularization completely degrades performancesharply except on J

OBS in the one-shot setting.Our further inspection shows that model learningis easier on J

OBS than on the other two datasets.Each predicate in J

OBS almost always aligns to

Figure 1: (Round) The support set with the lowest accu-racy. (Box) The support set with the highest accuracy. the same word across examples, while a predicatecan align with different word/phrase in differentexamples in G EO Q UERY and ATIS. The perfor-mance drop with -strsim and -cond indicates thatwe cannot only reply on a single statistical measurefor regularization. For instance, we cannot alwaysﬁnd predicates take the same string form as the cor-responding words in input utterances. In fact, theproportion of predicates present in input utterancesis only 42%, 38% and 44% on J

OBS , ATIS, andG EO Q UERY , respectively.Furthermore, without pre-training smoothing (-smooth), the accuracy drops at least 1.6% in termsof mean accuracy on all datasets. Smoothing en-ables better model parameter training by more ac-curate modelling in pre-training.

Support Set Analysis

We observe that all mod-els consistently achieve high accuracy on certainsupport sets of the same dataset, while obtaininglow accuracies on the other ones. We illustrate thereasons of such effects by plotting the evaluationset of G EO Q UERY . Each data point in Figure 1 de-picts an representation, which is generated by theencoder of our parser after pre-training. We appliedT-SNE (Maaten and Hinton, 2008) for dimensionreduction. We highlight two support sets used inthe one-shot setting on G EO Q UERY . All exam-les in the highest performing support set tend toscatter evenly and cover different dense regions inthe feature space, while the examples in the lowestperforming support set are far from a signiﬁcantnumber of dense regions. Thus, the examples ingood support sets are more representative of theunderlying distribution than the ones in poor sup-port sets. When we leave out each example in thehighest performing support set and re-evaluate ourparser each time, we observe that the good ones(e.g. the green box in Figure 1) locate either in orclose to some of the dense regions.

We propose a novel few-shot learning based seman-tic parser, coined

ProtoParser , to cope withnew predicates in LFs. To address the challengesin few-shot learning, we propose to train the parserwith a pre-training procedure involving predicate-dropout, attention regularization, and pre-trainingsmoothing. The resulted model achieves superiorresults over competitive baselines on three bench-mark datasets.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Grifﬁtt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In

Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse , pages 178–186.Yujia Bao, Shiyu Chang, Mo Yu, and Regina Barzilay.2018. Deriving machine attention from human ra-tionales. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 1903–1913.Qingqing Cai and Alexander Yates. 2013. Semanticparsing freebase: Towards open-domain semanticparsing. In

Second Joint Conference on Lexical andComputational Semantics (* SEM), Volume 1: Pro-ceedings of the Main Conference and the SharedTask: Semantic Textual Similarity , pages 328–338.Shuaichen Chang, Pengfei Liu, Yun Tang, Jing Huang,Xiaodong He, and Bowen Zhou. 2019. Zero-shot text-to-sql learning with auxiliary task. arXivpreprint arXiv:1908.11052 . Bo Chen, Le Sun, and Xianpei Han. 2018. Sequence-to-action: End-to-end semantic graph generation forsemantic parsing. arXiv preprint arXiv:1809.00773 .Jianpeng Cheng, Siva Reddy, Vijay Saraswat, andMirella Lapata. 2019. Learning an executable neu-ral semantic parser.

Computational Linguistics ,45(1):59–94.Javid Dadashkarimi, Alexander Fabbri, SekharTatikonda, and Dragomir R Radev. 2018. Zero-shottransfer learning for semantic parsing. arXivpreprint arXiv:1808.09889 .Janez Demˇsar. 2006. Statistical comparisons of clas-siﬁers over multiple data sets.

Journal of Machinelearning research , 7(Jan):1–30.Li Dong and Mirella Lapata. 2016. Language to log-ical form with neural attention. arXiv preprintarXiv:1601.01280 .Li Dong and Mirella Lapata. 2018. Coarse-to-ﬁne de-coding for neural semantic parsing. arXiv preprintarXiv:1805.04793 .Long Duong, Hadi Afshar, Dominique Estival, GlenPink, Philip Cohen, and Mark Johnson. 2018. Ac-tive learning for deep semantic parsing. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 43–48.Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A Smith. 2015. Transition-based dependency parsing with stack long short-term memory. arXiv preprint arXiv:1505.08075 .Catherine Finegan-Dollak, Jonathan K Kummerfeld,Li Zhang, Karthik Ramanathan, Sesh Sadasivam,Rui Zhang, and Dragomir Radev. 2018. Improvingtext-to-sql evaluation methodology. arXiv preprintarXiv:1806.09029 .Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.Model-agnostic meta-learning for fast adaptation ofdeep networks. In

Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 ,pages 1126–1135. JMLR. org.Felix A Gers, J¨urgen Schmidhuber, and Fred Cummins.1999. Learning to forget: Continual prediction withlstm.Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,Jian-Guang Lou, Ting Liu, and Dongmei Zhang.2019. Towards complex text-to-sql in cross-domaindatabase with intermediate representation. arXivpreprint arXiv:1905.08205 .Jonathan Herzig and Jonathan Berant. 2018. Decou-pling structure and lexicon for zero-shot semanticparsing. arXiv preprint arXiv:1804.07918 .onathan Herzig and Jonathan Berant. 2019. Don’tparaphrase, detect! rapid and effective data col-lection for semantic parsing. arXiv preprintarXiv:1908.09940 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.Zikun Hu, Xiang Li, Cunchao Tu, Zhiyuan Liu, andMaosong Sun. 2018. Few-shot charge predictionwith discriminative legal attributes. In

Proceedingsof the 27th International Conference on Computa-tional Linguistics , pages 487–498.Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. 2018. Natural languageto structured query generation via meta-learning. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 2 (Short Papers) , pages 732–738.Srinivasan Iyer, Alvin Cheung, and Luke Zettlemoyer.2019. Learning programmatic idioms for scalablesemantic parsing. arXiv preprint arXiv:1904.09086 .Robin Jia and Percy Liang. 2016. Data recombina-tion for neural semantic parsing. arXiv preprintarXiv:1606.03622 .Aishwarya Kamath and Rajarshi Das. 2018. A surveyon semantic parsing.

CoRR , abs/1812.00978.Hidetaka Kamigaito, Katsuhiko Hayashi, Tsutomu Hi-rao, Masaaki Nagata, Hiroya Takamura, and Man-abu Okumura. 2017. Supervised attention forsequence-to-sequence constituency parsing.

IJC-NLP 2017 , page 7.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Tom´aˇs Koˇcisk`y, G´abor Melis, Edward Grefenstette,Chris Dyer, Wang Ling, Phil Blunsom, andKarl Moritz Hermann. 2016. Semantic parsing withsemi-supervised sequential autoencoders. arXivpreprint arXiv:1609.09315 .Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In

Proceedings ofthe 2017 Conference on Empirical Methods in Natu-ral Language Processing , pages 1516–1526.Dongjun Lee. 2019. Clause-wise and recursive decod-ing for complex and cross-domain text-to-sql gener-ation. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages6047–6053.Dongjun Lee, Jaesik Yoon, Jongyun Song, Sanggil Lee,and Sungroh Yoon. 2019. One-shot learning for text-to-sql generation. arXiv preprint arXiv:1905.11499 . Yoonho Lee and Seungjin Choi. 2018. Gradient-basedmeta-learning with learned layerwise metric and sub-space. arXiv preprint arXiv:1801.05558 .Vladimir I Levenshtein. 1966. Binary codes capableof correcting deletions, insertions, and reversals. In

Soviet physics doklady , volume 10, pages 707–710.Zechang Li, Yuxuan Lai, Yansong Feng, and DongyanZhao. 2020. Domain adaptation for semantic pars-ing. arXiv preprint arXiv:2006.13071 .Kevin Lin, Ben Bogin, Mark Neumann, JonathanBerant, and Matt Gardner. 2019. Grammar-based neural text-to-sql generation. arXiv preprintarXiv:1905.13326 .Lemao Liu, Masao Utiyama, Andrew Finch, and Ei-ichiro Sumita. 2016. Neural machine translationwith supervised attention. In

Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers , pages3093–3102.Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2017.Exploiting argument information to improve eventdetection via supervised attention mechanisms. In

Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 1789–1798.Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 .Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne.

Journal of machinelearning research , 9(Nov):2579–2605.Christopher D Manning, Prabhakar Raghavan, and Hin-rich Sch¨utze. 2008.

Introduction to information re-trieval . Cambridge university press.Richard Montague. 1973. The proper treatment ofquantiﬁcation in ordinary english. In

Approaches tonatural language , pages 221–242. Springer.Ansong Ni, Pengcheng Yin, and Graham Neubig. 2019.Merging weak and active supervision for semanticparsing. arXiv preprint arXiv:1911.12986 .Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Patti J Price. 1990. Evaluation of spoken language sys-tems: The atis domain. In

Speech and Natural Lan-guage: Proceedings of a Workshop Held at HiddenValley, Pennsylvania, June 24-27, 1990 .Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract syntax networks for code generationnd semantic parsing. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1139–1149.Scott Reed, Yutian Chen, Thomas Paine, A¨aronvan den Oord, SM Eslami, Danilo Rezende, OriolVinyals, and Nando de Freitas. 2017. Few-shot autoregressive density estimation: Towardslearning to learn distributions. arXiv preprintarXiv:1710.10304 .Jake Snell, Kevin Swersky, and Richard Zemel. 2017.Prototypical networks for few-shot learning. In

Ad-vances in Neural Information Processing Systems ,pages 4077–4087.Yu Su and Xifeng Yan. 2017. Cross-domain se-mantic parsing via paraphrasing. arXiv preprintarXiv:1704.05974 .Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In

Advancesin neural information processing systems , pages2440–2448.Yibo Sun, Duyu Tang, Nan Duan, Yeyun Gong, Xi-aocheng Feng, Bing Qin, and Daxin Jiang. 2019.Neural semantic parsing in low-resource settingswith back-translation and meta-learning. arXivpreprint arXiv:1909.05438 .I Sutskever, O Vinyals, and QV Le. 2014. Sequence tosequence learning with neural networks.

Advancesin NIPS .Oriol Vinyals, Charles Blundell, Timothy Lillicrap,Daan Wierstra, et al. 2016. Matching networks forone shot learning. In

Advances in neural informa-tion processing systems , pages 3630–3638.Bailin Wang, Richard Shin, Xiaodong Liu, Olek-sandr Polozov, and Matthew Richardson. 2019.Rat-sql: Relation-aware schema encoding andlinking for text-to-sql parsers. arXiv preprintarXiv:1911.04942 .Yushi Wang, Jonathan Berant, and Percy Liang. 2015.Building a semantic parser overnight. In

Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers) , volume 1, pages1332–1342.Frank Wilcoxon. 1992. Individual comparisons byranking methods. In

Breakthroughs in statistics ,pages 196–202. Springer.Pengcheng Yin and Graham Neubig. 2018. Tranx: Atransition-based neural abstract syntax parser for se-mantic parsing and code generation. arXiv preprintarXiv:1810.02720 . Pengcheng Yin, Chunting Zhou, Junxian He, and Gra-ham Neubig. 2018. StructVAE: Tree-structured la-tent variable models for semi-supervised semanticparsing. In

The 56th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL) .Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-ing Yao, Shanelle Roman, et al. 2018. Spider: Alarge-scale human-labeled dataset for complex andcross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 .John M Zelle and Raymond J Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In

Proceedings of the national con-ference on artiﬁcial intelligence , pages 1050–1055.Rui Zhang, Tao Yu, He Yang Er, Sungrok Shim,Eric Xue, Xi Victoria Lin, Tianze Shi, Caim-ing Xiong, Richard Socher, and Dragomir Radev.2019a. Editing-based sql query generation for cross-domain context-dependent questions. arXiv preprintarXiv:1909.00786 .Sheng Zhang, Xutai Ma, Kevin Duh, and Ben-jamin Van Durme. 2019b. Broad-coverage se-mantic parsing as transduction. arXiv preprintarXiv:1909.02607 .Q. Zhu, X. Ma, and X. Li. 2019. Statistical learning forsemantic parsing: A survey.

Big Data Mining andAnalytics , 2(4):217–239. lgorithm 2:

Template Normalization

Input :

A set of abstract trees T , a minimal support τ Output :

A set of normalized trees O := mapping of subtrees to their occurrences in T . for tree t in T do update occurrence of all leaf nodes v of t to O [ v ] endwhile O updated with new trees dofor tree t , occur list l in O do build occurrence list l (cid:48) for supertree t (cid:48) of t if size( l (cid:48) ) ≥ size( l ) then O [ t (cid:48) ] = l (cid:48) endendendfor tree t , occur list l in O doif size( l ) ≥ τ then collapse t into a node for all t (cid:48) in l . endend A Template Normalization

Many LF templates in the existing corpora haveshared subtrees in the corresponding abstract se-mantic trees. The tree normalization algorithmaims to treat those subtrees as single units. Theidentiﬁcation of such shared structured is con-ducted by ﬁnding frequent subtrees. Given an LFdataset, the support of a tree t is the number of LFsthat it occurs as a subtree. We call a tree frequent if its support is greater and equal to a pre-speciﬁedminimal support.We also observe that in an LF dataset, somefrequent subtrees always have the same supertree.For example, ground fare $1 is always the child of =( . . . , $0 ) in the whole dataset. We call a subtree complete w.r.t. a dataset if any of its supertreesin the dataset occur signiﬁcantly more often thanthat subtree. Another observation is that some treenodes have ﬁxed siblings. In order to check if two tree nodes sharing the same root are ﬁxed siblings,we merge the two tree paths together. If the mergedtree has the same support as that of the of the twotrees, we call the two trees pass the ﬁxed siblingtest. In the same manner, we collapse tree nodeswith ﬁxed siblings, as well as their parent nodeinto a single tree node to save unnecessary parseactions.Thus, the normalization is conducted by collaps-ing a frequent complete abstract subtree into a treenode. We call a tree normalized if all its frequentcomplete abstract subtrees are collapsed into thecorresponding tree nodes. The pseudocode of thetree normalization algorithm is provided in Algo-rithm 2. B One Example Transition Sequence

As in Table 4, we provide an example transitionsequence to display the stack states and the cor-responding action sequence when parsing the ut-terance in Introduction ”how much is the groundtransportation between atlanta and downtown?”. t Stack Action t [] GEN [(ground transport v a )] t [(ground transport v a )] GEN [(to city v a v e )] t [(ground transport v a ), (to city v a v e )] GEN [(from airport v a v e )] t [(ground transport v a ), (to city v a v e ), (from airport v a v e )] GEN [(= (ground fare v a ) v a )] t [(ground transport v a ), (to city v a v e ),(from airport v a v e ), (= (ground fare v a ) v a )] REDUCE [and :- NT NT NT NT] t [(and (ground transport v a ) (to city v a v e )(from airport v a v e ) (= (ground fare v a ) v a ))] REDUCE [exists :- v a NT] t [(exists v a (and (ground transport v a ) (to city v a v e )(from airport v a v e ) (= (ground fare v a ) v a )))] REDUCE [lambda :- v a e NT] t [(lambda v a e (exists v a (and (ground transport v a ) (to city v a v e )(from airport v a v e ) (= (ground fare v a ) v a ))))]))))]