Language Generation via Combinatorial Constraint Satisfaction: A Tree Search Enhanced Monte-Carlo Approach
LLanguage Generation via Combinatorial Constraint Satisfaction:A Tree Search Enhanced Monte-Carlo Approach
Maosen Zhang † , Nan Jiang † , Lei Li ‡ , and Yexiang Xue †† Department of Computer Science, Purdue University, Indiana, USA ‡ ByteDance AI Lab { maosen,jiang631,yexiang } @purdue.edu, [email protected] Abstract
Generating natural language under complexconstraints is a principled formulation towardscontrollable text generation. We present aframework to allow specification of combina-torial constraints for sentence generation. Wepropose TSMH , an efficient method to gen-erate high likelihood sentences with respectto a pre-trained language model while sat-isfying the constraints. Our approach ishighly flexible, requires no task-specific train-ing, and leverages efficient constraint satisfac-tion solving techniques. To better handle thecombinatorial constraints, a tree search algo-rithm is embedded into the proposal processof the Markov chain Monte Carlo (MCMC)to explore candidates that satisfy more con-straints. Compared to existing MCMC ap-proaches, our sampling approach has a bet-ter mixing performance. Experiments showthat TSMH achieves consistent and significantimprovement on multiple language generationtasks. Supervised techniques still dominate in natural lan-guage generation tasks. Despite its success, super-vised approaches need to be trained with massivedatasets of input-output pairs, which is non-trivialto acquire. In addition, it is hard to guarantee thatthe output sentences satisfy constraints. Recentapproaches first pre-train a language model on ageneral-purpose dataset, then fine-tune the neuralnet on a task-specific dataset (Devlin et al., 2019;Radford et al., 2019). These approaches partiallymitigate data hunger in training large and flexibleneural networks. Nevertheless, they still requirecarefully crafted datasets for fine-tuning.We present a constraint satisfaction driven ap-proach for language generation. In particular, we https://github.com/Milozms/TSMH P r ob a b ilit y π ( x ) Paris is located in France.Paris is located in France.Paris located in France.Is Paris located in France? : Deletion T S M H C G M H Rejected Accepted
Hard/soft constraintsPretrained LM Sampling Outputsentenceguide
NLG via Constraint Satisfaction
New inputInput-output dataset Supervisedtraining Outputsentence
Supervised (a)(b)
Trainedneural net
Figure 1: (a)
Natural language generation via con-straint satisfaction (bottom), comparing to supervisedapproach (up). (b)
Our proposed tree search enhancedMCMC (TSMH, pink line) traverses the probabilisticspace of high-quality sentences more effectively thanthe baseline (blue line). sample sentences that attain high likelihoods froma language model and satisfy task-specific con-straints. Sampling sentences that attain high likeli-hoods in the language model ensures the quality ofthe generated sentence. Constraints guarantee thatthe sentences fit the specific language task. Theconstraints can be hard ones such as the grammarrules, or soft ones such as attaining positive senti-ment scores.Our method harnesses constraint satisfaction, a r X i v : . [ c s . C L ] N ov ather than learning, to guide language generation.In fact, there is no task-specific training in ourapproach. Our approach is highly flexible sinceconstraints can be switched quickly to be adaptedto a different task, even faster than fine-tuning. Italso allows us to leverage the latest developmentsof automated reasoning for language generation.Although the field of language generation is domi-nated by learning, reasoning should play an equallyimportant role. Human beings can write beautifulwords from reasoning over what is needed in thespecific writing task, without learning from previ-ous examples.To better handle the combinatorial constraints, atree search is embedded into the proposal processof the Markov chain Monte Carlo (MCMC) for con-strained language generation, which suggests candi-date proposals that satisfy more constraints. Our ap-proach is motivated by Sample-Search (Gogate andDechter, 2007a,b, 2011), which integrates back-track search into importance sampling. Makingmultiple word-level changes within one proposalstep of MCMC allows the direct transition betweenlegitimate sentences, while previous approachesmust go through infeasible intermediate states.Such moves are typically rejected by MCMC andtherefore result in a slow mixing rate (See Fig-ure 1(b) and Section 3.1).In literature, constrained language generationhas been attacked in a supervised way in (Sutskeveret al., 2014; Berglund et al., 2015; Hu et al., 2017;Zhang et al., 2019; Miao et al., 2020). There arealso multiple works of literature which model lan-guage rules as decomposed tree structures (Leeet al., 2019) or sentiment tags (Su et al., 2018).Markov Logic network (Richardson and Domin-gos, 2006; Khot et al., 2015) are also used to for-mulate grammar rules. The distance between vec-tors representing sentences meaning is consideredas soft constraints in (Prabhumoye et al., 2018;Belanger and McCallum, 2016; Amato and Mac-Donald, 2010). In a nutshell, we summarize ourcontributions as follows:1. We define the problem of constraint satisfac-tion driven natural language generation, andpropose a sampling-based approach to tacklethe problem with combinatorial constraints.2. We propose a Tree Search enhancedMetropolis-Hastings approach (TSMH)for the proposed task, which mixes fasterthan standard MCMC in the presence of combinatorial constraints.3. Experiment results on generating interroga-tive, imperative sentences with keywords, andsentences with given sentiments demonstratethat our TSMH is able to generate sentencesthat satisfy more hard and soft constraints aswell as retain good quality. We provide a general framework for the constrainednatural language generation. In this framework,sentences are generated by sampling from a proba-bility distribution that is proportional to the score ofa pre-trained language model times the constraintscore. Formally, let x be a sentence, π ( x ) be theprobability that x is sampled, then π ( x ) should be: π ( x ) ∝ P LM ( x ) · Constraint( x ) . (1)Here, P LM ( x ) is the score of a language model(Sundermeyer et al., 2012; Radford et al., 2019),which measures the quality of sentence x . Higher P LM ( x ) means the sentence x is better in quality. Constraint( x ) is a task-specific penalty term.For example, in interrogative sentences generation,we would enforce Constraint( x ) to guarantee thatonly sentences in the interrogative form receivehigh scores. Constraints are composed of hard andsoft constraint terms: Constraint( x ) = Φ hard ( x ) · Φ soft ( x ) . (2)Both the hard constraint score Φ hard ( x ) and thesoft constraint score Φ soft ( x ) are float values rang-ing from to . The closer to , the more satisfiedthe constraints are.Unlike supervised methods which need to betrained with paired input-output data, our frame-work can solve language generation tasks withouttask-specific training. P LM ( x ) comes from a lan-guage model, only trained on general-purpose lan-guage tasks. There is no fine-tuning of P LM ( x ) onthe specific task. Φ hard ( x ) is based on crafted con-straints. Φ soft ( x ) comes from either user-definedfunctions, or pre-trained neural networks, whichagain is not fine-tuned on the specific task. Theoverall formulation composed of the languagemodel and the task-specific constraints allows usto sample sentences which are close to natural lan-guage while satisfying constraints. .1 Hard Constraints In this paper, we use propositional logic to definehard constraints Φ hard ( x ) . Nevertheless, our sam-pling approach generalizes to other logic forms. Weleave the generalization to first-order logic as futurework. For hard constraints, we define Φ hard ( x ) as Φ hard ( x ) = β M − (cid:80) i c i ( x ) (3)where c i ( x ) is an indicator variable which takes 1 ifthe sentence x satisfies the i -th constraint, and M isthe total number of hard constraints. β is between0 and 1. We use quite small β values in our experi-ments, which put a large penalty on violating onehard constraint. We also define Constraint Error C ( x ) as the number of hard constraints a sentenceviolates, i.e., C ( x ) = M − (cid:80) i c i ( x ) . Constraintsare defined in the logical form of word categories. Word Category Division
We divide the en-tire vocabulary into several categories of words.Given vocabulary set U , we partition U into non-overlapping subsets: V = { V , V , . . . , V |V| } , sat-isfying: (i) all V i are subsets of U : V i ⊆ U, ∀ V i ∈V ; (ii) categories are non-overlapping: V i ∩ V j = ∅ , ∀ V i , V j ∈ V , i (cid:54) = j ; (iii) V i together cover thewhole vocabulary: (cid:83) |V| i V i = U .The word category division strategy varies fordifferent tasks. For example, we split the wholevocabulary into V = { [QWH] , [AUX] , [OTH] } for generating interrogative sentences. Here, V = [QWH] represents the set of wh -words lead-ing a question: what, when, where, which, who,whom, whose, why, how . V = [AUX] repre-sents the set of auxiliary verbs and copula words: do, does, did, be, am, are, is, . . . , etc. V = [OTH] means all other words in the vocabulary.We may use another division in, e.g., generat-ing imperative sentences. Sometimes we needto generate sentences with keywords. We leteach keyword forms a category. For example,to generate interrogative sentences with the key-word learning , the division would be: V = { [QWH] , [AUX] , [ learning ] , [OTH] } . Hard Constraints
Given a sentence with length m , let w V j i ∈ { true, f alse } be an indicator vari-able that the i -th word in the sentence is in category V j . For example, variable w [QWH] = true if andonly if the first word in sentence is a wh -like word.For sentence-level constraints, we can define them As we conduct sampling for the sentence, sentence lengthis pre-known and we set m as the length of the longest one. using propositional logic over w V j i ( and ( ∧ ), or ( ∨ ), not ( ¬ )). We give a few examples below. Enforcing Keywords in a Sentence
Given onekeyword K , we can enforce its existence in thesentence using the following constraint: w [K] ∨ w [K] ∨ · · · ∨ w [K] m . here [K] is a set containing the keyword K . We for-mulate this constraint assuming a known sentencelength m . Indeed, length m is a variable and canvary over the sampling procedure. Nevertheless,as we can see shortly in the sampling process, thelengths are known for both sentences when transit-ing from one sentence to another. Therefore, thesemantic meaning of m is clear during sampling.Details on the sampling process is in Section 3.2. Enforcing Imperative Sentence
According to thedefinition in (Aarts, 1989), the starting word ofan imperative sentence should be either a verb: w [VERB] or an adverb followed by a verb: w [ADV] ∧ w [VERB] . We encode such constraint as: w [VERB] ∨ ( w [ADV] ∧ w [VERB] ) . Enforcing Interrogative Sentence
We use the fol-lowing two constraints to enforce the sentence tobe interrogative: (i) The first word is in [QWH] .(ii) The second or third word in the sentence is in [AUX] . (i, ii) can be written together as: w [QWH] ∧ ( ( w [AUX] ∧¬ w [AUX] ) ∨ ( w [AUX] ∧¬ w [AUX] ) ) . This constraint is similar to the definition in(Zhang et al., 2017). We acknowledge that thisis a relaxed constraint. Nevertheless, our samplingapproach also consider the score from languagemodel. These constraints accompanied with thelanguage model guide us to good interrogative sen-tences in practice.
A soft constraint assigns a float value between and to indicate how the constraint is satisfied.For tasks with only hard constraints, Φ soft ( x ) isset to . . Soft constraints can be derived quiteflexibly. It can be from a user-defined function (see“sentence similarity” for an example), or from apre-trained neural network (see “sentiment score”): Sentence Similarity
We can define a soft constraintfunction ensuring that the generated sentence x is close to the reference sentence y in semanticmeaning. For one word in sentence x , we first findhe closest word in sentence y by computing theircosine similarity. Then either the minimum or theaverage of these words’ cosine similarity is takenas the similarity score for sentence x and y . Sentiment Score
We can enforce that the generatedsentence must have a given sentiment by enforcingthe value for the sentence from a sentiment analysismodel. The output score of a sentiment analysisneural net represents whether the sentence has apositive or negative sentiment. We use this scoreas a soft constraint to control the sentiment of gen-erated sentence with positive or negative attitude.Notice that the sentiment analysis neural net is pre-trained on a separate dataset and remains intact inour framework.This setup gives us additional flexibility. Tobe specific, if we need to generate sentences thatcontain keywords while having a given sentiment,it is difficult to find a large dataset of this type andthe performance of a pure learning approach maybe limited. To summarize, the main attribute ofthe constraint satisfaction framework is allowinga formulation using both hard and soft constraints,without the need of task-specific training or tuning.
Markov chain Monte Carlo (MCMC) is a classicalapproach to sample sentences from probability dis-tribution π ( x ) as defined in Equation 1. Startingfrom one sentence x , MCMC moves to the nextsentence x ∗ by first generating a sample x ∗ fromthe proposal distribution Q ( x ∗ | x ) and then accept x ∗ with the following acceptance rate A ( x ∗ | x ) : A ( x ∗ | x ) = min (cid:26) , π ( x ∗ ) Q ( x | x ∗ ) π ( x ) Q ( x ∗ | x ) (cid:27) , (4)If sentence x ∗ is rejected, then the sample remainsat x . The distribution of samples will convergeto the sentence stationary distribution of Markovchain π ( x ) . Previous work (Miao et al., 2019) pro-poses to use MCMC for constrained sentence gen-eration, namely CGMH algorithm. Their proposaldistribution only suggests sentences with one-wordmodification. Nevertheless, CGMH cannot handlethe combinatorial constraints in our problem def-inition, because of the low acceptance ratio prob-lem caused by the locality of the proposal distri-bution. In other words, the sampling process canonly visit a limited number of neighbors, thus theMarkov chain will easily be trapped at one infeasi-ble state, resulting in a lot of rejections. We illus-trate this problem in detail and hence motivate our tree search embedded MCMC approach using thefollowing example. Suppose we need to generate a question, whose an-swer comes from an underlined part of a sentence.For example, suppose we underline
France in thesentence:
A: Paris is located in France.
The question we would like to generate is:
B: Which country is Paris located in?
Under our constraint satisfaction framework, wedefine
Constraint( x ) so that real interrogative sen-tences such as question B would receive high prob-ability in the defined π ( x ) . Our constraints are: (i)the whole sentence is in the interrogative form. (ii) Paris and located must appear in the sentence. Werun MCMC starting from sentence A .It is hard for MCMC without tree search to gen-erate question B in a reasonable time starting from A . Because the edit distance between sentence A and B is larger than , we cannot generate B from A with one step of word insertion, removal, or re-placement. In order for CGMH to reach B from A ,it has to encounter a few intermediate steps. With-out loss of generality, suppose CGMH proposessentence C in one MCMC step by removing is : C: Paris is located in France.
Notice that C is not a legitimate English sen-tence, so its language model score P LM ( x ) be-comes much smaller compared to the original sen-tence A . In addition, C violates more constraintsthan A , which decreases its Constraint( x ) scoreas well. In MCMC, the probability to accept themove from A to sentence C is given by Equa-tion 4, in which the dominating term is π ( C ) π ( A ) = P LM ( C ) Constraint( C ) P LM ( A ) Constraint( A ) . Because both P LM ( C ) and Constraint( C ) are smaller, the acceptance ratiobecomes really small. In fact, we found the accep-tance ratio to be . × − in our experiment.This means that it will take CGMH many steps (onthe order of ) to move one step from sentence A to C . Figure 2 (left) demonstrates this. It is easyto verify that barriers of low acceptance rate existon every path from sentence A to C and thus therejection problem exists.On the other hand, if we allow the proposal dis-tribution to suggest sentences with multiple word-level changes, one can transit from sentence A to B through all legitimate sentences as intermediate aris is located in France.Paris located in France.R DIIs Paris located in France?R DIStep 1Step 2 Accept rate ≈ 𝟏𝟎 !𝟏𝟐
CGMH (highly likely toreject intermediate states) Tree Search Enhanced MCMC (𝒌 −step , 𝒌 ≥ 𝟐) Paris is located in France.Is Paris located in France?
Accept rate: …… ………… …… Which city is located in France?Which country is Paris located in?R DI R DI…… ……(A Single Proposal by Tree Search)What is located in France? ……I…… RD …………
Figure 2: Our method, tree search embedded MCMC (TSMH), outperforms CGMH in generating sentences withcomplex combinatorial constraints. (Left) CGMH must pass intermediate sentence states (highlighted in red),which have very low acceptance rate to reach the intermediate sentence
Is Paris located in France? starting fromsentence
Paris is located in France . This results in the poor performance of CGMH when handling combinatorialconstraints. (Right) By embedding a tree search into MCMC, TSMH can reach the an intermediate sentence fromthe starting sentence in one step, and with an acceptance rate of 100%. R, I, D mean replace, insert, delete . SeeSection 3.1 for a detailed discussion. steps. Consider the following two-step change:1. First delete is and insert is before Paris . Thischanges sentence A to D:Is Paris located in France?
2. Delete
France and insert
Which and country .This changes sentence D to B .Because the intermediate step sentence D is alegitimate English sentence and Constraint( D ) =Constraint( A ) , π ( D ) π ( A ) is close to , resulting in a acceptance ratio in this step. When changingfrom D to B , notice that B is also a legitimatesentence and it satisfies more constraints than D .In fact, the acceptance ratio is also . Figure 2(right) demonstrates this case.For tasks with soft constraints, there are also sim-ilar rejection problems for CGMH. For example, “Nothing is impossible” is a sentence with positivesentiment. If we insert, replace or delete one word,it is hard to keep the sentence valid and preservethe positive sentiment.Motivated by these examples, we propose theembed a tree search into the proposal process ofMCMC to solve the rejection problem, which sug-gests candidate sentences with multiple word-levelchanges and satisfy more constraints. Our Tree Search enhanced Metropolis-Hastings(TSMH) still follows the classical MCMC proce-dure. The only difference is a new proposal distri- bution Q ( x ∗ | x ) generated from a tree search pro-cess. The tree search defines a probability distri-bution over templates of sentence moves. Eachtemplate defines a subset of possible moves. Thesentences within the same template satisfy the samehard constraints score Φ hard ( x ) . The proposalprobability distribution induced by the tree searchalgorithm biases towards templates that have high Constraint( x ) scores.A template defines a set of sentences whereeach word is either given or specified bya word category. For example, a template [[QWH] , [AUX] , [OTH] , [OTH]] restricts thatthe first word must be a wh -word, the second wordmust be an auxiliary verb and the last two wordsmust be other words.Notice that we can decide how many hardconstraints a sentence satisfies at the templatelevel, since the indicator variables in the con-straints defined in this paper only restrict the cat-egories of words. For example, the template [[QWH],[AUX],[OTH],[OTH]] satisfies theconstraints of being an interrogative sentence de-fined in Section 2. Our proposal procedure firstsample a template and then fills in this templatewith words based on a language model. Overview of the Proposal Process
During thesampling process, suppose we are at one sentence x . We will sample a new sentence x ∗ from the pro-posal distribution as follows. First, our algorithmwill decide the positions of the words to changey random selection. Typically, our algorithm willchange more than one word. Then we use a tree search , which enumerates all possible operationson the selected words. This includes deciding theoperation on each word ( insert, delete, or replace )as well as the associated word category in case of insert and replacement . In this case, every leafbranch of the search tree will be a sentence tem-plate. Because the number of word categories islimited, the tree search procedure is often cheap.As discussed, we can infer the number of hard con-straints satisfied based on the template associatedwith each tree leaf. We then rank these templatesbased on the number of constraints satisfied andsample one template based on a geometric series,favoring templates that satisfy more constraints. Fi-nally, we fill in the sampled template with wordssuggested by a language model, and then select onefilled sentence ˆ x as proposal, according to the lan-guage model score times the soft constraint score P LM (ˆ x ) · Φ soft (ˆ x ) . Soft constraints Φ soft ( x ) giveus a real number, which is similar to the languagemodel P LM ( x ) . We treat them together with thelanguage model in the proposal process.Our approach alleviates the rejection problemof MCMC by enumerating all possibilities in thespace of multiple word change at the templatelevel, based on the analysis in section 3.1. Thisprocess enables us to handle combinatorial con-straints. Tree search also allows us to prune uselessbranches. The procedure of searching proposals in our treesearch embedded MCMC is as follows and shownin Figure 3.
Position
Randomly select k positions { t , . . . , t k } to perform word-level operations with uniformprobabilities, where k is the size of the search steps.The probability of getting each combination of po-sitions is: P pos = 1 / (cid:0) mk (cid:1) , where m is the length ofthe sentence. Search
Search and iterate all different operationsand all different word categories (mentioned in Sec-tion 2.1) for each selected position. For example, ifwe have |V| word categories and the operation set { replace, insert, delete, none } , we need to enumer-ate (2 |V| +2) k different combinations of operationsand word categories. We use word placeholders [MASK] to represent the unknown inserted or re-placed words. We keep track of all the generated templates and their corresponding numbers of vio- Paris is located in France 𝑥 𝑥 𝑥 𝑥 𝑥 Paris is located in France 𝛼 𝛼 𝒙 𝒙 Template
𝑪(𝒙) ... ... ... ... ...I I [QWH] [OTH] [QWH] Paris [OTH] is located in France 1I R [QWH] [OTH] [QWH] Paris [OTH] located in France 1I D [QWH] - [QWH] Paris located in France 2R I [QWH] [OTH] [QWH] [OTH] is located in France 0
R N [QWH] - [QWH] is located in France 0 ... ... ... ... ...
Group Template 𝑷 group (1 − 𝛽)𝛽 [QWH] is located in France...1 [QWH] Paris [OTH] is located in France (1 − 𝛽)𝛽 [QWH] Paris [OTH] located in France...2 [QWH] Paris located in France (1 − 𝛽)𝛽 ...... ... ... InputPositionSearchRankGroup Selection : Select Group 𝑖 with probability (1 − 𝛽)𝛽 𝑖 Template Selection (in the selected group)
Fill Sentence with BERT 𝑷 LM ∗ 𝚽 soft Which city is located in France? −10
What is located in France? −16 ...
Proposal : Which city is located in France?
Rank by
𝑪(𝒙) ( 𝛼 , 𝛼 ∈ {𝐼, 𝑅, 𝐷, 𝑁} (insert, replace, delete, none) 𝑊 , 𝑊 ∈ 𝒱 = {[QWH],[AUX],[OTH]}(new words to be insert or replace)Randomly select 𝑘 positions to operateRandomly select one as proposal SelectedGroup Figure 3: The proposal process of Tree Search Embed-ded MCMC. The input is the current sentence (state)and the output is the proposed sentence. This proposalprocess favors sentences satisfying a large number ofconstraints. lated constraints.
Rank and Group Selection
We define a group asthe set of templates which violate the same numberof constraints. We sort all templates by its numberof violated constraints (constraint error) C in as-cending order, and put templates with the same C into one group. We then randomly select group i with probability: P group = (1 − β ) · β C i − min j C j ,where C i is the constraint error of group i , and β is a very small float value (like − ). In this way,we favor choosing the group satisfying the largestamount of constraints, while also ensuring the ir-reducibility of the Markov chain. Let the chosengroup at this step be G i . Fill and Template Selection
In this step we willfirst fill every template with words in the selectedgroup G i , then we select one filled template as theproposal. Because the template restricts the maskedword to be chosen only from the correspondingord category, we fill it by selecting words fromthe given word category. The probability of select-ing the t i -th word P Fill i is the conditional proba-bility of filling words at this locations given con-texts: P LM ( x t i | x , ..., x t i − , x t i +1 , ..., x m ) . Theprobability of getting one sampled sentence is: P fill = (cid:81) ki =1 P fill i , where i means the word levelaction for i -th position we selected. If the operationin t i is delete or none , then P fill i = 1 . We sampleone template within the group (together with thecorresponding sampled sentence) according to thesentence probability times soft constraint score: P template = P LM ( x ∗ ) · Φ soft ( x ∗ ) (cid:80) ˆ x ∈ Gi P LM (ˆ x ) · Φ soft (ˆ x ) .The proposal distribution Q ( x ∗ | x ) leading fromsentence state x to x ∗ in this procedure is Q ( x ∗ | x ) = P pos P group P fill P template . We evaluate our approach on three applications:interrogative, imperative, and fixed sentiment sen-tences generation. In each task, we construct thespecified type of sentences by sampling startingfrom keywords and enforcing task-specific con-straints. For each task, we run our TSMH algo-rithm for 100 steps, with 100 candidate sentencesgenerated. k is set to 3. Since the tree search inTSMH considers changing 3 words at each iter-ation, we run the baseline CGMH for 300 stepsas a comparison. We select the sentence with thehighest π ( x ) value among the sentences generatedby each algorithm as the output. Our results aresummarized in Table 1.In general, our method TSMH outperforms base-lines and generates sentences that satisfy more con-straints, are of good quality and are likely to beclose to the natural language. Our main results aresummarized in Table 1, in which Valid% denotesthe percentage of generated sentences that satisfyall constraints. π ( x ) is the value of the stationaryprobability P LM ( x ) · Constraint( x ) . P GPT − ( x ) is language model probability estimated by a pre-trained GPT-2 model, which measures the qualityof the sentences. Accept% means the acceptancerate of MCMC. Detailed experiment settings canbe reviewed in appendix A.1. In the interrogative sentence generation, we con-struct interrogative sentences by sampling startingfrom the keywords. We enforce that sentences witha high probability to be sampled must satisfy gram- mar constraints of being interrogative and containa few given keywords. The constraint definition forinterrogative sentences is in section 2.1.According to the results, in the experiment withkeywords, 92.67% of the output sentences of ourTSMH algorithm satisfy all the constraints, whilemerely 18.33% satisfy constraints for the baseline.The numbers are 83.17% and 45.50% for the exper-iment without keywords, respectively. This demon-strates that our TSMH generates sentences withmore constraints satisfied. In addition, our methodhas a higher π ( x ) (stationary probability value) andacceptance rate, suggesting that the tree search em-bedded help MCMC to mix faster. Overall, ourmethod TSMH can handle more complicated con-straints in language generation tasks. Human Evaluation
We conduct human evalu-ation for interrogative sentences generated withkeywords. We present human participants fromthe Amazon Mechanical Turk with a pair of sen-tences at a time. One sentence is generated byour TSMH model and the other one is from thebaseline CGMH. We ask human participants whichsentence is better in terms of fluency and grammar.In terms of the experimental setting, we use 100sentence pairs generated by CGMH and TSMHwith the same keyword inputs. We randomly splitthese 100 test sentence pairs into 5 survey groups,and then deploy them on the Amazon MechanicalTurk. We randomly assign human participants tosurvey groups. When showing the sentence pairs,we also provide the keywords that the sentencesmust contain. We ask human participants to votewhich sentence in the pair is better in terms of gram-mar coherence, keyword coverage and fluency. Weuse a gold-standard question to detect if the voteris randomly doing the survey. Every valid surveycontains a randomized set of 20 questions. Wereceived in all 580 votes. Each question pair re-ceives votes ranging from to . As shown inTable 2, sentences from our model receive almosttwice times of votes than the baseline, which sug-gests that the sentences generated by our approachis better in human evaluation. Case Studies
As shown in Table 3, we comparesome output sentences of our method with the base-line using the same inputs and keywords. Moreexamples can be seen in the appendix A.2. Fromthese cases, we can see that our method generatessentences with better quality.
Comparison with Other Methods
We compareasks Methods π ( x ) P GPT − ( x ) Accept%Interrogative CGMH 300 1 18.33% 2.60E-04 1.78E-18 5.45%TSMH (Ours) 100 3
Imperative CGMH 300 1 91.32% 0.0004 9.86E-16 5.49%TSMH (Ours) 100 3
Sentiment CGMH 300 1 96.33% 4.93E-19 4.57E-22 6.72%TSMH (Ours) 100 3
Table 1: Our method TSMH outperforms CGMH by generating sentences that satisfy more constraints, are ofgood quality and are likely to be natural language. Column Valid% shows the percentage of generated sentencesthat satisfy all constraints, which TSMH clearly leads baselines. In addition, TSMH has better acceptance rates(Accept%). The language generated by TSMH is also of good quality, because it matches other models in languagemodel scores P GPT − ( x ) . Multiplying both the language model score and the constraint score, the sentencesgenerated by TSMH tend to attain higher stationary probability π ( x ) . Methods
384 66.36 % Table 2: Human evaluation of the quality of the gen-erated interrogative sentences from keywords in termsof fluency and grammar. Most human participants (na-tive speakers) agree that the sentences generated by ourTSMH are better in quality compared to CGMH.
Keys waste heat waterCGMH what waste is there, it seems now?TSMH where was the waste - water heater?Keys responses protect lungsCGMH how can immune responses also occur bynot only infecting pathogens in thecentral nervous system?TSMH what responses do your lungs have to protectyou from pathogenic bacteria?Keys median temperature winterCGMH what do you mean we have median temperaturewinter and spring, anyways?TSMH what is the median temperature range in thewinter months?Keys catholics concentrated franceCGMH the catholics are now mainly concentrated there.TSMH why are the french roman catholics so denselyconcentrated in southern france?
Table 3: Case study of generating interrogative sen-tences with keywords, where Keys stands for keywords.Full case study is in the supplementary materials. our TSMH method with UQA (Lewis et al., 2019).The setting of UQA is different from us: it takes aparagraph as input and generates a correspondingquestion. Although this comparison is not fair, thebaseline is the most similar and the best framework that we can compare with. To run UQA, we usethe corresponding original sentences from whichthe keywords of TSMH are extracted as the input.In other words, for TSMH, the inputs are keywordsextracted from the SQuAD 2.0 (Rajpurkar et al.,2018) questions. For UQA, we take the correspond-ing paragraphs of the selected questions as input.This also gives UQA additional advantage becauseit has access to a paragraph, rather than keywords.To make it more comparable, we remove the key-word constraints in this experiment. In Table 4, wecompare the language model scores log P LM of thegenerated sentences that reflect the naturalness andfluency, and the stationary probability π ( x ) andvalid percentage Valid% that show how good it sat-isfies our pre-defined constraints. We pointed outthat UQA was trained on the specific interrogativesentences while our method was not trained at all.Methods π ( x ) Valid% log P LM UQA 0.0024 50% -92.75TSMH
Table 4: Comparison with UQA. Our TSMH outper-forms UQA in terms of the percentage of satisfyingthe interrogative sentence constraints, and has a higherscore predicted by a language model, despite UQA istrained on specific interrogative sentences while ourmethod is not trained at all.
We generate imperative sentences via samplingstarting from the keywords. We enforce grammarconstraints of being an imperative sentence: thestarting word should be either a verb w [VERB] orn adverb followed by a verb w [ADV] ∧ w [VERB] .We also enforce keyword constraints in this task.As shown in Table 1, our method has a highervalid percentage of 97.75% compared to 91.32%of the baseline, showing that the sentences gener-ated by our method can satisfy more constraints.Our method has a higher π ( x ) (stationary proba-bility value) and acceptance rate, suggesting ourapproach has a better mixing behavior. Overall,results show that our method using Tree SearchEmbedded MCMC can handle more complicatedcombinatorial constraints in language generation. In this task, we require the sentences to containthe specified keywords and have positive senti-ments (Fu et al., 2019). We enforce the sentencesto attain high scores from a sentiment analysis neu-ral network. We also enforce keyword constraintsas hard constraints. We need to emphasize that,our method uses a model pre-trained on a sepa-rate dataset for sentiment analysis, which is keptintact in our experiment. No additional fine-tuningto the sentiment analysis model was performed.we consider two sub-tasks in Table 5: (i) positivesentiment to positive sentiment (P2P), where theinput keywords are extracted from sentences whichoriginally have positive sentiments; (ii) negativesentiment to positive sentiment (N2P), where thekeywords are extracted from sentences with nega-tive sentiments. N2P is more difficult as it requirestransforming the sentiment.Our method has a higher sentiment score, sug-gesting that our method generates sentences withmore positive sentiments (better aligned with thetarget of this experiment). The increase againstCGMH is bigger on the more difficult N2P task,which requires flipping the sentiment. Our modelalso leads in terms of language model scores, sug-gesting the language quality is better.
Tasks Method π ( x ) P GPT-2
Accept% SentiP2P CGMH 9E-19 8E-22 8.16% 0.8647TSMH
N2P CGMH 5E-20 6E-23 5.65% 0.3470TSMH
Table 5: Generate sentences with positive sentiment.Half of the input are extracted from positive sentences(P2P), and the other half are from negative (N2P),which are harder to transform to positive sentences.
Methods π ( x ) P GPT-2 ( x ) SentimentCtrlGen 3.19E-07 4.64E-22 0.4614TSMH
Table 6: Compare with CtrlGen (Hu et al., 2017) overthe N2P subtask with acceptance rate, language scoreand sentiment score metrics.
Comparison with Other Methods
We compareour method with CtrlGen (Hu et al., 2017). Thesetting is a little different from ours: it takes asentence with a negative sentiment as input andtransforms it to positive, without the guarantee ofsatisfying keyword constraints. Our method takesa set of keywords as input. To make the outputscomparable, we select the same set of negativesentences as the input of CtrlGen and extract thekeywords of those sentences as the input of TSMH.Our method requires no additional training besidesa pre-trained sentiment analysis model and a pre-trained language model, while CtrlGen requirestraining the auto-encoder.The results in Table 6 show that our method out-performs CtrlGen in terms of both sentence qualityand sentiment, as the sentences generated by ourmethod receive higher language model scores andsentiment scores.
We propose a framework for constraint-driven lan-guage generation via sampling and combinatorialconstraint satisfaction. Our solution strategy is tosample sentences from the constrained space withprobability proportional to the scores of the lan-guage model. To better handle the combinatorialconstraints, a tree search is embedded into the pro-posal process of MCMC to suggest candidate pro-posals that satisfy more constraints. Experimentsdemonstrate that our approach generates sentencesthat satisfy more constraints, are of good qualityand are likely to be close in quality to the naturallanguage.
Acknowledgements
This research was supported by the National Sci-ence Foundation (Award number IIS-1850243 andCCF-1918327). The computing infrastructure waspartially supported by the Microsoft AI for Earthcomputing award. The authors would like to thankMr. Ning Miao for valuable suggestions. eferences
Flor Aarts. 1989. Imperative sentences in english:semantics and pragmatics.
Studia Linguistica ,43(2):119–134.Michael S Amato and Maryellen C MacDonald. 2010.Sentence processing in an artificial language: Learn-ing and using combinatorial constraints.
Cognition ,116(1):143–148.David Belanger and Andrew McCallum. 2016. Struc-tured prediction energy networks. In
Proceedingsof the 33nd International Conference on MachineLearning , pages 983–992.Mathias Berglund, Tapani Raiko, Mikko Honkala, LeoK¨arkk¨ainen, Akos Vetek, and Juha Karhunen. 2015.Bidirectional recurrent neural networks as genera-tive models. In
Advances in Neural Information Pro-cessing Systems , pages 856–864.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers) , pages 4171–4186. Association for Computa-tional Linguistics.Yao Fu, Hao Zhou, Jiaze Chen, and Lei Li. 2019. Re-thinking text attribute transfer: A lexical analysis. In the 12th International Conference on Natural Lan-guage Generation (INLG) .Vibhav Gogate and Rina Dechter. 2007a. Approxi-mate counting by sampling the backtrack-free searchspace. In
Proceedings of the Twenty-Second AAAIConference on Artificial Intelligence , pages 198–203.Vibhav Gogate and Rina Dechter. 2007b. Sample-search: A scheme that searches for consistent sam-ples. In
Proceedings of the Eleventh InternationalConference on Artificial Intelligence and Statistics,AISTATS , pages 147–154.Vibhav Gogate and Rina Dechter. 2011. Samplesearch:Importance sampling in presence of determinism.
Artif. Intell. , 175(2):694–729.Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P. Xing. 2017. Toward con-trolled generation of text. In
Proceedings of the34th International Conference on Machine Learning,ICML , pages 1587–1596.Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efficienttext classification. In
Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics, EACL 2017, Valen-cia, Spain, April 3-7, 2017, Volume 2: Short Papers , pages 427–431. Association for Computational Lin-guistics.Tushar Khot, Niranjan Balasubramanian, Eric Gribkoff,Ashish Sabharwal, Peter Clark, and Oren Etzioni.2015. Exploring markov logic networks for questionanswering. In
Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Process-ing , pages 685–694.Jay Yoon Lee, Sanket Vaibhav Mehta, Michael Wick,Jean-Baptiste Tristan, and Jaime G. Carbonell. 2019.Gradient-based inference for networks with outputconstraints. In
The Thirty-Third AAAI Conferenceon Artificial Intelligence , pages 4147–4154.Patrick S. H. Lewis, Ludovic Denoyer, and SebastianRiedel. 2019. Unsupervised question answering bycloze translation. In
Proceedings of the 57th Confer-ence of the Association for Computational Linguis-tics, ACL 2019, Florence, Italy, July 28- August 2,2019, Volume 1: Long Papers , pages 4896–4910. As-sociation for Computational Linguistics.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural lan-guage processing toolkit. In
Proceedings of the52nd Annual Meeting of the Association for Com-putational Linguistics, ACL 2014, June 22-27, 2014,Baltimore, MD, USA, System Demonstrations , pages55–60. The Association for Computer Linguistics.Ning Miao, Yuxuan Song, Hao Zhou, and Lei Li. 2020.Do you have the right scissors? tailoring pre-trainedlanguage models via monte-carlo methods. In
Pro-ceedings of the 58th Annual Meeting of the Associ-ation for Computational Linguistics, ACL 2020, On-line, July 5-10, 2020 , pages 3436–3441. Associationfor Computational Linguistics.Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li.2019. CGMH: constrained sentence generation bymetropolis-hastings sampling. In
The Thirty-ThirdAAAI Conference on Artificial Intelligence , pages6834–6842.George A. Miller. 1995. Wordnet: A lexical databasefor english.
Commun. ACM , 38(11):39–41.Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W. Black. 2018. Style transferthrough back-translation. In
Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics , pages 866–876.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In
Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics, ACL , pages 784–789.atthew Richardson and Pedro M. Domingos. 2006.Markov logic networks.
Machine Learning , 62(1-2):107–136.Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic keyword extraction fromindividual documents.
Text mining: applicationsand theory , 1:1–20.Jinyue Su, Jiacheng Xu, Xipeng Qiu, and XuanjingHuang. 2018. Incorporating discriminator in sen-tence generation: a gibbs sampling method. In
Pro-ceedings of the Thirty-Second AAAI Conference onArtificial Intelligence , pages 5496–5503.Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney.2012. LSTM neural networks for language mod-eling. In , pages194–197.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In
Advances in Neural Information Processing Sys-tems 27: Annual Conference on Neural Informa-tion Processing Systems 2014, December 8-13 2014,Montreal, Quebec, Canada , pages 3104–3112.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.
CoRR , abs/1910.03771.Russ Wolfinger and Michael O’connell. 1993. Gener-alized linear mixed models a pseudo-likelihood ap-proach.
Journal of statistical Computation and Sim-ulation , 48(3-4):233–243.Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li.2019. Generating fluent adversarial examples fornatural languages. In
Proceedings of the 57th Con-ference of the Association for Computational Lin-guistics, ACL 2019, Florence, Italy, July 28- August2, 2019, Volume 1: Long Papers , pages 5564–5569.Association for Computational Linguistics.Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang,and Jiawan Zhang. 2017. Automatic generation ofgrounded visual questions. In
Proceedings of theTwenty-Sixth International Joint Conference on Arti-ficial Intelligence , pages 4235–4243.Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In
Advances in Neural Information Pro-cessing Systems 28: Annual Conference on NeuralInformation Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 649–657.
Appendix
A.1 Detailed Experiment Settings
In this section, we detail our experimental settingsfor interrogative, imperative, and sentimental sen-tence generation tasks, along with the process ofhuman evaluation.In the expression of stationary distributionEq.(1), the first term P LM ( x ) is evaluated by theBERT model, which is based on the huggingface’sBERT implementation (Wolf et al., 2019). Weuse BERT-base in our experiments, with hyper-parameters: L=12, H=768, A=12, Total Param-eters=110M. To evaluate the term P LM ( x ) withBERT model, we multiply the BERT score of mask-ing and querying the conditional probability of eachword in sentence x , close in form of the pseudo-likelihood (Wolfinger and O’connell, 1993). Sincewe only requires π ( x ) to be proportional to P LM ( x ) times the constraint score, P LM ( x ) does not needto be normalized. A.1.1 Interrogative Sentences Generation
According to the adapted definition of interrogativesentence grammar, the first word should be a ques-tion word, and there should be an auxiliary verb ata suitable position. The constraint definition for in-terrogative sentences is in section 2.1. In our actualimplementation, we also enforce that there shouldbe only one question word and one auxiliary verbin the sentence in order to improve the quality ofgenerated sentences. The question words include what, when, where, which, who, whom, whose, why,how ; the auxiliary verbs include do, does, did, be,am, are, is, was, were, shall, will, should, would,can, could, may, might, must .For the task of generating interrogative sentenceswith keywords, we also enforce the keyword onlyappear once in the sentence.The dataset of this task is based on the SQuAD2.0 dataset (Rajpurkar et al., 2018), where we select600 questions and removing the stop words usingthe Rake toolkit (Rose et al., 2010).
A.1.2 Imperative Sentences Generation
The dataset for generating imperative sentencesis retrieved from . We select 300 sentences andextract the keywords from the sentences as ourinput. According to the grammar of imperativesentences, we need to verify if the word is a presenttense verb. In the implementation, we use the POS https://github.com/lettergram/sentence-classification tag information in WordNet and Stanford CoreNLPas the criterion for deciding the word POS tag ofthe given word. We first select all the words with atleast one verb meaning in WordNet (Miller, 1995),then use Stanford CoreNLP (Manning et al., 2014)to get POS tags for each word and only preservethe present tense form of verbs. A.1.3 Sentiment Sentence Generation
This application requires the set of input keywordsand an external sentiment classifier, which is usedto estimate whether the sentiment of the sentence ispositive or not. To estimate the sentiment score ofthe sentences, we train a sentiment analysis modelwith fastText (Joulin et al., 2017) on Yelp ReviewPolarity dataset (Zhang et al., 2015). The inputkeywords are extracted from 300 selected sentencesin the Yelp test set. Half of the original sentencesare positive, and the other half are negative (whichis harder to transform to positive sentences).With input keywords of positive and negativesentiment, we enforce the model to generate sen-tences with positive sentiment. The second sub-task with negative sentiment keywords is muchmore difficult than the sub-task with positive sen-timent keywords, as it requires transforming fromnegative to positive sentiment.
A.2 Case Studies
As shown in Table 7, we compare some outputsentences of our method with the baseline usingthe same inputs and keywords. From these cases,we can see that the baseline sometimes generatesawkward or disordered sentences. For example, thebaseline generates one sentence:“ how was lowernormandy ever truly founded? ”. Although this sen-tence seems to satisfy the constraints of an inter-rogative sentence, its meaning is awkward. Thesentence generated by our method is “ when wasthe duchy of normandy founded? ”, which is morerealistic. Also, the sentence from the baseline “ andplease be a very very careful ” does not followimperative grammar, and “ the catholics are nowmainly concentrated there ” is not a question. eys university warsaw establishedTSMH when was the technical university of warsawfirst formally established?CGMH polish polytechnical institute - university oftechnology warsaw - was established herein 1964?Keys organization charge runningTSMH who would charge her with running such anorganization?CGMH who else would charge him with running avery profitable business?Keys tribes khan fightTSMH what tribes would fight back against thegenghis khans?CGMH why else would tribesmen like gen. and gen.genghis khan fight them off?Keys european travel amazonTSMH why did early european explorers not travel toamazonia?CGMH see below, also : did any european settlers evertravel to build the ” first north american sailingcanoes ”?Keys economic growth schoolingTSMH how do economic growth rates in the unitedstates make children receive high - qualityschooling?CGMH what good is economic growth in comparisonwith being among the best in public schooling?(1) Interrogative SentencesKeys seatTSMH please get up from your seatCGMH go on in and take your seatKeys carefulTSMH please be so very very careful.CGMH and please be a very very carefulKeys turn, lightsTSMH turn on the lights all the timeCGMH turn on near all the main lightsKeys close, windowTSMH stay close enough to the windowCGMH stick close enough to meet the windowKeys nice, weekendTSMH have yourself a very nice private weekendCGMH please be nice about spending the weekend(2) Imperative Sentenceseys university warsaw establishedTSMH when was the technical university of warsawfirst formally established?CGMH polish polytechnical institute - university oftechnology warsaw - was established herein 1964?Keys organization charge runningTSMH who would charge her with running such anorganization?CGMH who else would charge him with running avery profitable business?Keys tribes khan fightTSMH what tribes would fight back against thegenghis khans?CGMH why else would tribesmen like gen. and gen.genghis khan fight them off?Keys european travel amazonTSMH why did early european explorers not travel toamazonia?CGMH see below, also : did any european settlers evertravel to build the ” first north american sailingcanoes ”?Keys economic growth schoolingTSMH how do economic growth rates in the unitedstates make children receive high - qualityschooling?CGMH what good is economic growth in comparisonwith being among the best in public schooling?(1) Interrogative SentencesKeys seatTSMH please get up from your seatCGMH go on in and take your seatKeys carefulTSMH please be so very very careful.CGMH and please be a very very carefulKeys turn, lightsTSMH turn on the lights all the timeCGMH turn on near all the main lightsKeys close, windowTSMH stay close enough to the windowCGMH stick close enough to meet the windowKeys nice, weekendTSMH have yourself a very nice private weekendCGMH please be nice about spending the weekend(2) Imperative Sentences