[PDF] Language Generation via Combinatorial Constraint Satisfaction: A Tree Search Enhanced Monte-Carlo Approach

Abstract

Generating natural language under complex constraints is a principled formulation towards controllable text generation. We present a framework to allow specification of combinatorial constraints for sentence generation. We propose TSMH, an efficient method to generate high likelihood sentences with respect to a pre-trained language model while satisfying the constraints. Our approach is highly flexible, requires no task-specific training, and leverages efficient constraint satisfaction solving techniques. To better handle the combinatorial constraints, a tree search algorithm is embedded into the proposal process of the Markov chain Monte Carlo (MCMC) to explore candidates that satisfy more constraints. Compared to existing MCMC approaches, our sampling approach has a better mixing performance. Experiments show that TSMH achieves consistent and significant improvement on multiple language generation tasks.

Full PDF

LLanguage Generation via Combinatorial Constraint Satisfaction:A Tree Search Enhanced Monte-Carlo Approach

Maosen Zhang † , Nan Jiang † , Lei Li ‡ , and Yexiang Xue †† Department of Computer Science, Purdue University, Indiana, USA ‡ ByteDance AI Lab { maosen,jiang631,yexiang } @purdue.edu, [email protected] Abstract

Generating natural language under complexconstraints is a principled formulation towardscontrollable text generation. We present aframework to allow speciﬁcation of combina-torial constraints for sentence generation. Wepropose TSMH , an efﬁcient method to gen-erate high likelihood sentences with respectto a pre-trained language model while sat-isfying the constraints. Our approach ishighly ﬂexible, requires no task-speciﬁc train-ing, and leverages efﬁcient constraint satisfac-tion solving techniques. To better handle thecombinatorial constraints, a tree search algo-rithm is embedded into the proposal processof the Markov chain Monte Carlo (MCMC)to explore candidates that satisfy more con-straints. Compared to existing MCMC ap-proaches, our sampling approach has a bet-ter mixing performance. Experiments showthat TSMH achieves consistent and signiﬁcantimprovement on multiple language generationtasks. Supervised techniques still dominate in natural lan-guage generation tasks. Despite its success, super-vised approaches need to be trained with massivedatasets of input-output pairs, which is non-trivialto acquire. In addition, it is hard to guarantee thatthe output sentences satisfy constraints. Recentapproaches ﬁrst pre-train a language model on ageneral-purpose dataset, then ﬁne-tune the neuralnet on a task-speciﬁc dataset (Devlin et al., 2019;Radford et al., 2019). These approaches partiallymitigate data hunger in training large and ﬂexibleneural networks. Nevertheless, they still requirecarefully crafted datasets for ﬁne-tuning.We present a constraint satisfaction driven ap-proach for language generation. In particular, we https://github.com/Milozms/TSMH P r ob a b ilit y π ( x ) Paris is located in France.Paris is located in France.Paris located in France.Is Paris located in France? : Deletion T S M H C G M H Rejected Accepted

Hard/soft constraintsPretrained LM Sampling Outputsentenceguide

NLG via Constraint Satisfaction

New inputInput-output dataset Supervisedtraining Outputsentence

Supervised (a)(b)

Trainedneural net

Figure 1: (a)

Natural language generation via con-straint satisfaction (bottom), comparing to supervisedapproach (up). (b)

Our proposed tree search enhancedMCMC (TSMH, pink line) traverses the probabilisticspace of high-quality sentences more effectively thanthe baseline (blue line). sample sentences that attain high likelihoods froma language model and satisfy task-speciﬁc con-straints. Sampling sentences that attain high likeli-hoods in the language model ensures the quality ofthe generated sentence. Constraints guarantee thatthe sentences ﬁt the speciﬁc language task. Theconstraints can be hard ones such as the grammarrules, or soft ones such as attaining positive senti-ment scores.Our method harnesses constraint satisfaction, a r X i v : . [ c s . C L ] N ov ather than learning, to guide language generation.In fact, there is no task-speciﬁc training in ourapproach. Our approach is highly ﬂexible sinceconstraints can be switched quickly to be adaptedto a different task, even faster than ﬁne-tuning. Italso allows us to leverage the latest developmentsof automated reasoning for language generation.Although the ﬁeld of language generation is domi-nated by learning, reasoning should play an equallyimportant role. Human beings can write beautifulwords from reasoning over what is needed in thespeciﬁc writing task, without learning from previ-ous examples.To better handle the combinatorial constraints, atree search is embedded into the proposal processof the Markov chain Monte Carlo (MCMC) for con-strained language generation, which suggests candi-date proposals that satisfy more constraints. Our ap-proach is motivated by Sample-Search (Gogate andDechter, 2007a,b, 2011), which integrates back-track search into importance sampling. Makingmultiple word-level changes within one proposalstep of MCMC allows the direct transition betweenlegitimate sentences, while previous approachesmust go through infeasible intermediate states.Such moves are typically rejected by MCMC andtherefore result in a slow mixing rate (See Fig-ure 1(b) and Section 3.1).In literature, constrained language generationhas been attacked in a supervised way in (Sutskeveret al., 2014; Berglund et al., 2015; Hu et al., 2017;Zhang et al., 2019; Miao et al., 2020). There arealso multiple works of literature which model lan-guage rules as decomposed tree structures (Leeet al., 2019) or sentiment tags (Su et al., 2018).Markov Logic network (Richardson and Domin-gos, 2006; Khot et al., 2015) are also used to for-mulate grammar rules. The distance between vec-tors representing sentences meaning is consideredas soft constraints in (Prabhumoye et al., 2018;Belanger and McCallum, 2016; Amato and Mac-Donald, 2010). In a nutshell, we summarize ourcontributions as follows:1. We deﬁne the problem of constraint satisfac-tion driven natural language generation, andpropose a sampling-based approach to tacklethe problem with combinatorial constraints.2. We propose a Tree Search enhancedMetropolis-Hastings approach (TSMH)for the proposed task, which mixes fasterthan standard MCMC in the presence of combinatorial constraints.3. Experiment results on generating interroga-tive, imperative sentences with keywords, andsentences with given sentiments demonstratethat our TSMH is able to generate sentencesthat satisfy more hard and soft constraints aswell as retain good quality. We provide a general framework for the constrainednatural language generation. In this framework,sentences are generated by sampling from a proba-bility distribution that is proportional to the score ofa pre-trained language model times the constraintscore. Formally, let x be a sentence, π ( x ) be theprobability that x is sampled, then π ( x ) should be: π ( x ) ∝ P LM ( x ) · Constraint( x ) . (1)Here, P LM ( x ) is the score of a language model(Sundermeyer et al., 2012; Radford et al., 2019),which measures the quality of sentence x . Higher P LM ( x ) means the sentence x is better in quality. Constraint( x ) is a task-speciﬁc penalty term.For example, in interrogative sentences generation,we would enforce Constraint( x ) to guarantee thatonly sentences in the interrogative form receivehigh scores. Constraints are composed of hard andsoft constraint terms: Constraint( x ) = Φ hard ( x ) · Φ soft ( x ) . (2)Both the hard constraint score Φ hard ( x ) and thesoft constraint score Φ soft ( x ) are ﬂoat values rang-ing from to . The closer to , the more satisﬁedthe constraints are.Unlike supervised methods which need to betrained with paired input-output data, our frame-work can solve language generation tasks withouttask-speciﬁc training. P LM ( x ) comes from a lan-guage model, only trained on general-purpose lan-guage tasks. There is no ﬁne-tuning of P LM ( x ) onthe speciﬁc task. Φ hard ( x ) is based on crafted con-straints. Φ soft ( x ) comes from either user-deﬁnedfunctions, or pre-trained neural networks, whichagain is not ﬁne-tuned on the speciﬁc task. Theoverall formulation composed of the languagemodel and the task-speciﬁc constraints allows usto sample sentences which are close to natural lan-guage while satisfying constraints. .1 Hard Constraints In this paper, we use propositional logic to deﬁnehard constraints Φ hard ( x ) . Nevertheless, our sam-pling approach generalizes to other logic forms. Weleave the generalization to ﬁrst-order logic as futurework. For hard constraints, we deﬁne Φ hard ( x ) as Φ hard ( x ) = β M − (cid:80) i c i ( x ) (3)where c i ( x ) is an indicator variable which takes 1 ifthe sentence x satisﬁes the i -th constraint, and M isthe total number of hard constraints. β is between0 and 1. We use quite small β values in our experi-ments, which put a large penalty on violating onehard constraint. We also deﬁne Constraint Error C ( x ) as the number of hard constraints a sentenceviolates, i.e., C ( x ) = M − (cid:80) i c i ( x ) . Constraintsare deﬁned in the logical form of word categories. Word Category Division

We divide the en-tire vocabulary into several categories of words.Given vocabulary set U , we partition U into non-overlapping subsets: V = { V , V , . . . , V |V| } , sat-isfying: (i) all V i are subsets of U : V i ⊆ U, ∀ V i ∈V ; (ii) categories are non-overlapping: V i ∩ V j = ∅ , ∀ V i , V j ∈ V , i (cid:54) = j ; (iii) V i together cover thewhole vocabulary: (cid:83) |V| i V i = U .The word category division strategy varies fordifferent tasks. For example, we split the wholevocabulary into V = { [QWH] , [AUX] , [OTH] } for generating interrogative sentences. Here, V = [QWH] represents the set of wh -words lead-ing a question: what, when, where, which, who,whom, whose, why, how . V = [AUX] repre-sents the set of auxiliary verbs and copula words: do, does, did, be, am, are, is, . . . , etc. V = [OTH] means all other words in the vocabulary.We may use another division in, e.g., generat-ing imperative sentences. Sometimes we needto generate sentences with keywords. We leteach keyword forms a category. For example,to generate interrogative sentences with the key-word learning , the division would be: V = { [QWH] , [AUX] , [ learning ] , [OTH] } . Hard Constraints

Given a sentence with length m , let w V j i ∈ { true, f alse } be an indicator vari-able that the i -th word in the sentence is in category V j . For example, variable w [QWH] = true if andonly if the ﬁrst word in sentence is a wh -like word.For sentence-level constraints, we can deﬁne them As we conduct sampling for the sentence, sentence lengthis pre-known and we set m as the length of the longest one. using propositional logic over w V j i ( and ( ∧ ), or ( ∨ ), not ( ¬ )). We give a few examples below. Enforcing Keywords in a Sentence

Given onekeyword K , we can enforce its existence in thesentence using the following constraint: w [K] ∨ w [K] ∨ · · · ∨ w [K] m . here [K] is a set containing the keyword K . We for-mulate this constraint assuming a known sentencelength m . Indeed, length m is a variable and canvary over the sampling procedure. Nevertheless,as we can see shortly in the sampling process, thelengths are known for both sentences when transit-ing from one sentence to another. Therefore, thesemantic meaning of m is clear during sampling.Details on the sampling process is in Section 3.2. Enforcing Imperative Sentence

According to thedeﬁnition in (Aarts, 1989), the starting word ofan imperative sentence should be either a verb: w [VERB] or an adverb followed by a verb: w [ADV] ∧ w [VERB] . We encode such constraint as: w [VERB] ∨ ( w [ADV] ∧ w [VERB] ) . Enforcing Interrogative Sentence

We use the fol-lowing two constraints to enforce the sentence tobe interrogative: (i) The ﬁrst word is in [QWH] .(ii) The second or third word in the sentence is in [AUX] . (i, ii) can be written together as: w [QWH] ∧ ( ( w [AUX] ∧¬ w [AUX] ) ∨ ( w [AUX] ∧¬ w [AUX] ) ) . This constraint is similar to the deﬁnition in(Zhang et al., 2017). We acknowledge that thisis a relaxed constraint. Nevertheless, our samplingapproach also consider the score from languagemodel. These constraints accompanied with thelanguage model guide us to good interrogative sen-tences in practice.

A soft constraint assigns a ﬂoat value between and to indicate how the constraint is satisﬁed.For tasks with only hard constraints, Φ soft ( x ) isset to . . Soft constraints can be derived quiteﬂexibly. It can be from a user-deﬁned function (see“sentence similarity” for an example), or from apre-trained neural network (see “sentiment score”): Sentence Similarity

We can deﬁne a soft constraintfunction ensuring that the generated sentence x is close to the reference sentence y in semanticmeaning. For one word in sentence x , we ﬁrst ﬁndhe closest word in sentence y by computing theircosine similarity. Then either the minimum or theaverage of these words’ cosine similarity is takenas the similarity score for sentence x and y . Sentiment Score

We can enforce that the generatedsentence must have a given sentiment by enforcingthe value for the sentence from a sentiment analysismodel. The output score of a sentiment analysisneural net represents whether the sentence has apositive or negative sentiment. We use this scoreas a soft constraint to control the sentiment of gen-erated sentence with positive or negative attitude.Notice that the sentiment analysis neural net is pre-trained on a separate dataset and remains intact inour framework.This setup gives us additional ﬂexibility. Tobe speciﬁc, if we need to generate sentences thatcontain keywords while having a given sentiment,it is difﬁcult to ﬁnd a large dataset of this type andthe performance of a pure learning approach maybe limited. To summarize, the main attribute ofthe constraint satisfaction framework is allowinga formulation using both hard and soft constraints,without the need of task-speciﬁc training or tuning.

Markov chain Monte Carlo (MCMC) is a classicalapproach to sample sentences from probability dis-tribution π ( x ) as deﬁned in Equation 1. Startingfrom one sentence x , MCMC moves to the nextsentence x ∗ by ﬁrst generating a sample x ∗ fromthe proposal distribution Q ( x ∗ | x ) and then accept x ∗ with the following acceptance rate A ( x ∗ | x ) : A ( x ∗ | x ) = min (cid:26) , π ( x ∗ ) Q ( x | x ∗ ) π ( x ) Q ( x ∗ | x ) (cid:27) , (4)If sentence x ∗ is rejected, then the sample remainsat x . The distribution of samples will convergeto the sentence stationary distribution of Markovchain π ( x ) . Previous work (Miao et al., 2019) pro-poses to use MCMC for constrained sentence gen-eration, namely CGMH algorithm. Their proposaldistribution only suggests sentences with one-wordmodiﬁcation. Nevertheless, CGMH cannot handlethe combinatorial constraints in our problem def-inition, because of the low acceptance ratio prob-lem caused by the locality of the proposal distri-bution. In other words, the sampling process canonly visit a limited number of neighbors, thus theMarkov chain will easily be trapped at one infeasi-ble state, resulting in a lot of rejections. We illus-trate this problem in detail and hence motivate our tree search embedded MCMC approach using thefollowing example. Suppose we need to generate a question, whose an-swer comes from an underlined part of a sentence.For example, suppose we underline

France in thesentence:

A: Paris is located in France.

The question we would like to generate is:

B: Which country is Paris located in?

Under our constraint satisfaction framework, wedeﬁne

Constraint( x ) so that real interrogative sen-tences such as question B would receive high prob-ability in the deﬁned π ( x ) . Our constraints are: (i)the whole sentence is in the interrogative form. (ii) Paris and located must appear in the sentence. Werun MCMC starting from sentence A .It is hard for MCMC without tree search to gen-erate question B in a reasonable time starting from A . Because the edit distance between sentence A and B is larger than , we cannot generate B from A with one step of word insertion, removal, or re-placement. In order for CGMH to reach B from A ,it has to encounter a few intermediate steps. With-out loss of generality, suppose CGMH proposessentence C in one MCMC step by removing is : C: Paris is located in France.

Notice that C is not a legitimate English sen-tence, so its language model score P LM ( x ) be-comes much smaller compared to the original sen-tence A . In addition, C violates more constraintsthan A , which decreases its Constraint( x ) scoreas well. In MCMC, the probability to accept themove from A to sentence C is given by Equa-tion 4, in which the dominating term is π ( C ) π ( A ) = P LM ( C ) Constraint( C ) P LM ( A ) Constraint( A ) . Because both P LM ( C ) and Constraint( C ) are smaller, the acceptance ratiobecomes really small. In fact, we found the accep-tance ratio to be . × − in our experiment.This means that it will take CGMH many steps (onthe order of ) to move one step from sentence A to C . Figure 2 (left) demonstrates this. It is easyto verify that barriers of low acceptance rate existon every path from sentence A to C and thus therejection problem exists.On the other hand, if we allow the proposal dis-tribution to suggest sentences with multiple word-level changes, one can transit from sentence A to B through all legitimate sentences as intermediate aris is located in France.Paris located in France.R DIIs Paris located in France?R DIStep 1Step 2 Accept rate ≈ 𝟏𝟎 !𝟏𝟐

CGMH (highly likely toreject intermediate states) Tree Search Enhanced MCMC (𝒌 −step , 𝒌 ≥ 𝟐) Paris is located in France.Is Paris located in France?

Accept rate: …… ………… …… Which city is located in France?Which country is Paris located in?R DI R DI…… ……(A Single Proposal by Tree Search)What is located in France? ……I…… RD …………

Figure 2: Our method, tree search embedded MCMC (TSMH), outperforms CGMH in generating sentences withcomplex combinatorial constraints. (Left) CGMH must pass intermediate sentence states (highlighted in red),which have very low acceptance rate to reach the intermediate sentence

Is Paris located in France? starting fromsentence

Paris is located in France . This results in the poor performance of CGMH when handling combinatorialconstraints. (Right) By embedding a tree search into MCMC, TSMH can reach the an intermediate sentence fromthe starting sentence in one step, and with an acceptance rate of 100%. R, I, D mean replace, insert, delete . SeeSection 3.1 for a detailed discussion. steps. Consider the following two-step change:1. First delete is and insert is before Paris . Thischanges sentence A to D:Is Paris located in France?

2. Delete

France and insert

Which and country .This changes sentence D to B .Because the intermediate step sentence D is alegitimate English sentence and Constraint( D ) =Constraint( A ) , π ( D ) π ( A ) is close to , resulting in a acceptance ratio in this step. When changingfrom D to B , notice that B is also a legitimatesentence and it satisﬁes more constraints than D .In fact, the acceptance ratio is also . Figure 2(right) demonstrates this case.For tasks with soft constraints, there are also sim-ilar rejection problems for CGMH. For example, “Nothing is impossible” is a sentence with positivesentiment. If we insert, replace or delete one word,it is hard to keep the sentence valid and preservethe positive sentiment.Motivated by these examples, we propose theembed a tree search into the proposal process ofMCMC to solve the rejection problem, which sug-gests candidate sentences with multiple word-levelchanges and satisfy more constraints. Our Tree Search enhanced Metropolis-Hastings(TSMH) still follows the classical MCMC proce-dure. The only difference is a new proposal distri- bution Q ( x ∗ | x ) generated from a tree search pro-cess. The tree search deﬁnes a probability distri-bution over templates of sentence moves. Eachtemplate deﬁnes a subset of possible moves. Thesentences within the same template satisfy the samehard constraints score Φ hard ( x ) . The proposalprobability distribution induced by the tree searchalgorithm biases towards templates that have high Constraint( x ) scores.A template deﬁnes a set of sentences whereeach word is either given or speciﬁed bya word category. For example, a template [[QWH] , [AUX] , [OTH] , [OTH]] restricts thatthe ﬁrst word must be a wh -word, the second wordmust be an auxiliary verb and the last two wordsmust be other words.Notice that we can decide how many hardconstraints a sentence satisﬁes at the templatelevel, since the indicator variables in the con-straints deﬁned in this paper only restrict the cat-egories of words. For example, the template [[QWH],[AUX],[OTH],[OTH]] satisﬁes theconstraints of being an interrogative sentence de-ﬁned in Section 2. Our proposal procedure ﬁrstsample a template and then ﬁlls in this templatewith words based on a language model. Overview of the Proposal Process

During thesampling process, suppose we are at one sentence x . We will sample a new sentence x ∗ from the pro-posal distribution as follows. First, our algorithmwill decide the positions of the words to changey random selection. Typically, our algorithm willchange more than one word. Then we use a tree search , which enumerates all possible operationson the selected words. This includes deciding theoperation on each word ( insert, delete, or replace )as well as the associated word category in case of insert and replacement . In this case, every leafbranch of the search tree will be a sentence tem-plate. Because the number of word categories islimited, the tree search procedure is often cheap.As discussed, we can infer the number of hard con-straints satisﬁed based on the template associatedwith each tree leaf. We then rank these templatesbased on the number of constraints satisﬁed andsample one template based on a geometric series,favoring templates that satisfy more constraints. Fi-nally, we ﬁll in the sampled template with wordssuggested by a language model, and then select oneﬁlled sentence ˆ x as proposal, according to the lan-guage model score times the soft constraint score P LM (ˆ x ) · Φ soft (ˆ x ) . Soft constraints Φ soft ( x ) giveus a real number, which is similar to the languagemodel P LM ( x ) . We treat them together with thelanguage model in the proposal process.Our approach alleviates the rejection problemof MCMC by enumerating all possibilities in thespace of multiple word change at the templatelevel, based on the analysis in section 3.1. Thisprocess enables us to handle combinatorial con-straints. Tree search also allows us to prune uselessbranches. The procedure of searching proposals in our treesearch embedded MCMC is as follows and shownin Figure 3.

Position

Randomly select k positions { t , . . . , t k } to perform word-level operations with uniformprobabilities, where k is the size of the search steps.The probability of getting each combination of po-sitions is: P pos = 1 / (cid:0) mk (cid:1) , where m is the length ofthe sentence. Search

Search and iterate all different operationsand all different word categories (mentioned in Sec-tion 2.1) for each selected position. For example, ifwe have |V| word categories and the operation set { replace, insert, delete, none } , we need to enumer-ate (2 |V| +2) k different combinations of operationsand word categories. We use word placeholders [MASK] to represent the unknown inserted or re-placed words. We keep track of all the generated templates and their corresponding numbers of vio- Paris is located in France 𝑥 𝑥 𝑥 𝑥 𝑥 Paris is located in France 𝛼 𝛼 𝒙 𝒙 Template

𝑪(𝒙) ... ... ... ... ...I I [QWH] [OTH] [QWH] Paris [OTH] is located in France 1I R [QWH] [OTH] [QWH] Paris [OTH] located in France 1I D [QWH] - [QWH] Paris located in France 2R I [QWH] [OTH] [QWH] [OTH] is located in France 0

R N [QWH] - [QWH] is located in France 0 ... ... ... ... ...

Group Template 𝑷 group (1 − 𝛽)𝛽 [QWH] is located in France...1 [QWH] Paris [OTH] is located in France (1 − 𝛽)𝛽 [QWH] Paris [OTH] located in France...2 [QWH] Paris located in France (1 − 𝛽)𝛽 ...... ... ... InputPositionSearchRankGroup Selection : Select Group 𝑖 with probability (1 − 𝛽)𝛽 𝑖 Template Selection (in the selected group)

Fill Sentence with BERT 𝑷 LM ∗ 𝚽 soft Which city is located in France? −10

What is located in France? −16 ...

Proposal : Which city is located in France?

Rank by

𝑪(𝒙) ( 𝛼 , 𝛼 ∈ {𝐼, 𝑅, 𝐷, 𝑁} (insert, replace, delete, none) 𝑊 , 𝑊 ∈ 𝒱 = {[QWH],[AUX],[OTH]}(new words to be insert or replace)Randomly select 𝑘 positions to operateRandomly select one as proposal SelectedGroup Figure 3: The proposal process of Tree Search Embed-ded MCMC. The input is the current sentence (state)and the output is the proposed sentence. This proposalprocess favors sentences satisfying a large number ofconstraints. lated constraints.

Rank and Group Selection

We deﬁne a group asthe set of templates which violate the same numberof constraints. We sort all templates by its numberof violated constraints (constraint error) C in as-cending order, and put templates with the same C into one group. We then randomly select group i with probability: P group = (1 − β ) · β C i − min j C j ,where C i is the constraint error of group i , and β is a very small ﬂoat value (like − ). In this way,we favor choosing the group satisfying the largestamount of constraints, while also ensuring the ir-reducibility of the Markov chain. Let the chosengroup at this step be G i . Fill and Template Selection

In this step we willﬁrst ﬁll every template with words in the selectedgroup G i , then we select one ﬁlled template as theproposal. Because the template restricts the maskedword to be chosen only from the correspondingord category, we ﬁll it by selecting words fromthe given word category. The probability of select-ing the t i -th word P Fill i is the conditional proba-bility of ﬁlling words at this locations given con-texts: P LM ( x t i | x , ..., x t i − , x t i +1 , ..., x m ) . Theprobability of getting one sampled sentence is: P ﬁll = (cid:81) ki =1 P ﬁll i , where i means the word levelaction for i -th position we selected. If the operationin t i is delete or none , then P ﬁll i = 1 . We sampleone template within the group (together with thecorresponding sampled sentence) according to thesentence probability times soft constraint score: P template = P LM ( x ∗ ) · Φ soft ( x ∗ ) (cid:80) ˆ x ∈ Gi P LM (ˆ x ) · Φ soft (ˆ x ) .The proposal distribution Q ( x ∗ | x ) leading fromsentence state x to x ∗ in this procedure is Q ( x ∗ | x ) = P pos P group P ﬁll P template . We evaluate our approach on three applications:interrogative, imperative, and ﬁxed sentiment sen-tences generation. In each task, we construct thespeciﬁed type of sentences by sampling startingfrom keywords and enforcing task-speciﬁc con-straints. For each task, we run our TSMH algo-rithm for 100 steps, with 100 candidate sentencesgenerated. k is set to 3. Since the tree search inTSMH considers changing 3 words at each iter-ation, we run the baseline CGMH for 300 stepsas a comparison. We select the sentence with thehighest π ( x ) value among the sentences generatedby each algorithm as the output. Our results aresummarized in Table 1.In general, our method TSMH outperforms base-lines and generates sentences that satisfy more con-straints, are of good quality and are likely to beclose to the natural language. Our main results aresummarized in Table 1, in which Valid% denotesthe percentage of generated sentences that satisfyall constraints. π ( x ) is the value of the stationaryprobability P LM ( x ) · Constraint( x ) . P GPT − ( x ) is language model probability estimated by a pre-trained GPT-2 model, which measures the qualityof the sentences. Accept% means the acceptancerate of MCMC. Detailed experiment settings canbe reviewed in appendix A.1. In the interrogative sentence generation, we con-struct interrogative sentences by sampling startingfrom the keywords. We enforce that sentences witha high probability to be sampled must satisfy gram- mar constraints of being interrogative and containa few given keywords. The constraint deﬁnition forinterrogative sentences is in section 2.1.According to the results, in the experiment withkeywords, 92.67% of the output sentences of ourTSMH algorithm satisfy all the constraints, whilemerely 18.33% satisfy constraints for the baseline.The numbers are 83.17% and 45.50% for the exper-iment without keywords, respectively. This demon-strates that our TSMH generates sentences withmore constraints satisﬁed. In addition, our methodhas a higher π ( x ) (stationary probability value) andacceptance rate, suggesting that the tree search em-bedded help MCMC to mix faster. Overall, ourmethod TSMH can handle more complicated con-straints in language generation tasks. Human Evaluation

We conduct human evalu-ation for interrogative sentences generated withkeywords. We present human participants fromthe Amazon Mechanical Turk with a pair of sen-tences at a time. One sentence is generated byour TSMH model and the other one is from thebaseline CGMH. We ask human participants whichsentence is better in terms of ﬂuency and grammar.In terms of the experimental setting, we use 100sentence pairs generated by CGMH and TSMHwith the same keyword inputs. We randomly splitthese 100 test sentence pairs into 5 survey groups,and then deploy them on the Amazon MechanicalTurk. We randomly assign human participants tosurvey groups. When showing the sentence pairs,we also provide the keywords that the sentencesmust contain. We ask human participants to votewhich sentence in the pair is better in terms of gram-mar coherence, keyword coverage and ﬂuency. Weuse a gold-standard question to detect if the voteris randomly doing the survey. Every valid surveycontains a randomized set of 20 questions. Wereceived in all 580 votes. Each question pair re-ceives votes ranging from to . As shown inTable 2, sentences from our model receive almosttwice times of votes than the baseline, which sug-gests that the sentences generated by our approachis better in human evaluation. Case Studies

As shown in Table 3, we comparesome output sentences of our method with the base-line using the same inputs and keywords. Moreexamples can be seen in the appendix A.2. Fromthese cases, we can see that our method generatessentences with better quality.

Comparison with Other Methods

We compareasks Methods π ( x ) P GPT − ( x ) Accept%Interrogative CGMH 300 1 18.33% 2.60E-04 1.78E-18 5.45%TSMH (Ours) 100 3

Imperative CGMH 300 1 91.32% 0.0004 9.86E-16 5.49%TSMH (Ours) 100 3

Sentiment CGMH 300 1 96.33% 4.93E-19 4.57E-22 6.72%TSMH (Ours) 100 3

Table 1: Our method TSMH outperforms CGMH by generating sentences that satisfy more constraints, are ofgood quality and are likely to be natural language. Column Valid% shows the percentage of generated sentencesthat satisfy all constraints, which TSMH clearly leads baselines. In addition, TSMH has better acceptance rates(Accept%). The language generated by TSMH is also of good quality, because it matches other models in languagemodel scores P GPT − ( x ) . Multiplying both the language model score and the constraint score, the sentencesgenerated by TSMH tend to attain higher stationary probability π ( x ) . Methods

384 66.36 % Table 2: Human evaluation of the quality of the gen-erated interrogative sentences from keywords in termsof ﬂuency and grammar. Most human participants (na-tive speakers) agree that the sentences generated by ourTSMH are better in quality compared to CGMH.

Keys waste heat waterCGMH what waste is there, it seems now?TSMH where was the waste - water heater?Keys responses protect lungsCGMH how can immune responses also occur bynot only infecting pathogens in thecentral nervous system?TSMH what responses do your lungs have to protectyou from pathogenic bacteria?Keys median temperature winterCGMH what do you mean we have median temperaturewinter and spring, anyways?TSMH what is the median temperature range in thewinter months?Keys catholics concentrated franceCGMH the catholics are now mainly concentrated there.TSMH why are the french roman catholics so denselyconcentrated in southern france?

Table 3: Case study of generating interrogative sen-tences with keywords, where Keys stands for keywords.Full case study is in the supplementary materials. our TSMH method with UQA (Lewis et al., 2019).The setting of UQA is different from us: it takes aparagraph as input and generates a correspondingquestion. Although this comparison is not fair, thebaseline is the most similar and the best framework that we can compare with. To run UQA, we usethe corresponding original sentences from whichthe keywords of TSMH are extracted as the input.In other words, for TSMH, the inputs are keywordsextracted from the SQuAD 2.0 (Rajpurkar et al.,2018) questions. For UQA, we take the correspond-ing paragraphs of the selected questions as input.This also gives UQA additional advantage becauseit has access to a paragraph, rather than keywords.To make it more comparable, we remove the key-word constraints in this experiment. In Table 4, wecompare the language model scores log P LM of thegenerated sentences that reﬂect the naturalness andﬂuency, and the stationary probability π ( x ) andvalid percentage Valid% that show how good it sat-isﬁes our pre-deﬁned constraints. We pointed outthat UQA was trained on the speciﬁc interrogativesentences while our method was not trained at all.Methods π ( x ) Valid% log P LM UQA 0.0024 50% -92.75TSMH

Table 4: Comparison with UQA. Our TSMH outper-forms UQA in terms of the percentage of satisfyingthe interrogative sentence constraints, and has a higherscore predicted by a language model, despite UQA istrained on speciﬁc interrogative sentences while ourmethod is not trained at all.

We generate imperative sentences via samplingstarting from the keywords. We enforce grammarconstraints of being an imperative sentence: thestarting word should be either a verb w [VERB] orn adverb followed by a verb w [ADV] ∧ w [VERB] .We also enforce keyword constraints in this task.As shown in Table 1, our method has a highervalid percentage of 97.75% compared to 91.32%of the baseline, showing that the sentences gener-ated by our method can satisfy more constraints.Our method has a higher π ( x ) (stationary proba-bility value) and acceptance rate, suggesting ourapproach has a better mixing behavior. Overall,results show that our method using Tree SearchEmbedded MCMC can handle more complicatedcombinatorial constraints in language generation. In this task, we require the sentences to containthe speciﬁed keywords and have positive senti-ments (Fu et al., 2019). We enforce the sentencesto attain high scores from a sentiment analysis neu-ral network. We also enforce keyword constraintsas hard constraints. We need to emphasize that,our method uses a model pre-trained on a sepa-rate dataset for sentiment analysis, which is keptintact in our experiment. No additional ﬁne-tuningto the sentiment analysis model was performed.we consider two sub-tasks in Table 5: (i) positivesentiment to positive sentiment (P2P), where theinput keywords are extracted from sentences whichoriginally have positive sentiments; (ii) negativesentiment to positive sentiment (N2P), where thekeywords are extracted from sentences with nega-tive sentiments. N2P is more difﬁcult as it requirestransforming the sentiment.Our method has a higher sentiment score, sug-gesting that our method generates sentences withmore positive sentiments (better aligned with thetarget of this experiment). The increase againstCGMH is bigger on the more difﬁcult N2P task,which requires ﬂipping the sentiment. Our modelalso leads in terms of language model scores, sug-gesting the language quality is better.

Tasks Method π ( x ) P GPT-2

Accept% SentiP2P CGMH 9E-19 8E-22 8.16% 0.8647TSMH

N2P CGMH 5E-20 6E-23 5.65% 0.3470TSMH

Table 5: Generate sentences with positive sentiment.Half of the input are extracted from positive sentences(P2P), and the other half are from negative (N2P),which are harder to transform to positive sentences.

Methods π ( x ) P GPT-2 ( x ) SentimentCtrlGen 3.19E-07 4.64E-22 0.4614TSMH

Table 6: Compare with CtrlGen (Hu et al., 2017) overthe N2P subtask with acceptance rate, language scoreand sentiment score metrics.

Comparison with Other Methods

We compareour method with CtrlGen (Hu et al., 2017). Thesetting is a little different from ours: it takes asentence with a negative sentiment as input andtransforms it to positive, without the guarantee ofsatisfying keyword constraints. Our method takesa set of keywords as input. To make the outputscomparable, we select the same set of negativesentences as the input of CtrlGen and extract thekeywords of those sentences as the input of TSMH.Our method requires no additional training besidesa pre-trained sentiment analysis model and a pre-trained language model, while CtrlGen requirestraining the auto-encoder.The results in Table 6 show that our method out-performs CtrlGen in terms of both sentence qualityand sentiment, as the sentences generated by ourmethod receive higher language model scores andsentiment scores.

We propose a framework for constraint-driven lan-guage generation via sampling and combinatorialconstraint satisfaction. Our solution strategy is tosample sentences from the constrained space withprobability proportional to the scores of the lan-guage model. To better handle the combinatorialconstraints, a tree search is embedded into the pro-posal process of MCMC to suggest candidate pro-posals that satisfy more constraints. Experimentsdemonstrate that our approach generates sentencesthat satisfy more constraints, are of good qualityand are likely to be close in quality to the naturallanguage.

Acknowledgements

This research was supported by the National Sci-ence Foundation (Award number IIS-1850243 andCCF-1918327). The computing infrastructure waspartially supported by the Microsoft AI for Earthcomputing award. The authors would like to thankMr. Ning Miao for valuable suggestions. eferences

Flor Aarts. 1989. Imperative sentences in english:semantics and pragmatics.

Studia Linguistica ,43(2):119–134.Michael S Amato and Maryellen C MacDonald. 2010.Sentence processing in an artiﬁcial language: Learn-ing and using combinatorial constraints.

Cognition ,116(1):143–148.David Belanger and Andrew McCallum. 2016. Struc-tured prediction energy networks. In

Proceedingsof the 33nd International Conference on MachineLearning , pages 983–992.Mathias Berglund, Tapani Raiko, Mikko Honkala, LeoK¨arkk¨ainen, Akos Vetek, and Juha Karhunen. 2015.Bidirectional recurrent neural networks as genera-tive models. In

Advances in Neural Information Pro-cessing Systems , pages 856–864.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers) , pages 4171–4186. Association for Computa-tional Linguistics.Yao Fu, Hao Zhou, Jiaze Chen, and Lei Li. 2019. Re-thinking text attribute transfer: A lexical analysis. In the 12th International Conference on Natural Lan-guage Generation (INLG) .Vibhav Gogate and Rina Dechter. 2007a. Approxi-mate counting by sampling the backtrack-free searchspace. In

Proceedings of the Twenty-Second AAAIConference on Artiﬁcial Intelligence , pages 198–203.Vibhav Gogate and Rina Dechter. 2007b. Sample-search: A scheme that searches for consistent sam-ples. In

Proceedings of the Eleventh InternationalConference on Artiﬁcial Intelligence and Statistics,AISTATS , pages 147–154.Vibhav Gogate and Rina Dechter. 2011. Samplesearch:Importance sampling in presence of determinism.

Artif. Intell. , 175(2):694–729.Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P. Xing. 2017. Toward con-trolled generation of text. In

Proceedings of the34th International Conference on Machine Learning,ICML , pages 1587–1596.Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efﬁcienttext classiﬁcation. In

Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics, EACL 2017, Valen-cia, Spain, April 3-7, 2017, Volume 2: Short Papers , pages 427–431. Association for Computational Lin-guistics.Tushar Khot, Niranjan Balasubramanian, Eric Gribkoff,Ashish Sabharwal, Peter Clark, and Oren Etzioni.2015. Exploring markov logic networks for questionanswering. In

Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Process-ing , pages 685–694.Jay Yoon Lee, Sanket Vaibhav Mehta, Michael Wick,Jean-Baptiste Tristan, and Jaime G. Carbonell. 2019.Gradient-based inference for networks with outputconstraints. In

The Thirty-Third AAAI Conferenceon Artiﬁcial Intelligence , pages 4147–4154.Patrick S. H. Lewis, Ludovic Denoyer, and SebastianRiedel. 2019. Unsupervised question answering bycloze translation. In

Proceedings of the 57th Confer-ence of the Association for Computational Linguis-tics, ACL 2019, Florence, Italy, July 28- August 2,2019, Volume 1: Long Papers , pages 4896–4910. As-sociation for Computational Linguistics.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural lan-guage processing toolkit. In

Proceedings of the52nd Annual Meeting of the Association for Com-putational Linguistics, ACL 2014, June 22-27, 2014,Baltimore, MD, USA, System Demonstrations , pages55–60. The Association for Computer Linguistics.Ning Miao, Yuxuan Song, Hao Zhou, and Lei Li. 2020.Do you have the right scissors? tailoring pre-trainedlanguage models via monte-carlo methods. In

Pro-ceedings of the 58th Annual Meeting of the Associ-ation for Computational Linguistics, ACL 2020, On-line, July 5-10, 2020 , pages 3436–3441. Associationfor Computational Linguistics.Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li.2019. CGMH: constrained sentence generation bymetropolis-hastings sampling. In

The Thirty-ThirdAAAI Conference on Artiﬁcial Intelligence , pages6834–6842.George A. Miller. 1995. Wordnet: A lexical databasefor english.

Commun. ACM , 38(11):39–41.Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W. Black. 2018. Style transferthrough back-translation. In

Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics , pages 866–876.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In

Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics, ACL , pages 784–789.atthew Richardson and Pedro M. Domingos. 2006.Markov logic networks.

Machine Learning , 62(1-2):107–136.Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic keyword extraction fromindividual documents.

Text mining: applicationsand theory , 1:1–20.Jinyue Su, Jiacheng Xu, Xipeng Qiu, and XuanjingHuang. 2018. Incorporating discriminator in sen-tence generation: a gibbs sampling method. In

Pro-ceedings of the Thirty-Second AAAI Conference onArtiﬁcial Intelligence , pages 5496–5503.Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney.2012. LSTM neural networks for language mod-eling. In , pages194–197.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In

Advances in Neural Information Processing Sys-tems 27: Annual Conference on Neural Informa-tion Processing Systems 2014, December 8-13 2014,Montreal, Quebec, Canada , pages 3104–3112.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

CoRR , abs/1910.03771.Russ Wolﬁnger and Michael O’connell. 1993. Gener-alized linear mixed models a pseudo-likelihood ap-proach.

Journal of statistical Computation and Sim-ulation , 48(3-4):233–243.Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li.2019. Generating ﬂuent adversarial examples fornatural languages. In

Proceedings of the 57th Con-ference of the Association for Computational Lin-guistics, ACL 2019, Florence, Italy, July 28- August2, 2019, Volume 1: Long Papers , pages 5564–5569.Association for Computational Linguistics.Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang,and Jiawan Zhang. 2017. Automatic generation ofgrounded visual questions. In

Proceedings of theTwenty-Sixth International Joint Conference on Arti-ﬁcial Intelligence , pages 4235–4243.Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-siﬁcation. In

Advances in Neural Information Pro-cessing Systems 28: Annual Conference on NeuralInformation Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 649–657.

Appendix

A.1 Detailed Experiment Settings

In this section, we detail our experimental settingsfor interrogative, imperative, and sentimental sen-tence generation tasks, along with the process ofhuman evaluation.In the expression of stationary distributionEq.(1), the ﬁrst term P LM ( x ) is evaluated by theBERT model, which is based on the huggingface’sBERT implementation (Wolf et al., 2019). Weuse BERT-base in our experiments, with hyper-parameters: L=12, H=768, A=12, Total Param-eters=110M. To evaluate the term P LM ( x ) withBERT model, we multiply the BERT score of mask-ing and querying the conditional probability of eachword in sentence x , close in form of the pseudo-likelihood (Wolﬁnger and O’connell, 1993). Sincewe only requires π ( x ) to be proportional to P LM ( x ) times the constraint score, P LM ( x ) does not needto be normalized. A.1.1 Interrogative Sentences Generation

According to the adapted deﬁnition of interrogativesentence grammar, the ﬁrst word should be a ques-tion word, and there should be an auxiliary verb ata suitable position. The constraint deﬁnition for in-terrogative sentences is in section 2.1. In our actualimplementation, we also enforce that there shouldbe only one question word and one auxiliary verbin the sentence in order to improve the quality ofgenerated sentences. The question words include what, when, where, which, who, whom, whose, why,how ; the auxiliary verbs include do, does, did, be,am, are, is, was, were, shall, will, should, would,can, could, may, might, must .For the task of generating interrogative sentenceswith keywords, we also enforce the keyword onlyappear once in the sentence.The dataset of this task is based on the SQuAD2.0 dataset (Rajpurkar et al., 2018), where we select600 questions and removing the stop words usingthe Rake toolkit (Rose et al., 2010).

A.1.2 Imperative Sentences Generation

The dataset for generating imperative sentencesis retrieved from . We select 300 sentences andextract the keywords from the sentences as ourinput. According to the grammar of imperativesentences, we need to verify if the word is a presenttense verb. In the implementation, we use the POS https://github.com/lettergram/sentence-classiﬁcation tag information in WordNet and Stanford CoreNLPas the criterion for deciding the word POS tag ofthe given word. We ﬁrst select all the words with atleast one verb meaning in WordNet (Miller, 1995),then use Stanford CoreNLP (Manning et al., 2014)to get POS tags for each word and only preservethe present tense form of verbs. A.1.3 Sentiment Sentence Generation

This application requires the set of input keywordsand an external sentiment classiﬁer, which is usedto estimate whether the sentiment of the sentence ispositive or not. To estimate the sentiment score ofthe sentences, we train a sentiment analysis modelwith fastText (Joulin et al., 2017) on Yelp ReviewPolarity dataset (Zhang et al., 2015). The inputkeywords are extracted from 300 selected sentencesin the Yelp test set. Half of the original sentencesare positive, and the other half are negative (whichis harder to transform to positive sentences).With input keywords of positive and negativesentiment, we enforce the model to generate sen-tences with positive sentiment. The second sub-task with negative sentiment keywords is muchmore difﬁcult than the sub-task with positive sen-timent keywords, as it requires transforming fromnegative to positive sentiment.

A.2 Case Studies

As shown in Table 7, we compare some outputsentences of our method with the baseline usingthe same inputs and keywords. From these cases,we can see that the baseline sometimes generatesawkward or disordered sentences. For example, thebaseline generates one sentence:“ how was lowernormandy ever truly founded? ”. Although this sen-tence seems to satisfy the constraints of an inter-rogative sentence, its meaning is awkward. Thesentence generated by our method is “ when wasthe duchy of normandy founded? ”, which is morerealistic. Also, the sentence from the baseline “ andplease be a very very careful ” does not followimperative grammar, and “ the catholics are nowmainly concentrated there ” is not a question. eys university warsaw establishedTSMH when was the technical university of warsawﬁrst formally established?CGMH polish polytechnical institute - university oftechnology warsaw - was established herein 1964?Keys organization charge runningTSMH who would charge her with running such anorganization?CGMH who else would charge him with running avery proﬁtable business?Keys tribes khan ﬁghtTSMH what tribes would ﬁght back against thegenghis khans?CGMH why else would tribesmen like gen. and gen.genghis khan ﬁght them off?Keys european travel amazonTSMH why did early european explorers not travel toamazonia?CGMH see below, also : did any european settlers evertravel to build the ” ﬁrst north american sailingcanoes ”?Keys economic growth schoolingTSMH how do economic growth rates in the unitedstates make children receive high - qualityschooling?CGMH what good is economic growth in comparisonwith being among the best in public schooling?(1) Interrogative SentencesKeys seatTSMH please get up from your seatCGMH go on in and take your seatKeys carefulTSMH please be so very very careful.CGMH and please be a very very carefulKeys turn, lightsTSMH turn on the lights all the timeCGMH turn on near all the main lightsKeys close, windowTSMH stay close enough to the windowCGMH stick close enough to meet the windowKeys nice, weekendTSMH have yourself a very nice private weekendCGMH please be nice about spending the weekend(2) Imperative Sentenceseys university warsaw establishedTSMH when was the technical university of warsawﬁrst formally established?CGMH polish polytechnical institute - university oftechnology warsaw - was established herein 1964?Keys organization charge runningTSMH who would charge her with running such anorganization?CGMH who else would charge him with running avery proﬁtable business?Keys tribes khan ﬁghtTSMH what tribes would ﬁght back against thegenghis khans?CGMH why else would tribesmen like gen. and gen.genghis khan ﬁght them off?Keys european travel amazonTSMH why did early european explorers not travel toamazonia?CGMH see below, also : did any european settlers evertravel to build the ” ﬁrst north american sailingcanoes ”?Keys economic growth schoolingTSMH how do economic growth rates in the unitedstates make children receive high - qualityschooling?CGMH what good is economic growth in comparisonwith being among the best in public schooling?(1) Interrogative SentencesKeys seatTSMH please get up from your seatCGMH go on in and take your seatKeys carefulTSMH please be so very very careful.CGMH and please be a very very carefulKeys turn, lightsTSMH turn on the lights all the timeCGMH turn on near all the main lightsKeys close, windowTSMH stay close enough to the windowCGMH stick close enough to meet the windowKeys nice, weekendTSMH have yourself a very nice private weekendCGMH please be nice about spending the weekend(2) Imperative Sentences