Generate and Revise: Reinforcement Learning in Neural Poetry
Andrea Zugarini, Luca Pasqualini, Stefano Melacci, Marco Maggini
GG ENERATE AND R EVISE :R EINFORCEMENT L EARNING IN N EURAL P OETRY
A P
REPRINT
Andrea Zugarini ∗ DINFO, DIISMUniversity of Florence, University of SienaFlorence 50139 Italy, Siena, 53100 Italy [email protected]
Luca Pasqualini
DIISMUniversity of SienaSiena 53100 Italy [email protected]
Stefano Melacci
DIISMUniversity of SienaSiena 53100 Italy [email protected]
Marco Maggini
DIISMUniversity of SienaSiena 53100 Italy [email protected]
February 9, 2021 A BSTRACT
Writers, poets, singers usually do not create their compositions in just one breath. Text is revisited,adjusted, modified, rephrased, even multiple times, in order to better convey meanings, emotionsand feelings that the author wants to express. Amongst the noble written arts, Poetry is probably theone that needs to be elaborated the most, since the composition has to formally respect predefinedmeter and rhyming schemes. In this paper, we propose a framework to generate poems that arerepeatedly revisited and corrected, as humans do, in order to improve their overall quality. We framethe problem of revising poems in the context of Reinforcement Learning and, in particular, usingProximal Policy Optimization. Our model generates poems from scratch and it learns to progres-sively adjust the generated text in order to match a target criterion. We evaluate this approach in thecase of matching a rhyming scheme, without having any information on which words are responsi-ble of creating rhymes and on how to coherently alter the poem words. The proposed framework isgeneral and, with an appropriate reward shaping, it can be applied to other text generation problems.
Developing machines that reproduce artistic behaviours and learn to be creative is a long-standing goal of the scientificcommunity in the context of Artificial Intelligence [1, 2]. Recently, several researches focused on the case of thenoble art of Poetry, motivated by success of Deep Learning approaches to Natural Language Processing (NLP) and,more specifically, to Natural Language Generation [3, 4, 5, 6, 7, 8]. However, existing Machine Learning-basedpoem generators do not model the natural way poems are created by humans, i.e., poets usually do not create theircompositions all in one breath. Usually a poet revisits, rephrases, adjusts a poetry many times, before reaching atext that perfectly conveys their intended meanings and emotions. In particular, a typical feature of poems is that thecomposition has also to formally respect predefined meter and rhyming schemes.With the aim of developing an artificial agent that learns to mimic this behaviour, we design a framework to generatepoems that are repeatedly revisited and corrected, in order to improve the overall quality of the poem. We frame thisproblem as a navigation task approached with Reinforcement Learning (RL), exploiting Proximal Policy Optimization(PPO) [9] that, to our best knowledge, is not commonly applied to Natural Language Generation, despite being an ∗ Corresponding author: http://sailab.diism.unisi.it/people/andrea-zugarini/ a r X i v : . [ c s . C L ] F e b PREPRINT - F
EBRUARY
9, 2021improved instance of the more common Vanilla Policy Gradient (VPG). In the task of generating and progressivelyediting the draft of a poem until it matches a target rhyming scheme, we show that PPO leads to better results thanVPG. The agent is not informed about what a rhyme is and how to implement the considered scheme, making the taskextremely challenging in an RL perspective. The agent generates a draft poem and it corrects the draft one word at atime. It not only understands that the ending words of each verse are the ones that are important with respect to therhyming scheme, but that also other words of the poem might need to be adjusted to make the poem coherent with therhyming words. Despite the application to poetry generation, the proposed framework is general and it can be appliedto other text generation problems, provided an opportune reward shaping.This paper is organized as follows. After discussing related work (Section 1.1), the neural models are described inSection 2, while the RL-based poem revision dynamics is detailed in Section 3. Experiments are reported in Section 4and, finally, conclusions are drawn in Section 5.
Early methods on Poetry Generation [10] addressed the problem with rule-based techniques, whereas more recentapproaches focused on learnable neural language models. The first deep learning solutions tackled Chinese Poetry.In [3], authors combined convolutional and recurrent networks to generate quatrains. Afterwards, both [5] and [4]proposed a sequence-to-sequence model with attention mechanisms. In the context of English Poetry, transducerswere exploited to generate poetic text [6]. The generation structure (meter and rhyme) is learned from charactersby cascading a module considering the context, with a weighted state transducer. Recently, in Deep-speare [7], theauthors generated English quatrains with a combination of three neural models that share the same character-basedembeddings. One network is a character-aware language model predicting at word level, another neural model learnsthe meter, and the last one identifies rhyming pairs. Generated quatrains are finally selected after a post-processingstep from the output of the three modules. In [8], the authors focused on a single Italian poet, Dante Alighieri, bymaking use of a syllable-based language model, that was trained with a multi-stage procedure on non-poetic works ofthe same author and on a large Italian corpus.Reinforcement Learning has been recently used in several Natural Language Generation applications, such as TextSummarization [11, 12, 13], Machine Translation [14] and Poem Generation [15, 16] as well. However, most of theproposed approaches exploit RL as a mean to make common evaluation metrics differentiable, such as BLEU andROUGE scores [17]. Of course, these metrics can be computed only in those tasks in which the target text (groundtruth) is available. In [15] the authors extended Generative Adversarial Networks (GANs) [18] to the generation ofsequences of symbols, through Reinforcement Learning. The GAN discriminator is used as a reward signal for a RL-based language generator, and, among a variety of tasks, their framework was applied to Chinese quatrains generation.In [16], a mutual Reinforcement Learning scheme was used to improve the quality of the generated Chinese quatrains.In both works, different generic rewards were designed exploiting the simplest policy-based RL algorithm, i.e. VanillaPolicy Gradient. Surprisingly, Proximal Policy Optimization is less commonly used in the scope of Natural LanguageGeneration, despite leading to a more robust and efficient RL algorithm [19].Our generate-and-revise framework is related to retrieve-and-edit seq2seq approaches [20, 21, 22, 23, 24], wheretext generation reduces to an adaptation/paraphrasing of the retrieved template(s) related to the current input. Therefinement process can be optimized with standard seq2seq learning algorithms because of the presence of revisedtargets. In our generate-and-revise instead, we neither start from retrieved templates, nor we have reference revisions.That is why we cast the problem as a navigation task and exploit RL to learn a revision policy that adjusts draft poemsin order to improve their quality.
Our framework is rooted on the idea that creating a poem is a multi-step process. First, the draft of a new poem isgenerated. Then, an iterative revision procedure is activated, in which the initial draft is progressively edited. Wemodel this problem by means of a generator , that creates the draft, and a reviser , that edits the draft up to the finalversion of the poem. The reviser is structured as an iterative procedure that, at each iteration, identifies a word of thepoem which does not suit well the context in which it is located, and substitutes it with a better word. At each step thereviser has to decide both which word to replace and with what . A straightforward approach to implement this idea isto design an RL agent that jointly addresses both the tasks. Thus, given an m -word poem with vocabulary size | V | ,the agent has to choose among a large number of actions, i.e. | V | · m , due to usually large | V | (in the order of tens ofthousands in our experiments). Therefore the problem quickly becomes extremely hard to tackle.2 PREPRINT - F
EBRUARY
9, 2021We keep the idea of exploiting an RL-based approach, but we decouple the problem implementing the reviser withtwo learnable models, namely the detector and the prompter , each of them responsible of one of the two aforemen-tioned tasks, i.e., detecting a word to substitute (detector), and suggesting how to change a target word (prompter),respectively. The generator , the detector , and the prompter are based on neural architectures, trained from scratch withappropriate criteria, while the detector is fully developed by means of RL. The whole scheme is sketched in Fig. 1.The structure of this module allows us to reduce the action space of the RL procedure to (up to) N words in the poem,making it independent on | V | . The prompter identifies the words in V that are most compatible with the surroundingcontext.In the following, the generator (Section 2.1), the detector (Section 2.2) and the prompter (Section 2.3) will be describedin detail, whereas the RL-dynamics of the detector are presented in Section 3. The poem generation procedure is an instance of Natural Language Generation based on a learnable Language Model(LM). Before considering the specific details of Poetry, we describe the LM used in this work. Let us consider asequence of tokens ( w , . . . , w n + m ) taken from a text corpus in a target language. For convenience in the description,let us divide the tokens into two sequences x and y , where x = ( x , . . . , x n ) and y = ( y , . . . , y m ) . The former ( x )is the context provided to the text generator from which to start the production of new text. More generally, x is asource of information that conditions the generation of y ( x could also be empty). The goal of the LM is to estimatethe probability p ( y ) , that is factorized as follows, p ( y ) = m (cid:89) i =1 p ( y i | y
The encoder of x computes a contextual representation of each word x j of the input sequence x ( n words), by means of a bidirectional LSTM (bi-LSTM). The output of this module is the set H x = { h , . . . , h n } ,being h j the contextualized representation of the j -th word. In detail, at each time step j , the bi-LSTM is fed withthe concatenation of the word embedding w j ∈ R d associated to x j , and u j ∈ R r , a character-based representationof x j . We indicate with −→ h j , ←− h j the internal states of the bi-LSTM processing the sequence of augmented wordrepresentations, −→ h j = −−−−→ LST M encx ([ w j , u j ] , −→ h j − ) , ←− h j = ←−−−− LST M encx ([ w j , u j ] , ←− h j +1 ) , PREPRINT - F
EBRUARY
9, 2021where −−−−→
LST M , ←−−−− LST M are the functions computed by the LSTMs in the two directions. The final representationof the j -th word of the input sequence is h j = [ −→ h j , ←− h j ] . Overall, the encoder outputs H x = { h , . . . , h n } . Thechar-based representation u j is obtained by processing the word characters with another bi-LSTM. We augment w j with a char-based representation to better encode sub-word information, that is crucial to capture rhyming schemesand meter in the poems. Decoding.
The decoder is responsible of returning the distribution p ( y | y
Poems are generated sampling from p . As a matter of fact, the sampling strategy plays a crucial rolein the quality of the generated text, and it has been recently shown to have a major impact in Natural LanguageGeneration [29]. We preferred nucleus (top- p ) sampling, with p = 0 . , to generate quatrains over multinomial andtop- k sampling. We indicate with o the sequence of words sampled from p that will consitute a draft poem. The draftspoems generated by the model will be then revised by the joint work of detector and prompter modules. Once we have generated a draft poem using the model of Section 2.1, a detection module learns to select the next wordof the draft that needs to be revised. The detector is a neural model that yields a probability distribution π ( o i | o , a, r ) over the N words of the poem. Of course, in order to detect which words to replace, it is important to take into accountthe author and rhyme information. In detail, the words of poem o are encoded by a network that is analogous to theencoder of x in Section 2.1. The word representations, collected in H o , are processed by an attention mechanisms attn det , building a compact embedding of the whole poem that is also function of the author and of the rhyme scheme.Then, a Multi-Layer Perceptron (MLP) with softmax activation in the output layer returns the probability over the N words, π ( o j | o , a, r ) = MLP j (attn det ([ a , r ] , H o )) (3)being MLP j the j -th output unit. Multinomial sampling applied to π leads to the selection of the word(s) that shouldbe replaced. This module is trained by RL, as we will describe in Section 3. The role of the prompter module is to provide valid candidates to replace the word previously selected by the detectorof Section 2.2. The prompter module solves the problem of modeling language given the left-right contexts of eachword, that can be formulated following an approach similar to the one exploited by the conditional LM of (2). Thus,given an author a and a rhyme scheme r , we use a neural model to learn the following distribution from data, p ( o ) = N (cid:89) i =1 p ( o i | o i , a, r ) , (4)4 PREPRINT - F
EBRUARY
9, 2021being o i the words in left and right context of o i , respectively. Once p( o ) has been learnt, we can sample p ( o i | o i , a, r ) to get one or more candidate words for replacing the selected one.The prompter network follows the context encoding schemes of [30] and [31]. In particular, the words of poem o areencoded by a network that computes representations of the left and right contexts around each target word, discardingthe target word itself. Differently from the encoding of o in Section 2.2, here the final representation of the j -thword is then [ −→ h j − , ←− h j +1 ] . This representation is concatenated with the author embedding a and the rhyme schemeembedding r , followed by a learnable linear layer with softmax activation that projects the concatenated vector to thespace of vocabulary indices. Including a and r in the prompter module is crucial in order to allow the network to learnhow to revise a target word in function of the poet and rhyme scheme. Candidate(s) for replacing the selected wordare sampled from p ( o i | o i , a, r ) , as discussed in the Poem Generator of Section 2.1. In this case we used top- k sampling ( k = 50 ) to have a large pool of candidates. The prompter is trained to maximize p ( o ) on a text corpora ofpoems (Section 4). ConditionalPoem Generator Prompter Detector
In fait I do not love thee with mine eyes , for they in thee a thousand errors note;but 'tis my heart that loves what they despise;who in despite of view is pleased to dote;In fait I do not love thee with mine heart, for they in thee a thousand errors note;but 'tis my heart that loves what they despise;who in despite of view is pleased to dote; draft
In fait I do not love thee with mine heart , for they in thee a thousand errors note;but 'tis my heart that loves what they despise;who in despite of view is pleased to dote; revision
Figure 1: Overall Generate and Revise scheme on an example poem. The conditional poem generator (light bluemodule) produces a draft poem, which is iteratively revised by the detector (pale yellow) - Prompter (light orange)modules until satisfaction of a certain criteria. At each step the detector identifies the word to replace, heart highlightedin red, while the prompter is responsible for finding the substitute, eyes highlighted in green.
Once the poem generator and the prompter modules have been trained, the task of revising a generated poem consistsin detecting which words to change and letting the prompter replace them. If we assume to change one word at a time,we can easily consider this task as a decision process in the space of the dictionary words V . Each decision defineswhich word to change at that a given step, and the prompter replaces it with a suitable candidate. The sequence ofdecisions is the policy of an agent whose goal is to improve the text, according to a given reward function. Text revisionstops when a satisfying score has been reached. This task may be cast as a navigation problem, where the current state of the agent is identified by the sequence of the words in the current text revision. This allows us to reformulate theproblem as an RL task where the navigation space is the environment [32], while the decisions are identified by actionsexecuted by the agent in the environment. We provide a brief introduction of Reinforcement Learning in Appendix A.A RL task can be framed as a sequential decision-making problem in which, at each step t , the agent observes astate S t ∈ S from the environment, and then selects an action A t ∈ A . The environment yields a numerical reward R t +1 ∈ R and then it moves to the next state S t +1 . This interaction gives raise to a trajectory of random variables.In our task, since words are elements of the vocabulary V , we have that S is the space of the poems of length N withwords from V for the target author a and with rhyming scheme r , A is the set of indices of the word positions inthe poem plus the do-nothing action, while R ⊂ R . To define a reward function we use the shortest path problemformulation. The agent aims at reaching the final text revision in the least amount of steps. Conventionally this meansthat the reward R t is defined as a negative number for each state not at the goal state position and a positive number In our implementation, we used the same LSTMs when encoding data in the detector and in the prompter module. PREPRINT - F
EBRUARY
9, 2021or zero when the goal state is reached. Formally, if o t is the poem revision at step t , we have A t = ˆ A t ∈ N +1 (cid:91) g =1 { g } (5) S t +1 = ( o t +1 , a, r ) (6) R t +1 = (cid:26) if S t +1 = S f − otherwise (7)where S f is the goal state in which the text is not revised anymore.The natural connection between the modules presented in Section 2 and the RL-based setting is easily establishedonce we redefine (3) as the probability of an action in the state described by the triple ( o , a, r ) , that perfectly suitsthe definition in (6), yielding a policy function. Using Deep Neural Networks (DNNs) to approximate the RL-relatedfunctions, as we do in the case of the probability distribution over the action space π , is a pretty common approachin nowadays RL-based problems (see, e.g., [32]). In the following descriptions, we compactly rewrite (3) adding thesymbol θ to refer to the network weights that are learned by means of the RL procedure, i.e., π ( ·|· ; θ ) . Policy Gradientmethods are suitable for navigation tasks, as shown in [33], especially when the states’ space becomes large [34]. Insuch spaces often off-policy algorithms (like Q-Learning) are indeed observed to be unable to converge. In this work,we compare two on-policy RL algorithms: Vanilla Policy Gradient (Section 3.1) and Proximal Policy Optimization(Section 3.2). Vanilla Policy Gradient (VPG) [35] is an on-policy RL algorithm whose aim is to learn a policy without using q -values as a proxy. This is obtained increasing the probabilities of actions that lead to higher return, and decreasingthe probabilities of actions that lead to lower return. Actions are usually sampled from a multinomial distributionfor discrete actions’ spaces and from a normal distribution for continuous action spaces. VPG works by updatingpolicy parameters θ via stochastic gradient ascent on policy performance over a buffer built from a certain number oftrajectories, θ ←− θ + α ∇ θ J ( π ( ·|· ; θ )) where J ( π ( ·|· ; θ )) denotes the expected finite-horizon undiscounted return of the policy and ∇ θ its gradient withrespect to θ ( α > ). In order to compute J ( π ( ·|· ; θ )) , the algorithm requires to evaluate further actions for each state s in the buffer. In this paper, we use Generalized Advantage Estimation (GAE) [36] to compute such actions, and theobtained rewards are saved and normalized with respect to “when” they are collected (the so called rewards-to-go).These are solutions reported in literature to be stable and to improve overall training performance of the model. Proximal Policy Optimization (PPO) [9] is another on-policy RL algorithm which improves upon VPG. It is consideredthe state-of-the-art in policy optimization methods and it is a modified version of Trust Region Policy Optimization(TRPO) [37]. Both methods try to take the biggest possible improvement step on a policy using the currently availabledata, without stepping “too far” and making the performance collapse. This is done by maximizing a surrogateobjective, subject to a constraint on policy update quantity, where such constraint depends on the KL-divergencebetween the old policy and the new policy after the update. Specifically, PPO uses a clipped objective to heuristicallyconstrain the KL-divergence, max θ E [min( ρ t ¯ A t , clip ( ρ t , − (cid:15), (cid:15) ) · ¯ A t ) , where ρ t = π ( A t | S t ; θ t ) π θold ( A t | S t ; θ t − ) is a policy ratio, clip ( ρ t , · , · ) clips ρ t in the interval defined by the last two arguments, (cid:15) is an hyperparameter (we set (cid:15) = 0 . ), ¯ A t is the estimated advantage function at time step t and A t , S t are respectivelythe action and the state at time step t . In the implementation used in this paper, when parameters θ are updated overa buffer of trajectories, the update process is early stopped if the constraint is not respected, thus avoiding the newpolicy to step “too far” from the previous one. We collected poems in English language from the Project Gutenberg using the GutenTag tool [38] to filter out non-poetic work and collections. We also discarded non-English contents that occasionally appeared in the retrieved PREPRINT - F
EBRUARY
9, 2021documents. Poems are organized in stanzas, according to their XML-based description. Each stanza was then dividedinto quatrains, if not already in such format, and we assigned a rhyming scheme to each stanza, from a fixed dictionaryof rhyming schemes. Rhymes were automatically detected with the Pronouncing library and a few additional heuristicrules to cover most of the undetected rhymes. Long poems without any rhyming pattern were discarded as well. Weused the meta-information about the author to define the authorship of the stanza, when available. We considered themost frequent authors, the rest was marked as unknown. Overall, we obtained , quatrains, divided in threesets of the sizes , , , and , , respectively, used to train, validate and test the models. We limited theword vocabulary to the most frequent , words, assigning an embedding of size for all the models in all theexperiments. The maximum sequence length of a quatrain has been set to , longer verses were truncated.We define multiple experimental settings and tasks in order to evaluate the quality of the each module proposed in thiswork, up to the entire system that includes all the modules and the full pipeline of generation and iterative revision. While the proposed generator of Section 2.1, follows an established neural architecture, the innovative elements weintroduce in this work are about the poem-related conditional features, author and rhyme scheme , and their use inPoetry with character-aware representations. We considered the task of generating a quatrain y given the contextsequence x that is the previous quatrain, where the rhyme scheme is a symbol indicating the rhymes of an eight-versepoem. We considered the most frequent rhyme schemes of size eight. The architecture hyper-parameters werecommonly selected by choosing the best configuration on the validation set for the vanilla (i.e., not conditioned byauthor and rhyme) generator. The bi-LSTM state encoding the context sequence x was set to , as the state ofthe decoder LST M enct , and the GRU cell as well. Author and rhyme embedding sizes were set to and ,respectively.We compared in terms of generation perplexity a model trained with or without any of the newly introduced conditionalfeatures, reporting results in Table 1. The conditional features allows the LM to be more accurate, that is an importantresult considering the open-ended challenging nature of the poem generation task.
A similar analysis was followed to evaluate the quality of the prompter model of Section 2.3. In particular, we traineda prompter model on single quatrains, enforcing it to learn how to predict a word given its context. We used a bi-LSTM state of units. Again, the role of the new conditional features is what we are mostly interested in and,observing results of Table 2, we can see that they improve the suggestion quality. This result is in line with the case ofSection 4.1, confirming the importance of further poem-related information.Table 1: Perplexity measured on the validation (Val) and test (Test) sets of the poem generator, trained with or withoutconditional features. Val TestVanilla Generator 52.98 59.78Conditional Generator
Table 2: Perplexity measured on the validation (Val) and test (Test) sets of the prompter module, trained with orwithout conditional features. Val TestVanilla Prompter 14.09 14.78Conditional Prompter
In order to show the quality of the detector module and that approaching text correction as shortest path problem isfeasible, we created “corrupted” poems from real poems in the dataset by replacing one or more words in randompositions with words sampled from the entire vocabulary V . The agent operates in an environment where eachepisode starts with a corrupted poem, and it has to learn to reconstruct the original not-corrupted poem, selecting at https://pypi.org/project/pronouncing/ Frequent words are sampled as replacement more often than rare ones. PREPRINT - F
EBRUARY
9, 2021each step which word to change. In this artificial setting we assume that, once the agent picks which word to substitute,a perfect prompter (oracle) will replace it with the ground truth, i.e. the word originally positioned there in the realpoem. This means that after each agent action, the selected position will be either replaced with the original word, incase of a corrupted word, or nothing will be changed, in case of a correct word. The navigation terminates when thegoal state is reached, that occurs after all the corrupted words are removed from the poem.The MLP predicting actions has a single hidden layer of size . We performed different experiments over this poemreconstruction environment, using a PPO-based agent. Each experiment differs in the number of poems that the agenthas to fix, and the number of words perturbed in the poem. We considered {
1, 10, 100 } poems that, at the beginning ofeach episode, are randomly “corrupted” by altering or (referred to as “multiple”) words in the original poem.Pleasenote that even in the simplest case, the experiment with one poem only and a single perturbed word, the number ofgenerated “corrupted” poems is huge, | V | | x | , where | x | is the poem length.The PPO-agent is trained for volleys in each experiment, with , / , / , episodes in the experimentswith / / poems, respectively. We set the maximum episode length to steps. Hence, the reward varies between [ − , where corresponds to the case in which the agent immediately identifies the “corrupted” word (when thereis only 1 corrupted word), and − indicates a full failure. Results are shown in Table 3. The “Volley 0” columndefines the average total reward at the end of the first volley, while the “Volley 9” column defines such value at the endof the last volley. The reward value improves during the training volleys, while increasing the number of poems makesTable 3: Results of the experiments with the PPO-based agent on poem reconstruction task of Section 4.3. Theaveraged total rewards after the first volley and the last volley are reported, respectively. R Volley R Volley Rhyme scheme Draft RevisionAABB the mist that made us sweat and ache the mist that made us sweat and chill with toil, from doing good or ill, with toil, from doing good or ill,the hour when we were led to play the hour when we were led to playthe children of the people’s brood , the children of the people’s way , ABBB and when, above, the winter’s snow and when, above, the winter’s snowhas risen in the wintry sky has risen in the wintry night away and leaves their path to cloud’s decay, and leaves their path to cloud’s decay,and life is spent, and life is drear , and life is spent, and life is drear today Now we consider the complete system in which all the modules are active as in Fig. 1. We focused on the task ofgenerating poems and progressively revising them, in which the agent goal is to substitute words so that the poemmatches a target rhyme scheme. Episodes begin with poems generated by the conditional generator. This task issignificantly more challenging than the previously described ones, since there is no ground truth for generated poems,and words replacements are provided by the prompter model described in Section 2.3. Therefore, we let the model freeto change any word in the quatrain, without restricting the agent actions to words at the end of each verse. Basically,the agent does not know that rhymes are related to the ending words of some verses, while the only information itreceives is the reward (or penalty) signal that tells if the poem fulfills the target rhyming scheme or not.We ran several experiments comparing PPO with VPG, varying in each experiment the number of poems to revisein the environment in { , , , , , } . We set to the number of training steps per volley at , for8 PREPRINT - F
EBRUARY
9, 2021the experiment with poems, and we increase it to , in the other experiments. Additionally, we consideredanother experiment, indicated as dynamic , in the most difficult scenario, i.e., where the environment spawns new,unseen, artificially generated quatrains at each episode. In such a case we report results of PPO only, because usingVPG always resulted in a failure. An episode ends either when the target rhyme scheme is matched or after steps, that corresponds to the maximum episode length. Therefore, the reward of an episode ranges in the interval [ − , . Differently from our previous work [8], we do not carry out human evaluations, since rhyme matching canbe quantitatively measured through the reward. Indeed, the reward is a direct way to assess the revised poem quality,because it is proportional to the number of steps needed for adjusting the target rhyme scheme. In particular, fromEquation 7 we can observe that the number of revising steps in an episode (i.e. where reaching the goal state S f ) isequivalent to | R f | + 2 .Results are presented in Table 5. We can see that, while the agent improves the reward in all the experiments withPPO, learning with VPG is not stable, and performs poorly. The superiority of PPO over VPG is also illustrated inFig. 2, where we can see the instability of VPG in contrast to the steady progresses of PPO. Even if the task is verychallenging, the model is able to strongly improve the average R score, thus indicating that it is actually moving theright steps in progressively fixing the rhymes. We also report in Table 4 two examples of draft revisions obtained withthe agent trained with PPO in the dynamic environment.Table 5: VPG vs PPO: Reward on the experiment of Section 4.4 with 10, 100, 200, 500 and 1000 poems. PPO is alsoevaluated with an environment that continuously generates new drafts ( dynamic ).N poems R first Volley R last VolleyVPG 10 -18.752 -14.630PPO -10.239 -1.186 VPG 100 -19.415 -19.264PPO -15.598 -5.200
VPG 200 -20.950 -18.432PPO -15.323 -3.757
VPG 500 -21.191 -19.623PPO -15.043 -7.780
VPG 1,000 -26.150 -21.179PPO -11.579 -9.733
PPO dynamic -14.796 -12.415 R Total Reward R vpgppo Figure 2: Rewards yielded by using PPO and VPG with respect to the number training volleys, in the experiment ofSection 4.4 with poems in the environment.
In this paper we presented an innovative way of implementing the notion of creativity in a machine. Consideringthe task of automatically generating new poems, we proposed a model that implements the human-like behaviour ofwriting a draft and revising it multiple times. We proposed to create drafts that are conditioned to author and rhymeinformation, while the revision process is built around an iterative procedure that can be described as a navigation9
PREPRINT - F
EBRUARY
9, 2021problem and solved with Reinforcement Learning with Proximal Policy Optimization, that significantly outperformedVanilla Policy Gradient. Multiple experiments confirmed that the proposed approach is feasible and that it allows themachine to learn how to revise text, even if it is not explicitly instructed on which portion of text it should revise. Theproposed framework is also general enough to be eventually applied to other text generation tasks, that is what we aregoing to do in future work.
References [1] Margaret A. Boden. “Chapter 9 - Creativity”. In:
Artificial Intelligence . Ed. by Margaret A. Boden. Handbookof Perception and Cognition. San Diego: Academic Press, 1996, pp. 267–291.
ISBN : 978-0-12-161964-0.
DOI : https://doi.org/10.1016/B978-012161964-0/50011-X . URL : .[2] Simon Colton, Geraint A Wiggins, et al. “Computational creativity: The final frontier?” In: European Confer-ence on Artificial Intelligence (ECAI) . Vol. 12. Montpelier. 2012, pp. 21–26.[3] Xingxing Zhang and Mirella Lapata. “Chinese poetry generation with recurrent neural networks”. In:
Proceed-ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . 2014, pp. 670–680.[4] Qixin Wang et al. “Chinese song iambics generation with neural attention-based model”. In:
Proceedings of theTwenty-Fifth International Joint Conference on Artificial Intelligence . AAAI Press. 2016, pp. 2943–2949.[5] Xiaoyuan Yi, Ruoyu Li, and Maosong Sun. “Generating chinese classical poems with rnn encoder-decoder”.In:
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated BigData . Springer, 2017, pp. 211–223.[6] Jack Hopkins and Douwe Kiela. “Automatically generating rhythmic verse with neural networks”. In:
Proceed-ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) .Vol. 1. 2017, pp. 168–178.[7] Jey Han Lau et al. “Deep-speare: A joint neural model of poetic language, meter and rhyme”. In:
Proceedingsof the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2018,pp. 1948–1958.[8] Andrea Zugarini, Stefano Melacci, and Marco Maggini. “Neural Poetry: Learning to Generate Poems UsingSyllables”. In:
International Conference on Artificial Neural Networks . Springer. 2019, pp. 313–325.[9] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017).[10] Simon Colton, Jacob Goodwin, and Tony Veale. “Full-FACE Poetry Generation.” In:
ICCC . 2012, pp. 95–102.[11] Romain Paulus, Caiming Xiong, and Richard Socher. “A deep reinforced model for abstractive summarization”.In: arXiv preprint arXiv:1705.04304 (2017).[12] Yen-Chun Chen and Mohit Bansal. “Fast Abstractive Summarization with Reinforce-Selected Sentence Rewrit-ing”. In:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) . 2018, pp. 675–686.[13] Shashi Narayan, Shay B Cohen, and Mirella Lapata. “Ranking Sentences for Extractive Summarization withReinforcement Learning”. In:
Proceedings of the 2018 Conference of the North American Chapter of the Associ-ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) . 2018, pp. 1747–1759.[14] Luisa Bentivogli, Matteo Negri, Marco Turchi, et al. “Machine Translation for Machines: the Sentiment Clas-sification Use Case”. In:
Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) .2019, pp. 1368–1374.[15] Lantao Yu et al. “Seqgan: Sequence generative adversarial nets with policy gradient”. In:
Thirty-First AAAIConference on Artificial Intelligence . 2017.[16] Xiaoyuan Yi et al. “Automatic Poetry Generation with Mutual Reinforcement Learning”. In:
Proceedings of the2018 Conference on Empirical Methods in Natural Language Processing . 2018, pp. 3143–3153.[17] Anja Belz and Ehud Reiter. “Comparing automatic and human evaluation of NLG systems”. In: . 2006.[18] Ian Goodfellow et al. “Generative adversarial nets”. In:
Advances in neural information processing systems .2014, pp. 2672–2680.[19] Yi-Lin Tuan et al. “Proximal Policy Optimization and its Dynamic Version for Sequence Generation”. In: arXivpreprint arXiv:1808.07982 (2018). 10
PREPRINT - F
EBRUARY
9, 2021[20] Kelvin Guu et al. “Generating sentences by editing prototypes”. In:
Transactions of the Association for Compu-tational Linguistics
Proceedingsof the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2018,pp. 152–161.[22] Yuan Li et al. “Hybrid retrieval-generation reinforced agent for medical image report generation”. In:
Advancesin neural information processing systems . 2018, pp. 1530–1540.[23] Jason Weston, Emily Dinan, and Alexander H Miller. “Retrieve and refine: Improved sequence generationmodels for dialogue”. In: arXiv preprint arXiv:1808.04776 (2018).[24] Nabil Hossain, Marjan Ghazvininejad, and Luke Zettlemoyer. “Simple and effective retrieve-edit-rerank textgeneration”. In:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics .2020, pp. 2532–2538.[25] Yoshua Bengio et al. “A neural probabilistic language model”. In:
Journal of machine learning research arXiv preprint arXiv:1904.09751 (2019).[27] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning toalign and translate”. In: arXiv preprint arXiv:1409.0473 (2014).[28] Junyoung Chung et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling”. In: arXiv preprint arXiv:1412.3555 (2014).[29] Abigail See et al. “Do Massively Pretrained Language Models Make Better Storytellers?” In:
Proceedings ofthe 23rd Conference on Computational Natural Language Learning (CoNLL) . 2019, pp. 843–861.[30] Oren Melamud, Jacob Goldberger, and Ido Dagan. “context2vec: Learning generic context embedding withbidirectional lstm”. In:
Proceedings of The 20th SIGNLL Conference on Computational Natural LanguageLearning . 2016, pp. 51–61.[31] Giuseppe Marra et al. “An unsupervised character-aware neural approach to word and context representationlearning”. In:
International Conference on Artificial Neural Networks . Springer. 2018, pp. 126–136.[32] V Madhu Babu, U Vamshi Krishna, and SK Shahensha. “An autonomous path finding robot using Q-learning”.In: . IEEE. 2016, pp. 1–6.[33] Matt Knudson and Kagan Tumer. “Policy Search and Policy Gradient Methods for Autonomous Navigation”.In: and Learning Agents Workshop at AAMAS 2010 . 2010.[34] Luca Pasqualini and Maurizio Parton. “Pseudo Random Number Generation: a Reinforcement Learning ap-proach”. In:
Procedia Computer Science
170 (2020), pp. 1122–1127.[35] Richard S Sutton et al. “Policy gradient methods for reinforcement learning with function approximation”. In:
Advances in neural information processing systems . 2000, pp. 1057–1063.[36] John Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXivpreprint arXiv:1506.02438 (2015).[37] John Schulman et al. “Trust region policy optimization”. In:
International conference on machine learning .2015, pp. 1889–1897.[38] Julian Brooke, Adam Hammond, and Graeme Hirst. “GutenTag: an NLP-driven tool for digital humanities re-search in the Project Gutenberg corpus”. In:
Proceedings of the Fourth Workshop on Computational Linguisticsfor Literature . 2015, pp. 42–47.[39] R.S. Sutton and A.G. Barto.
Reinforcement Learning: An Introduction . Adaptive Computation and MachineLearning series. MIT Press, 2018.
ISBN : 9780262039246.
URL : https://books.google.it/books?id=6DKPtQEACAAJ . A Reinforcement Learning
Reinforcement Learning (RL) is learning what to do in order to accumulate as much reward as possible during thecourse of actions. This very general description, known as the RL problem , can be framed as a sequential decision-making problem as follows.Let us consider an agent interacting with an environment , through a set of possible actions that depends on the currentsituation, namely the state . An action affects the environment, therefore after each action the state will change. Somestates are “better” than others, and the goodness of the state can be numerically quantifies with a value, called reward . For a comprehensive introduction to RL, see sections from . to . in [39]. PREPRINT - F
EBRUARY
9, 2021The pair “state, reward” may possibly be drawn from a joint probability distribution, called the model or the dynamics of the environment. The agent will choose actions according to a certain strategy, called policy in the RL setting. TheRL problem can then be stated as finding a policy maximizing the expected value of the total reward accumulatedduring the interaction agent-environment.The RL problem implicitly assumes that the joint probability distribution of S t +1 , R t +1 , i.e. the state of the environ-ment and the reward obtained at next time step t + 1 depend only on the past via S t and A t , corresponding to the stateof the environment and the action executed by the agent at time step t . In fact, the environment is fed only with thelast action, and no other data from the history. This means that, for a fixed policy, the corresponding stochastic process { S t } is Markovian. When the agent experiences a trajectory, or episode, starting at time t , it accumulates a discountedreturn G t : G t := R t +1 + γR t +2 + γ R t +3 + · · · = ∞ (cid:88) k =0 γ k R t + k +1 , γ ∈ [0 , . The return G t is a random variable, whose probability distribution depends not only on the environment dynamics, butalso on how the agent chooses actions in a certain state s . Choices of actions are encoded by the policy, i.e. a discreteprobability distribution π on A : π ( a | s ) := π ( a, s ) := Pr ( A t = a | S t = s ) . A discount factor γ < is used mainly when rewards far in the future are less and less reliable or important, or in continuing tasks, that is, when the trajectories do not decompose naturally into episodes .The average return from a state s , that is, the average total reward the agent can accumulate starting from s , representshow good is the state s for the agent following the policy π , and it is called state-value function: v π ( s ) := E π [ G t | S t = s ] . Likewise, one can define the action-value function (known also as quality or q-value ), encoding how good is choosingan action a from s and then following the policy π : q π ( s, a ) := E π [ G t | S t = s, A t = a ] . In most problems, like the one at hand, we have only a partial knowledge of the environment dynamics . This can beovercome by sampling trajectories S t = s, A t = a, R t +1 , S t +1 , A t +1 , R t +2 , . . . . Policy Gradient (PG) algorithmsestimate directly the policy π ( a | s ; θ ) from sampled trajectories, without using a value function. The parameters vector θ t at time t is often approximated by a neural network in a Deep Reinforcement Learning (DRL) fashion and it ismodified to maximize a suitable scalar performance function J ( θ ) , with the gradient ascent update rule: θ t +1 := θ t + α (cid:92) ∇ J ( θ t ) . Here the learning rate α is the step size of the gradient ascent algorithm, determining how much we are trying toimprove the policy at each update, and (cid:92) ∇ J ( θ t ) is any estimate of the performance gradient ∇ J ( θ ))