[PDF] GraphPlan: Story Generation by Planning with Event Graph

Abstract

Story generation is a task that aims to automatically produce multiple sentences to make up a meaningful story. This task is challenging because it requires high-level understanding of semantic meaning of sentences and causality of story events. Naive sequence-to-sequence models generally fail to acquire such knowledge, as the logical correctness can hardly be guaranteed in a text generation model without the strategic planning. In this paper, we focus on planning a sequence of events assisted by event graphs, and use the events to guide the generator. Instead of using a sequence-to-sequence model to output a storyline as in some existing works, we propose to generate an event sequence by walking on an event graph. The event graphs are built automatically based on the corpus. To evaluate the proposed approach, we conduct human evaluation both on event planning and story generation. Based on large-scale human annotation results, our proposed approach is shown to produce more logically correct event sequences and stories.

Full PDF

GGraphPlan: Story Generation by Planning with Event Graph

Hong Chen , Raphael Shu , Hiroya Takamura , Hideki Nakayama The University of Tokyo, Tokyo Institute of Technology, National Institute of Advanced Industrial Science and Technology, Japan { chen, nakayama } @nlab.ci.i.u-tokyo.ac.jp, [email protected], [email protected] Abstract

Story generation is a task that aims to automatically pro-duce multiple sentences to make up a meaningful story. Thistask is challenging because it requires high-level understand-ing of semantic meaning of sentences and causality of storyevents. Naive sequence-to-sequence models generally fail toacquire such knowledge, as the logical correctness can hardlybe guaranteed in a text generation model without the strategicplanning. In this paper, we focus on planning a sequence ofevents assisted by event graphs, and use the events to guidethe generator. Instead of using a sequence-to-sequence modelto output a storyline as in some existing works, we proposeto generate an event sequence by walking on an event graph.The event graphs are built automatically based on the corpus.To evaluate the proposed approach, we conduct human eval-uation both on event planning and story generation. Based onlarge-scale human annotation results, our proposed approachis shown to produce more logically correct event sequencesand stories.

Narrative Intelligence (Mateas and Sengers 2003) is oneform of Humanistic Artiﬁcial Intelligence that requires thesystem to organize, comprehend, and reason about narrativesand produce meaningful responses. Story generation taskscan be considered as a test bed for examining whether a sys-tem develops a good understanding of the narratives.Other than leaving the model to output random sentences,the model is usually given a speciﬁc topic (e.g., title orprompt) or visual information (e.g., image or video). Onestraight-forward approach for these story generation tasksis to leverage a sequence-to-sequence model to predict sen-tences sequentially. Although the model can be trained tocapture the word-prediction distribution from training data,it has two serious drawbacks when applied to generate sto-ries: 1) Conditional Language model (i.e., decoder) tendsto assign high probabilities to generic, repetitive words,especially when beam search is applied in the decodingphase (Holtzman et al. 2019); 2) Sequence-to-sequencemodels often fail to produce logically correct stories.Recently, huge interests have been aroused to decom-pose story generation into two phases: planning and gen-eration (Yao et al. 2019; Goldfarb-Tarrant, Feng, and Peng2019; Xu et al. 2018; Fan, Lewis, and Dauphin 2019). Plan- G r aph P l an ( O u r s ) S eq2 S eq Story Title: Broken arm crushdrivedecide(go)trip hurt regretbleed diebe(painful)fall(down) be(alright)Mutual Exclusionplay play ﬂip landsnapbreak

1. John and Jane played catch inthe house.2. John and Jane played catch inthe kitchen. 3. John ﬂipped the ball too hard.4. It landed on his arm. 5. It snapped and John broke hisarm. (Repetition) (Inconsistency)

1. I decided to go for a run.2. I tripped on a rock. 3. My arm is hurt and broken.4. It was painful.5. Thankfully a few days later,I was alright.

Event Planning Story Generation sc o r e Figure 1: Comparison between sequence-to-sequence modeland GraphPlan (ours). Two problems in the sequence-to-sequence model when generating events: Repetition andLogical Inconsistency. Repeated words (e.g., play) in thestoryline result in the repeated sentences in the generatedstories. Besides, The logic between “land” and “snap” lackscausality, thus generating incoherent stories. On the con-trary, our GraphPlan method does not rely on any languagemodel, it applies beam search on the event graph based ona well-designed score function. The mutually exclusive setfurther ensures the global logical consistency for the plannedevents.ning (Meehan 1976; Riedl and Young 2010) creates a high-level abstraction or a blueprint to encourage the generatorto focus on the ﬂow of a story, similar to making an out-line before writing. The planned elements are referred to as events in many papers. However, the detailed deﬁnition ofevents varies. For instance, an event can be represented as averb argument pair (e.g., (admits, subj) ) (Chambers and Ju-rafsky 2008), a tuple of subject, verb, object and modiﬁer or“wildcard” (e.g., (PERSON0, correspond-36.1, empty, PER-SON1) ) (Martin et al. 2018; Ammanabrolu et al. 2019) or areconstructed verb phrase (e.g., decide(go) ) (Peng and Roth2016). In this paper, we follow Peng and Roth (2016) to rep- a r X i v : . [ c s . C L ] F e b vent Graph Selection GraphPlanStory Generationdisagree be (happy) sadagreechangeﬁghtEvent Graph: Topic

Elizabeth and her husband disagreed on paintingthe room. One day they changed their mind ...

Story

Figure 2: Overview of our approach. In the preprocessingstep, we cluster the stories into K topics and build an eventgraph for each topic. In the planning step, a event graph se-lection module select a event graph based on the input. Thena related event graph is retrieved. The event planning modelgenerates a sequence of events. Finally, based on the inputand the planned events, a story generation module gener-ates the story. The dash line denotes the mutually exclusiveevents that can hardly coexist in one storyline.resent an event with verb phrases.Previous works (Yao et al. 2019) have shown that if theevents are well-planned, then the correctness of generatedstories is almost guaranteed, and furthermore, the stories canbe easily controlled by modifying the events. However, ex-isting approaches (Meehan 1976; Goldfarb-Tarrant, Feng,and Peng 2019; Martin et al. 2018; Ammanabrolu et al.2019) regard event generation as an abstracted version ofstory generation. In other words, they treat each event as onetoken and use a sequence-to-sequence model to make a planof the events. Our preliminary experiments show that repeti-tion and logical inconsistency problems happen in the eventsequence and same problems occur in the generated stories.Figure 1 shows an example using sequence-to-sequence inevent planning. We can see that the both events and gener-ated stories are repeated and illogical.In this paper, instead of leveraging a sequence-to-sequence model on event planning, we propose a planningmethod GraphPlan . To plan the events, GraphPlan walkson a topic-speciﬁc event graph with beam search. Eventgraphs are adopted for story generation even before theemergence of neural-based models (Weyhrauch 1997; Chenet al. 2009; Regneri, Koller, and Pinkal 2010; McIntyre andLapata 2010; Li et al. 2013). An event graph represents thelogical ﬂow of events based on the facts presented in a cor-pus. With a learned event graph, we can walk on it and pro-duce a reasonable event sequence. We follow the graph set-ting in Li et al. (2013), in which each graph is composedof event nodes, connections and a set of mutually exclusiveevents.To generate a story, we ﬁrst identify the topic based on the input (e.g., title or image) and retrieve a related eventgraph. We then plan the events by running beam search witha score function that takes the event-event coherence andinput-event coherence into account. Finally, a story gener-ation module transforms the planned event sequence into areadable story. Figure 1 shows an example of using Graph-Plan and Figure 2 depicts the whole pipeline of our proposedapproach.We conduct experiments on open story generation to eval-uate how event graphs beneﬁt the task. Our approach isshown to signiﬁcantly outperform baseline models that gen-erate events with sequence-to-sequence models in terms oflogical consistency. We also conduct Story Cloze Test to fur-ther validate the effectiveness of the event graphs and themutually exclusive sets. Our contributions can be summa-rized as follows:• We propose a score-based beam search approach to planstory events with an event graph.• Comparing to baseline models, our graph-based planningapproach results in much better logical correctness instory generation tasks according to the human evaluation.• Experiments on Story Cloze Test directly conﬁrms thehigh accuracy of the proposed event planning approach.

Planning for Story Generation

Several approaches havebeen explored to plan a skeleton of the story before ac-tual generation. Before the emergence of neural-based mod-els, Reiter and Dale (1997) and Riedl (2010) attemptedto use hand crafted rules to arrange actions into charactersequences. Recently, with the help of neural sequence-to-sequence models, Xu et al. (2018) proposed to generate mul-tiple key phrases and expand them into a complete story. Abuilt-in key phrases generation module is used in their modelarchitecture. In contrast to Xu et al. (2018), some works ex-plicitly plan a sequence of events (Martin et al. 2018; Am-manabrolu et al. 2019; Tambwekar et al. 2019), keywords(Yao et al. 2019; Ippolito et al. 2019) or actions (Fan, Lewis,and Dauphin 2019) before generating the story based on theplanned items.All of these planning models rely on a language model forplanning without following an external structure of events,which resulted in degraded performance (Holtzman et al.2019). Compared with these works, the main contributionof this paper is to propose a planning method based on auto-matically created event graphs. Instead of a language model,we use score-based beam search to generate a sequence ofevents by walking on the graph.

Graph-based Story Planning

Event graph is a variantof plot graph whose nodes represent events. A quantityof research made progress on generating stories from plotgraphs (Weyhrauch 1997; Chen et al. 2009; Regneri, Koller,and Pinkal 2010; McIntyre and Lapata 2010; Li et al. 2019).Li et al. (2013) proposed a plot graph on story generationtasks which is most related to our work. They crowd-sourcedthe story corpus and manually created plot nodes and edgesn the graph. In their graph, mutually exclusive events arenot allowed to be present in the same story.In this work, both the event graphs and mutually exclu-sive sets are automatically generated. We further propose anevent planning method taking into account the relations be-tween events and various inputs (i.e., title or image).

As a preprocessing step, we ﬁrst extract events automaticallyfrom a corpus. Then, we divide the corpus into several top-ics. Finally, we build an event graph for each topic.

Data-based Event Extraction

Following Peng and Roth(2016), we represent each event with a verb phrase. Unlikeother representations, a verb phrase is the minimum unitin one sentence which is abstract, simple and comprehen-sible for humans. From our observation, this representationdoes not have a severe sparseness problem. In statistics, eachevent can connect over 3 next possible events on average.Please note that our work does not investigate event rep-resentation, we focus on planning a more logical event se-quence. Speciﬁcally, as a preprocessing step, we parse allsentences with semantic role labeling and extract the verbphrases. If an extracted verb has an argument with semanticrole “AM-NEG” (negation) for a verb, we add (not) beforeit (e.g., (not)take ). If a verb is followed by a preposition, weappend the prepositional word to the verb (e.g., take(over) ).If the label is “AM-PRD” (secondary predicate), we makean event from it (e.g., be(excite) ). Finally, if two verbs areclose to each other within ﬁve-word distance in the corpus,we combine them to make an event (e.g., decide(buy) ). Allwords in the event are stemmed with NLTK (Bird, Klein,and Loper 2009).

Topic Modelling

Generally, a story dataset contains a va-riety of topics ranging from animal, health, to robbery. Here,we use Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jor-dan 2003) to infer the topics in the corpus. Considering thatthe relation between events drastically changes according tothe topic, in this work, we build an independent event graphfor each topic. Formally, we denote e k , . . . , e kt the event setfrom the stories that belong to the k-th topic T k in the cor-pus. These events would be used as nodes for the event graphof T k . LDA clusters the stories and thus reduces the amountof unique events in each graph, which will make the graphwalking algorithm more efﬁcient. Event Connection

After collecting the events from a cor-pus for each topic, we need to ﬁnd connections among theseevents to build a graph. The connections are representedas directed edges whose direction indicates possible nextevents. In practice, if events e i and e j occur adjacently inthe text, we add an edge e i → e j . An example of an eventgraph can be found in Figure 2. Mutually Exclusive Set

Following the graph setting in Liet al. (2013), there are events (e.g. “die” and “be(happy)”)

Input-Event Coherence Model

Event-Event Coherence Model

Figure 3: Coherence models we used in this paper. Theevent-event coherence model outputs a coherence score fortwo events. The input-event coherence model takes a titleand an event as input. Both coherence models ﬁnally pro-duce a score within 0 to 1. These coherence scores decidethe next event when running beam search.that are mutually exclusive and cannot be placed in onestory. These mutually exclusive relations are considered asexceptions and difﬁcult to be represented along with theevent graph. We create a held-out set consisting of mutuallyexclusive event pairs for each graph.To identify these mutually exclusive events from the con-structed graphs, we prepare an event-event coherence modelto detect the coherence score between two events. We pre-vent two events with low coherence scores from coexist-ing in the planned events. The model architecture is basedon compositional neural networks (Granroth-Wilding andClark 2016), as shown in Figure 3. The model takes twoevents ( e i , e j ) represented with unique embeddings and out-puts a coherence score normalized with the sigmoid func-tion f event ( e i , e j ) ∈ [0 , . We use contrastive training tooptimize the model. Here, positive examples are the eventsextracted from the same story or title, whereas negative ex-amples are randomly sampled from the events in differentstories. Let ( e i , e j ) denote a positive pair of events and ˜ e j denote a randomly sampled event. The training loss for theevent-event coherence model is deﬁned as: L event = max(0 , − f event ( e , e ) + f event ( e , ˜ e ) + m ) (1)where m is a ﬁxed margin. Finally, we consider two eventsare mutually exclusive if the coherence score falls below acertain threshold τ .On average, after taking the mutually exclusive sets intoaccount, each event graph can still plan over one million dif-ferent possible event sequences. Please refer to the supple-mentary materials for more statistics details of event graphs.Additionally, these in-topic event graphs can be hierarchi-cally combined into a larger graph if the model is requiredto generate longer discourse-level stories. This will be a fu-ture direction of our work. In this section, we describe our approach for planned storygeneration. We separated the whole pipeline into two steps:1) Our GraphPlan walks on the event graph and produces sequence of events as a blueprint of the story. 2) Thestory generation module then ﬁnalizes the texts followingthe planned events.

Till now, each topic has a corresponding event graph. Beforestory generation, we propose GraphPlan to plan event se-quences from the event graph. These planned events will beused to guide the story generation module in the next step.GraphPlan contains two steps. 1) Selecting an event graphfor the input (i.e. title or image). 2) Generating an event se-quence by walking on the graph.

Event Graph Selection

Firstly, we identify the topic ofinputs to retrieve the corresponding event graph. Dependingon tasks, the inputs can be titles for open story generation, orimages for the visual storytelling task. If the input is a pieceof text, we directly use the LDA model we trained earlier toidentify the topic.

Event Sequence Generation

Once we identify the topicof input, we walk on the corresponding event graph to gen-erate a sequence of events. In our experiments, we foundthat an autoregressive language model tends to produce rep-etitions, thus resulting in degraded performance. Hence, wepropose to use a score-based generation method. The algo-rithm can be seen as a type of beam search, in which thecandidate event sequences are ranked by a score function.Starting from a random event e , we progressively searchfor the next event e t in the following candidate set: { e t | e t ∈ Graph( e t − ) ,e t / ∈ Exclusive( e , . . . , e t − ) } , (2)where Graph( · ) returns a set of possible next events inthe graph, Exclusive( · ) returns a set of mutually exclusiveevents. This ﬁltering step greatly reduces the number of can-didate events to consider.To select the event from the candidate set, we rank all re-maining candidate events with the following score function: Score( e t ) = 1 (cid:80) t − i =0 λ i t − (cid:88) i =0 λ i log f event ( e i , e t )+ log f input ( x, e t ) (3)where the ﬁrst term of score function sums the event-eventcoherence score of candidate event e t to each partially gen-erated event e i and gives more weights to recent events. λ denotes the decay rate. Then a decayed average applied overthe score. The model used in producing the event-event co-herence is the same model used in detecting mutually ex-clusive events. The second term is an input-event coherencescore f input ( x, e t ) , which indicates the coherence score be-tween event e t and the input x . We propose an input-eventcoherence model to compute this score. Please refer to Fig-ure 3 for details of parameterization. For the task of openstory generation, the input-event coherence model is imple-mented with compositional neural networks, where the input x in Equation (3) is the title. As a common practice for beam search, we set a bud-get of the number of candidates to explore (i.e., beam size).The candidates with the highest scores are maintained in thebeam. The ﬁnal candidate with the highest score is selectedas the result of planning. The generated event sequence will be sent to a story gen-eration module, which converts the events into a story. Thisstory generation module can be any type of model. Recently,large pre-trained language models show great capability ofgenerating knowledgeable and informative sentences. Tak-ing these advantages, the planned events are more likely tobe logically connected in the generated story. Therefore, weapply GPT2-small (Radford et al. 2019) as our story gen-eration module. During the training, we feed the modulewith the title words and events. A special token “ < EOT > ”separates the title and the events and another special to-ken “ < SEP > ” is placed in every interval of the events.“ < | endoﬁnput | > ” token is added at the end of the input.Besides, we also train an RNN-based sequence-to-sequencemodel that is fed with the same inputs for comparison.However, as stated in Yao et al. (2019); Tan et al. (2020),Exposure bias problem happens when plan-write strategy isapplied. To mitigate this problem, we alternatively add twokinds of noises into the inputs. 1) Mask 20% events with a“[MASK]” token. 2) Mask all events. The ﬁrst noise encour-ages the model to generate sentences referring to all plannedevents. While, the second noise promotes the model to gen-erate more related stories to the title. The effectiveness oftwo noises are analyzed in the supplementary material. We design two experiments to explicitly evaluate event qual-ity and story quality. Firstly, we calculate the diversity scoreand conduct human evaluation on the planned events. Sec-ondly, we use the story generation module to transform theevents into full stories and conduct human evaluation toevaluate the story quality. Moreover, to further verify thecorrectness of our GraphPlan, we conduct experiments onStory Cloze Test. The details of model implementations forall experiments can be found in the supplementary material.

ROCStories Corpora (Mostafazadeh et al. 2017) consists of98,162 training stories and 1,874 stories for validation andtesting. Each story contains a title which we use as the inputand a ﬁve-sentence story as the target. Since titles are an-notated only for training data, we split this training set into8:1:1 for training, validating and testing. We applied clus-tering to the training split (i.e., 8 of 8:1:1) and obtained 500topics, in which each topic represents one speciﬁc domain.Each story is generated from one speciﬁc domain in the fol-lowing experiments. Gold event sequences that are used inplanning methods are extracted from the stories in the cor-pus.iversity S2S S2S(R) GP GOLDDist-1 10.17% 11.35% % 24.92%Dist-2 56.55% 58.92% % 87.75%Table 1: Diversity of planned events. We can see that bothsequence-to-sequence models achieve low diversity, whileGraphPlan can achieve high diversity.

S2S

Following Yao et al. (2019), we use a sequence-to-sequence model (Bahdanau, Cho, and Bengio 2015), whichstraightforwardly generates events given the input titles.

S2S(R)

To build a more competitive baseline, weadopt reward shaping in the sequence-to-sequence model.Like (Tambwekar et al. 2018), we apply policy gradient on ∇ θ J ( θ ) = R ( e i ) ∇ θ log P ( e i | e , . . . , e i − ; θ ) R ( v ) = α × r ( v ) × r ( v ) r ( v ) = log f input ( x, v ) r ( v ) = (cid:80) e ∈ E ∧ e (cid:54) = v log f event ( e, v ) N − (4)where e denotes set of the events in the planning sequence, E denotes the events in the story, N denotes the number ofevents in the story, x denotes the input title and α denotesthe normalization constant across the events in each train-ing sample. During training, the gradient from e i is multi-plied by the Reward R ( e i ) that are proportional to r ( e i ) and r ( e i ) . In brieﬂy, r ( e i ) get larger when e i is more re-lated to the input x , while r ( e i ) become larger when e i ismore likely to coexist with all events { e | e ∈ E ∧ e (cid:54) = e i } .This method enforces the model to focus on the event thathas a high coherence score to the input and events in eachtraining sample. GR In this method, we apply random walk on the eventgraphs while considering mutually exclusive sets. We aimsto show the importance of using coherence models by com-paring with this method.

GP(Ours)

This is our proposed method that plans events onan event graph with mutually exclusive set and coherencemodels.

We plan the events on randomly selected 1000 test datawith different baselines and our proposal. We ﬁrst test thediversity of generated sequences. We calculate Distinct-1and Distinct-2 scores to show the percentage of unique uni-gram and bigram events in the whole generated events. Ta-ble 1 shows that the sequence-to-sequence model suffersfrom producing repeated unigram and bigram events. Graphplan produ more events in the full event set (more unigrams)and produces more combinations between events (more bi-grams).To further evaluate the quality of planned events, weconduct human evaluation. Instead of using overly abstractevent representation as (Tambwekar et al. 2019), we use theverb phrase which is more comprehensive for humans. Thus,

Choices(%) GP vs S2S GP vs S2S(R) GP vs GRGP S2S GP S2S(R) GP GRRevelence

Table 2: Human evaluation on event planning. Cohen’sKappa coefﬁcient ( κ ) for all annotations are in the Moderateagreement (0.4-0.6). Sign tests further show the signiﬁcantdifference. p-values are < . for all pairwise comparisons. Title Married too fastS2S be → be → be → fall → marryS2S(R) be → go(up) → ask → say → marryGR want(do) → go → sit → wonder → callGP feel → decide → begin → start → regretTitle New glassesS2S sit → be(unhappy) → have → go → ﬁndS2S(R) break → need → go → get → be(glad)GR wake(up) → be → (not)care → take → makeGP buy → wear → break → shatter → decide(buy)Title The new dressS2S want → ﬁnd → decide(make) → buy → haveS2S(R) skip → go → have → let → be(black) → makeGR celebrate → wear → look → pick(out) → wear → makeGP want(dress) → want(change) → dress → wear → feel(beautiful)Title Grilled cheeseS2S love → be → decide → forget → end(up)S2S(R) make → get → go → go → look → seeGR feel(comfortable) → like → smile → decide → feel(full)GP melt → put → decide(roast) → burn → taste Table 3: Examples of the planned events.we request the annotators to compare the event sequencesby two criteria: Relevance and Logicality. Table 2 shows thehuman evaluation results. From the results, we can see thatour planned events (i.e., verb phrases) are more related tothe input title and can be easily transformed into a story.Table 3 shows some of the examples generated by dif-ferent methods. The results show that sequence-to-sequencemodels tend to generate repetitive events. Speciﬁcally, ittends to output the event that occurs with high frequencyin the corpus, such as “be”(there is sth.) and “go” (sb. gosomewhere). This is common for a model trained under theframework of maximum likelihood estimation method. Al-though reward shaping (S2S(R)) helps a lot, the problemis still not eliminated. Without the limitation of coherencescore, GR walks on graphs randomly to produce a sequence.As the graph is relatively not small, achieving a good eventsequence is extremely challenging. We can see that our pro-posal GP produces more logical and diverse events, whichhumans can easily tell a story based on these given events.

The ultimate goal of event planning is to generate more rela-tive and logical coherent stories. Human evaluation on theevent sequence is subjective and tricky since the event ishighly abstract. To prove that better event planning improves hoices(%) GP vs GPT2 GP vs S2S GP vs S2S(R) GP vs GR GP vs GP+RNNGP GPT2 GP S2S GP S2S(R) GP GR GP GP+RNNRevelence 33.3

Table 4: Human evaluation on open story generation. (+GPT2) are omitted for all methods expect for GP+RNN. We calculateCohen’s Kappa coefﬁcient ( κ ). They are in the Moderate agreement (0.4-0.6) and a few of them are in the Fair agreement(0.2-0.4). We also conduct sign tests to show the signiﬁcant difference as well. (*) denotes p-value is < . and (**) denotesp-value is < . . The result shows that our method achieves signiﬁcant improvement on logicality, thereby achieving betteroverall performance. Title:

Fire next door G P T - After the ﬁre, the ﬁre was out. The family who lived nextdoor started to panic, so they followed it. [The ﬁrehouseacross the street was still burning. The ﬁre house was stillburning next door.] (Repetition) S S make → think → go → see → get → apologize Lisa’s new neighbor made a big ﬁre outside his house. Lisathought she would be sorry. She went outside to check onhim. She saw that he was not there. Lisa’s neighbor couldnot get a word with him and apologized. S S ( R ) try → tell → leave → throw → take Sally tried to ﬁght a burning bush. The homeowner told hershe couldn’t do it. She left a piece of wood in the bush nextdoor. A woman threw the wood at sally. Sally took the woodto the ﬁre department. G R quit → be → look(buy) → ﬁnd → go I quit my job when i was 16. The only thing I ever didn’t dowas take care of my children. Now I’m looking to buy a car.I found a cheap car on ebay. Now i can go out in the sunevery day. G P + R NN catch(on) → come → go(off) → evacuate → extinguish The ﬁre alarm was caught on. The ﬁreﬁghters came to help.The ﬁre alarm went off. Everyone evacuated. The ﬁremenextinguished it before it could go off. G P catch(on) → come → go(off) → evacuate → extinguish The house next door caught on ﬁre. The ﬁre departmentcame to the scene. The ﬁre alarm went off. The entire neigh-borhood evacuated. The ﬁre department extinguished theﬁre. G o l d John woke up smelling like something was burning. He wentoutside. He saw the ﬁre next door. He called the authorities.The ﬁremen came to put out the ﬁre.

Table 5: Examples of open story generation. The red wordrepresents the events.the story quality, we generate stories using the plannedevents and conduct human evaluation to assess the Rele-vance, Logicality, Interestingness and Overall scores. Weuse story generation module (i.e. GPT2 and RNN) to trans-form these planned events into the full stories.We compare the following methods in this experiment.

GPT2 . It is a large scale language model that shows greatperformance in generating stories in recent research. In thismethod we directly input the title to the GPT2 and generatethe full stories. *+GPT2 . We associate the aforementioned event plan- ning methods with GPT2 which is used in the storygeneration module. Thus, we compare with S2S+GPT2,S2S(R)+GPT2, GR+GPT2 and GP+GPT2.

GP+RNN . In this method, we use an RNN based sequence-to-sequence model to generate the full story which takes thetitle and the events as inputs. We compare this method toGP+GPT2 to show the effectiveness of large scale languagemodels in transforming the events into the stories.We conduct human evaluation on Amazon MechanicalTurk (AMT) over four aspects:

Relevance (whether the storyis related to the topic),

Interestingness (whether the storycontent and style are interesting),

Logicality (whether thestory is logical), and

Overall (overall quality). The full de-tails of human evaluation are listed in the supplemantarymaterials. We randomly sample 300 titles from the testingset and generate the stories via each method. Pairwise com-parison is conducted to each criteria and each sample is as-signed for two different workers to avoid randomness or per-sonal bias. Table 4 shows that our approach performs betterin Logicality, and Overall. Especially, our method greatlyoutperforms other planning methods in the Logicality mea-sure, which suggests that our planned events are logicallysound. We believe that two factors are the primary reasonsof improved logic: 1) each event graph is built from the cor-pus and thus walking on the graph remains the events’ log-ical relations, and 2) The coherence models ﬁlter the candi-dates and the mutually exclusive set further eliminates thenon-logical combinations when planning the events. Table 5shows an example of stories generated by these methods. Weshow both the planned events and the stories. Directly usingGPT2 produces repeated sentences. Both equipped with anauto-regressive model for event planning, the events plannedby S2S and S2S(R) fail to output satisfactory results and thusresults in the low logicality score in the generated sentences.Since there is no restriction on the event selection in GR, theevent produced could be irrelevant to the title and even mu-tually contradictory. By using our proposed method, GP canplan a reasonable set of events and thus generate the mostlogical story.

To better validate the effectiveness of our event graphs andmutual exclusive relation between events, we conduct StoryCloze Test. This task aims to select the right ending sen-tence from two candidates. We incorporate the event fea-CC(%) Test v1.0 Test v1.5DN 77.60 64.45DN+Origin 78.87 67.64DN+RandomWalk 79.36 68.09DN+GraphPlan

Table 6: Results on story cloze test. DN, denote the DiffNet.From the results, events planned by our event graphs andmutually exclusive sets have positive effects on this task.ture generated by different methods into the story cloze test.The accuracy of Story Cloze Test would reﬂect the qual-ity of the event features. The event feature is learned bya mask language model (MLM) (i.e., BERT model withfewer parameters) (Devlin et al. 2018): if the training eventsequence is more logical and reasonable, the learned fea-ture by MLM would better ﬁt the story cloze test. To provethat our event graph and mutually exclusive relation canhelp us to generate reasonable event sequences, we com-pare the features generated by MLM model with differenttraining data: (1)

Origin : event sequences extracted fromROCStories Corpora. (2)

RandomWalk : Randomly walkon the event graphs and sample training data. (3)

Graph-Plan : Use our planning method to generate training data.Note that the input-event coherence score is excluded in thescore function due to no input being given. We then use theevent feature and adopt the state-of-the-art model DiffNet(Cui et al. 2019) for the story cloze test. For fair compari-son, RandomWalk and GraphPlan sample the same amountof event chains in the dataset during training. Further detailsof the model could be found in the supplementary material.Results of Story Cloze Test are presented in Table 6.It shows that RandomWalk and GraphPlan achieves betterscores in both SCTv1.0 and v1.5, which prove that our eventgraphs and mutually exclusive events have positive effectson event planning.

To further verify the effectiveness of our proposed method,we also conduct experiments on Visual Storytelling tasks.Due to the page limitation, we put it into the supplementarymaterial. In the results, GraphPlan also shows improvementin terms of logicality.

Controllable Generation

As mentioned before, the sto-ries can be easily controlled by modifying the events. Table 7shows an example. Selecting different upcoming events for“feel(sick)” would change the following storyline.

Why GraphPlan Works?

Our experiment shows thatevent graphs can produce more logical stories than planningwith language models. Here we give an empirical explana-tion. Sequence-to-sequence models usually fail to capturelong term relation and order information in the event se-quence. The decoder is not guaranteed to take into account cough be(sick) feel(sick) vomit diagnosetry(rest) feel(not)eat starveMutual Exclusion (1)(2)(3) (1)

The man was coughing a lot. He was sick. He felt sickfor days. He vomited on the couch. He was later diag-nosed with the ﬂu. (2)

The man was coughing a lot. He was sick. He felt sick.He tried to rest for an hour. The man felt better ! (3)

The man was coughing a lot. He was sick. He felt sick.He couldn’t eat anything. He starved himself.

Table 7: Example of controllable generation. (1) and (2) ex-tends different events after “feel sick” to achieve differentendings and (3) shows logical inconsistency when generat-ing with two mutually exclusive eventsall previous events during decoding. At this point, our ap-proach applying event-event coherence scores enforces themodel to consider long term relations during planning. Inaddition, the order of events is captured from the gold casesthat can be guaranteed in our event graphs. Moreover, mutu-ally exclusive sets help us to decide whether two events canco-occur in one sequence no matter how distant two eventsare. Table 7 gives an example. “cough” and “starve” are con-sidered as mutually exclusive events in our event graph. Ifwe generate a story based on this event chain, the last sen-tence “he starved himself” seems not reasonable in this case.The ﬁndings in this work also opens up new researchquestions: 1) Better event deﬁnition 2) Exposure Bias is-sue in the story generation module 3) Better topic modellingmethods. These are left for our future work.

In this paper, we show that a graph-based event planningapproach can indeed produce more natural event sequencescompared to conventional language models. We propose towalk on automatically learned event graphs by performingbeam search with a score function dedicated for event plan-ning. Then the story is generated follows the planned events.We evaluate our approach on event planning and openstory generation with large scale human judgements. Resultsshow that our proposed approach clearly outperforms thenon-planning baseline and the sequence-to-sequence modelbased planning models. In human evaluation, the events andthe stories generated by our proposal are believed to be morelogical and coherent. An additional experiment on storycloze test further proves the advantages of event graphs andmutually exclusive sets.

References

Ammanabrolu, P.; Tien, E.; Cheung, W.; Luo, Z.; Ma, W.;Martin, L. J.; and Riedl, M. O. 2019. Story Realiza-tion: Expanding Plot Events into Sentences. arXiv preprintarXiv:1909.03480 .Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural ma-chine translation by jointly learning to align and translate. In rd International Conference on Learning Representations,ICLR 2015 .Bird, S.; Klein, E.; and Loper, E. 2009.

Natural languageprocessing with Python: analyzing text with the natural lan-guage toolkit . ” O’Reilly Media, Inc.”.Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirich-let allocation.

Journal of machine Learning research

Proceedings of ACL-08: HLT ,789–797.Chen, S.; Nelson, M. J.; Sullivan, A.; and Mateas, M. 2009.Evaluating the Authorial Leverage of Drama Management.In

AAAI Spring Symposium: Intelligent Narrative Technolo-gies II , 20–23.Cui, Y.; Che, W.; Zhang, W.-N.; Liu, T.; Wang, S.; and Hu,G. 2019. Discriminative Sentence Modeling for Story End-ing Prediction. arXiv preprint arXiv:1912.09008 .Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Fan, A.; Lewis, M.; and Dauphin, Y. 2019. Strate-gies for structuring story generation. arXiv preprintarXiv:1902.01109 .Goldfarb-Tarrant, S.; Feng, H.; and Peng, N. 2019. Plan,Write, and Revise: an Interactive System for Open-DomainStory Generation. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Compu-tational Linguistics (Demonstrations) , 89–97.Granroth-Wilding, M.; and Clark, S. 2016. What Happensnext? Event Prediction Using a Compositional Neural Net-work Model. In

Proceedings of the Thirtieth AAAI Confer-ence on Artiﬁcial Intelligence , AAAI’16, 2727–2733. AAAIPress.Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y.2019. The curious case of neural text degeneration. arXivpreprint arXiv:1904.09751 .Ippolito, D.; Grangier, D.; Callison-Burch, C.; and Eck, D.2019. Unsupervised Hierarchical Story Inﬁlling. In

Pro-ceedings of the First Workshop on Narrative Understand-ing , 37–43. Minneapolis, Minnesota: Association for Com-putational Linguistics. doi:10.18653/v1/W19-2405. URL .Li, B.; Lee-Urban, S.; Johnston, G.; and Riedl, M. 2013.Story generation with crowdsourced plot graphs. In

Twenty-Seventh AAAI Conference on Artiﬁcial Intelligence .Li, W.; Xu, J.; He, Y.; Yan, S.; Wu, Y.; and Sun, X. 2019.Coherent Comments Generation for Chinese Articles witha Graph-to-Sequence Model. In

Proceedings of the 57thAnnual Meeting of the Association for Computational Lin-guistics , 4843–4852. Florence, Italy: Association for Com-putational Linguistics. doi:10.18653/v1/P19-1479. URL . Martin, L. J.; Ammanabrolu, P.; Wang, X.; Hancock, W.;Singh, S.; Harrison, B.; and Riedl, M. O. 2018. Event repre-sentations for automated story generation with deep neuralnets. In

Thirty-Second AAAI Conference on Artiﬁcial Intel-ligence .Mateas, M.; and Sengers, P. 2003.

Narrative intelligence . J.Benjamins Pub.McIntyre, N.; and Lapata, M. 2010. Plot induction and evo-lutionary search for story generation. In

Proceedings of the48th Annual Meeting of the Association for ComputationalLinguistics , 1562–1572. Association for Computational Lin-guistics.Meehan, J. R. 1976. The metanovel: writing stories by com-puter. Technical report, YALE UNIV NEW HAVEN CONNDEPT OF COMPUTER SCIENCE.Mostafazadeh, N.; Roth, M.; Louis, A.; Chambers, N.; andAllen, J. 2017. Lsdsem 2017 shared task: The story clozetest. In

Proceedings of the 2nd Workshop on Linking Modelsof Lexical, Sentential and Discourse-level Semantics , 46–51.Peng, H.; and Roth, D. 2016. Two Discourse Driven Lan-guage Models for Semantics. In

Proceedings of the 54thAnnual Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , 290–300. Berlin, Ger-many: Association for Computational Linguistics. doi:10.18653/v1/P16-1028. URL .Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; andSutskever, I. 2019. Language models are unsupervised mul-titask learners.

OpenAI Blog

Proceedings ofthe 48th Annual Meeting of the Association for Computa-tional Linguistics , 979–988.Reiter, E.; and Dale, R. 1997. Building applied natural lan-guage generation systems.

Natural Language Engineering

Mindsand Machines

Journal of Artiﬁcial Intelli-gence Research

39: 217–268.Tambwekar, P.; Dhuliawala, M.; Martin, L. J.; Mehta, A.;Harrison, B.; and Riedl, M. O. 2018. Controllable NeuralStory Plot Generation via Reinforcement Learning. arXivpreprint arXiv:1809.10736 .Tambwekar, P.; Dhuliawala, M.; Martin, L. J.; Mehta, A.;Harrison, B.; and Riedl, M. O. 2019. Controllable neuralstory plot generation via reward shaping. In

Proceedings ofthe 28th International Joint Conference on Artiﬁcial Intelli-gence , 5982–5988. AAAI Press.Tan, B.; Yang, Z.; AI-Shedivat, M.; Xing, E. P.; and Hu, Z.2020. Progressive Generation of Long Text. arXiv preprintarXiv:2006.15720 .eyhrauch, P. 1997.

Guiding interactive ﬁction . Ph.D. the-sis, Ph. D. Dissertation, Carnegie Mellon University.Xu, J.; Ren, X.; Zhang, Y.; Zeng, Q.; Cai, X.; and Sun, X.2018. A Skeleton-Based Model for Promoting CoherenceAmong Sentences in Narrative Story Generation. In

Pro-ceedings of the 2018 Conference on Empirical Methods inNatural Language Processing , 4306–4315.Yao, L.; Peng, N.; Weischedel, R.; Knight, K.; Zhao, D.; andYan, R. 2019. Plan-and-write: Towards better automatic sto-rytelling. In