Automatic Story Generation: Challenges and Attempts
aa r X i v : . [ c s . C L ] F e b Automatic Story Generation:Challenges and Attempts
Amal Alabdulkarim ∗ , Siyan Li ∗ , Xiangyu Peng ∗ Georgia Institute of TechnologyAtlanta, GA 30332 {amal, lisiyansylvia, xpeng62}@gatech.edu
Storytelling is central to human communication. People use stories to communicate effectively withone another. As humans, we engage with well-told stories and comprehend more information fromstories Suzuki et al. (2018). However, when it comes to automatic storytelling, computers still havea long way to go. The field of automated story generation, or computational narrative, has receivedmore attention because of recent technological enhancements. The importance of computational nar-rative is that it can improve human interaction with intelligent systems. Storytelling helps computerscommunicate with humans Riedl (2016), and automated story generation drives improvements innatural language processing. Computational narrative research involves story understanding, storyrepresentation, and story generation. In this survey, we will focus on the story generation capabilitiesof computational systems.Many surveys were written on different facets of computational storytelling. Gervás (2009) providesa chronological summary of storytelling systems focusing on computational creativity, measured us-ing metrics including the stories’ novelty and the users’ involvement in the storytelling process.Riedl and Bulitko (2013) focuses on interactive intelligence, a digital interactive storytelling expe-rience where users interact with the computational system to build storylines. The survey papertouches on generating narrative structures and character building. Riedl (2016) discusses human-centered computational narrative and how it can improve artificial intelligence applications. Thepaper shed some light on machine learning challenges concerned with story generation and com-monsense reasoning. Nevertheless, it does not go into these challenges in-depth as it is not itsprimary focus point.Past survey papers focused primarily on story generation using specific approaches or on specificsub-problems in story generation. For example, Kybartas and Bidarra (2017) summarizes progressin the areas of plot and space generation without much discussion around neural language models.Hou et al. (2019) examine different deep learning models used in story generation and categorizethem by their goals. However, there is still motivation to organize a survey in a different manner.The process of automatically generating a logically-coherent and interesting narrative is complex.Therefore, it might be more beneficial detailing the major problems present in the field and tech-niques used to address them rather than summarizing different types of models. For people whoare new in the field, our survey should serve as a decent starting point for conducting innovativeresearch in the field.Some of the survey papers, albeit comprehensive, do not include the latest development in storygeneration because of transformers. Riedl and Bulitko (2013) chronicles interactive narrative priorto 2013, yet the discussed approaches do not include large-scale neural language models, which wehave access to now and has been fueling new research in the field. Another example would be thepaper by Gervás (2009), where the author comments on storytelling systems and different evaluationcriteria for creativity; similarly, all of the systems consist of planning and no neural approaches. ∗ Equal contributionsPreprint. e acknowledge that more survey papers exist with different areas of focus within the domain ofcomputational narratives, such as Narrative theories (Cavazza and Pizzi, 2006), Interactive Intelli-gence (Luo et al., 2015), Drama Management (Roberts and Isbell, 2008), Plan-based story genera-tion (Young et al., 2013).It has been demonstrated that the field of automated story generation has a gap in up-to-date sur-vey papers. Our paper, by laying out all the prominent research problems in story generation andprevious literature addressing these issues, will fill this gap.The scope of this survey paper is to explore the challenges in automatic story generation. We hopeto contribute in the following ways:1. Explore how previous research in story generation addressed those challenges.2. Discuss future research directions and new technologies that may aid more advancements.3. Shed light on emerging and often overlooked challenges such as creativity and discourse.There are several important background concepts crucial to understanding the problem of story gen-eration. Automated story generation is a process involving the use of computer systems to createwritten stories, often involving artificial intelligence (AI). Story generation requires story under-standing and representation, which are usually handled by natural language processing. Hence, thefirst concentration in this paper is content encoding and comprehension. A system is conventionallydefined as capable of story comprehension if it, given a textual story, can read and answer questionsabout it (Lehnert et al., 1983; Reeves, 1991). Recently, state-of-the-art neural text generation models(such as GPT-2 (Radford et al., 2019)), are used to generate stories. These models are trained on theWebText corpus, a collection of texts scraped from the internet. Hence, the key challenge of apply-ing these language models to story generation is to ensure that the generated story remains on topicand maintains entity and event consistencies. In our paper, we consider the following two conceptsas crucial starting points: Controllability – having human inputs influence the generation results(Section 2.1 of the paper), and commonsense – narrative systems with pre-existing knowledge thatwould help generate coherent stories (Section 2.2 of the paper).
The controllability problem in story generation is the user input’s ability to influence the generationresults. Such influence often takes the form of a plot the user wishes the system to adhere to whenproducing a new narrative. Controlling story generation is a significant challenge that gained moreattention in the last few years due to the limitations of neural-based story generation approaches.Most modern story generators use Neural based techniques that need little to no manual modeling togenerate stories. Neural based models solve the lack of novelty issues found in the symbolic systemsdue to their unstructured generation. Yet, this advance comes at the cost of less controllability andplot coherence. In this section, we shed light on a few approaches to the problem of controllability,discuss their strengths and weaknesses, and compare their methodologies.
Reinforcement Learning . Tambwekar et al. (2019) aimed at controlling the story plot by control-ling its ending and events order. They proposed a deep reinforce approach to controlled story gen-eration with a reward shaping technique to optimize the pre-trained sequence to sequence model in(Martin et al., 2017). Their reward function encompasses two main parts, the distance to the goalverb and the story verb frequency. The distance to the goal verb measures how many lines between agenerated verb and the goal verb in training stories. Simultaneously, the story verb frequency countsthe stories with both the goal verb and the generated verb. They evaluated their model on plot coher-ence and goal achievement, length, and perplexity. Their method was better than their base modelalone in the aspects being assessed. However, this approach requires training the model for everynew goal, which can be inconvenient for the users. Another drawback to this model is it uses thesequence to sequence model in (Martin et al., 2017), which generates stories as sequences of objectsencapsulating the sentence components (verb and subject) that require translation to full sentences.
Model Fusion . Fan et al. (2018) attempts to solving the plot controllability problem by dividingthe generation process into two levels of hierarchy a premise and a story. The premise provides an2verall sketch of the story, which was utilized to write the story. This fusion model combines a con-volutional sequence to sequence model with a self-attention mechanism to improve generated storyquality. A convolutional network first generates a writing prompt which then, becomes the input tothe sequence to sequence model and guide it in generating a story conditioned on the prompt. Theirmodel was superior in both human evaluations and perplexity scores than a traditional sequence tosequence method. Conditioning on the generated premise makes the generated story plot consistentand has an improved long-term dependency. Overall, this approach improves the shortcomings ofthe previous work by writing the stories directly and being conditioned for different prompts withoutretraining. Yet this model also has its limitations. First, it relies heavily on random sampling for thegeneration, which is prone to errors. Second, it suffers from text repetition in the generated stories.Lastly, the generated prompts are generic and less interesting than human written writing prompts,which often generates boring stories.
Plan and Write . Yao et al. (2019) proposed the Plan-and-write story generation framework. Theauthors leveraged some of the characteristics of symbolic planning and integrated it into a neuralsystem. Their work improves the previous literature in that it uses the titles to generate controlledstorylines rather than the auto-generated writing prompts directly. They utilize storyline planning toimprove the generated stories’ quality and coherence and thus control the generation. They exploreseveral story planning strategies to see their effect on story generation. This framework takes as aninput the title of the story and then generates a storyline. The storyline and the title are then usedas input to control the story generation in a sequence to sequence model. They also proposed twometrics to evaluate their model, inter-story repetition, and intra-story repetition. The evaluationsshowed that the model is more superior to the used conditional language model baselines. Thoseevaluations also showed that the model suffers from several major problems: repetition, going off-topic, and logical inconsistencies. It also utilizes a sequential language model to approximate thestory plot, which simplifies the structure and depth of a good story plot, suggesting that generatingcoherent and logical story plots is still far from being solved.
Generation by Interpolation . Wang et al. (2020) introduced a generation-by-interpolation storygeneration model. While previously introduced methods require minimal human input, they stillsuffer from logical inconsistencies and off-topic wandering. The generation by interpolation modelis designed to overcome these challenges. It is an ending-guided model that is better than storyline-guided models because, in the storyline-guided, the model can easily be misled by a very generalprompt. In contrast, an ending-guided model can use a single ending sentence to develop a goodstory plot. Their ending-guided method centers on conditioning the generation on the first and lastsentences of the story. Where a GPT-2 model Radford et al. (2019) generates several candidates fora storyline, and then these candidates are ranked based on their coherence scores using a RoBERTamodel(Liu et al., 2019). Then the sentence with the highest coherence with the first and last sen-tence is chosen and then generated. Their evaluations demonstrate the informativeness of the endingguide and the effectiveness of the coherence ranking approach. The generated stories were of higherquality and better coherence than previous state-of-the-art models. The model’s human evaluationssuggested that good stories’ assessment needs better and deeper evaluation metrics to match howhumans define an excellent story, for example, measuring how the organization of events and char-acters can constitute better narratives. Lastly, using a transformer-language-model-based systemimproved the model’s coherence and repetition. However, it showed that it could not manage com-monsense inference beyond a small extend and thus established the need to integrate more humanknowledge into the model.
Plot Machines . Rashkin et al. (2020) proposed a transformer-language-model-based system thatgenerates multi-paragraph stories conditioned on specified outlines for these stories. This modelshows improvements in the narrative over the previous work. The approach utilizes memory statetracking and discourse structures to better control the generated story plot and keep track of thegenerated lines to maintain the coherence. The outlines are represented with an unordered list ofhigh-level, multi-word descriptions of events occurring in the story. At every step, the model gen-erates based on the representation of the given outline, the high-level discourse representation, thepreceding story context, and the previous memory. Discourse representation is an encoding of thetype of paragraph the current paragraph is, including introduction ( _i_ ), body ( _b_ ), and conclusion( _c_ ), which is appended to the outline representations at every time step. The preceding story con-text is the same as the hidden state vectors output by the transformer’s attention blocks upon feedinggenerated sentences into a static GPT-2 model. Finally, the memory is a concatenated vector contain-3ng both the generated tokens and an encoded state of the story. When evaluated based on humanpreferences, the proposed system outperforms baseline models, including Fusion (Radford et al.,2018), GPT-2 (Radford et al., 2019), and Grover (Zellers et al., 2019) in metrics measuring logicalordering, narrative flow, and the level of repetitiveness. In PlotMachines, the conditioning of genera-tion depended on a general outline that includes events and phrases for ease of extraction. Even withthe better performance in PlotMachines, the stories can benefit from incorporating a comprehensiveplot outline such as the output of an event-based planning system that can improve the generatedstories’ depth and interestingness.Narrative controllability is still an open challenge for automatic story generation. Albeit beingan active research area in natural language generation, we can attribute some of its problems tothe new technologies that were essentially used to improve it, which manifested after introducingneural-based systems to story generation models. As summarized in table 1 in appendix A, narrativecontrollability approaches are typically ending-focused or storyline-focused. In the ending focused,the goal is to generate a story with a specific desired ending. An example of these such systemsare (Tambwekar et al., 2019; Wang et al., 2020). Whereas in the storyline focused, the generatedstories would follow an outline of the plot. (Rashkin et al., 2020; Yao et al., 2019; Fan et al., 2018)are examples of such systems. Both approaches reflect different controllability goals which needsto be addressed when comparing generation systems. We also notice a shift from Seq2Seq mod-els (Tambwekar et al., 2019; Fan et al., 2018; Yao et al., 2019) to transformer based architecture innewer models (Rashkin et al., 2020; Wang et al., 2020).After examining those solutions we notice that there are three main challenges that needs to besolved. First, controlled models generally suffer from low creativity and interestingness, which isobvious, especially in more rigid controls such as story outlines. Second, the evaluation metrics forthe controllability of automatic story generation systems are neither sufficient nor unified, making itharder to evaluate and compare systems. Third, despite the controls added to the generation process,we still need to improve the coherence and logical plot generation. Those challenges are an openinvitation for more research in controllability.
Commonsense is regarded obvious to most humans(Cambria et al., 2011), and comprises sharedknowledge about how the world works (Nunberg, 1987). Commonsense serves as a deep under-standing of language. Two major bottlenecks here are how to acquire commonsense knowledge andincorporate it into state-of-the-art story-telling generation systems.
Before integrating commonsense knowledge into neural language models, the models often aretrained on commonsense knowledge bases, datasets containing information detailing well-knownfacts or causal relationships. We will first introduce these benchmarks, which target commonsense.
ConceptNet . ConceptNet by Speer et al. (2017) is a large semantic knowledge graph that connectswords and phrases of natural language with labeled edges, describing general human knowledge andhow it is expressed in natural language. The data is in form of triples of their start node, relation label,and end node. For example, the assertion that “a dog has a tail” can be represented as (dog, HasA,tail). It lays the foundation of incorporating real-world knowledge into a variety of AI projects andapplications. What’s more, many new benchmarks extract from ConceptNet and serve other utilities.
CommonsenseQA . CommonsenseQA by (Talmor et al., 2019) is a benchmark extracting from Con-ceptNet’s multiple target concepts, which have the same semantic relation, to a single source con-cept. It provides a challenging new dataset for commonsense question answering. Each question re-quires one to disambiguate a target concept from three connected concepts in ConceptNet. The bestpre-trained LM tuned on question answering, can only get 55.9% accuracy on CommonsenseQA,possessing important challenge for incorporating commonsense into large language model.
ATOMIC . Sap et al. (2019a) presented ATlas Of MachIne Commonsense (ATOMIC), an atlas forcommonsense knowledge with 877K textual descriptions of nine different types
If-then relations.Instead of capturing general commonsense knowledge like ConceptNet, ATOMIC focuses on se-quences of events and the social commonsense relating to them. The purpose of the dataset is toallow neural networks abstract commonsense inferences and make predictions on previously unseen4vents. The dataset is in the form of
GLUCOSE . ATOMIC is person centric, hence it can not be used in sentences describing events.Mostafazadeh et al. (2020) constructs GLUCOSE (GeneraLized and COntextualized Story Expla-nations), a large-scale dataset of implicit commonsense causal knowledge, which sentences candescribe any event/state. Each GLUCOSE entry is organized into a story-specific causal statementpaired with an inference rule generalized from the statement. Given a short story and a sentenceX in the story, GLUCOSE captures ten dimensions of causal explanations related to X. GLUCOSEshares the same purpose with ATOMIC.
SocialIQA . SocialIQA(Sap et al., 2019b) is the a large-scale benchmark for commonsense reason-ing about social situations, which provides 38k multiple choice questions. Each question consistsof a brief context, a question about the context, and three answer options. It covers various types ofinference about people’s actions being described in situational contexts. The purpose of SocialIQAis to reason about social situations.There are also many other benchmarks involved in commonsense domain.
MC-Script (Ostermann et al., 2018) provides narrative texts and questions, collected based onscript scenarios.
OpenBookQA (Mihaylov et al., 2018) is a question answering dataset, modeledafter open book exams for assessing human understanding of a subject.
Cosmos QA (Huang et al.,2019) provides 35k problems with multiple-choice, which require commonsense-based readingcomprehension.What’s more, technique of generating commonsense datasets are also developed. For example,Davison et al. (2019) proposed a method for generating commonsense knowledge by transformingrelational triples into masked sentences, and then using a large, pre-trained bidirectional languagemodel to rank a triple’s validity by the estimated pointwise mutual information between the twoentities. Schwartz et al. (2017) and Trinh and Le (2018) demonstrate a similar approach to usinglanguage models for tasks requiring commonsense, such as the Story Cloze Task and the WinogradSchema Challenge, respectively (Mostafazadeh et al., 2016; Levesque et al., 2012).
Three ways of applying these benchmarks on commonsense story generation are (1) fine-tuningpre-trained language models (LM) on commonsense benchmarks, (2) perceptions of causality aftergenerating stories, and (3) incorporating benchmarks into language models encoding.An intuition is to utilize commonsense knowledge is to train language model on commonsensedatasets. Yang et al. (2019) integrates external commonsense knowledge to BERT (Devlin et al.,2019) to enhance language representation for reading comprehension. Guan et al. (2020) fine-tunedGPT-2(Radford et al., 2019) on on knowledge-augmented data, ATOMIC and ConceptNet, for abetter performance for commonsense story generation. They firstly transform ConceptNet andATOMIC into readable natural language sentences and then post-trained on these transformed sen-tences by minimizing the negative likelihood of predicting the next token. Mao et al. (2019) andGuan et al. (2020) also fine-tuned GPT-2 on ConceptNet and the BookCorpus(Kiros et al., 2015).They achieve a less perplexity and higher BLEU score, however, these knowledge-enhanced pre-training model for commonsense story generation are still far from generating stories with long-range coherence.Instead of directly training language models on commonsense datasets, which improves LM’s log-icality and grammaticality, an alternative of incorporating commonsense into language model is toanalyze perceptions of causality or overall story quality.Bosselut et al. (2019) extended upon the work ATOMIC by Sap et al. (2019a) and ConceptNet bySpeer et al. (2017) and trained a GPT model (Radford et al., 2018) on commonsense knowledgetuples, in the format of
COMeT , is capable of generating new commonsense triples on novel phrases. With this fea-ture, automatic generated story can be evaluated easily. The model has been proven to be efficient inlearning commonsense knowledge tuples, as in humans deem most COMeT-generated triples fromnovel phrases to be correct. It provides a easy way of making inferece on generated text. However, itis Sentence-level Commonsense inferences, which is only able to deal with short sentences, within58 tokens. Story generation is usually in need of a paragraph-level commonsense inference becausecombining with context, the inference could be completely different.In order to incorporates paragraph-level information to generate coherent commonsense inferencesfrom narratives, Gabriel et al. (2020) proposed a discourse-aware model
PARA-COMeT . PARA-COMeT firstly created commonsense datasets by (1) using COMeT to provides inference on sen-tences in ROCStories corpus (Mostafazadeh et al., 2016) and (2) transform inference into naturallanguage by human-written templates, (3) then filter out those with low coherence with narrative.PARA-COMeT consists of (1) a memory-less model, focusing on extracting semantic knowledgefrom the context, and (2) a model augmented with recurrent memory, used for incorporating episodicknowledge. Compared with COMeT, PARA-COMeT demonstrated the effectiveness of generatingmore implicit and novel discourse-aware inferences in paragraph level.Ammanabrolu et al. (2020) also developed proposed C ausal, C ommonsense P ot O rdering(CCPO)on COMeT. CCPO establishs plot points by (1) extracting all the coreference clusters from agiven textual story plot using a pre-trained neural coreference resolution model(Clark and Manning,2016), and (2)extract a set of
How characters are motivated and interact with each other influence the progression of a story. Dif-ferent approaches have been taken to model how focusing on characters can produce higher-qualitygenerated narratives, some from the perspective of character affect, and some from entity represen-tation in narrative generation. E N G EN Clark et al. (2018) presented an entity-based generation model E N G EN , which producesnarratives relying on: (1) the current sentence; (2) the previous sentence, encoded by a Seq2Seqmodel (S2SA); (3) dynamic, up-to-date representations of all the entities in the narrative. The en-tity representation vectors are based on EntityNLM (Ji et al., 2017), and the vectors are updatedevery time their corresponding entities are mentioned. The model was evaluated on a series oftasks, including a novel mention generation task, where the model fills a slot with all previous men-tions of entities, including coreferences. Similarly, the automated sentence selection task examinesE N G EN ’s ability to distinguish between the ground truth continuation sentence and a distractionsentence. E N G EN is able to out-perform both S2SA and EntityNLM for these tasks. Another taskinvolved Mechanical Turk workers reading sentences generated by both E N G EN and S2SA on thesame prompts and deciding which continuation is more fluent. Out of the 50 prompt passages, Turk-ers preferred the E N G EN stories for 27 of them, and S2SA for the rest 23, although most of thehuman evaluations yield similar results between the two models. Incorporating character or entityinformation into the context for generation can improve model performance on some automated andhuman-evaluated tasks. The authors contended that this design improves the fluency of the gener-ated texts. However, the lengths of the generated segments for the human-evaluation task are veryshort, usually fragments of sentences. Therefore, it is unlikely that these generated texts help propel6he plot. Furthermore, the paper does not indicate how the entity representations model characterinteractions and how these interactions contribute to the stories. Using Character Affinities
A dive into character interactions in particular is detailed inMéndez et al. (2016), where the authors attempted to model character interactions using numericalaffinity values. Character relationships are categorized into four types: foe (lowest affinity), indif-ferent (medium affinity), friend (higher affinity), and mate (highest affinity). The system consistsof a Director Agent, which sets up the environment for interactions to occur, and a set of CharacterAgents, each representing a character. The authors defines that each Character Agent interacts withthe character’s foes, friends, and mates. Actions pertinent to different interactions are templatedusing defined interaction protocols and are relatively restricted in terms of scope. These actions areindependent and can be added upon each other to alter the affinity values. The primary parameterof concern in this model is the affinity between characters, a factor related to character emotions.Although this modeling approach has been suggested for narrative generation, the authors did notprovide examples of stories generated using this character affinity model. Instead, the authors pre-sented affinity changes for different Character Agents in the story to illustrate how different affinitythreshold values for foe interactions affect the affinity evolution in the narratives. The model mightbe considered useful for modeling character interactions, yet the effect affinity changes have on thestory plot remains unclear.
EC-CLF
Brahman and Chaturvedi (2020) proposed a method for story generation conditioned onemotion arc of the protagonist by using reinforcement learning to train a GPT-2 model. The au-thors suggested two emotion consistency rewards: EC-EM and EC-CLF. EC-EM calculates howwell the generated story aligns with the given arc using character reaction inferences from COMET(Bosselut et al., 2019); it is a modified Levensthtein distance that considers the cosine similaritiesbetween words from the given arc and the COMET inferences. EC-CLF, on the other hand, involvestraining a BERT (Devlin et al., 2019) classifier to identify the emotion in the generated sentences;the reward value is the probability of the desired emotions throughout the narrative from the clas-sifier head. For human-evaluated tasks such as assessing emotion faithfulness and content quality,RL-CLF (GPT-2 trained with EC-CLF reward) outperformed baselines including GPT-2 trained withthe emotion arc as an additional input to the narrative examples (EmoSup) and GPT-2 trained on thereward function EC-EM. This work augmented current state-of-the-art models with the ability togenerate narratives with the protagonist’s emotion changes following a specified emotion arc. It isan example of how character emotions can be used to inform story progression and improve nar-rative quality. Despite the enhancement of generation quality, the model still only focuses on onecharacter instead of interactions between characters.
SRL + Entity
Fan et al. (2019) generated action-driven narratives by adapting the followingpipeline: (1) based on the prompt given, produce an action plan with where all entities are repre-sented with placeholder tags; (2) create an entity-anonymized story from the action plan; (3) outputthe full story after replacing the anonymized, generalized entities with natural language entities.Every entry in the action sequence consists of a predicate, which is a verb, and a series of argu-ments, which are the entities involved in the action. This representation allows models to learnmore in-depth and generalizable relationships between different verbs and characters. A convolu-tional Seq2Seq model is trained on the prompts from the W
RITING P ROMPTS dataset (Fan et al.,2018) and their corresponding action sequences. The network has an attention head dedicated topast verbs to improve verb diversity in generations.Human preference studies showed that the novelmodel generated more coherent narratives than the
Fusion model from Fan et al. (2018); addition-ally, the new model had more diversity in the generated verbs. The technique of abstraction andgeneralization can be proven useful in the story generation process, since abstractions reveal morewidely-applicable rules in storytelling. Again, it is not clear if character interactions are implicitlylearned by the models in this work, therefore further investigation would be required to determine ifthis work is suitable for multi-agent narrative generation.In this section, we examine four works in the sub-field of character and entity-focused automatednarrative generation. Generally, representing entities in certain format can improve the quality of theplotline, and character emotions can help inform the story generation process. Interactions betweenmultiple characters are currently not the focus of the field, but it should be for potential futureresearch. 7 .3.2 Creativity
Creativity in human-authored narratives manifests in ways including figures of speech, charactertraits, and the environment for the narrative to occur in. Martin et al. (2016) developed a system forimprovisational interactive storytelling based on a plot graph as a general guideline for the generatedstoryline. Recent introduction to transformer-based language models has inspired people generatingnovel contents using these language models , including using GPT-2 to generate fantasy descrip-tions with explicit subjects and weblinks (Austin, 2019). Nonetheless, there has still not been muchspecific research into further improving the creativity of transformer-based language models. This survey discussed several directions in automatic story generation research and their respectivechallenges, and summarized research attempts at solving them. The research in automatic storygeneration is far from done. As with every new technology, new challenges arise. With automatedstory generation, such challenges include controlling the story content, commonsense knowledge,inferring reasonable character actions, and creativity. This survey provides a dive into some ofthese active research problems. This survey’s value is that it is a good starting point for researcherswho want to learn more about the domain and the current state-of-the-art solutions for several storygeneration challenges.In Section 2.1, we summarized a few approaches addressing the problem of story generation con-trollability. We noticed that the papers we reviewed shared one of two goals, either controlling thestory outline or controlling the story ending. We also observed an emerging trend towards usingtransformer-based language models for story generation.In Section 2.2, we introduced methods to incorporate commonsense knowledge into story genera-tion systems and frameworks with such integrated commonsense knowledge. Frequent approachesinclude: (1) Fine-tuning on commonsense datasets, (2) analyzing perceptions of causality and (3)incorporating commonsense knowledge graph into encoders. These methods are able to increase theoverall story quality. However, no methods can ensure the generation of reasonable and coherentstories. One potential path to major improvements in this area would be to combine all of thesedifferent approaches.In Section 2.3, we provided insight into some less-researched areas at the moment, including char-acters in generated narratives and the creativity of generated stories. Incorporating representationsof entities into the generation process seems to improve the coherence of the plot, and characteraffect can help navigate the generation space as well. Extending the work in character affect fromsingle character to multi characters can perhaps further enhance the generated narratives. There hasnot been much emphasis on the creativity of generated texts.Additionally, we highlight a few future research problems that are worth exploring:1. In the controllability systems we examined, we noticed that the stories become less interest-ing when the generation process is more controlled. There is a trade-off between narrativecreativity and structural coherence of narratives.2. The evaluation metrics used are generally the metrics used for other natural language gen-eration tasks such as BLEU, perplexity, and ROUGE. Those metrics are weak and do notperform well for this task. Moreover, the story generation domain needs different metricsto capture story-specific characteristics. Such as measures for creativity and interestingness.Besides, there is a need to develop more robust and unified metrics to facilitate comparisonsbetween systems.3. The problems of plot incoherence and illogical plot generation are far from being solved.Both are still very active research areas and can be an interesting future research direction.4. Instead of sentence-level and paragraph-level commonsense inference, a story-level com-monsense inference could increase the accuracy of inference and provides a better tool forgenerating a more logic coherent story. eferences Prithviraj Ammanabrolu, Wesley Cheung, William Broniec, and Mark O Riedl. 2020. Automatedstorytelling via causal, commonsense plot ordering. arXiv preprint arXiv:2009.00829 .Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning. 2015. Leveraginglinguistic structure for open domain information extraction. In
Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1: Long Papers) , pages 344–354.John Austin. 2019. The book of endless history: Authorial use of gpt2 for interactive storytelling.In
International Conference on Interactive Digital Storytelling , pages 429–432. Springer.Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and YejinChoi. 2019. Comet: Commonsense transformers for automatic knowledge graph construction.pages 4762–4779.Faeze Brahman and Snigdha Chaturvedi. 2020. Modeling protagonist emotions for emotion-awarestorytelling. arXiv preprint arXiv:2010.06822 .Erik Cambria, Yangqiu Song, Haixun Wang, and Amir Hussain. 2011. Isanette: A common andcommon sense knowledge base for opinion mining. In , pages 315–322. IEEE.Marc Cavazza and David Pizzi. 2006. Narratology for Interactive Storytelling: A Critical Introduction.In
Technologies for Interactive Digital Storytelling and Entertainment , Lecture Notes in Com-puter Science, pages 72–83, Berlin, Heidelberg. Springer.Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. 2018. Neural text generation in stories usingentity representations as context. In
Proceedings of the 2018 Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long Papers) .Kevin Clark and Christopher D Manning. 2016. Deep reinforcement learning for mention-rankingcoreference models. In
Proceedings of the 2016 Conference on Empirical Methods in NaturalLanguage Processing , pages 2256–2262.Joe Davison, Joshua Feldman, and Alexander M Rush. 2019. Commonsense knowledge miningfrom pretrained models. In
Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 1173–1178.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language understanding. In
Proceedings of the 2019 Confer-ence of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186.Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. pages889–898.Angela Fan, Mike Lewis, and Yann Dauphin. 2019. Strategies for structuring story generation. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages2650–2660.Saadia Gabriel, Chandra Bhagavatula, Vered Shwartz, Ronan Le Bras, Maxwell Forbes, and YejinChoi. 2020. Paragraph-level commonsense transformers with recurrent memory. arXiv preprintarXiv:2010.01486 .Pablo Gervás. 2009. Computational approaches to storytelling and creativity.
AI Magazine ,30(3):49–49.Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang. 2020. A knowledge-enhancedpretraining model for commonsense story generation.
Transactions of the Association for Com-putational Linguistics , 8:93–108. 9ian Guan, Yansen Wang, and Minlie Huang. 2019. Story ending generation with incremental en-coding and commonsense knowledge. In
Proceedings of the AAAI Conference on Artificial Intel-ligence , volume 33, pages 6473–6480.Chenglong Hou, Chensong Zhou, Kun Zhou, Jinan Sun, and Sisi Xuanyuan. 2019. A survey ofdeep learning applied to story generation. In
Smart Computing and Communication , pages 1–10,Cham. Springer International Publishing.Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos qa: Machinereading comprehension with contextual commonsense reasoning. In
Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2391–2401.Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A Smith. 2017. Dynamicentity representations in neural language models. In
Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing , pages 1830–1839.Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. In
Advances in neural information processingsystems , pages 3294–3302.B. Kybartas and R. Bidarra. 2017. A survey on story generation techniques for authoring computa-tional narratives.
IEEE Transactions on Computational Intelligence and AI in Games , 9(3):239–253.Wendy G Lehnert, Michael G Dyer, Peter N Johnson, CJ Yang, and Steve Harley. 1983. Boris—anexperiment in in-depth understanding of narratives.
Artificial intelligence , 20(1):15–62.Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In
Thirteenth International Conference on the Principles of Knowledge Representation and Reason-ing . Citeseer.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pre-training approach. arXiv preprint arXiv:1907.11692 .Linbo Luo, Wentong Cai, Suiping Zhou, Michael Lees, and Haiyan Yin. 2015.A review of interactive narrative systems and technologies: a training perspective.
SIMULA-TION , 91(2):126–147. Publisher: SAGE Publications Ltd STM.Huanru Henry Mao, Bodhisattwa Prasad Majumder, Julian McAuley, and Garrison Cottrell. 2019.Improving neural story generation by targeted common sense grounding. In
Proceedings of the2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 5990–5995.Lara J Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Har-rison, and Mark O Riedl. 2017. Event representations for automated story generation with deepneural nets. arXiv preprint arXiv:1706.01331 .Lara J. Martin, Brent Harrison, and Mark O. Riedl. 2016. Improvisational computational storytellingin open worlds. In
Interactive Storytelling , pages 73–84, Cham. Springer International Publishing.Gonzalo Méndez, Pablo Gervás, and Carlos León. 2016. On the use of character affinities for storyplot generation. In
Knowledge, Information and Creativity Support Systems , pages 211–225.Springer.Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor con-duct electricity? a new dataset for open book question answering. In
Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing , pages 2381–2391.Todor Mihaylov and Anette Frank. 2018. Knowledgeable reader: Enhancing cloze-style readingcomprehension with external commonsense knowledge. In
Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 821–832.10asrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vander-wende, Pushmeet Kohli, and James Allen. 2016. A corpus and evaluation framework for deeperunderstanding of commonsense stories. arXiv preprint arXiv:1604.01696 .Nasrin Mostafazadeh, Aditya Kalyanpur, Lori Moon, David Buchanan, Lauren Berkowitz, Or Biran,and Jennifer Chu-Carroll. 2020. Glucose: Generalized and contextualized story explanations. arXiv preprint arXiv:2009.07758 .Geoffrey Nunberg. 1987. Position paper on common-sense and formal semantics. In
TheoreticalIssues in Natural Language Processing 3 .Simon Ostermann, Ashutosh Modi, Michael Roth, Stefan Thater, and Manfred Pinkal. 2018. Mc-script: A novel dataset for assessing machine comprehension using script knowledge. In
Proceed-ings of the Eleventh International Conference on Language Resources and Evaluation (LREC2018) .Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving languageunderstanding by generative pre-training.Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language models are unsupervised multitask learners.
OpenAI Blog , 1(8).Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao. 2020. Plotmachines: Outline-conditioned generation with dynamic plot state tracking. arXiv preprint arXiv:2004.14967 .John Fairbanks Reeves. 1991. Computational morality: A process model of belief conflict andresolution for story understanding.Mark O. Riedl. 2016. Computational Narrative Intelligence: A Human-Centered Goal for Artificial Intelligence. arXiv:1602.06484 [cs] . ArXiv: 1602.06484.Mark Owen Riedl and Vadim Bulitko. 2013. Interactive narrative: An intelligent systems approach.
Ai Magazine , 34(1):67–67.David L Roberts and Charles L Isbell. 2008. A Survey and Qualitative Analysis of Recent Advancesin Drama Management. page 15.Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, HannahRashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019a. Atomic: An atlas of machinecommonsense for if-then reasoning. In
Proceedings of the AAAI Conference on Artificial Intelli-gence , volume 33, pages 3027–3035.Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019b. Social iqa:Commonsense reasoning about social interactions. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP) , pages 4453–4463.Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A Smith. 2017.The effect of different writing tasks on linguistic style: A case study of the roc story cloze task.In
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL2017) , pages 15–25.Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: an open multilingualgraph of general knowledge. In
Proceedings of the Thirty-First AAAI Conference on ArtificialIntelligence , pages 4444–4451.Wendy A. Suzuki, Mónica I. Feliú-Mójer, Uri Hasson, Rachel Yehuda, and Jean Mary Zarate. 2018.Dialogues: The Science and Power of Storytelling.
The Journal of Neuroscience , 38(44):9468–9470.Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: Aquestion answering challenge targeting commonsense knowledge. In
Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4149–4158.11radyumna Tambwekar, Murtaza Dhuliawala, Lara J Martin, Animesh Mehta, Brent Harrison, andMark O Riedl. 2019. Controllable neural story plot generation via reward shaping. pages 5982–5988.Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprintarXiv:1806.02847 .Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and YoshuaBengio. 2018. Graph attention networks. In
International Conference on Learning Representa-tions .Su Wang, Greg Durrett, and Katrin Erk. 2020. Narrative interpolation for generating and understand-ing stories. arXiv preprint arXiv:2008.07466 .An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, and Sujian Li. 2019.Enhancing pre-trained language representations with rich knowledge for machine reading com-prehension. In
Proceedings of the 57th Annual Meeting of the Association for ComputationalLinguistics , pages 2346–2357.Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019.Plan-and-Write: Towards Better Automatic Storytelling.
Proceedings of the AAAI Conference onArtificial Intelligence , 33:7378–7385.R Michael Young, Stephen Ware, Brad Cassell, and Justus Robertson. 2013. Plans and Planningin Narrative Generation: A Review of Plan-Based Approaches to the Generation of Story, Dis-course and Interactivity in Narratives. page 24.Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner,and Yejin Choi. 2019. Defending against neural fake news. In
Advances in Neural InformationProcessing Systems , pages 9054–9065. 12 ppendix A Controllability Approaches
Table 1: Summary of controllability approaches