[PDF] WikiHow: A Large Scale Text Summarization Dataset

Abstract

Sequence-to-sequence models have recently gained the state of the art performance in summarization. However, not too many large-scale high-quality datasets are available and almost all the available ones are mainly news articles with specific writing style. Moreover, abstractive human-style systems involving description of the content at a deeper level require data with higher levels of abstraction. In this paper, we present WikiHow, a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and therefore represent high diversity styles. We evaluate the performance of the existing methods on WikiHow to present its challenges and set some baselines to further improve it.

Full PDF

WWikiHow: A Large Scale Text Summarization Dataset

Mahnaz Koupaee

University of California, Santa Barbara [email protected]

William Yang Wang

University of California, Santa Barbara [email protected]

Abstract

Sequence-to-sequence models have recentlygained the state of the art performance in sum-marization. However, not too many large-scalehigh-quality datasets are available and almostall the available ones are mainly news articleswith speciﬁc writing style. Moreover, abstrac-tive human-style systems involving descrip-tion of the content at a deeper level requiredata with higher levels of abstraction. In thispaper, we present WikiHow, a dataset of morethan 230,000 article and summary pairs ex-tracted and constructed from an online knowl-edge base written by different human authors.The articles span a wide range of topics andtherefore represent high diversity styles. Weevaluate the performance of the existing meth-ods on WikiHow to present its challenges andset some baselines to further improve it.

Summarization as the process of generating ashorter version of a piece of text while preservingimportant context information is one of the mostchallenging NLP tasks. Sequence-to-sequenceneural networks have recently obtained signiﬁ-cant performance improvements on summariza-tion (Rush et al., 2015; Chopra et al., 2016). How-ever, the existence of large-scale datasets is the keyto success of these models. Moreover, the lengthof the articles and the diversity in their styles cancreate more complications.Almost all existing summarization datasets suchas DUC (Harman and Over, 2004), Gigaword(Napoles et al., 2012), New York Times (Sand-haus, 2008) and CNN/Daily Mail (Nallapati et al.,2016) consist of news articles. The news articleshave their own speciﬁc styles and therefore thesystems trained on only news may not be gen-eralized well. On the other hand, the existingdatasets may not be large enough (DUC) to train a sequence-to-sequence model, the summaries maybe limited to only headlines (Gigaword), they maybe more useful as an extractive summarizationdataset (New York Times) and their abstractionlevel might be limited (CNN/Daily mail).To overcome the issues of the existing datasets,we present a new large-scale dataset called Wiki-How using the online WikiHow knowledge base.It contains articles about various topics written indifferent styles making them different form exist-ing news datasets. Each article consists of multi-ple paragraphs and each paragraph starts with asentence summarizing it. By merging the para-graphs to form the article and the paragraph out-lines to form the summary, the resulting versionof the dataset contains more than 200,000 long-sequence pairs. We then present two features toshow how abstractive our dataset is. Finally, weanalyze the performance of some of the existingextractive and abstractive systems on WikiHow asbenchmarks for further studies. The contributionof this work is three-fold: • We introduce a large-scale, diverse datasetwith various writing styles, convenient forlong-sequence text summarization. • We introduce level of abstractedness andcompression ratio metrics to show how ab-stractive the new dataset is. • We evaluate the performance of the existingsystems on WikiHow to create benchmarksand understand the challenges better.

There are several datasets used to evaluate thesummarization systems. We brieﬂy describe theproperties of these datasets as follows. a r X i v : . [ c s . C L ] O c t he Lead The most important information about an eventWho? What? Where? When? Why? How?

The Body

The crucial information expanding the topicArgument, Controversy, Story, Evidence, Background details

The Tail

Extra information Interesting, related items.JournalistAssessment

Figure 1 : Inverted Pyramid writing style. The ﬁrst fewsentences of news articles contain the important informationmaking Lead-3 baselines outperforming most of the systems.

DUC:

The Document Understanding Conferencedataset (Harman and Over, 2004) contains 500news articles and their summaries capped at 75bytes. The summaries are written by human au-thors and there exist more than one summary perarticle which is its major advantage over other ex-isting datasets. The DUC dataset cannot be usedfor training models with large number of parame-ters and therefore is used along with other datasets(Rush et al., 2015; Nallapati et al., 2017).

Gigaword:

Another collection of news articlesused for summarization is Gigaword (Napoleset al., 2012). The original articles in the datasetdo not have summaries paired with them. How-ever, some prior work (Rush et al., 2015; Chopraet al., 2016) used a subset of this dataset and con-structed pairs of summaries by using the ﬁrst lineof the article and its headline, making the datasetsuitable for short text summarization tasks.

New York Times:

The New York Times (NYT)dataset (Sandhaus, 2008) is a large collection ofarticles published between 1996 and 2007. Whilethis dataset has been mainly used for extractivesystems (Hong and Nenkova, 2014; Durrett et al.,2016), Paulus et al. (2017) are the ﬁrst to evaluatetheir abstractive system using NYT.

CNN/Daily Mail:

This dataset mainly used in re-cent summarization papers (Nallapati et al., 2016;See et al., 2017; Nallapati et al., 2017) consists ofonline CNN and Daily Mail news articles and wasoriginally developed for question/answering sys-tems. The highlights associated with each articleare concatenated to form the summary. Two ver-sions of this dataset depending on the preprocess-ing exist. Nallapati et al. (2017) has used the entityanonymization to create the anonymized versionof the dataset while See et al. (2017) replaced theanonymized entities with their actual values andcreate the non-anonymized version.

Dataset Size 230,843Average Article Length 579.8Average Summary Length 62.1Vocabulary Size 556,461

Table 1 : The WikiHow datasets statistics.

NEWSROOM:

This corpus (Grusky et al., 2018)is the most recent large-scale dataset introducedfor text summarization. It consists of diversesummaries combining abstractive and extractivestrategies yet it is another news dataset and the av-erage length of summaries are limited to . . The existing summarization datasets, consist ofnews articles. These articles are written by jour-nalists and follow the journalistic style. The jour-nalists usually follow the Inverted Pyramid style(Po¨ ttker, 2003) (depicted in Figure 1) to priori-tize and structure a text by starting with mention-ing the most important, interesting or attention-grabbing elements of a story in the opening para-graphs and later adding details and any back-ground information. This writing style might bethe cause why lead-3 baselines (where the ﬁrstthree sentences are selected to form the summary)usually score higher compared to the existing sum-marization systems. We introduce a new datasetcalled WikiHow, obtained from WikiHow datadump. This dataset contains articles written by or-dinary people, not journalists, describing the stepsof doing a task throughout the text. Therefore, theInverted Pyramid does not apply to it as all partsof the text can be of similar importance.

The WikiHow knowledge base contains online ar-ticles describing a procedural task about varioustopics (from arts and entertainment to computersand electronics) with multiple methods or stepsand new articles are added to it regularly. Eacharticle consists of a title starting with “How to” and a short description of the article. There aretwo types of articles: the ﬁrst type of articles de-scribe single-method tasks in different steps, whilethe second type of articles represent multiple stepsof different methods for a task. Each step descrip-tion starts with a bold line summarizing that stepand is followed by a more detailed explanation. Atruncated example of a WikiHow article and howthe data pairs are constructed is shown in Figure 2. ait for a full load of clothing before running a washing machine.

Washing machines take up a lot of waterand electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until youcan fill the machine.

Turn off the water when you’re not using it.

Avoid letting the water run while you’re brushing your teeth orshaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.

Take quicker showers to conserve water.

One easy way to conserve water is to cut down on your shower time.Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter showerevery day.

Select biodegradable cleaning products.

Choose recycled products instead of new ones.

New products take way more water to make than recycledproducts. Reuse what you already own when possible. If you need to buy something, opt for products made outof recycled paper or other reused material. …… One easy way to conserve water is to cut down on your shower time. Practice cutting your showersdown to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washingmachines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing isinefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you’rebrushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When youneed them, use them sparingly.…

Take quicker showers to conserve water. Wait for a full load of clothing before running a washingmachine. Turnoff thewater whenyou’renotusingit. … Article 2:Article 1:Summary 1:Summary 2:

Selectbiodegradablecleaningproducts. Chooserecycledproductsinsteadof newones.…

Reducing Your Water UsageUsing River-Friendly Products

Method 1Method 2

Any chemicals you use in your home end up back in the water supply. Choose natural soaps or createyour own cleaning and disinfecting agents out of vinegar, baking soda, lemon juice, and other naturalproducts. These products have far less of a negative impact if they reach a river. New products take waymore water to make than recycled products. Reuse what you already own when possible. If you need tobuy something, opt for products made out of recycled paper or other reused material.….

Figure 2 : An example of our new dataset: WikiHow summary dataset, which includes +200K summaries. The bold linessummarizing the paragraph (shown in red boxes) are extracted and form the summary. The detailed descriptions of each step(except the bold lines) will form the article. Note that the articles and the summaries are truncated and the presented texts arenot in their actual lengths.

We made use of the python Scrapy library towrite a crawler to get the data from the Wiki-How website. The articles classiﬁed into differ-ent categories, cover a wide range of topics. Ourcrawler was able to obtain , unique arti-cles (some containing more than one method) atthe time of crawling (new articles are added regu-larly). To prepare the data for the summarizationtask, each method (if any) described in the articleis considered as a separate article. To generate thereference summaries, bold lines representing thesummary of the steps are extracted and concate-nated. The remaining parts of the steps (the de-tailed descriptions) are also concatenated to formthe source article. After this step, , articlesand reference summaries are generated. There aresome articles with only the bold lines i.e. there isno more explanation for the steps, so they cannotbe used for the summarization task. To ﬁlter outthese articles, we used a size threshold so that pairswith summaries longer than the article size will beremoved. The ﬁnal dataset is made of , articles and their summaries. The statistics of thedataset are shown in Table 1. The dataset is re-leased to the public . The large scale of the WikiHow dataset by hav-ing more than , pairs, and its averagearticle and summary lengths makes it a betterchoice compared to DUC and Gigaword corpus.We also deﬁne two metrics to represent the ab-straction level of WikiHow by comparing it with https://scrapy.org/ https://github.com/mahnazkoupaee/WikiHow-Dataset Figure 3 : Uniqueness of n-grams in CNN/Daily mail andWikiHow datasets.

CNN/Daily mail known as one of the most ab-stractive and common datasets in recent summa-rization papers (Nallapati et al., 2016, 2017; Seeet al., 2017; Paulus et al., 2017).

Abstractedness of the dataset is measured by cal-culating the unique n-grams in the reference sum-mary which are not in the article. The compari-son is shown in Figure 3. Except for common uni-grams, bi-grams and trigrams between the articles,and the summaries, no other common n-grams ex-ist in the WikiHow pairs. The higher level of ab-stractedness creates new challenges for the sum-marization systems as they have to be more cre-ative in generating more novel summaries.

We deﬁne compression ratio to characterize thesummarization. We ﬁrst calculate the averagelength of sentences for both the articles and thesummaries. The compression ratio is then de-ﬁned as the ratio between the average length ofsentences and the average length of summaries.The higher the compression ratio, the more difﬁ-cult the summarization task, as it needs to capture

NN/Daily Mail WikiHowModel ROUGE METEOR ROUGE METEOR1 2 L exact 1 2 L exactTextRank 35.23 13.90 31.48 18.03 27.53 7.4 20.00

Seq-to-seq with attention 31.33 11.81 28.83 12.03 22.04 6.27 20.87 10.06Pointer-generator 36.44 15.66 33.42 15.35 27.30 9.10 25.65 9.70Pointer-generator + coverage 39.53 17.28 36.38 17.32

Table 2 : The ROUGE-F1 scores of different methods on non-anonymized version of CNN/Daily Mail dataset and WikiHowdataset. The ROUGE scores are given by the 95% conﬁdence interval of at most ± . in the ofﬁcial ROUGE script.WikiHow CNN/Daily MailArticle Sentence Length 100.68 118.73Summary Sentence Length 42.27 82.63Compression Ratio 2.38 1.44 Table 3 : Compression ratio of WikiHow and CNN/Dailymail datasets. The represented article and summary lengthsare the mean over all sentences. higher levels of abstraction and semantics. Table3 shows the results for WikiHow and CNN/DailyMail. The higher compression ratio of WikiHowshows the need for higher levels of abstraction.

We evaluate the performance of the WikiHowdataset using existing extractive and abstractivebaselines. The systems used and the results gen-erated for WikiHow and CNN/Daily mail are de-scribed in the following sections.

An extractive sum-marization system (Mihalcea and Tarau, 2004;Barrios et al., 2016) using a graph-based rankingmethod to select sentences from the article andform the summary.

Sequence-to-sequence model with attention:

Abaseline system applied by Chopra et al. (2016);Nallapati et al. (2016) to abstractive summariza-tion task to generate summaries using the prede-ﬁned vocabulary. This baseline is not able to han-dle Out of Vocabulary words (OOVs).

Pointer-generator abstractive system:

Apointer-generator mechanism (See et al., 2017)allowing the model to freely switch betweencopying a word from the input sequence orgenerating a word form the predeﬁned vocabulary.

Pointer-generator with coverage abstractivesystem:

The pointer-generator baseline withadded coverage loss (See et al., 2017) to reducethe repetition in the ﬁnal generated summary.

Lead-3 baseline:

A baseline selecting the ﬁrstthree sentences of the article to form the summary. This baseline cannot be directly used for the Wiki-How dataset as the ﬁrst sentences of each articleonly describe a small portion of the whole article.We created the Lead-3 baseline by extracting theﬁrst sentence of each paragraph and concatenatedthem to create the summary. To study the performance of the evaluated sys-tems, we used the Pyrouge package to report theF1 score for ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) and the METEOR (Banerjee andLavie, 2005) both based on the exact matchesand on inclusion of stem, paraphrasing and syn-onyms (s/p/s) to evaluate the methods . Table2 represents the results of multiple baselines onboth the CNN/Daily Mail (the well-known, mostcommon abstractive summarization dataset) andalso the proposed WikiHow dataset. As it can beseen, the summarization systems perform a lot bet-ter on CNN/Daily mail compared to the WikiHowdataset with lead-3 outperforming other baselinesdue to the news inverted pyramid writing style de-scribed earlier. On the other hand, the poor perfor-mance of lead-3 on WikiHow shows the differentwriting styles in its articles. Moreover, all base-lines perform about ROUGE scores better onthe CNN/Daily mail compared to the WikiHow.This difference suggests new features and aspectsinherent in the new dataset which can be used tofurther improve the summarization systems.

We present WikiHow, a new large-scale summa-rization dataset consisting of diverse articles formWikiHow knowledge base. The WikiHow featuresdiscussed in the paper can create new challenges tothe summarization systems. We hope that the newdataset can attract researchers attention as a choiceto evaluate their systems. pypi.python.org/pypi/pyrouge/0.1.3 eferences Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In

Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization , pages 65–72.Federico Barrios, Federico L´opez, Luis Argerich, andRosa Wachenchauzer. 2016. Variations of the simi-larity function of textrank for automated summariza-tion. arXiv preprint arXiv:1602.03606 .Sumit Chopra, Michael Auli, Alexander M Rush, andSEAS Harvard. 2016. Abstractive sentence summa-rization with attentive recurrent neural networks. In

HLT-NAACL , pages 93–98.Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein.2016. Learning-based single-document summariza-tion with compression and anaphoricity constraints. arXiv preprint arXiv:1603.08887 .Max Grusky, Mor Naaman, and Yoav Artzi. 2018.Newsroom: A dataset of 1.3 million summaries withdiverse extractive strategies. In

Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers) , volume 1, pages 708–719.Donna Harman and Paul Over. 2004. The effects of hu-man variation in duc summarization evaluation.

TextSummarization Branches Out .Kai Hong and Ani Nenkova. 2014. Improving theestimation of word importance for news multi-document summarization. In

Proceedings of the14th Conference of the European Chapter of the As-sociation for Computational Linguistics , pages 712–721.Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. In

Text summariza-tion branches out: Proceedings of the ACL-04 work-shop , volume 8. Barcelona, Spain.Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In

Proceedings of the 2004 con-ference on empirical methods in natural languageprocessing .Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.Summarunner: A recurrent neural network based se-quence model for extractive summarization of docu-ments.

AAAI .Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,C¸ a glar Gulc¸ehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequencernns and beyond.

CoNLL 2016 , page 280.Courtney Napoles, Matthew Gormley, and BenjaminVan Durme. 2012. Annotated gigaword. In

Pro-ceedings of the Joint Workshop on Automatic Knowl-edge Base Construction and Web-scale Knowledge Extraction , pages 95–100. Association for Compu-tational Linguistics.Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304 .Horst Po¨ ttker. 2003. News and its communicativequality: The inverted pyramidwhen and why did itappear?

Journalism Studies , 4(4):501–511.Alexander M Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for ab-stractive sentence summarization. arXiv preprintarXiv:1509.00685 .Evan Sandhaus. 2008. The new york times annotatedcorpus.

Linguistic Data Consortium, Philadelphia ,6(12):e26752.Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: Summarization with pointer-generator networks.