SHARI -- An Integration of Tools to Visualize the Story of the Day
Shawn M. Jones, Alexander C. Nwala, Martin Klein, Michele C. Weigle, Michael L. Nelson
SSHARI – An Integration of Tools to Visualize the Story of the Day
Shawn M. Jones · Alexander C. Nwala · Martin Klein · Michele C. Weigle · Michael L. NelsonAbstract
Tools such as Google News and Flipboard exist to convey daily news, but what about the past?In this paper, we describe how to combine several existing tools with web archive holdings to perform newsanalysis and visualization of the “biggest story” for a given date. StoryGraph clusters news articles togetherto identify a common news story. Hypercane leverages ArchiveNow to store URLs produced by StoryGraph inweb archives. Hypercane analyzes these URLs to identify the most common terms, entities, and highest qualityimages for social media storytelling. Raintale then takes the output of these tools to produce a visualizationof the news story for a given day. We name this process SHARI (StoryGraph Hypercane ArchiveNow RaintaleIntegration).
Tools such as Google News and Flipboard exist to convey daily news, but what about the news of the past?We have combined StoryGraph with tools from the Dark and Stormy Archives Toolkit to produce theStoryGraph Hypercane ArchiveNow Raintale Integration (SHARI) process. These tools represent disparateresearch efforts in news analysis, corpus summarization, web archiving, and visualization. The integrationproduces a summary of the “biggest story” for a given date. SHARI combines the following components fromOld Dominion University’s Web Science and Digital Libraries Research Group : – StoryGraph: a platform that downloads RSS feeds and analyzes the linked articles to cluster news stories[12] – http://storygraph.cs.odu.edu/ – Hypercane: a framework for intelligently sampling and analyzing documents from web archive collections[5] – https://oduwsdl.github.io/hypercane – ArchiveNow: a library developed by Aturban et al. [2] that submits live web URI-Rs to web archives tocreate URI-Ms – https://github.com/oduwsdl/archivenow – Raintale: a MementoEmbed [3] client that creates stories from a sample of mementos – https://oduwsdl.github.io/raintale
Shawn M. Jones · Martin KleinLos Alamos National Laboratory, Los Alamos, NMAlexander C. Nwala · Michele C. Weigle · Michael L. NelsonOld Dominion University, Norfolk, VA http://storygraph.cs.odu.edu https://oduwsdl.github.io/dsa/software.html https://ws-dl.cs.odu.edu a r X i v : . [ c s . D L ] A ug Shawn M. Jones et al. { ”config”: ”/files/config/polar − media − consensus − graph/f6e84be9969ecef7adb20689002608d0/”,”connected − comps”: [ { ”avg − degree”: 4.318181818181818,”density”: 0.10042283298097252,”node − details”: { ”annotation”: ”polarity”,”color”: ”green”,”connected − comp − type”: ”event” } ,”nodes”: [0,1,... additional node ids omitted for brevity ...],”unique − source − count”: 14 } , { ”avg − degree”: 1,”density”: 1,”node − details”: { ”annotation”: ”polarity”,”color”: ”red”,”connected − comp − type”: ”cluster” } ,”nodes”: [9,67],”unique − source − count”: 2 } ],”links”: [ { ”rank”: 1,”sim”: 0.57,”source”: 2,”target”: 21,”label”: ”1 (0.57)”,”label − description”: ”rank (sim)” } ,... additional link definitions omitted for brevity ... { ”rank”: 96,”sim”: 0.3,”source”: 53,”target”: 73,”label”: ”96 (0.3)”,”label − description”: ”rank (sim)” } ],”ner − version”: ”3.8.0”,”nodes”: [... other nodes omitted for brevity ... { ”entities”: [ { ”class”: ”LOCATION”,”entity”: ”Coney Island” } , { ”class”: ”LOCATION”,”entity”: ”Brooklyn” } , { ”class”: ”PERSON”,”entity”: ”Victor J. Blue” } ,... ],”extraction − time”: ”2020 − − − assets/static − assets/favicon − − − restrictions − us.html”,”node − details”: { ”annotation”: ”polarity”,”color”: ”blue”,”connected − comp − type”: ”event”,”type”: ”left” } ,”published”: ”Sun, 22 Mar 2020 22:00:52 +0000”,”rss − uri − m”: ”https://web.archive.org/web/20200323000609id /https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml”,”text”: ”Health | Harsh Steps Are Needed to Stop the Coronavirus, Experts Say \ nhttps://nyti.ms/3dkfoCc \ nA beach stroller in the Coney Island (cid:44) → neighborhood of Brooklyn on Saturday.Credit...Victor J. Blue for The New York Times \ nHarsh Steps Are Needed to Stop the (cid:44) → Coronavirus, Experts Say \ nScientists who have fought pandemics describe difficult measures needed to defend the United States against (cid:44) → a fast − moving pathogen. \ nA beach stroller in the Coney Island neighborhood of Brooklyn on Saturday.Credit...Victor J. Blue for The (cid:44) → New York Times \ nSupported by \ nBy Donald G. McNeil Jr. \ nMarch 22, 2020, 6:00 p.m. ET \ nTerrifying though the coronavirus may be, (cid:44) → it can be turned back. China, South Korea, Singapore and Taiwan have demonstrated that, with furious efforts, the contagion can be (cid:44) → brought to heel. \ nWhether they can keep it suppressed remains to be seen...”,”title”: ”Harsh Steps Are Needed to Stop the Coronavirus, Experts Say − The New York Times” } ,... other articles omitted for brevity ...],”self”: ”http://storygraph.cs.odu.edu/graphs/polar − media − consensus − graph/ − − − − − pointer”: { ”cursor”: 0,”hist”: 1440,”cur − path”: ”2020/03/23” }} Fig. 1: An abridged version of the JSON file generated by StoryGraph that drives the visualization in Figure2.
HARI – An Integration of Tools to Visualize the Story of the Day 3
Fig. 2: The StoryGraph news similarity graph for March 23, 2020.URL: http://storygraph.cs.odu.edu/graphs/polar-media-consensus-graph/
Shawn M. Jones et al.
Fig. 3: The “biggest news story” of March 23, 2020 produced by the SHARI process.URL: https://oduwsdl.github.io/dsa-puddles/stories/shari/2020/03/23/storygraph_biggest_story_2020-03-23/
HARI – An Integration of Tools to Visualize the Story of the Day 5
Fig. 4: Annotations detail which SHARI components provide each part of the visualization shown in Figure 3.Nwala et al. [10,11] have focused on finding seeds within search engine result pages (SERPs), social mediastories, and news feeds. As part of this research, Nwala et al. also developed StoryGraph [12], a service thatsaves RSS feeds from 17 news sources (Table 1 in Appendix A) every ten minutes. With these RSS feeds,StoryGraph analyzes the lexical connections between articles across feeds to generate JSON output, whichdrives a graph visualization. Figure 1 displays some of this JSON output for March 23, 2020. StoryGraph thenvisualizes this output, as shown in Figure 2.Collections on specific topics exist at various web archives [7]. AlNoamany et al. [1] introduced how to usesocial media storytelling to summarize web archive collections. Klein et al. [8] have built collections from webarchives by conducting focused crawls. Jones developed Hypercane [5] to intelligently sample mementos fromlarger collections. Jones also developed Raintale [4] for generating social media stories to summarize groupsof mementos, providing visualizations that employ familiar techniques, like cards, that require no training formost users to understand.The JSON data structure from Figure 1 provides all information gathered but is difficult for humans to understand at a glance. The graph shown in Figure 2 provides an overview of the JSON through favicons andedges, but a user requires some training to fully comprehend what it represents. Figure 3 displays the largestconnected component from this graph visualized via the SHARI process. Through images, text snippets, titles,cards, domain names, favicons, and other content, the SHARI output allows the viewer to intuitively understandthat the biggest news story for this date consists of different reactions to the growing COVID-19 pandemic.
Shawn M. Jones et al.
Fig. 5: SHARI process for creating a visualization of the biggest news story for a given day
HARI – An Integration of Tools to Visualize the Story of the Day 7
The StoryGraph Hypercane ArchiveNow Raintale Integration (SHARI) [6] process automatically creates storiessummarizing news for a day. Figure 4 details what each tool contributes to the story. Figure 5 shows the stepsof the SHARI process.1. With the StoryGraph Toolkit, we query the StoryGraph service for the URI-Rs belonging to the biggeststory of the day.2. Hypercane converts these URI-Rs to URI-Ms by first attempting to find a corresponding URI-M by queryingthe LANL Memento Aggregator via the Memento Protocol [13]. For each URI-M that does not have amemento, Hypercane creates a memento by calling ArchiveNow [2] (Figure 6).3. Hypercane runs the mementos through spaCy to generate a list of named entities, sorted by frequency(Figure 7).4. Hypercane runs the mementos through sumgram [9] and generates a list of sumgrams, sorted by frequency(Figure 8).5. Hypercane scores all of the mementos’ embedded images. Images that article authors reference in HTMLMETA tags are favored first, followed by MementoEmbed [3] score, then pixel size, color count, the ratioof width to height, and finally position on the page (Figure 9).6. Hypercane runs the mementos through newspaper3k to extract each article’s publication date and ordersthe URI-Ms by that date (Figure 10) .7. Hypercane consolidates the entities, terms, image scores, and ordered URI-Ms into a JSON file containing the structured data for the summary. During this step, Hypercane uses the highest scoring image as the striking image for the summary (Figure 11). In Figure 4, the highest-ranking image is the UK Prime Ministeraddressing his country about the COVID-19 pandemic.8. Raintale renders the output as Jekyll HTML based on the contents of this JSON file, a template file, andinformation on each memento provided by MementoEmbed (Figure 11).9. The SHARI script publishes the summary story to GitHub Pages for distribution. Figure 13 shows theoutput of our dsa tweeter bot which announces the story after publication through the @StormyArchives Twitter account.
StoryGraph is a valuable resource with additional unrealized potential. We are not only able to create storiesfor today or yesterday but any date back to August 8, 2017, when Nwala launched StoryGraph. As seen inFigures 14, 15, and 16 we can see how the world has evolved each year on StoryGraph’s launch date. In Figure14, the biggest news story was that of North Korea threatening other nations with nuclear weapons. One yearlater, in Figure 15, we see that the biggest news story is the results of several United States Congressional andgubernatorial primaries. Two years after StoryGraph’s launch, Figure 16 shows that the biggest news story isthe aftermath of the 2019 shootings in El Paso and Dayton.
SHARI produces a familiar yet novel method of viewing news for a given day. SHARI can create stories fortoday, yesterday, and back to StoryGraph’s creation on August 8, 2017. It is different from other storytelling https://timetravel.mementoweb.org https://spacy.io/ https://newspaper.readthedocs.io/en/latest/ Shawn M. Jones et al.
Fig. 6: SHARI steps 1-2 illustrated with a single URI-R from the story shown in Figure 3. Here SHARI extracts the URI-R from StoryGraph and then creates a corresponding URI-M with ArchiveNow.
HARI – An Integration of Tools to Visualize the Story of the Day 9
Fig. 7: SHARI step 3 reproting entities from the URI-M generated in Figure 6
Fig. 8: SHARI step 4 reporting sumgrams from the URI-M generated in Figure 6
HARI – An Integration of Tools to Visualize the Story of the Day 11
Fig. 9: SHARI step 5 reporting a image metrics from the URI-M generated in Figure 6
Fig. 10: SHARI step 6 orders all mementos first by publication date, then memento-datetime.
HARI – An Integration of Tools to Visualize the Story of the Day 13
Fig. 11: SHARI step 7 combines all data into a JSON format used by Raintale for storytelling.services like Wakelet because SHARI is entirely automated. The stories produced by SHARI are different fromservices like Google News or Flipboard because those tools focus on current events and personalized topics.Because StoryGraph samples content from multiple sides of the political spectrum, the SHARI process canprovide a visualization of articles not tied to one interest area or even a single side’s terminology. This processworks because each component is loosely coupled, has high cohesion, has explicit interfaces, and engages ininformation hiding. Each command passes data in the expected format to the next.We are also exploring how to improve striking image selection for stories. One could use this to consider howthe same story is told in different venues. For instance, one could ask StoryGraph only to include left-leaningsources and produce a SHARI story. One could then do the same for only the right-leaning sources. With bothstories, one could compare the striking images and sumgrams that SHARI produces. We are investigating howto produce and render other news stories for a given day and any given period of time. Finally, we are examininghow to best visualize significant events that span substantial periods of time, like the entire COVID-19 news story. Though StoryGraph is an existing service that gathers current news, we also want to apply its algorithmdirectly to mementos and tell the news stories of past events like the Hurricane Katrina disaster. https://wakelet.com/ https://news.google.com/ https://flipboard.com/ Fig. 12: SHARI step 8 feeds the JSON file from Step 7 and a template file into Raintale to generate the story.Raintale queries MementoEmbed for information about each memento.
This work supported in part by the Institute of Museum and Library Services (LG-71-15-0077-15).
References AlNoamany, Y., Weigle, M. C., and Nelson, M. L.
Generating Stories From Archived Collections. In
WebSci 2017 (Troy, New York, USA, 2017), pp. 309–318. http://doi.org/10.1145/3091478.3091508 .2.
Aturban, M., Kelly, M., Alam, S., Berlin, J. A., Nelson, M. L., and Weigle, M. C.
ArchiveNow: Simplified,Extensible, Multi-Archive Preservation. In
JCDL 2018 (Fort Worth, Texas, USA, 2018), pp. 321–322. https://doi.org/10.1145/3197026.3203880 .3.
Jones, S. M.
A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages. https://ws-dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html , 2018.4.
Jones, S. M.
Raintale – A Storytelling Tool For Web Archives. https://ws-dl.blogspot.com/2019/07/2019-07-11-raintale-storytelling-tool.html , 2019.5.
Jones, S. M.
Hypercane Part 1: Intelligent Sampling of Web Archive Collections. https://ws-dl.blogspot.com/2020/06/2020-06-03-hypercane-part-1-intelligent.html , 2020.HARI – An Integration of Tools to Visualize the Story of the Day 15
Fig. 13: The dsa tweeter bot announces the availability of new SHARI stories each day. Jones, S. M.
SHARI: StoryGraph Hypercane ArchiveNow Raintale Integration – Combining WS-DL Tools For CurrentEvents Storytelling. https://ws-dl.blogspot.com/2020/04/2020-04-01-shari-storygraph-hypercane.html , 2020.7.
Jones, S. M., Nwala, A., Weigle, M. C., and Nelson, M. L.
The Many Shapes of Archive-It. In iPres 2018 (Boston,Massachusetts, USA, 2018), pp. 1–10. https://doi.org/10.17605/OSF.IO/EV42P .8.
Klein, M., Balakireva, L., and Van de Sompel, H.
Focused crawl of web archives to build event collections. In
WebSci2018 (Amsterdam, Netherlands, 2018), p. 333342. https://doi.org/10.1145/3201064.3201085 .9.
Nwala, A. C.
Introducing sumgram, a tool for generating the most frequent conjoined ngrams. https://ws-dl.blogspot.com/2019/09/2019-09-09-introducing-sumgram-tool-for.html , 2019.10.
Nwala, A. C., Weigle, M. C., and Nelson, M. L.
Bootstrapping Web Archive Collections from Social Media. In
Hypertext 2018 (Baltimore, Maryland, USA, 2018), pp. 64–72. http://doi.org/10.1145/3209542.3209560 .11.
Nwala, A. C., Weigle, M. C., and Nelson, M. L.
Scraping SERPs for Archival Seeds: It Matters When You Start. In
JCDL 2018 (Fort Worth, Texas, USA, 2018), pp. 263–272. http://doi.org/10.1145/3197026.3197056 .12.
Nwala, A. C., Weigle, M. C., and Nelson, M. L.
365 Dots in 2019: Quantifying Attention of News Sources. Tech. Rep.arXiv:2003.09989, 2020. https://arxiv.org/abs/2003.09989 .13.
Van de Sompel, H., Nelson, M., and Sanderson, R.
RFC 7089 - HTTP Framework for Time-Based Access to ResourceStates – Memento. https://tools.ietf.org/html/rfc7089 , Dec. 2013.6 Shawn M. Jones et al.
Fig. 14: SHARI output for August 8, 2017 - the launch date of StoryGraphURL: https://oduwsdl.github.io/dsa-puddles/stories/shari/2017/08/08/storygraph_biggest_story_2017-08-08/
HARI – An Integration of Tools to Visualize the Story of the Day 17
Fig. 15: SHARI output for August 8, 2018 - a year after the launch date of StoryGraphURL: https://oduwsdl.github.io/dsa-puddles/stories/shari/2018/08/08/storygraph_biggest_story_2018-08-08/
Fig. 16: SHARI output for August 8, 2019 - two years after the launch date of StoryGraphURL: https://oduwsdl.github.io/dsa-puddles/stories/shari/2019/08/08/storygraph_biggest_story_2019-08-08/
HARI – An Integration of Tools to Visualize the Story of the Day 19
Table 1: The 17 news sources analyzed by StoryGraph