Analytic Provenance Datasets: A Data Repository of Human Analysis Activity and Interaction Logs
Sina Mohseni, Andrew Pachuilo, Ehsanul Haque Nirjhar, Rhema Linder, Alyssa Pena, Eric D. Ragan
AAnalytic Provenance Datasets: A DataRepository of Human Analysis Activityand Interaction Logs
Sina Mohseni
Department of ComputerScience & EngineeringTexas A&M [email protected]
Andrew Pachuilo
Department of ComputerScience & EngineeringTexas A&M [email protected]
Ehsanul Haque Nirjhar
Department of ComputerScience & EngineeringTexas A&M [email protected]
Rhema Linder
Department of ComputerScience & EngineeringTexas A&M [email protected]
Alyssa Peña
Department of VisualizationTexas A&M [email protected]
Eric D. Ragan
Department of VisualizationDepartment of ComputerScience & [email protected]
Datasets are available online athttps://research.arch.tamu.edu/analytic-provenance/datasets/ for research purposes.
Abstract
We present an analytic provenance data repository thatcan be used to study human analysis activity, thought pro-cesses, and software interaction with visual analysis toolsduring exploratory data analysis. We conducted a seriesof user studies involving exploratory data analysis sce-nario with textual and cyber security data. Interactions logs,think-alouds, videos and all coded data in this study areavailable online for research purposes. Analysis sessionsare segmented in multiple sub-task steps based on userthink-alouds, video and audios captured during the stud-ies. These analytic provenance datasets can be used forresearch involving tools and techniques for analyzing inter-action logs and analysis history. By providing high-qualitycoded data along with interaction logs, it is possible to com-pare algorithmic data processing techniques to the ground-truth records of analysis history.
Author Keywords
Analytic provenance; Text analysis; Cyber analysis; Userinteraction logs; Eye tracking; Online dataset.
ACM Classification Keywords
H.5.m [Information interfaces and presentation]: Miscella-neous a r X i v : . [ c s . H C ] J a n ntroduction Visual analytic tools assist analysts with exploratory inspec-tion of large amounts of data to identify, understand, andconnect pieces of information. At a meta level, understand-ing analysis processes is important for improving tools,communicating analysis strategies, and explaining the ev-idence.
Provenance for data analysis tracks the history ofthe analysis, including the progression of findings, inter-actions, data inspection, and visual state [6, 5]. Analyzinguser interactions and data provenance reveals more infor-mation about analysis process, helps in understanding howthe user discovers insights, and is essential for understand-ing analysis behavior during open-ended data explorationtasks.Designing visualization designs and techniques to studyanalysis processes requires sample analysis records forresearch and development. Thus, our work contributes mul-tiple analytic provenance datasets captured from user stud-ies with high quality capture of participant interaction logs,think-aloud comments, screen capture, and transcribednotes from qualitative coding of sample analysis sessionsfrom multiple data analysis scenarios. To collect the prove-nance records, we conducted a set of user studies usingbasic visual data analysis tools appropriate for each sce-nario but generalizable enough to have similarities to manycommonly used visualization software. The datasets arefully anonymized and records are transcribed to for easyuse by researchers interested in studying human data anal-ysis behaviors. Captured videos, user interaction logs andinsight codings for all studies are available online for re-search purposes.Currently, our provenance data repository contains recordsfrom two types of data analysis scenarios: textual intelli- https://research.arch.tamu.edu/analytic-provenance/ gence analysis and multidimensional cybersecurity analysisscenarios. Text Analysis Provenance Dataset
Our text analysis data is based on intelligence-analysis in-vestigations from the publicly available VAST ChallengedatasetsEach study session involved one of three intelli-gence analysis scenarios selected from the VAST Chal-lenge data sets [7], a set of synthetically created data setsand analysis scenarios designed to be similar to real-worldcases and problems. Specifically, our studies used datafrom the 2010 mini-challenge 1, 2011 mini-challenge 3, and2014 mini-challenge 1. All datasets contain various textrecords such as news articles, emails, telephone intercepts,bank transaction logs, and web blog posts. For example,the 2014 data set involves articles about events and peo-ple related to missing individuals and violence related to aprotest group in a fictional island.Participants were tasked with gathering information andfinding connections between events, people, places andtimes in the data sets. Text documents varied in length fromsingle sentences up to multiple paragraphs. While all ofthe data was in plain-text format, some of the documentsprimarily consisted of numerical data related to financialtransactions. The 2010 data set had a total of 102 docu-ments. Due to the larger sizes of the 2011 and 2014 datasets, we reduced the number of included documents to ac-commodate constraints of user study duration for 90-minutesessions. We used a subset of each data set to limit thesetwo data scenarios to 152 documents.All participants for each session were university studentsfrom varying majors, and ages ranged from 20 to 30. Noneof the users were expert in analytic tasks. Participantsused the document explorer tool and our cyber analysis igure 1:
Screenshot of the document analysis tool used for collecting provenance data in the text analysis scenarios. All text documents arelisted in a collapsed format with a random order on the left monitor at the beginning of the study. Documents have titles and users are able todrag and displace documents in the explorer tool space. In this tool user can write notes, move and link documents, highlight text, and searchfor keywords. tool desktop computer with two 27-inch monitors to analyzethe data for 90 minutes. At the beginning of study sessions,text explore tool and analysis task were explained to theusers. Users had a 15 minutes time to work with the tooland ask questions prior to the start. Currently, our prove-nance repository includes data records from 24 participants(6 female, 18 male) for the text analysis sessions.To complete the analysis task, participants used a basicvisual analysis tool (see Figure 1 on next page. The toolsupports spatial arrangement of articles, the ability to linkdocuments, keyword searching, highlighting, and note-taking. When loading the data in our document explorertool, each document starts as collapsed with only its ti-tle visible. Users could “open” any document by double-clicking the title bar or by clicking a dedicated button onthe document’s title bar, and this would expand the docu-ment to a window containing the text of the document. Thedocument could be collapsed back to the title in the same way. Within an open document, users could highlight textby selecting it, right-clicking, and activating a menu item.When a window has highlighted text, the window could be“reduced to highlight”, which would hide all text in the doc-ument except for the highlighted content. At the beginningof the study, documents were arranged in the left screenwithout a specific order or grouping. Users clicked anddragged documents, freely re-arranging documents in theworkspace. They could also create editable notes windowsin the same workspace. When using the search functional-ity, both matching words within windows and the windowsthemselves were highlighted. Users could also draw con-nection lines across document windows, which created aline to denote relationships visually.All user interactions at a rudimentary level like mouse move-ments and clicks are captured during the study using thetext explorer. Later we transform basic data log recordedfrom explorer tool to nine type of user actions, see Table 1. nteractions Purpose
Open documents Explore new articlesRead documents Explore new informationSearch Keyword searchHighlight Highlight document textBookmark Select documentsConnect Linking documents and notesMove documents Arrange documents in screenBrush titles Review document titlesCreating Notes Making sticky notesWriting Notes Writing notes
Table 1:
Types of interactions logged from the text analysis toolduring the user studies.
We associate analytical reasoning with different interactionsavailable to the user, and later use it to modify the topicmodels.Based on prior observations (e.g., [1, 2, 4]) that mouse in-put can correspond with informational attentional. We usehovering the mouse over new document titles as users in-tend to explore new information. Hovering mouse over doc-ument text shows reading interaction of the articles.
Multidimensional Data Analysis Dataset
The multidimensional data analysis scenario currently hasprovenance records from 10 participants. The analysis sce-nario used cyber analysis dataset was taken from 2009VAST Challenge [3], mini-challenge 1. The backstory of thescenario involves an employee of a fictional embassy try-ing to exfiltrate sensitive information to an outside criminalorganization using office computers. Participants exploredthis tabular multidimensional data set with a visual analysis
Figure 2:
Screenshot of the cyber analysis tool used for collectingprovenance data in the multidimensional analysis scenario. Chartson the top left sides are detailed histogram and overviewhistogram, respectively. A network graph is shown in the bottomleft. Boxes on the top right corner indicates the office view withslider tools. Below the office view is an information panel. Finally,at the bottom right, an IP traffic table shows detailed network data. tool comprised of multiple coordinated views. Views includehistograms of the traffic data, network graph, table of IPtraffic details, and a work station layout showing proximitycard status. Participants were asked to find the suspiciousIP addresses used to transfer data to the criminal organiza-tion by exploring different views of the tool.Participants explored the multidimensional dataset to de-termine the suspicious behavior. A static view of the tool isshown in Figure 2. The tool is divided into following 6 differ-ent views described in 2.Due to the large amount of data, filtering is required to findspecific patterns in the data. Brushing and linking is en-abled in the overview histogram view. Any portion of it canbe selected and then the detailed histogram will change ye Area of Interest (AOI) Mouse interaction Data logs
Overview histogram Brush start and endDetailed histogram Mouse enter and bar clickNetwork graph Mouse enterOffice view Mouse click and slider moveInformation box Mouse hoverIP table Page change and row select
Table 2:
Types of eye and mouse interactions logged from thecyber analysis tool during the user studies. according to the selection. Office view shows the currentstatus of the employees inside embassy in that specific timeframe using different color codes. A slider tool also allowsthe participant to select specific time and day to check theemployee status and network traffic. Moreover, participantscan select specific IP addresses from a multi-paged IP ta-ble for future reference. User interactions with the tool isrecorded in the form of mouse interaction and eye area ofinterest (AOI). Mouse tracking is done within the tool whileeye tracking is performed using a Tobii EyeX, a standardeye tracking device that tracks the eye gaze fixation points.
Data Coding
In order for the provenance datasets to be useful for a widerange of research purposes, we prioritized the capture ofusers’ thought processes and actions throughout the anal-ysis activities. We used a think-aloud protocol to captureparticipant’s thoughts and insights during the study. Wetranscribed users’ think-aloud comments by watching thescreen-captured videos of each session along with notesfrom the research team about observations from the studysessions. Transcripts include all user’s actions, talks, andtime stamps of events.
Coding for Text Analysis Dataset
Two members of the research team reviewed all analysisrecords and identified times where user changed topic ofthe investigation. We save all topic changes moments dur-ing the exploratory task and code them as topic changing(inflection) points. For example, participant P7 was workingon the third dataset and said “I’m looking for these cater-ers at the executive breakfast” and searches for “cater-ers”. The participant continues reading documents fromthis search for about 10 minutes. Then the user says “I’mtrying to figure out what the government was doing at thecompany”, which is a change in topic of investigation. Theuser looks through titles and picks a couple of documentsabout government for about 8 minutes. While reading newdocuments, P7 finds out about the name “Edward” and issearching for incidents related to this name for the next 4minutes. There are also moments that user is done withcurrent topic and wants to change the subject. For instance,participant P3 working on the second dataset says “Let’ssearch for some keywords” after 3 minutes of thinking andtaking no actions. Then the user searches for keyword“thread” to find new articles about it. Also, in many cases,topic changing does not include think-alouds, like opening arandom document and continuing with that, or writing a noteabout an old topic, or even returning to an old topic after awhile. Coding for Cyber Analysis Datset
A similar approach was used to identify the inflection pointsin the cyber analysis data. The research team identifiedkey points by examining the task video and the audio ofthe think-aloud process. Heuristics of marking the inflec-tion points relied on the change in strategy attempted bythe participant to complete the task. Change in strategycan also be identified as the use of different views, differ-ent focal attributes within a view, or other means based onbservations or verbal comments from the participant.For example, participant
Cyber-F used the overview his-togram to select some random times and tried to find un-usual traffic pattern in the detailed histogram. After spend-ing about 5 minutes, the participant moved on to a newstrategy involving the office view.
Cyber-F then started us-ing slider tool to find the proximity card status of differentemployees to know their current position and cross checkwith the IP table. Another participant,
Cyber-J started theanalysis by selecting each IP addresses and trying to findunusual traffic pattern in them. But with the large amountof data in the IP traffice table, the participant moved on toa new strategy after about 7 minutes. The new strategyfor
Cyber-J involved looking at the network graph to findunique destination IP addresses with large traffic. Thesechanges in strategy are noted as inflection points by thecoders and included in the transcripts.
Online Dataset
This analytic provenance datasets can be used for researchinvolving tools and techniques for analyzing interactionlogs and analysis history. By providing high-quality codeddata along with interaction logs, it is possible to comparealgorithmic data processing techniques to the ground-truth records of analysis history. The Provenance AnalyticsDataset is free and publicly available for research purposes.Captured videos, user interaction logs, the analysis toolsused in the studies, and transcripts from think-aloud com-ments and observations from all studies are available onlineat https://research.arch.tamu.edu/analytic-provenance/.
Acknowledgements
This material is based on work supported by NSF 1565725.
References [1] Mon Chu Chen, John R. Anderson, and Myeong HoSohn. 2001. What Can a Mouse Cursor Tell Us More?:Correlation of Eye/Mouse Movements on Web Brows-ing. In
CHI ’01 Extended Abstracts on Human Factorsin Computing Systems (CHI EA ’01) . ACM, New York,NY, USA, 281–282.[2] Jeremy Goecks and Jude Shavlik. 2000. Learningusers’ interests by unobtrusively observing their nor-mal behavior. In
Proceedings of the 5th internationalconference on Intelligent user interfaces . ACM, 129–132.[3] Georges Grinstein, Jean Scholtz, Mark Whiting, andCatherine Plaisant. 2009. VAST 2009 challenge: aninsider threat. In
Visual Analytics Science and Tech-nology, 2009. VAST 2009. IEEE Symposium on . IEEE,243–244.[4] Sampath Jayarathna, Atish Patra, and Frank Ship-man. 2015. Unified Relevance Feedback for Multi-Application User Interest Modeling. In
Proceedings ofthe 15th ACM/IEEE-CS Joint Conference on DigitalLibraries . ACM, 129–138.[5] Sina Mohseni, Alyssa Pena, and Eric D. Ragan. 2017.ProvThreads: Analytic Provenance Visualization andSegmentation.
Proceeding of IEEE VIS (2017).[6] Eric D Ragan, Alex Endert, Jibonananda Sanyal, andJian Chen. 2016. Characterizing provenance in visual-ization and data analysis: an organizational frameworkof provenance types and purposes.
IEEE transactionson visualization and computer graphics
22, 1 (2016),31–40.[7] Jean Scholtz, Mark A Whiting, Catherine Plaisant, andGeorges Grinstein. 2012. A reflection on seven yearsof the VAST challenge. In