Explainable Patterns: Going from Findings to Insights to Support Data Analytics Democratization
Leonardo Christino, Martha D. Ferreira, Asal Jalilvand, Fernando V. Paulovich
EExplainable Patterns: Going from Findings to Insights to SupportData Analytics Democratization
Leonardo Christino, Martha Dais Ferreira, Asal Jalilvand, and Fernando V. Paulovich,
Member, IEEE
User chooses indicators of interest and explores the line chart selecting patterns of interestRecommendation lists are displayed suggesting interesting countries or indicators for further analysis Similarity to other countries is calculated based on the user selection indicating places to investigateExplanations for the selected pattern are extracted from the external source of information to help in the interpretation
Fig. 1. Explainable Patterns interface setup to explore world demographic indicators. Using the line chart visualization, users canselect patterns of interest (A). A world map is then rendered to display if other places in the globe present similar behavior (B), andtextual explanations are created using an external source of information to describe the selection (C). Also, recommendations of othercountries and indicators are shown (D), offering an environment to promote data exploration without relying on domain experts.
Abstract —In the past decades, massive efforts involving companies, non-profit organizations, governments, and others have been putinto supporting the concept of data democratization, promoting initiatives to educate people to confront information with data. Althoughthis represents one of the most critical advances in our free world, access to data without concrete facts to check or the lack of anexpert to help on understanding the existing patterns hampers its intrinsic value and lessens its democratization. So the benefits ofgiving full access to data will only be impactful if we go a step further and support the
Data Analytics Democratization , assisting usersin transforming findings into insights without the need of domain experts to promote unconstrained access to data interpretation andverification. In this paper, we present
Explainable Patterns (ExPatt) , a new framework to support lay users in exploring and creatingdata storytellings, automatically generating plausible explanations for observed or selected findings using an external (textual) sourceof information, avoiding or reducing the need for domain experts. ExPatt applicability is confirmed via different use-cases involvingworld demographics indicators and Wikipedia as an external source of explanations, showing how it can be used in practice towardsthe data analytics democratization.
Index Terms —Storytelling, Visualization to text, Data Analytics Democratization, Interpretability
NTRODUCTION
It is undeniable that the past decades represent a revolution withoutprecedent in human history for information and knowledge dissemi-nation. The amount of data available and accessible is increasing ina pace never seen before, with massive efforts involving companies, • L. Christino, M. D. Ferreira, A. Jalilvand, and F. V. Paulovich are withDalhousie University. E-mail: { christinoleo, asal.jalilvand, daismf,paulovich } @dal.ca.Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx non-profit organizations, governments, and others to support its de-mocratization. The concept of giving access to everyone, everywhere,anytime, unthinkable in the past, is becoming the norm. Initiativeslike Gapminder [25] to help educate people to confront informationwith data has been showing how wrong are our premises about theworld, biased by outdated information that does not reflect the currentreality [27]. It is a paradigm shift in the way we handle information,even when official and (auto-declared) credible fonts release it.Data democratization is one of the most critical advances in ourfree world. However, most of the initiatives are instrumental whenhypotheses are known or when a domain expert is available to ex-plain what we are seeing, being limited when such assumptions do nothold [28]. In exploratory scenarios, when users access data withoutconcrete questions or facts to check, the lack of a domain expert to help a r X i v : . [ c s . H C ] J a n n understanding the data patterns or findings hampers the intrinsicvalue of data, lessening the advantages of its democratization. There-fore, the benefits of giving full access to data will only be impactful forthe general public if we go a step further and support the Data AnalyticsDemocratization , assisting users in transforming findings into insightswithout the need of experts to promote unconstrained access to datainterpretation and verification.The idea of reducing the need for experts in data analytics pipelinesis a recent trend in the machine learning community with AutoMLmethods [3, 11, 12, 13, 23], replacing them by automatic and exhaus-tive procedures to build computational models. A similar trend isalso observed in the visualization community with the proliferationof approaches to facilitate and automate parts of the storytelling pro-cess [7, 17, 18], including automatic infographics creation [8] andpattern identification [9, 29]. Although representing a step towards thedata analysis democratization, both movements still rely on domainexperts to provide explanations for the findings observed in the data,limiting the extent lay users can understand and take advantage ofthe produced results, even in domains of general public interest likegovernment open data initiatives.In this paper, we proposed
Explainable Patterns (ExPatt) , a newframework to explore and analyze world demographics indicators sup-porting lay users in understanding and creating storytelling, automati-cally generating plausible explanations for observed patterns withoutthe need for domain experts. To provide the explanations, an externaltextual source of information is automatically linked to the data usingdifferent natural language processing and pattern analysis strategies sothat every time a user selects a pattern or finding in a visual representa-tion, potential explanations are derived to assist in understanding whatis being observed. ExPatt also recommends similar indicators (or datasets) based on the selected pattern, enabling users to go from finding tofinding to compose a story or narrative supported by automated expla-nations, representing another step towards the full implementation ofData Analytics Democratization.In summary, the main contributions of this paper are: • A novel query-based strategy to link data patterns with (textual)information contained in an external source of data; • A recommendation strategy to go from textual information back torelated data using different natural language methods; • An approach to support users to go from findings to (textual) ex-planations and from explanations back to data, supporting an ex-ploratory cycle to create narrative storytelling without demandingdomain experts in the process.The remainder of the paper is structured as follows. In Sect. 2, wediscuss related work involving visualization techniques that seek toreduce the need for experts in the data analytics process. In Sect. 3, weformalize the problem and outline the ExPatt solution. In Sect. 4, wepresent ExPatt interface design and a usage scenario showing how itcan be used in practice for user-driven data-based storytelling. Also,an evaluation is presented, indicating a good degree of stability andreproducibility in the results. Finally, in Sect. 5 we discuss ExPattlimitations and in Sect. 6 we draw our conclusions.
ELATED W ORK
In general, two different experts are heavily involved in the data analyt-ics process. One, the data science expert, is accountable for developingand deploying the analytics pipeline. The second, the domain expert,is responsible for using the developed tools to make sense of the dataunder analysis. In a data analytics democratization scenario, where thegoal is to promote unconstrained access to lay users or the general pub-lic to data interpretation and verification, it is becoming more and moreevident the need to reduce the dependency on having both types of ex-perts in the process. Aiming at addressing this issue, an emerging fieldhas been attracting considerable attention inside the machine learningarea, the Automated Machine Learning (AutoML) [3, 11, 12, 13, 23].AutoML strategies reduce or eliminate the need for data science expertsfocusing on automating the process to find the most suitable computa-tional models and their parametrizations. A similar trend has also been observed in the visualization domain with some strategies that seek toautomate parts of the visual analytics process to better support narrativeand storytelling.One example is the work presented by Cui et al. [8]. In this paper,authors introduce an approach to automatically generate infographicsfrom natural language statements, providing visualizations to representstorytelling from textual information. Similarly, Lin et al. [17] pro-posed a method to automatically render visual representations for newsarticles, including a contextual representation to improving the under-standing of the extracted information. Although both approaches arerelated to narrative visualization and storytelling, they did not providefurther interpretations or explanations for the data, limiting the extenta non-domain expert user can understand and take advantage of thevisual representations. Gapminder [26], the well-know application forexploring world indicators data through visualizations, has a similarlimitation, the lack of support to understand the observed data patterns.Aiming to enrich storytelling with explanations, Bryan et al. [7]presents an approach, called Temporal Summary Images (TSIs). TSIsincludes an automatic annotation strategy to visually indicating relevantregions and features, connecting data exploration with storytelling.Although it provides textual explanations for the visualizations, theircontent needs to be provided by the user or domain expert. Tanget al. [29] also present a strategy to improve storytelling, supportingautomatic identification of findings or patterns. Likewise, Ding et al.[9] propose a technique to automatically identify interesting patternsin multidimensional data, avoiding the detection of easily inferablecommon patterns. Although some level of automation is added tothe data analytics process, all these techniques still rely upon domainexperts to explain the discovered patterns, limiting the extent lay userscan benefit.In terms of approaches that unify visualization and natural languageprocessing methods, Yu and Silva [33] propose the FlowSense system.FlowSense is a natural language interface to assist dataflow diagramconstruction where users can expand and adjust diagrams using plainEnglish. Luo et al. [18] also proposed a query-based system calledDeepEye. DeepEye supports data exploration based on an informationretrieval strategy where users provide a textual query, and the systemreturns the most meaningful visualizations. Although both providesupport for storytelling considering textual and visual information, thetextual query and the visual search do not offer explanations for thefindings or features found in the data. Metoyer et al. [20] also integratestext and visualization elements, using natural language process strate-gies to automatically linking narrative components (who, what, when,where) to data in an offline approach. Although representing a fascinat-ing approach to connect textual information with data, it does not focuson using text to explain findings in the data but to link complementaryinformation to better support storytelling.In summary, the existing approaches to reduce the dependency onexperts, although representing considerable advances in replacing datascience experts, still requires domain experts to provide explanationsfor the patterns or findings discovered in the data. For companiesor in applied domains where the final user and domain expert aretypically the same, reducing the need for data science experts is highlybeneficial. However, in data democratization initiatives, like Gapminderand others, where users are lay people or the general public withoutdeep knowledge on the domain under analysis, replacing data scienceexperts is only one step to make them impactful. Such initiatives willonly reach their ultimate goal if the need for domain experts is alsoreduced, henceforth democratizing the data analytics. This secondstep is the novelty of our framework, offering support to automaticallyderive plausible explanations for observed or selected patterns andunusual behaviors. We address this challenge linking the data setsunder analysis with external sources of textual information, from wherethe explanations are derived. To promote the connection betweendata and the external sources, we devise a novel scheme based onpattern analysis and natural language process strategies to translate thefindings selected by a user into textual queries that are used to fetchthe explanations. Our framework also recommends similar data setsbased on the selected pattern or recovered explanations, allowing userso navigate through the data and build up storytelling to understand itsbehavior, patterns, and features, replacing domain experts by (external)textual sources of information.
ETHODOLOGY
In this paper, we present
Explainable Patterns (ExPatt) , a novel frame-work designed to support data exploration and interpretation of worlddemographics indicators time-series. Based on user-driven interac-tive selections of patterns (or findings) of interest, ExPatt supports theidentification of related indicators and the creation of explanations au-tomatically extracted from an external corpus of textual data containinggeneral information about world history. Our reasoning is to allowusers to identify and select findings in the time-series and automaticallylink them to the textual information, offering potential explanations forthe selected findings. The story-telling cycle is then completed by usingthe retrieve textual information to recommend other related time-seriesindicators, effectively implementing a complete data analytics processwithout requiring experts to support it.To set notation, consider a dataset I = { I ,..., I N } of N indica-tors, where each indicator I r is composed by a set of time-series I r = { S r ,..., S rM r } representing real-valued measures over-time of M r distinct countries. In our framework, the user initially selects a time-series S rc = { x ,..., x K c } ∈ R K c that represents the indicator r of a par-ticular country c . The selected time-series is then displayed using a linechart (Fig. 1(A)), where the user can select any finding of interest, that isany sequence of consecutive values F rc = { x k ≥ , x k + ,..., x k ≤ K c } ⊂ S rc .Every time a pattern is selected by the user, a map is colored to indi-cate the similarity of the selected finding of the country under analysiswith other countries considering the same indicator (Fig. 1(B)) andto suggest highly similar countries (Fig. 1(D)). Such selection is alsotranslated into a query for the external source of information. Thefetched information (documents) is then summarized using severalnatural language processing strategies to aid with the selected findingexplanation and displayed to the user (Fig. 1(C)). Also, topic termsare extracted from the recovered documents and used to suggest otherrelated indicators of the same country or other countries. These stepsare detailed in the next sections. One of the core concepts in our framework is to automatically link theselected finding to textual information that (potentially) describes it. Asdiscussed, this process is performed by translating the selected findingsinto textual queries that are submitted to a search engine. In this process,all information related to the user selection, that is indicator description,country, time period, and pattern type, is combined and decoded ontothe text query. This text query is composed of a set of terms, each onerepresenting a different attribute of the data, taking the form q = (cid:91) t dv ∈ D r t dv ∩ (cid:91) t cu ∈ C r t cu ∩ (cid:91) t fp ∈ P r t fp ∩ (cid:91) t eq ∈ E r t eq , (1)where D r = { t d , t d ,... } is the set of names or tags associated withthe indicator r under analysis (Sect. 3.2.1), C r = { t c , t c ,... } is thecountry name, country citizen adjectives (Sect. 3.2.2), or world regions, P r = { t p , t p ,... } is the pattern or finding description (Sect. 3.2.3), and E r = { t e , t e ,... } is the time period. The strategy to compose E isstraightforward, all the years corresponding to the selected intervalare added as terms. The process to composed the other sets are morecomplex and is described in the following sections. Each indicator is related to a different aspect of the countries. To beable to compose the set D r of terms to represent such aspect, we applya text mining approach to assign to each indicator I n a set of related tags G n = { g , g ,... } . For example, the “life expectancy” indicator besidesbeing represented by the terms “life” and “expectancy” could also berepresented by terms like “longevity” and “lifetime”, so resulting in G li f e expectancy = { li f e , expectancy , longevity , li f etime } . In this way,each indicator has its semantic meaning defined by the tags associatewith it.In this paper, user-intervention is expected and encouraged with thepurpose to best emulate how a specific user would encode the indicatorsrelevant characteristics. In that sense, we allow users to select the G n tags from a list of words to compose the set D r used in the text query. G n tags are automatically generated every time a new indicator isuploaded into the system, in which the indicator’s name is tokenizedand used (after removing stopwords) on a pre-trained GloVe model [24]to find the vector representation of each token. Then, we average thevectors to get one vector that represents the combined semantic of alltokens and search the external corpus for the K most similar termsusing the cosine distance, where K is a system parameter. Also, weadd synonyms and antonyms of the initial tokens to the candidate listusing WordNet [10], resulting in a more comprehensive list of terms torepresent the indicators’ semantics. Notice that this list of words is justa suggestion, and users are free to use other not suggested terms as tagsfor an indicator. To create the set C r containing country’s/ information, the country’sname of the time-series S rc under analysis is added along with other re-lated country names or world regions. Although some findings and theirrespective explanations/reasons may be specific for a single country,there are other findings that occur in multiple countries simultaneously.When this is the case, their explanations can either be specific for eachindividual country or for the aggregation of countries, that is, for thecontinent or sub-continent the country belongs to. This generalizationis important to best match the external corpus, since some textual infor-mation may only contain more general references to the geographicallocation where a specific event occurred.To complement C r with the continent or sub-continent information,we use similarity to analyse the set of time-series S r ,..., S rM r that com-pose the indicator I r for different countries, considering only the timeinterval of the selected finding F rc . A continent or sub-continent nameis added to C r if the average similarity of the finding F rc and the sametime interval for all other countries belonging to the same continentof the country under analysis is larger than a threshold as defined asfollow 1 | W c | ∑ v ∈ W c sim ( F rc , F rv ) > α , (2)where, F rc ⊂ S rc is the selected finding in the indicator r of country c , W c is the list of countries belonging to same continent of country c , F rv ⊂ S rv represents the time-series interval of the same indicator r of adifferent country v ∈ W c , al pha is the threshold, and sim ( · , · ) denotesthe similarity between the interval time-series (see Sect. 3.4). Thethreshold α was adjusted empirically to α = . africa , asia , america , europe , australia ) to be used inthe query. The finding description, set P r , is a list of terms describing the trend andthe pattern of the selected finding F rc = { x k ≥ , x k + ,..., x k ≤ K c } ⊂ S rc .The type of trend can be ascending , descending , or stable , while thepattern can be peak , valley , and neutral , resulting in 9 different combi-nations. Fig. 2 presents examples of these combinations.To define the trending type, we apply the moving average (MA)strategy. MA is a well-known technique in time-series analysis to defineif the series is stationary or not, providing the trend estimation [6]. MAfirst transform the time-series that defines the finding F rc into a newseries F (cid:48) rc setting its values x (cid:48) i ∈ F (cid:48) rc to the average value in the timeinterval x i − q ,..., x i ,..., x i q (here we empirically define q = F (cid:48) rc and sum the normalizedvalues as follows ig. 2. Examples of combinations of types of trends and patterns inuser selections. Each dotted line shows the patterns detected if the userselects the specific interval. tr = ∑ x (cid:48) i ∈ F (cid:48) rc x (cid:48) i − x (cid:48) i − | x (cid:48) i − x (cid:48) i − | . (3)Based on that, the trend type is defined as trend = ascending , tr > , tr < , otherwise . (4)To define the pattern type, we use a peak detection method thatattempts to identify a local maximum by comparing neighboring val-ues [31, 32]. Using this method, we assign a pattern factor p f to thefinding F rc computing the following equation p f = | F rc | × ( w + − w − ) σ ( F rc ) , (5)where w + is the weight obtained from the peak detection method repre-senting the peak width and prominence, w − is the weight obtained fromthe peak detection method considering the inverse of F rc (multiplyingits values by − σ ( F rc ) is the standard deviation of F rc . Based onthe pattern factor p f the type of pattern is defined as pattern = peak , p f > λ and σ ( F rc ) > λ valley , p f < − λ and σ ( F rc ) > λ unstable , λ ≥ p f ≥ − λ and σ ( F rc ) > λ , (6)where λ is a threshold to consider if a finding contains a pattern or not,and, if a pattern is detected, λ is a threshold to define if the finding isa peak, a valley, or an unstable oscillation. Empirically we set λ = . λ = . P r . This process is the same presentedin Sect. 3.2.1, in which a set of tags is associated with each identifier. Inour framework, a list of suggested terms is taken from a Thessaurus [2]with synonyms for the identifiers ascending , descending , peak , valley ,and neutral , and the user selects, during a setup phase, what is more ap-propriate. Users can also add terms that are not suggested in this processto represent the identifiers. For example, we translate a stable and peak finding to P stable , peak = { high , peak , boom , boost } , and a ascending and neutral finding to P ascending , neutral = { increase , higher , growth } , Once the terms of the query q are defined (Equation 1), it is submittedto a text-based search engine to fetch relevant documents inside theexternal corpus. In this process, we use an open-source search engine called Elasticsearch [16] since it is fast for text indexing, processing,and searching large databases. Besides fetching documents according toa given query, the Elasticsearch engine ranks each retrieved documentsbased on how well it matches the query, returning the top n documents(where n is a setup parameter). This ranking can be manipulated byboosting specific keywords or by giving more importance to termsof specific areas of a document, such as title or URL link. Since thedocument’s title represent a brief summary of what is expected to becontained within the document, we boost the search result ranking bytwo when the title matches the query q versus matching only in the textbody. With this, there is a higher chance of documents with relevanttitles, and, therefore, relevant information in the entire document, to beranked higher than documents with less relevant titles and less likely tobe specifically about the related subject portrayed by the query. Once the documents are retrieved, the last step on transforming findingsinto explanations is to present the fetched information. In this process,we present an overview visualization outlining the top k retrieved doc-uments ( k is a user parameter) and visualizations summarizing eachdocument on demand.To create the overview visualization, we extract topic keywords foreach retrieved document. In this process, we use Latent Dirichlet Al-location (LDA) [5] initially training a model using the entire externalcorpus but removing terms that occur in less than 15 documents andterms that occur in more than 50% of the documents. We empiricallyset these values following the common practice, but users can changethat in the framework setup phase. Using this model, for each retrieveddocument, we compute the probability of its terms of composing ameaningful topic and getting the top terms. In this process, we useLDA since it is nearly real-time considering a pre-trained model, anessential feature for an exploratory tool. Once the lists of topic termsper document are computed, we create an Explanation River Overview visualization based on the ThemeRiver metaphor [14]. In this overview,the documents are positioned in the x -axis, and the rivers’ width repre-sent the probability of each topic term in each document. The topicsterms are also listed below the rivers to facilitate navigation. Fig. 1(C)presents an example of such visual representation.Using the explanation river overview to navigate the results, detailedinformation about each document is displayed on demand every timea mouse click in a document happens. In the detailed representation,instead of showing the content of the entire document, we opt to showsummaries and a tag cloud. The tag cloud is composed of the extractedtopic terms, and the summary is created using the Gensim’s summariza-tion method [4, 21]. The summary is displayed as a plain text with theterms used in the query in bold. Moreover, a map highlighting othercountries mentioned in the document is displayed with a list of relateddatasets (see Sect. 3.5). Fig. 1(C) presents an example of the detailedview of a selected document. Every time a finding F rc = { x k ≥ , x k + ,..., x k ≤ K c } ⊂ S rc is selectedin a time-series representing an indicator r of country c , a worldmap is displayed comparing c to each other country in the world,and different ranking lists are computed. For the map, each othercountry c (cid:48) is colored according to the similarity between F rc and F rc (cid:48) = { x k ≥ , x k + ,..., x k ≤ K c (cid:48) } ⊂ S rc (cid:48) . For the lists, one ranks thecountries using the same values used to color the map, from thehighest to the lowest similarity. Another ranks all other indica-tors S r (cid:48) c for the country c calculating the similarity between F rc and F r (cid:48) c = { x k ≥ , x k + ,..., x k ≤ K c } ⊂ S r (cid:48) c . This enables users to find othertime-series with similar patterns and shapes, varying the country or theindicator, showing how similar or divergent is the data.Our framework enables users to choose between two differentstrategies to compute the similarity among patterns (or subsets oftime-series): the Pearson correlation and the Dynamic Time-Warping(DTW) [15, 22]. If the user wishes to compare time-series accordingto their shapes ignoring amplitude, Pearson Correlation is the choice.If the user wishes to compare time-series according to differences inmplitude, DTW is the option. The Pearson correlation is calculated asfollows corr ( F , F (cid:48) ) = N N ∑ i ( x i − µ ( F ))( x (cid:48) i − µ ( F (cid:48) )) σ ( F ) σ ( F (cid:48) ) , (7)where µ ( · ) is the average value, σ ( · ) is the standard deviation, and N isthe number of values in the patterns. Correlation corr ( F , F (cid:48) ) ranges in [ − , ] . Positive values indicates linear related series, negative inverselyrelated series, no relationship otherwise. The second option, the DTW,is a robust dissimilarity measure that finds the non-linear alignmentthat has the lowest accumulative Euclidean distances between points,resulting in an optimal shape match preserving magnitude [15, 22].Since the correlation is a similarity and the DTW is a unboundeddissimilarity, we transform the DTW dissimilarity into a similarity tokeep consistency using as follow dtw sim ( F , F (cid:48) ) = + dtw ( F , F (cid:48) ) , (8)where dtw ( F , F (cid:48) ) is the DTW distance between two patterns, and theresulting similarity dtw sim ( F , F (cid:48) ) ranges in [ , ] . In addition to thesimilarity ranking lists, we have a third list that, whenever a peak (orvalley) in F rc is detected (see Sect. 3.2.3), it ranks all other countries c (cid:48) from the highest peak (or deepest valley) to the lowest peak (or shallow-est valley). The idea is to rank the countries to show the ones that aremore (negatively or positively) “impacted” by extracting and sortingall pattern factors, calculated through Equation 5, while considering aspecific indicator in the same period of time. In addition to the recommendations of countries or indicators givenby the similarity ranking lists, our framework also recommends coun-tries and indicators given the set of documents returned by the query q , closing the loop (data to documents, documents to data). The rec-ommendation of countries is straightforward: considering that q wasderived from a finding F rc of an indicator r and country c , for eachreturned document, every other country c (cid:48) (cid:54) = c that is mentioned inthe document is listed as a recommendation (same indicator, differ-ent country S rc (cid:48) ). In this process, we look for country names or theircorresponding adjectival and demonymic forms.For the recommendation of other indicators (and the same country S r (cid:48) c ), the process is more involving. We use the topics extracted torepresent the documents (see Sect. 3.3) and compare them to the indi-cators’ tags (see Sect. 3.2.1) to find indicators that are “semantically”related to the retrieved documents. To calculate the similarity amongtopics and indicators’ tags, we use the GloVe model [24] to find thevector representation of indicators’ tags and topics and use the cosinedissimilarity to compare them. Considering T = { t , t ,... } the listof topic terms and ˆ T r = { ˆ t , ˆ t ,... } the list of tags associated with anindicator r , the similarity is calculated as sim ( T , ˆ T r ) = | T ||| ∑ t i ∈ T glove ( t i ) || + | ˆ T r ||| ∑ ˆ t i ∈ ˆ T glove ( ˆ t i ) || , (9)where glove ( · ) returns the vector representation of a topic term or a tag.The result of this process is a ranked list of the most similar indicatorsconsidering the retrieved documents. One alternative to it would be toevaluate the similarity between the keywords of the retrieved documentsand the indicators’ tags. However, the advantage of training an LDAmodel using the entire external corpus is that we have a broader rangeof terms and topics to look at. Thus the suggestion might includesomething that is not mentioned in a document but is related to it interms of its high-level theme or topic. ESULTS AND E VALUATION
In this section, we present Explainable Patterns (ExPatt) interface de-sign, showing how to employ the features and how it can be used inpractice for user-driven data-based storytelling through a usage sce-nario. For ExPatt implementation, we use javascript for the front-end and python for back-end . ExPatt design is data agnostic, allowing itsapplication in different scenarios as soon as textual information relatedto the time-series under analysis exists. However, for demonstrationpurposes, we loaded time-series datasets of world demographics indica-tors from the well-known Gapminder [26] initiative and a cirrussearch dump of the English Wikipedia database [1] as the external textualsource of information. We start presenting the design and implementation of the ExPatt inter-face. To develop this interface, we create a preliminary mock-up usingwell-established visual metaphors employed to display time-series data(line charts) and text (wordcloud and theme river). Staring from it, weperformed multiple sessions using the thinking-aloud approach withdifferent members of our lab to refine and define the final design. Fig. 1presented the resulting interface composed of four distinct modules.The
Line Chart Module (A) is for loading one (or multiple) time-series indicator(s) of a particular country (we later explain how tochange the country under analysis) and displaying it as a line chart.Using this graph, the user can interact with the time-series selectingfindings by clicking and dragging the mouse. The selected pattern isthem used to build the query for retrieving the explanations from theexternal source of information (see Sect. 3.2) and to update the othermodules. Fig. 1(A) shows an example of selecting a finding (the grayarea) – a massive decrease in the “Oil Production Per Person” for the“Iran” in the period between 1973 and 1981.The
Similarity Graph Module (B) contains a map showing the sim-ilarity between the country under analysis and other nations consideringthe same indicator and selection (time period). Users can use the mapto investigate if the finding occurs in other countries (Fig. 1(B)), if theselected country is different from other regions of the world, or to getan overview of similarity distribution of the finding around the globe.By inspecting Fig. 1(B), it is possible to sense that similar behavior tothe selected finding is also observed in multiple countries (e.g., UnitedStates, Canada, and Venezuela). In contrast, the behavior looks theopposite for some others (e.g., Mexico and the United Kingdom).The
Explanation Module (C) comprises a visual representationof the retrieved documents (or explanations) for the selected findingshowing an overview of the documents’ topic keywords and details ondemand. The explanation river overview shows the top 10 retrieveddocuments with their titles and topic terms. While titles and topic termsprovide a summary of each document, details on demand are shown byclicking over the titles. The detailed view provides an overview of whatcan be found inside a document, as well as other mentioned countriesand similar indicators, helping on navigating from the explanationsback to the indicators (datasets). According to the example presentedin Fig. 1(C), the finding observed in (A) is probably related to differentoil crises in 1973 and 1979, in which the production was reduced dueto an embargo and a strike, respectively.Finally, the
Recommendation Lists Module (D) includes a set oflists containing other recommended indicators or countries based onthe selected finding. If a user wants to discover countries with similarfindings, s/he can use the “Similar/Dissimilar Countries” lists, whichshows the same (or inverse) information of the correlation map but ina ranked list. However, if a user wants to analyze similarities usingother indicators while considering only the country under analysis, s/hecan use the “Similar/Dissimilar Datasets” lists. Finally, to discover themost prominent peaks and valleys over the selected time-frame, the“Prominent Peaks/Valleys” lists can be used, allowing to see what arethe most impacted country over the selected time-frame. Accordingto the “Different Countries” list in Fig. 1(D), Mexico and the UnitedKingdom present considerable different behaviors in the same periodof the decrease in the production of oil in Iran. ExPatt can be accessed at http://vav.research.cs.dal.ca/explainablepatterns/ .2 Usage Scenario – United States Life Expectancy
Using these modules, we present a hypothetical scenario to show howExPatt can assist non-expert users in understanding discovered findingsand building up storytelling based on data.Justin, an American student in High School, wants to investigate howthe average life-span has changed over the years in the United States.With that in mind, he selects the “Life Expectancy” indicator and the“United States” country. The resulting graph shows a positive trendtowards increasing the overall American life-span over the years, but henotices two interesting patterns, one valley between 1860 and 1866 andanother between 1917 and 1919 with some instability between 1901and 1930 (see Fig. 3).To further inspect these patterns aiming at understanding what ishappening, Justin selects the first valley (between 1860 and 1866). In-ternally the ExPatt framework builds a query, retrieves documents, andgenerates a visual representation summarizing the results, describingthe selected finding. By checking the similarity map (see Fig. 4(a))and adding the two countries (Sweden and the United Kingdom) withthe highest similarity, Justin realizes that this drop in life expectancyis probably an American effect (see Fig. 4(b)) since even the mostrelated countries do not present similar valley in the same period. Withthat in mind, Justin goes to check the explanations starting with theexplanation river overview (see Fig. 5(a)). He observes the prevalenceof terms “war” and “wars” and the multiple resulting documents titlesand concludes that the lower life expectancy rating may be the result ofa civil war. Clicking in the documents “Union (American Civil War)”and “1862 and 1863 United States House of Representatives elections”and reading the summaries, he notices that the trigger of the civil warwas the result of Abraham Lincoln’s election and the US southern statesfeeling unrepresented and/or challenged due to slavery (see Fig. 5(a)),being able to discover one important piece for the storytelling of theUS life-span variations, including its trigger.
Fig. 3. Life expectancy line chart of the United States. Interestingpatterns can be observed, one valley between and , anotherbetween and , and some instability between and .The gray area represents the user selection.
Continuing from his previous experience with the system, Justinfollows up to investigate the second dip in the life expectancy, between1917 and 1919. The results are, however, less clear with different causesfor the drop, as seen in Fig. 5(b). By investigating the correlation mapand reading the summary snippet of each explanation, Justin builds ahypothesis in his mind of which area to drill down more. Differentlyfrom the correlation map of the 1860 − (a) Similarity map using Pearson Correlation.(b) Line chart. Fig. 4. Similarity map of the finding selected in Fig. 3. For most coun-tries the patterns has a low correspondence (a) and the two countrieswith highest similarity (Sweden and the United Kingdom), an inspectionsuggest that this drop in life expectancy is an American effect. other countries and finds more indications of both World War I and theSpanish flu (see Fig. 5(b)) impacting life expectancy. Justin decidesto check for the most prominent valley among the ranking lists to seewhich was the country most impacted during this time (here omitteddue to space constraints) and concludes that Samoa, a small island nearAustralia, was the heaviest impacted country during this period withtheir life-expectancy dropping to close to 1 year as seen in Fig. 5(b).Finally, Justin concludes that most of the world had its life expectancyaffected by both Spanish flu and World War I in this period.Justin, however, is not comfortable with the fact that one of the onlycountries that do not have a high similarity is Russia (see Fig. 6(a)),and he starts looking for reasons. Following the same investigationprocedure, Justin finds out that the reason for the similarity to below in this period is that Russia’s valley pattern is much wider (seeFig. 7(b)) than the United States. By focusing only on Russia’s linechart and selecting the valley, he obtains a different similarity map withneighborhood countries to Russia (Belarus, Ukraine, Turkmenistan,Uzbekistan, Tajikistan, and Kazakhstan) presenting a mush highersimilarity than the rest of the world, which is confirmed by the similarityranking list. Justin switches to the Dynamic Time-Warping similaritymap of Fig. 7(a) and verifies that Uzbekistan is the most similar countryto Russia by adding it to the line-chart (see Fig. 7(b)). After checkingthe explanation river overview (here omitted due to space constraints),he realizes that during the same period Russia was facing, besides theWorld War I and Spanish Flu, the abolition of Russia’s monarchy in1917, a civil war, and the beginning of the Soviet Union.While checking the “History of Soviet Russia and the Soviet Union(1917–1927)” explanation details, Justin notices that ExPatt recom-mends “Democracy Index” as one of the related indicators of thisexplanation (see Fig. 8(a)). At the same time, the “Different Datasets”ranking list (here omitted due to space constraints) also suggests thesame indicator. Indeed, Justin confirms that by adding the secondindicator to the line chart view. By removing the “life expectancy” a) Explanation river overview of the finding selected in Fig. 3. Terms “war” and “wars” are frequent among the returned documents, and many of these are aboutthe American civil war, indicating that the drop in the life expectancy in the United States in the period between 1860 and 1866 is probably a result of the AmericanCivil War.(b) Explanation river overview for the United States life expectancy in the period between 1917 and 1919. Spanish flu or the Influenza Pandemic and the WorldWar I are the most plausible explanations for the observed drop in the American’s life-span.
Fig. 5. Different examples of explanation river overviews for the two valleys observed in the United States life expectancy indicator (see Fig. 3). indicator and focusing only on the “democracy index” indicator, Justinselects the inclines and declines of the indicator individually and lookfor explanations for this sudden change in the democracy index. Hegoes on to find that Russia wrote a constitution on 1906 and anotheron 1918 by selecting the period of 1903 and 1907 and 1915 and 1918,respectively. Finally, by selecting the period of 1922 and 1924, Justinreads the “Russian Revolution” explanation, which tells about howthe Monarchy was abolished in 1917 and the Soviet Union was estab-lished in 1923, matching the steep fall of Russia’s democracy index in1923. Justin concludes that in Russia, the changes in the political land-scape definitively had an impact on the lives of the Russian population,including their life expectancy.Although an elementary analysis, through data, Justin was able todiscover one piece for the storytelling of the life-span variations in theUS, including its trigger. Two different global situations, World War Iand the Spanish Flu, and why Russia was differently affected than therest of the world. This simple storytelling generated from aggregatingtextual information with information obtained from the line chart andmap shows how important it is for final users to be in the center of the analytical process without the need of a domain expert, that may not beavailable.
In this section, we present a formal evaluation of our framework. Al-though the standard practice would be to test it with users, since werely on well-established and straightforward visual representations (linechart and theme river), the relevance of the test would be minimal tosupport what we are proposing. Also, user tests would not give us anybetter insight into our framework since the setup depends on users’specific knowledge about the data being used, and the explanation accu-racy shown to the users are heavily dependent on how well the textualsource describes the events of the temporal datasets. User tests wouldonly allow us to gather application-specific results, which would onlyapply to the datasets and not to the framework itself.Therefore, instead of testing the interface design, we opt to checkhow stable our framework is given variations in the user interactions.In other words, how much the explanations differ if slightly differentintervals in the time-series are selected representing a pattern. Again, a) Similarity map using Pearson Correlation.(b) Line chart.
Fig. 6. Similarity map of the finding between and in Fig. 3.For most countries the pattern has high correspondence (a) and thesame valley can be observed in other countries of different world regions,indicating a global reason for the drop in life expectancy. users could be involved in this test, but it would not be comprehensiveenough. Instead, we follow Memon et al. [19] and create an oracle thatemulates the user behavior, allowing us to test our approach exhaus-tively. Notice that we are not evaluating the quality of the providedexplanations, since it reflects the accuracy of the search engine and thereliability of the external textual information, and it is not our objectiveto measure how reliable Wikipedia is nor if the Elasticsearch is good.Indeed, such an evaluation is not possible since we were not able tofind a labeled dataset that allows us to evaluate the connection betweentime-series patterns and their textual explanations.The oracle implementation we use is straightforward. For a time-series S cr it first automatically extracts the top findings, valleys and peaksclassified by their magnitude, then varies a window around the patternsto generate similar selections that emulate the variations in users’ behav-iors. Given a detected finding F rc = { x k , x k + ,..., x k + q } ⊂ S rc , we firstexpand the selection interval to R p + p - = { k − p - , k − p - + ,... k + p + + q } where p + , p - ∈ [ , P ] , and then for each R p + p - a new query is generatedcentering the window with size q + p - + p + in the point, resulting in P distinct queries for each original extracted top finding. In our tests, theexpansion size P of the window was set to 3. This was defined runninga test with our lab members asking them to manually select findingsin many different time-series, considering the average variation amongthem to set P .The stability of our query process is then measured comparing the in-tersection of the set of documents retrieved using the original extractedfinding and the sets of documents fetched using the derived patterns.Given D the set documents retrieved using the original finding and D i the list of documents returned using one of the derived findings, thestability is computed as 1 | D | · | ∆ | ∑ D i ∈ ∆ | ( D ∩ D i ) | , (10) (a) Similarity map with Pearson Correlation (A) and Dynamic Time-Warping(B). (b) Line chart. Fig. 7. Life expectancy analysis of Russia around the period of and . Compared to the United States, Russia presents a muchwider valley spamming much longer (a), with much higher similarity withneighbor countries indicating that something happened in that region ofthe world beyond what is observed globally (see Fig. 5(b).) where ∆ is the set containing the lists of documents produced by allderived patterns. Notice that the number of documents in D and D i isthe same and defined by the number of documents we display in ourinterface. In our tests, we set it to 10.To execute a comprehensive test, we select 960 time-series fromour database and measure the query stability for each one. In total,we automatically detect 5 ,
286 relevant patterns and generate 8 derivedpatterns per original pattern, resulting in 47 ,
574 queries submitted tothe search engine. Notice that, if we have tested with users, considering50 users, each one should have executed approximately 951 selectionsto reach the same statistical representativity, an impractical numbereven if we triple the number of users. Fig. 9 summarize the resultsfor the 9 different combinations of trends and patterns. Overall, onaverage, the stability is 0 . .
22, while ’unstable patterns’ have lessdocument stability with an average of 39% and std of 0 .
27. Fig. 9 alsoshows how ’neutral unstable’ patterns are the least stable, which isexpected since this combination means no coherent pattern in the data.Although this cannot be used to quantify the quality of the explanations,it indicates that users with similar behavior are expected to receivesimilar explanations when selecting the same pattern, indicating a gooddegree of stability and reproducibility.
ISCUSSIONS AND L IMITATIONS
Since we finished the first implementations of our framework, mostof the exploratory tests we executed resulted in useful explanations.However, in some cases, it failed to bring any meaningful information a) Line chart.(b) Document Explanation and its recommended indicators/countries.
Fig. 8. Democracy Index analysis of Russia around the period of and . Russia presents a peak of democracy index in the same time-frameof a valley in line expectancy (a), The document shown summarizesfindings of this time-frame, while recommending the “Democracy scoreindex” indicator as related to the document for further analysis. about the pattern under analysis. The problem is the quality of ourexplanations is bounded by what we call explainability factor of theexternal textual corpus considering the time-series data. In other words,if the corpus has no information on the events/findings of the time-series, our framework may still return documents but without actualinformation to aggregate to the user. The straightforward solution isto use a richer corpus, for instance, the entire Internet. We tried thisusing the Google API [30] as our external source of information, andthe results were more informative to say the less. Indeed, Google APIwas our first choice, but the system responsiveness was not ideal dueto API limitations. And a real-time response is mandatory for systemsdedicated to the general public. Also, we had problems obtaining thetopic terms from the retrieved document since we need to have accessto the content of the documents, and our attempts to create genericparsers did not work as expected. So we opt for Wikipedia, even thoughit is less comprehensive than Google.In this paper, we have explored world indicators datasets as a way toexemplify the usage of our framework. However, other domains mayalso benefit from it. For instance, stock market time-series datasetscould be explored, not for forecasting but for connecting observablepatterns with facts in the same period. The challenge in this example isto find good sources of external information from where the explana-tions can be derived. Another example of a domain that can also benefitfrom our approach is more fine grainer demographics (e.g., cities orneighborhood levels) analysis supported by newspaper press datasets asexternal sources of information. In any case, a fundamental observationhas to be made. The explanations are not intended to serve are causalityinformation since spurious correlations can be derived. It is only a wayto enrich users’ knowledge and a solution to break some barriers in thedemocratization of information.
Fig. 9. Query stability analysis. On average, slights variations in the se-lection return to of the documents returned by the original querydepending on the pattern type. Thereby, users with similar selectionsare expected to receive similar explanations from ExPatt when selectingthe same pattern, indicating a good degree of stability and reproducibilityespecially for patterns like peaks and valleys. Finally, our visualizations are not perfect. Line charts, for instance,have inherent limitations in terms of visual scalability when displayingtoo many time-series at once, the same restriction presented by thetheme river metaphor and the wordcloud. However, we opt to usesimple and popular visual representations that are (probably) familiarto our intended audience, the general public, and not only to computerscience graduate students. So, although other more scalable visual rep-resentations can be used, the added complexity and the prolonged userlearning curve would reduce the reach of our framework, confrontingour primary goal of democratization.
ONCLUSION
In this paper, we present Explanaible Patterns (ExPatt), a framework tosupport exploratory analysis replacing the need for domain experts toaid in understanding discovered findings by explanations automaticallyderived from external textual sources of information. Although theemployed strategies are generic, we have reported encouraging resultsusing world demographics indicators databases and Wikipedia as anexternal source of information. The core of ExPatt is a novel strategyto translate user selections into queries used to fetch the information tocompose the explanations. In our tests, such strategy shows to be verystable and reliable, returning practically the same list of explanationsgiven slight variations in user selection. The usefulness of ExPatt isbrought by a usage scenario where a lay user explores a time-seriesof interest and navigate through explanations and other time-seriesto understand remarkable patterns in the data in a user-driven story-telling process. Despite its limitations, ExPatt is unique in what itproposes, representing a real advance towards providing solutions fordata analytics democratization in domains of general public interest.
CKNOWLEDGMENTS R EFERENCES [1] Wikimedia downloads. URL https://dumps.wikimedia.org/other/cirrussearch/ .[2] Thesaurus.com, 1995. URL .[3] Adithya Balaji and Alexander Allen. Benchmarking automaticmachine learning frameworks. arXiv preprint arXiv:1808.06492 ,2018.[4] Federico Barrios, Federico L´opez, Luis Argerich, and RosaWachenchauzer. Variations of the similarity function of textrankfor automated summarization.
CoRR , abs/1602.03606, 2016.[5] David M Blei, Andrew Y Ng, and Michael I Jordan. Latentdirichlet allocation.
Journal of machine Learning research , 3(Jan):993–1022, 2003.6] Peter J Brockwell and Richard A Davis.
Introduction to timeseries and forecasting . springer, 2016.[7] Chris Bryan, Kwan-Liu Ma, and Jonathan Woodring. Temporalsummary images: An approach to narrative visualization via in-teractive annotation generation and placement.
IEEE transactionson visualization and computer graphics , 23(1):511–520, 2016.[8] Weiwei Cui, Xiaoyu Zhang, Yun Wang, He Huang, Bei Chen, LeiFang, Haidong Zhang, Jian-Guan Lou, and Dongmei Zhang. Text-to-viz: Automatic generation of infographics from proportion-related natural language statements.
IEEE transactions on visual-ization and computer graphics , 26(1):906–916, 2019.[9] Rui Ding, Shi Han, Yong Xu, Haidong Zhang, and DongmeiZhang. Quickinsights: Quick and automatic discovery of insightsfrom multi-dimensional data. In
Proceedings of the 2019 Inter-national Conference on Management of Data , pages 317–332,2019.[10] Christiane Fellbaum. Wordnet. In
Theory and applications ofontology: computer applications , pages 231–243. Springer, 2010.[11] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Sprin-genberg, Manuel Blum, and Frank Hutter. Efficient and robustautomated machine learning. In
Advances in neural informationprocessing systems , pages 2962–2970, 2015.[12] Isabelle Guyon, Kristin Bennett, Gavin Cawley, Hugo Jair Es-calante, Sergio Escalera, Tin Kam Ho, N´uria Macia, Bisakha Ray,Mehreen Saeed, Alexander Statnikov, et al. Design of the 2015chalearn automl challenge. In , pages 1–8. IEEE, 2015.[13] Isabelle Guyon, Imad Chaabane, Hugo Jair Escalante, Sergio Es-calera, Damir Jajetic, James Robert Lloyd, N´uria Maci`a, BisakhaRay, Lukasz Romaszko, Mich`ele Sebag, et al. A brief reviewof the chalearn automl challenge: any-time any-dataset learningwithout human intervention. In
Workshop on Automatic MachineLearning , pages 21–30, 2016.[14] Susan Havre, Elizabeth Hetzler, Paul Whitney, and Lucy Nowell.Themeriver: Visualizing thematic changes in large documentcollections.
IEEE transactions on visualization and computergraphics , 8(1):9–20, 2002.[15] Eamonn Keogh and Chotirat Ann Ratanamahatana. Exact in-dexing of dynamic time warping.
Knowledge and informationsystems , 7(3):358–386, 2005.[16] Oleksii Kononenko, Olga Baysal, Reid Holmes, and Michael W.Godfrey. Mining modern repositories with Elasticsearch. In , pages 328–331, New York, New York, USA, 52014. Association for Computing Machinery, Inc.[17] Allen Yilun Lin, Joshua Ford, Eytan Adar, and Brent Hecht.Vizbywiki: mining data visualizations from the web to enrichnews articles. In
Proceedings of the 2018 World Wide Web Con-ference , pages 873–882, 2018.[18] Yuyu Luo, Xuedi Qin, Nan Tang, Guoliang Li, and Xinran Wang.Deepeye: Creating good data visualizations by keyword search.In
Proceedings of the 2018 International Conference on Manage-ment of Data , pages 1733–1736, 2018.[19] A. Memon, I. Banerjee, and A. Nagarajan. What test oracleshould I use for effective GUI testing? pages 164–173. Instituteof Electrical and Electronics Engineers (IEEE), 1 2004.[20] Ronald Metoyer, Qiyu Zhi, Bart Janczuk, and Walter Scheirer.Coupling story to visualization: Using textual analysis as a bridgebetween data and interpretation. In , pages 503–507, 2018. [21] R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In
Proceedings of EMNLP-04and the 2004 Conference on EmpiricalMethods in Natural Language Processing , July 2004.[22] Meinard M¨uller. Dynamic time warping.
Information retrievalfor music and motion , pages 69–84, 2007.[23] Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason HMoore. Evaluation of a tree-based pipeline optimization tool forautomating data science. In
Proceedings of the Genetic and Evo-lutionary Computation Conference 2016 , pages 485–492. ACM,2016.[24] Jeffrey Pennington, Richard Socher, and Christopher D Manning.Glove: Global vectors for word representation. In
Proceedings ofthe 2014 conference on empirical methods in natural languageprocessing (EMNLP) , pages 1532–1543, 2014.[25] Anna Rosling R¨onnlund and Ola Rosling. Free software for aworld in motion. In
Proceedings - Second International Con-ference on Creating, Connecting and Collaborating ThroughComputing , pages 14–19, 2004. ISBN 0769521665. doi:10.1109/c5.2004.1314363.[26] Hans Rosling. Data - gapminder.org, 2012.[27] Ola Rosling Rosling, Hans and Anna Rosling R¨onnlund.
Fact-fulness: Ten Reasons We’re Wrong About the World - and WhyThings Are Better Than You Think . New York: Flatiron Books,2018.[28] David S. Sawicki and William J. Craig. The democratizationof data: Bridging the gap for community groups.
Journal ofthe American Planning Association , 62(4):512–523, 1996. ISSN01944363.[29] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and DongmeiZhang. Extracting top-k insights from multi-dimensional data.In
Proceedings of the 2017 ACM International Conference onManagement of Data , pages 1509–1524, 2017.[30] Mario Vilas. Google API, 2020. URL https://pypi.org/project/google/ .[31] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haber-land, Tyler Reddy, David Cournapeau, Evgeni Burovski, PearuPeterson, Warren Weckesser, Jonathan Bright, St´efan J. van derWalt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, NikolayMayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, EricLarson, CJ Carey, ˙Ilhan Polat, Yu Feng, Eric W. Moore, JakeVand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, IanHenriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald,Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, andSciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms forScientific Computing in Python.
Nature Methods , 2020.[32] Chao Yang, Zengyou He, and Weichuan Yu. Comparison of publicpeak detection algorithms for MALDI mass spectrometry dataanalysis.
BMC Bioinformatics , 10:4, 1 2009. ISSN 14712105.[33] B. Yu and C. T. Silva. Flowsense: A natural language interfacefor visual data exploration within a dataflow system.