[PDF] Leam: An Interactive System for In-situ Visual Text Analysis

Abstract

With the increase in scale and availability of digital text generated on the web, enterprises such as online retailers and aggregators often use text analytics to mine and analyze the data to improve their services and products alike. Text data analysis is an iterative, non-linear process with diverse workflows spanning multiple stages, from data cleaning to visualization. Existing text analytics systems usually accommodate a subset of these stages and often fail to address challenges related to data heterogeneity, provenance, workflow reusability and reproducibility, and compatibility with established practices. Based on a set of design considerations we derive from these challenges, we propose Leam, a system that treats the text analysis process as a single continuum by combining advantages of computational notebooks, spreadsheets, and visualization tools. Leam features an interactive user interface for running text analysis workflows, a new data model for managing multiple atomic and composite data types, and an expressive algebra that captures diverse sets of operations representing various stages of text analysis and enables coordination among different components of the system, including data, code, and visualizations. We report our current progress in Leam development while demonstrating its usefulness with usage examples. Finally, we outline a number of enhancements to Leam and identify several research directions for developing an interactive visual text analysis system.

Full PDF

LLeam : An Interactive System for In-situ Visual Text Analysis

Sajjadur Rahman

Megagon Labs

Peter Griggs

Megagon Labs & MIT

Çağatay Demiralp

Megagon Labs

ABSTRACT

With the increase in scale and availability of digital text generatedon the web, enterprises such as online retailers and aggregatorsoften use text analytics to mine and analyze the data to improvetheir services and products alike. Text data analysis is an itera-tive, non-linear process with diverse workflows spanning multiplestages, from data cleaning to visualization. Existing text analyticssystems usually accommodate a subset of these stages and oftenfail to address challenges related to data heterogeneity, provenance,workflow reusability and reproducibility, and compatibility withestablished practices. Based on a set of design considerations wederive from these challenges, we propose Leam, a system that treatsthe text analysis process as a single continuum by combining ad-vantages of computational notebooks, spreadsheets, and visualiza-tion tools. Leam features an interactive user interface for runningtext analysis workflows, a new data model for managing multipleatomic and composite data types, and an expressive algebra thatcaptures diverse sets of operations representing various stages oftext analysis and enables coordination among different componentsof the system, including data, code, and visualizations. We reportour current progress in Leam development while demonstrating itsusefulness with usage examples. Finally, we outline a number ofenhancements to Leam and identify several research directions fordeveloping an interactive visual text analysis system.

With the rapid growth of the e-commerce economy, the internethas become the platform for many of our everyday activities, fromshopping to dating, to job searching, to travel booking. A recentstudy projects the worldwide e-commerce sales to be around sixtrillion dollars by 2023 [32], nearly a 50% increase over the currentmarket. This growth has contributed to the proliferation of digitaltext, particularly user-generated text (reviews, Q&As, discussions),which often contain useful information for improving the servicesand products on the web. Enterprises increasingly adopt text miningtechnologies to extract, analyze, and summarize such informationfrom the unstructured text data. Like data analysis at large, text dataanalysis also benefits from interactive workflows and visualizationsas they facilitate accessible, rapid iterative analysis. Therefore, wecharacterize the text data analysis process more formally as visualinteractive text analysis (VITA hereafter). A challenging aspect ofVITA is its iterative and non-linear nature—it is a multistage processthat involves tasks like data preprocessing and transformation,model building, hypothesis testing, and insight exploration, all ofwhich require multiple iterations to obtain satisfactory outcomes.While there are many commercial and open-source tools thatsupport various stages of VITA [21, 30], none of these capturethe end-to-end VITA pipeline. For example, spreadsheets are suit-able for directly processing and manipulating data, computationalnotebooks enable flexible exploratory analysis and modeling, and

In[3]:

ABC L EAM {"data": { ... }, "combine": [ {"class": “aggregate”, ...}, {"class": “visualize”, ...} ]} E vis1 = vis_view.get_vis(0)vis1.select("comfy") In[1]: col = data.get(“reviews”)

In[4]: vis1.coordinate(table_view)vis1.select("comfy")

In[2]: col.visualize(“barchart”)

Enter command here... vis1 = vis_view.get_vis(0)vis1.select("comfy")

In[3]: DF Visualize Bar chart review

Figure 1:

Leam user interface. (A) Operator View enables users toperform visual text analytics (

VITA ) operations using drop-downmenus, (B) Visualization View holds a carousel of interactive visu-alizations created by users, (C) Table View displays the data and itssubsequent transformations, and (D) Notebook View allows usersto compose and run

VITA operations using a declarative language.Inset (E) shows the

VTA json specification for the bar chart operatorin Operator View. Inset (F) shows a declarative

VTA command forinteracting with the bar chart from Notebook View. visualization systems, typically based on chart templates, facilitatequick interactive visual analysis. There are also many customizedtext analytics tools [21], often in the form of research prototypes,that support specific use-cases like review exploration [38], sen-timent analysis [18], and text summarization [8]. Unfortunately,none of these solutions accommodate the inherently cyclic, trial-and-error-based nature of VITA pipelines in an integrated manner.While supporting the entire VITA life-cycle within a single sys-tem does seem natural, developing it leads to several challengesrelated to (a) extensibility and expressivity of VITA workflows,(b) their continuity and reproducibility, (c) data heterogeneity andprovenance, and (d) synchronization of user interactions. We docu-ment these challenges in Section 1. These challenges are a culmi-nation of (1) our conversations with practitioners while workingin an industry research lab, part of a larger company with morethan three hundred e-commerce subsidiaries, (2) prior research,particularly those reporting from interview studies on data analysisworkflows [20, 38], and (3) our experience in developing and evalu-ating interactive data systems. As such, the list of challenges here isintended to be a useful guide informing research and developmenton text analytics systems, not a comprehensive enumeration, andinevitably reflects our personal taste. We propose a set of designcriteria (desiderata) to address the challenges of a VITA system(Section 1). One crucial theme underlying these criteria is the con-sideration of data analysis as a single continuum, not as discretesteps of tasks performed in isolation. a r X i v : . [ c s . D B ] S e p n this paper, we introduce Leam, a one-stop-shop for visual inter-active text analysis. Leam combines the advantages of spreadsheets,computational notebooks, and visualization tools by integratinga Notebook View with interactive views of raw and transformeddata (Figure 1). A key component in the design of Leam is a visualtext algebra (VTA) that enables users to specify complex VITA op-erations over heterogeneous data and their visual representations;either using a declarative language in the Notebook View (Fig-ure 1F) or by creating operators in the front-end that translates to json -style VTA specifications (Figure 1E). Through usage examples,we demonstrate the expressivity of VTA and how it enables Leamto support diverse tasks ranging from data cleaning to visualization.Moreover, to facilitate efficient execution of VTA on heterogeneousdata, we introduce a new data model extending dataframes calledVITAframe. Based on our experience in developing Leam, we haveidentified several system design challenges related to storage modelefficiency, scalable computation of VITA workflows, data and work-flow versioning, and workflow optimization. To address these chal-lenges, we identify several research directions that may enhancethe capabilities of Leam. We have made the current version of Leamopen-source at https://github.com/megagonlabs/leam.The goal of this paper is to explore challenges in and designconsiderations for VITA systems development, present vital compo-nents of Leam that enable interactive text analysis ( e.g. , VITAframeand VTA), and identify some concrete research directions based onour experience of developing Leam. VITA

CHALLENGES

Motivating Example.

Ada, a data scientist in the e-commerce de-partment of a retail business, has been tasked to analyze customerreviews of products purchased from their website. Ada would liketo capture the underlying topics by performing topic modeling andclustering to characterize the review corpus better. Figure 2 capturesthe use-case which involves—(a) preprocessing the data ( clean ), cre-ating feature vectors from the text reviews ( featurize ), creating topicvectors from the corpus ( topic modeling ), clustering reviews intotopics ( cluster assignment ), and finally, visualizing the clusters byprojecting the topics vectors to lower dimensions (2D) using featuretransformation techniques such as PCA ( visualize ). In practice, theworkflow may be non-linear, and each step may require multiplepasses and different tools. In the process, visualizations are usefulnot only for exploratory analysis or final presentation but also forevery other step—VITA workflows resemble the read-eval-printloop (REPL) approach where users perform incremental operationson data and examine intermediate results.We now characterize thechallenges of the existing VITA workflows in the context of thisuse-case as follows:

C1. Workflow discontinuity.

As mentioned in Section 1, data scien-tists lack tools that support different VITA operations and work-flows within an integrated environment. For example, to definethe data cleaning rules, Ada first visually inspects the data usingtools like spreadsheets. Next, they execute those rules in a com-putational notebook, e.g. , Jupyter. Upon inspecting the data in thespreadsheet, Ada may revise the rules in the notebook. To visualizetop-ranked words after the featurization step ( e.g. , a bar chart of

Figure 2:

An example

VITA use-case: topic exploration. words ranked by their TF-IDF scores), they need to either use dedi-cated visualization tools or write scripts in the notebook. Therefore,even completing simple tasks may require accessing different tools,which can be cumbersome user experience due to, for example, thelogistical and cognitive overhead of context switching.

C2. Limited coordination.

VITA necessitates coordination amongdifferent views ( e.g. , between visualization and raw data). The highdimensional text data can be difficult to interpret and users oftenmap different facets of the data to visualizations for better inter-pretability. However, without coordination between perceptualcomponents and the data space, understanding the relations be-tween the facets of the same entities on demand can be challenging.For example, say Ada wants to inspect which reviews contain atop-word in the bar chart (generated after featurization). However,visualizations in notebooks or visualization tools are decoupledfrom the data. As a result, Ada has to either open and then filterthe data in a spreadsheet, or programmatically filter the data fromthe notebook to inspect the relevant reviews. Therefore, the lackof coordination impacts both workflow continuity and the user’sability to explore data effectively.

C3. Data types and workflow diversity.

VITA workflows deal withheterogeneous data ( e.g. , text, visualizations) and workflows ( e.g. ,in use-cases like text summarization, sentiment analysis). Whilethere are a number of VITA tools for specific workflows [21], moreoften than not these tools use a stack of independent solutions fordata storage and processing glued together by scripting languageslike Python and R. These bespoke solutions may not capture allVITA requirements like direct data manipulation and interactivevisual coordination. As a result, users are often forced to developnew and heavily customized systems on top of these solutions.

C4. On demand workflow authoring.

VITA workflows contain a va-riety of operations, e.g. , cleaning, featurization, interactive visual-ization, classification. Similar to relational [10] or data visualiza-tion algebra [29], VITA operations with similar objectives can begrouped into high-level categories. Moreover, operations in differ-ent categories can be combined to compose new operation pipelines.For example, cleaning and featurization can be combined into apreprocessing pipeline. As existing systems lack any formalizationof the operations and their application, the onus is on the user todesign the optimal analytics workflow for different use-cases.

C5.

VITA

Session management.

As demonstrated in the usage ex-ample, VITA workflows inspire the trial-error-correct-style itera-tive approach—users often need to reproduce previous steps of theworkflow, make updates, and rerun the subsequent steps. Therefore,ensuring reproducibility of VITA sessions require management ofdataset versions produced by various operations, the operation logs,and different states of and interactions on the visual representa-tions of the data. Prior work from the data management communityfocused on versioning structured datasets [16], versioning code for ebugging workflows [6, 23] and managing deep learning mod-els [22]. However, these systems lack support for versioning anend-to-end VITA workflow involving heterogeneous data typesand user interactions spanning multiple views. Design goals.

We propose the following design principles to ad-dress the challenges related to VITA:

D1. In-situ analytics.

VITA systems should provide a one-stop-shop( C1 ) where users can directly manipulate (spreadsheets) and visu-alize (visualization tools) data while writing scripts (notebooks) toimmediately view the effects on data and visualizations withoutcontext switching between tools. D2. Multi-view coordination.

Beyond integrating multiple views withina single interface, VITA systems should enable coordination be-tween these views ( C2 ). Multiple coordinated views capture thecontext of the user’s exploration across different views [36] and helpusers understand the data better as they view it through differentconnected representations. D3. Heterogeneous data management.

VITA systems should supportheterogeneous data types ( e.g. , texts, visualization), treating themas first-class citizens of the underlying data model ( C3 ). Insteadof developing bespoke data management solutions, VITA systemsshould adapt their underlying storage model to accommodate thesedata types and also enable a tight coupling between the data modeland the analytical workflows to ensure fast and efficient data access. D4. Expressivity and accessibility.

VITA systems should provide anexpressive specification language to represent and communicate theentire breadth of workflows within the domain ( C4 ). Moreover, thespecification language should be accessible to existing tools to allowmore expressive operations. For example, the specification languagecan be packaged as a Python library with an interactive widgetwith support for a subset of VITA operations in a computationalnotebook. D5. Provenance.

VITA systems should support advanced provenancetracking for heterogeneous data types and various workflows toensure reproducibility and encourage workflow and data re-use.Moreover, these systems should track user interactions on visualcomponents to enable versioning of states of and dependenciesamong different views. Leam : IN-SITU ANALYSIS

Leam backend features an in-memory dataframe, a versioning data-base, a compiler for translating VTA commands that the executionengine runs, and a session manager to manage data, code, andvisualizations. Figure 3 shows the overview of the Leam systemarchitecture. The components with dashed borders (“- -”) are par-tially implemented and require further refinement. We discuss thefront-end components later.

The underlying data structure of Leam is a dataframe. A dataframeis a multidimensional array, A mn , with a vector of row labels, R = { R , R , . . . , R m } and a vector of column labels, C = { C , C , . . . , C n } [25].We opted for dataframes instead of a database since VITA tasks( e.g. , featurization, classification) cannot always be convenientlyperformed inside the database [19]. Dataframes are widely used inexploratory data analysis, including VITA due to their coverage of a Version DBMS Main Memory: VTA Execution EngineVTA CompilerSession ManagerPresentation Layer (User Interface)VTA API (Operators, Notebook Commands)

VITA

FRAME

Command logs VTA Spec.

Figure 3:

Leam system architecture. wide variety of data analysis operators [25]. The

Pandas dataframeAPI within Python (pandas.pydata.org) has been downloaded morethan 300 million times and served as a dependency for over 265,000repositories in GitHub. They also do not require data to be definedschema-first allowing flexibility in supported data types and datastructures. Finally, dataframes provide a functional interface suit-able for REPL-style VITA workflows to perform “quick and dirty”analysis.

Figure 4:

VITAframe with column and metadata schema specifica-tion. The Review column is of type

Text and the metadata, the col-lection of unique tokens, is of type

List ( token ) . To support VITA use-cases, we assign a schema to dataframecolumns—each column C i ∈ C may have a schema defined overa set of data domains, D = { d , d , . . . } that spans heterogeneousdata types like text, visualizations ( D3 ). We call this data structurea VITAframe. We discuss the underlying data domain for VITA inSection 4. The REPL-style VITA workflows involve users creatingand examining intermediate results that are also from the datadomain, D . An intermediate result with one-to-one correspondencewith a VITAframe column ( e.g. , n reviews are featurized into n feature vectors) is added as a new column. However, these resultsmay not have one-to-one correspondence ( e.g. , dictionary of wordsin the set of n reviews) and are stored in a separate data structure asmetadata of the corresponding column. Therefore, for each column C i ∈ C in a VITAframe, there is a schema specification functionthat assigns a domain d j ∈ D to the column and a domain d k ∈ D for each column metadata (see Figure 4). For example, in Figure 4,the column Review ’s type is

Text and metadata type is

List ( token ) ( i.e. , the list of unique tokens in the text reviews). Metadata areoften used for computing aggregate statistics and visualizations.For example, to visualize TF-IDF score of top ranked words inthe reviews, a user can first construct a new metadata of type Dictionary ( token , float ) that contains the average TF-IDF score ofthe unique tokens. Then they can use the metadata to create andvisualize a ranked list of words as a bar chart (see Figure 5c). Notethat there can be columns with no metadata. "data": { "view": "explorer.vitaframe", "source": "column”, "name": "review" }, "combine": [ {"class": "project", "type": "lowercase"}, {"class": "project", "type": "stopwords"}, {"class": "set", "type": "unique_tokens"} ], "actions": [ {"action": “update”, ... }, {"action": “update”, ... }, {"action": "add", ... } ]} { "data": { "view": "explorer.vitaframe", "source": "column", "name": "review" }, "operator": { "class": "mutate", "type": "tf_idf" }, "actions": [{ "action": "create", "view": "explorer.table”, "target": "column", "name": "review_tf_idf" }]} { "data": { ... }, "combine": [ {"class": “aggregate”, "type": “AVG”}, {"class": “visualize”, "type": “bar”} ], "actions": [{ "action": “add”, “view”: “explorer.vitaframe”, ... }, { "action": “create”, “view”: “explorer.summary”, ... }]} (a) Clean and generate metadata (b) Create TF-IDF feature vector (c) Visualize top-words by TF-IDF score Figure 5:

VTA specification. (a) Create a composite cleaning operator ( lowercase and remove_stopwords ) using combine and generatemetadata ( unique_tokens ). (b) Create TF-IDF vectors from reviews using mutate operator (added as a new column in Table View). (c) Createa bar chart by combining aggregate (average TF-IDF score for each token) and visualize operator.

VTA

Compiler.

In Section 4, we present VTA, an algebra for spec-ifying VITA operations. Leam compiles the user interactions onOperator View (see in Figure 1a) or VTA commands in NotebookView (see in Figure 1d) into VTA specifications. However, the VTAspecifications are incomplete, in the sense that they may omit de-tails ranging from visual encoding such as fonts, line widths to inputdata type. To resolve these ambiguities Leam currently uses a rule-based compiler that translates a VTA specification into lower-leveloperators for the execution engine to run backend computation(see Section 4).

Execution Engine.

The VITA execution engine takes the follow-ing input generated by the VTA compiler: (1) input data schema in D and (2) translated VTA specifications. The text analysis operations—data preprocessing ( e.g. , cleaning), feature extraction ( e.g. , TF-IDFfeaturization), feature transformation/selection ( e.g. , PCA), estima-tion ( e.g. , classify), and more advanced post-processing operations( e.g. , anomaly detection)—are mapped to existing ML and NLP li-braries like Spacy, Scikit-learn. While other VTA operators relatedto visual coordination are mapped to built-in implementations. Version Control.

After each operation, Leam checkpoints the cur-rent state of the visualizations, VITAframe, and notebook com-mands. The development of a fully functional system for fine-grained ( i.e. , at operation level) versioning is currently in progress( D5 ). We discuss how to support fine-grained version managementof a VITA session involving heterogeneous data types, notebookcommands, and user interactions in Section 5. Leam

Front End

Leam user interface has four components—Operator View, TableView, Visualization View, and Notebook View (see Figure 1)—whichenables users to perform in-place text analytics ( D1 ). Users can per-form various VITA operations using the operators in Operator View(see Figure 1a). For example, cleaning the data in Table View, addingvisual summaries in Visualization View. Table View (see Figure 1c)design is inspired by traditional spreadsheets and tabular data visu-alization tools [31] and enables users to directly operate on the data.Table View data can be transformed into visual summaries like barcharts and scatterplots using the visualization operators or VTAcommands (see Figure 1). Moreover, a cell in Table View can alsobe a visualization, similar to tabular data analytics tools [31]. Thevisualizations in Visualization View (see Figure 1b) are displayed as a carousel of charts. Interactive visualizations are generated bytranslating the visualization operators or commands to Vega-Litespecifications [29]. Notebook View design (see Figure 1d) is in-spired by computational notebooks and enables users to authorVITA workflows using a Python-based VTA library. The differentviews in Leam can be linked—interactions on one view are reflectedin other views ( D2 ). Through VTA, users can declaratively specifyinteractive coordination ( e.g. , brush-and-linking) between theseviews that are translated to Vega signals [28]. VTA : AN ALGEBRA OF

VITA

We now discuss our visual text algebra, VTA, in detail. We demon-strate how VTA captures various tasks in the usage example inSection 1 (see Figure 1, 5, and 6).

VTA

Specification

Data domain.

VITA involves many data types, including differ-ent forms of text ( e.g. , words, tokens, sentences), complex datatypes like lists, vectors, and even visualizations. We define thedata domain of VITA as D = { P , S } , where P and S are sets ofprimitive and synthesized ( i.e. , composite) data types. The domainsof primitive data types are taken from, P = { α ∗ , int , float , bool , datetime , List , Vector , Dictionary } . The domain α ∗ is the set of fi-nite strings over an alphabet α . The domains of composite datatypes are taken from, S = { Text , Visualizations } . If a schema ofthe data is not specified upfront, the data is initially assumed tobe from the domain α ∗ . Each domain d i ∈ D includes a generatorfunction д i : α ∗ → d i which defines the rule for inferring exactdata types of the respective domain. For example, the composite Text data types ( e.g. , words, tokens, sentences) are generated usinga context free grammar [9]. The generator function of visualizationdata types is defined based on Vega-Lite [29]:

Visualizations = ( data , trans f orms , mark − type , encodinдs ) . VTA operators can bespecified in a json or as declarative commands that are availablethrough a VTA library in Python. Selection operator.

VTA enables the specification of interactionthrough selections. Selection operations select data points of in-terest on which subsequent operations in the workflow may beperformed ( e.g. , row(s) in Table View, visualization mark(s) in Visu-alization View). Supported selection types include a single point( e.g. , a table row, marks like a bar or a circle in a chart), a list ofpoints ( e.g. , rows, bars, or circles), or an interval of points ( e.g. , ten "data": { "view": "explorer.summary", ... }, "coordinate": [ {"class": "select", "type": "single"}, {"class": "select", "type": "multi”} ], "actions": [{ "action": "update", "view": "explorer.summary" },{ "action": "update", "view": "explorer.table” }]} The beds were plush and comfy.

Review

Friendly stu ﬀ , cheap rooms, comfy beds. Free wiﬁ. "coordinate": [ {"class": "select", "type": "single"}, {"class": "select", "type": “multi”} ] The beds were plush and comfy.

Review

Friendly stu ﬀ , cheap rooms, comfy beds. Free wiﬁ. (a) Coordination of two views (b) Coordination of three views Figure 6:

Multi-view coordination specification. (a) Use the coordinate operator to create a unidirectional coordination between the barchart and Table View. Selecting a bar (word) in the bar chart triggers a filter operation on Table View by the selected word. (b) Creatinga unidirectional coordination between the bar chart and scatterplot automatically links the three views (multi-view coordination). For theprevious action on the bar chart, all relevant points in the scatterplot are selected, in addition to filtering Table View. rows starting from row i , circles in scatterplot within x -axis range).The selection criteria are specified by a predicate to determine theset of selected points. Filter is an example of list selection wherethe selection predicate is the filtering condition. VTA also supportssimilar types of selections on visualizations (Figure 6a). Besides json specification, users can also perform such selections by writingcommands in Notebook View. For example, Figure 1F shows how auser can select a bar ( e.g. , the word “comfy”) in the bar chart usinga VTA command:

Transformation operators.

While developing Leam, we identi-fied a set of core transformation operators that encompasses thevarious VITA workflows. These transforms manipulate the com-ponents of the selection they are applied to. Note that the coreoperator set is minimal, and there is room for adding more op-erators to make VTA more expressive ( D4 ). The transformationoperation has five subclasses: project , mutate , aggregate , set ,and visualize . The project operators change the dimension-ality ( e.g. , LDA, PCA) or cardinality ( e.g. , stopword removal) ofthe input data or update the content ( e.g. , lowercase). The mutate operator generates a new representation of the input data ( e.g. ,create a list of tokens or feature vectors from text). The aggregate operator computes summary statistics of the input data ( e.g. , aver-age review length in a corpus). The set operations enable set-likeoperators on the input data ( e.g. , get unique tokens in the corpus).The visualize operator generates visualizations of data. Figure 5captures the data preprocessing phase of the workflow discussedin Section 1. Ada first cleans the reviews using project operators(Figure 5a),then applies a mutate operator ( e.g. , TF-IDF featurevector creations) to featurize the reviews (Figure 5b). Finally, Adacomputes the average TF-IDF score of each word ( aggregate ) andvisualizes the top ranking words ( visualize ) using a bar chart(Figure 5c).Each operation takes an input from the given data domain D and generates an output that also belongs to the same domain D .An action defines how the resulting output should be used. Anaction can be of three types: add , create , update . For add action,the output becomes meta-data of the input ( e.g. , metadata of aVITAframe column). For create , the output becomes part of thedata the user directly operates on ( e.g. , create a new VITAframecolumn). For update , the output essentially replaces the input data( e.g. , update an existing VITAframe column). Composite Transforms.

VTA currently supports two compositetransforms that combine unit transformation operations: combine and synthesize . The combine operator enables users to specifyan operation pipeline. In Figure 5a, a user combines two project operations with a set operation into a single operation. Similarly,in Figure 5c, a user combines aggregate and visualize opera-tors into a single operation for bar chart creation. The synthesize operator enables users to create new operations from these combina-tions which can be reused later. For example, a user can synthesize the previous cleaning pipeline to be a clean operator which thenbecomes an operation in the

Operator View and is used later.

So far we have explained selections that are defined within a singleview ( e.g. , Table View or Visualization View). However, selectionsthat involve multiple views cannot be captured by the default single-view-based VTA specification. We define a coordination operatorcalled coordinate that captures such multi-view coordination.Coordination can be either unidirectional or bidirectional. For ex-ample (Figure 6a), selecting a top-word bar in the bar chart andfiltering corresponding Table View rows is a unidirectional coordi-nation. However, adding a selection in the reverse direction makesthe coordination bidirectional. For example, changing opacity ofthe bars in the bar chart based on reviews selected in Table View.Currently, Leam supports only unidirectional coordination withina single specification. To create a bidirectional specification, usersare required to specify two unidirectional specifications. Besidesthe direction of coordination, the coordinate operator needs toresolve the mapping between coordinated views and compositeselections across views.

Mapping coordination.

In Figure 6a, there is a one-to-many map-ping from the bar chart to Table View—selecting a bar may filtermultiple reviews that contain the top-word ( e.g. , two reviews con-tain the word in “comfy”). In Figure 6b, there is a one-to-many map-ping from the bar chart to scatterplot and a one-to-one mappingfrom the scatterplot to Table View. Therefore, the coordinate op-erator needs to resolve the coordination mapping among views onthe fly, so that relevant visualization marks are selected/highlightedin respective views. In our current VTA implementation, users arerequired to explicitly specify the mapping using the type tag in re-spective select operators of the views. For example, in Figure 6a,the user selects type single ( i.e. , one) for the bar chart and multi ( i.e. , any) for Table View. We aim to automate the mapping process infuture by resolving the mapping between the underlying data ofthe respective views. Resolving composite selections.

Users can add multiple single-view selections to a view. For example, in Figure 6b, the user addsa link from the bar chart to the scatterplot, which creates a multi-view coordination among the bar chart, scatterplot, and Table View.Selecting a bar in the bar chart highlights multiple circles (scatter-plot) and reviews (Table View). However, following the top-wordselection, the user may select a rectangular area in the scatterplotor select multiple reviews in the table. It is not clear whether weshould deselect the previous selection and only highlight currentselection in the scatterplot and update corresponding bar chart andtable view selections, i.e. , always perform independent selection.The other option is to perform a union or intersection amongthe selected data points of all the selections. Currently, Leam onlysupports independent selection. We outline advanced compositespecifications such as union and intersection as future work. Leam

ENHANCEMENTS

We now outline our vision for Leam development.

Efficient storage model.

While VITAframe currently supportsdata that fits in memory, in practice, the datasets will be larger.So, the storage layer of Leam should enable operations on bothin-memory and disk-resident data—an important requirement forscalability. Therefore, instead of Pandas, we can leverage scalabledataframes such as

Modin [25] that supports both main-memoryand persistent storage out-of-core, allowing intermediate dataframesto exceed memory limitations. The extension would require inte-grating the VTA library in

Modin and adapting its column defini-tion to include metadata schema. Another alternative is embeddedanalytical systems that tightly integrate RDBMS with analyticaltools and provide fast and efficient access to the data stored withinthem [26]. Leam employs PostgreSQL for session management andversion control. Designing a storage layer that enables efficient datasharing to allow for seamless passing of data back and forth betweenVITA sessions and the RDBMS is a possible research direction.

Interaction at scale.

One of the primary requirements of analyticstools like Leam is to ensure interactivity even with larger datasets.One approach is to allow users to select a sample of data upfrontvia the selection operator and then conduct the rest of the sessionon the sample for preliminary EDA—a common practice amongdata scientists [38]. However, this approach still doesn’t solve thescalability problem. An alternative is to pursue system driven sam-pling, where the goal is to draw samples of data progressively andthen display approximate results. We call this approach Pro-VITA, i.e. , progressive VITA. How do we provide users with approximate,yet informative, intermediate responses? While we can leverage ex-isting work from the DB community related to approximate queryprocessing to support operations like aggregation [1–3, 15] and visualization [11, 13, 27], it is not clear how to provide meaningfulintermediate results progressively for operations like classificationor clustering. For example, how to determine crucial model meta-parameters without having access to the complete data? Progres-sive computation can be complemented by optimistic analytics [24]where precise computations run on the background as users explore approximate results. When there is a significant difference betweenthe approximate and precise results ( e.g. , classification results varyfrom ground truth), the analyst can decide which parts of the explo-ration have to be redone. Larger datasets also impede direct datamanipulation—a common problem of spreadsheet-style interfacesand necessitates design of interactive and navigable representationof the data [5].

Versioning

VITA sessions.

The Leam versioning system can main-tain a version graph to keep track of fine-grained changes at theunit operation level. Leam needs to consider the storage-latencytrade-off as a user adds new nodes to the graph: storing entiredata ensures faster session reconstruction at the cost of storagewhile storing delta between subsequent session reduces storageoverhead at the cost of increased reconstruction time. Designing afine-grained version control system for Leam offers unique researchchallenges—besides VITAframe, Leam also needs to checkpoint (a)the states of all the front-end components ( e.g. , formatting like font,color, opacity of views), (b) the coordination mappings of viewsand composite selections, (c) user-defined operator pipelines andcustom models, and (d) VTA commands in Notebook View. Existingsystems address some of these challenges in isolation ( e.g. , data [16]and model [22] versioning, workflow debugging [6, 23]).VTA : coverage, accessibility, and automation.

Our goal is toincrease Leam’s coverage of VITA workflows [21] by introducingnew VTA operators and adding popular ML and NLP libraries [30]as default operators in Operator View. We can further improveLeam’s extensibility by enabling users to add their custom-builtmodels as new operators in Operator View and reuse later. To makeVTA more accessible to a wider audience, we are working on inte-grating VTA with an interactive widget that allows users to issueVTA operators from Jupyter notebooks. Other goals include auto-matically generating VITA workflows given an analysis goal [4],recommending next operator based on users’ current workflow [37],and training an autoregressive language model like

GPT-3 [7] onVTA to automatically compose coordinated views or Leam-styleuser interface based on user specifications in natural language.VITA workflow optimization.

Operator pipelines in a VITA work-flow be executed in different order. For example, text reviews can betokenized first and then cleaned or the other way around. However,tokenizing a cleaned text is less expensive due to it’s smaller cardi-nality than the original text. We can leverage VTA to design a VITAworkflow optimizer, similar to a query optimizer in databases. Otherapproaches for workflow optimization can use parallel executionsimilar to

Modin [25], enabling distributed processing of partitionsof a VITAframe and speed up computation. Designing an optimizerfor progressive computation can be another interesting researchdirection [12, 14].

Evaluation of

Leam . While we demonstrate the expressivity andon-demand workflow authoring capabilities of VTA with severalusage examples, we did not report the performance of other Leamcomponents. Future iterations should include experiments thatevaluate the storage model’s efficiency in managing large datasets,storage overhead and responsiveness of the versioning system,performance of VTA query optimizer, and overall interactivity ofLeam. Moreover, user studies should be conducted to evaluate theusability of Leam. RELATED WORK

We now summarize existing work on VITA, data management fordata science, and domain-specific algebra design.

Visual interactive text analytics.

VITA systems employ visual-ization techniques—both basic ( e.g. , scatterplot, line chart, treemap)and complex ( e.g. , wordcloud, steam graph, flow graph) [21]—tonumerous uses-cases like review exploration [38], sentiment analy-sis [18], text summarization [8]. While these earlier works highlightthe appeal of integrating interactive visualization with text analy-sis, they lack the generalizability of Leam, where users can employan expressive algebra to author different VITA use-cases within asingle system.

Data management for

VITA . Prior work focuses on designingsystems for scalable computation ( e.g. , scalable dataframe and queryoptimization [25], caching and prefetching for visualization [34]),storage models for efficient data access [26, 35]. We discussed re-lated work on versioning [6, 16, 22, 23], approximate query process-ing [1–3, 13, 27] in earlier sections. Leam builds on the earlier workwith specific focus on developing an efficient storage model, en-abling scalable computation, and performing fine-grained versioncontrol.

Data model and algebra.

Our work takes inspiration from ex-isting algebras that provide well-founded semantics for relationaldatabases [10], dataframes [17, 25], and interactive visualizations [28,29, 33]. Here we introduce a new grammar for visual text data ana-lytics and interactive view coordination, building on earlier work.To the best of our knowledge, VTA is the first algebra defined forVITA.

This paper presents our vision for Leam, an integrated system thatsupports VITA workflows end to end. Leam is designed based onseveral design considerations that we derived by identifying ex-isting challenges in developing VITA systems. Leam enables usersto perform interactive text analysis in-situ—direct data manipu-lation (Table View), REPL-style analytics (Notebook View), andcoordinated visual data exploration (Visualization View). We intro-duce a novel algebra for visual text analysis, VTA, that provides asuite of operators to author any VITA workflow on-demand andenable different modes of coordination among views. We presentour current progress in developing Leam’s underlying data man-agement system and outline several research directions related toVTA extensibility and coverage, and storage, computation, and ver-sioning of data and VITA workflows. Addressing these challengesrequires interdisciplinary research efforts from the DB, NLP, HCI,and Visualization communities.

REFERENCES [1] Acharya et al. 1999. The Aqua approximate query answering system. In

ACMSIGMOD . 574–576.[2] Agarwal et al. 2013. BlinkDB: queries with bounded errors and bounded responsetimes on very large data. In

ACM EUROSYS . 29–42.[3] Babcock et al. 2003. Dynamic sample selection for approximate query processing.In

ACM SIGMOD . 539–550.[4] Bar et al. 2020. Automatically generating data exploration sessions using deepreinforcement learning. In

ACM SIGMOD . 1527–1537.[5] Bendre et al. 2019. Faster, higher, stronger: Redesigning spreadsheets for scale.In

ICDE . IEEE, 1972–1975. [6] Brachmann et al. 2020. Your notebook is not crumby enough, REPLace it. In

Proc.CIDR .[7] Brown et al. 2020. Language models are few-shot learners. arXiv preprintarXiv:2005.14165 (2020).[8] Carenini et al. 2006. Interactive multimedia summaries of evaluative text. In

IUI .124–131.[9] Eugene Charniak. 1997. Statistical parsing with a context-free grammar andword statistics.

AAAI/IAAI

Commun. ACM

13, 6 (1970), 377âĂŞ387.[11] Fisher et al. 2012. Trust me, I’m partially right: incremental visualization letsanalysts explore large datasets faster. In

SIGCHI . 1673–1682.[12] Haas et al. 1996. Selectivity and cost estimation for joins based on randomsampling.

J. Comput. System Sci.

52, 3 (1996), 550–569.[13] Hellerstein et al. 1997. Online aggregation. In

SIGMOD . 171–182.[14] Hou et al. 1988. Statistical estimators for relational algebra expressions. In

PODS .276–287.[15] Hou et al. 1989. Processing aggregate relational queries with hard time constraints.In

ACM SIGMOD . 68–77.[16] Huang et al. 2017. ORPHEUSDB: Bolt-on Versioning for Relational Databases.

PVLDB

10, 10 (2017).[17] Hutchison et al. 2017. LaraDB: A minimalist kernel for linear and relationalalgebra computation. In

ACM SIGMOD Workshop on Algorithms and Systems forMapReduce and Beyond . 1–10.[18] Kucher et al. 2018. The state of the art in sentiment visualization. In

ComputerGraphics Forum . 71–96.[19] Lajus et al. 2014. Efficient data management and statistics with zero-copy inte-gration. In

SSDBM . 1–10.[20] Lee et al. 2020. Demystifying a Dark Art: Understanding Real-World MachineLearning Model Development. arXiv preprint arXiv:2005.01520 (2020).[21] Liu et al. 2018. Bridging text visualization and mining: A task-driven survey.

IEEE TVCG

25, 7 (2018), 2482–2504.[22] Miao et al. 2016. Modelhub: Towards unified data and lifecycle management fordeep learning. arXiv preprint arXiv:1611.06224 (2016).[23] Miao et al. 2016. ProvDB: A system for lifecycle management of collaborativeanalysis workflows. arXiv preprint arXiv:1610.04963 (2016).[24] Moritz et al. 2017. Trust, but verify: Optimistic visualizations of approximatequeries for exploring big data. In

SIGCHI . 2904–2915.[25] Petersohn et al. 2020. Towards Scalable Dataframe Systems.

PVLDB

13, 11 (2020),2033–2046.[26] Raasveldt et al. [n.d.]. Data Mgmt for Data Science-Towards Embedded Analytics.[27] Rahman et al. 2017. I’ve seen" enough" incrementally improving visualizationsto support rapid decision making.

PVLDB

10, 11 (2017), 1262–1273.[28] Satyanarayan et al. 2015. Reactive vega: A streaming dataflow architecture fordeclarative interactive visualization.

IEEE TVCG

22, 1 (2015), 659–668.[29] Satyanarayan et al. 2016. Vega-lite: A grammar of interactive graphics.

IEEETVCG

23, 1 (2016), 341–350.[30] Smith et al. 2020. The machine learning bazaar: Harnessing the ML ecosystemfor effective system development. In

ACM SIGMOD . 785–800.[31] Spenke et al. 1996. FOCUS: the interactive table for product comparison andselection. In

ACM UIST . 41–50.[32] Statista. 2020.

Retail e-commerce sales worldwide from 2014 to 2023

IEEE TVCG

8, 1 (2002), 52–65.[34] Tao et al. [n.d.]. Kyrix: Interactive Visual Data Exploration at Scale. ([n. d.]).[35] TileDB Core Team. 2020.

TileDB: The Universal Storage Engine . TileDB, Inc., USA.https://tiledb.com/[36] Wang et al. 2000. Guidelines for using multiple views in information visualization.In

Proc. AVI . 110–119.[37] Yan et al. 2020. Auto-Suggest: Learning-to-Recommend Data Preparation StepsUsing Data Science Notebooks. In

ACM SIGMOD . 1539–1554.[38] Zhang et al. 2020. Teddy: A System for Interactive Review Analysis. In

SIGCHI .1–13..1–13.