[PDF] Taggle: Combining Overview and Details in Tabular Data Visualizations

Abstract

Most tabular data visualization techniques focus on overviews, yet many practical analysis tasks are concerned with investigating individual items of interest. At the same time, relating an item to the rest of a potentially large table is important. In this work we present Taggle, a tabular visualization technique for exploring and presenting large and complex tables. Taggle takes an item-centric, spreadsheet-like approach, visualizing each row in the source data individually using visual encodings for the cells. At the same time, Taggle introduces data-driven aggregation of data subsets. The aggregation strategy is complemented by interaction methods tailored to answer specific analysis questions, such as sorting based on multiple columns and rich data selection and filtering capabilities. We demonstrate Taggle using a case study conducted by a domain expert on complex genomics data analysis for the purpose of drug discovery.

Full PDF

TTaggle: Combining Overview and Detailsin Tabular Data Visualizations

Journal TitleXX(X):1–16c (cid:13)

SAGE

Katarina Furmanova ∗ , Samuel Gratzl ∗ , Holger Stitz , Thomas Zichner ,Miroslava Jaresova , Alexander Lex , and Marc Streit Abstract

Most tabular data visualization techniques focus on overviews, yet many practical analysis tasks are concerned withinvestigating individual items of interest. At the same time, relating an item to the rest of a potentially large tableis important. In this work we present Taggle, a tabular visualization technique for exploring and presenting largeand complex tables. Taggle takes an item-centric, spreadsheet-like approach, visualizing each row in the sourcedata individually using visual encodings for the cells. At the same time, Taggle introduces data-driven aggregationof data subsets. The aggregation strategy is complemented by interaction methods tailored to answer speciﬁc analysisquestions, such as sorting based on multiple columns and rich data selection and ﬁltering capabilities. We demonstrateTaggle using a case study conducted by a domain expert on complex genomics data analysis for the purpose of drugdiscovery.

Keywords

Visualization techniques, tabular data, multidimensional data visualization, aggregation, hierarchical grouping andsorting, degree of interest, focus and context.

Introduction

Visualization of tabular or multidimensional data isimportant in many application domains and is a mainstayof visualization research. Many multidimensional datavisualization techniques, however, focus on providingoverviews. To answer questions about the high-levelsimilarity of items, projections techniques have provenuseful, and exploring correlations between dimensions iswell addressed by axes-based techniques such as scatterplotmatrices and parallel coordinates plots. A third type of task isconcerned with understanding the properties of an item in alldimensions, which is well addressed by tabular techniques.Tabular techniques use a spreadsheet-like layout, with eachitem in a row and each dimension in a column. In contrast tospreadsheets, the cells use visual encodings to make the dataeasy to view and to be able to explore higher level trends.Prominent examples of tabular visualization are the TableLens , Bertiﬁer , LineUp , and ComplexHeatmap .A shortcoming of current tabular visualization techniquesis their lack of sophisticated focus and context. A commonsolution implemented in both the Table Lens and LineUpis to scale down the rows in the visualization, andthen use geometric distortion (lenses) to reveal detailsabout selected items. Distortion, however, is associatedwith a variety of drawbacks, such as maintaining objectconstancy ,p. . Also, lens-based approaches in tables relyon linear orderings, which cannot leverage higher levelsemantics of the data to provide compact but meaningfulaggregations. Aggregation approaches based on grouping,in contrast, can stratify a table in a data-driven and hencesemantically meaningful way. Our primary contribution is Taggle , a tabular visualizationmethod that displays large tabular datasets with up to amillion data items by selectively grouping and aggregatingsubsets of a dataset. The goal of Taggle is to provide ahigh-level overview of large tabular datasets while allowingusers to drill down to individual items. Groupings andaggregations of rows can be dynamically deﬁned by usersusing selection, or in a data-driven way based on categoricalor numerical dimensions. Hierarchical combinations ofaggregations enable a ﬁne-grained control of what to showin a dataset at which level of detail. Taggle also introducesgrouping and aggregation of columns for cases wherecolumns represent data of the same type, as, for example, intime-series data. The grouping and aggregation capabilitiesare complemented by sorting and ﬁltering techniques.We showcase Taggle using a public health dataset: thespread of AIDS across the nations of the world. We alsodemonstrate Taggle using a variety of datasets, including adataset of soccer players, programming language popularity,world happiness measures, economic data, and many othersat https://taggle.caleydoapp.org/ . Masaryk University, Brno, Czech Republic datavisyn GmbH, Linz, Austria Johannes Kepler University, Linz, Austria, Boehringer Ingelheim RCV GmbH & Co KG., Vienna, Austria Czechitas, z.s., Prague, Czech Republic University of Utah, Salt Lake City, UT, USA * These authors contributed equally to this work.

Corresponding author:

Marc Streit, Johannes Kepler University Linz, Institute of ComputerGraphics, Altenbergerstrae 69, 4020 Linz, AustriaEmail: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20] a r X i v : . [ c s . H C ] S e p Journal Title XX(X) a b cdef

Figure 1.

The Taggle interface consisting of a table view (a) and a data selection panel (b) showing a dataset on AIDS in severalcountries grouped by continent and level of human development index. The data selection panel consists of grouping (c) andsorting (d) hierarchy panels and attribute ﬁlter views that allow users to ﬁlter out records by interacting with the histograms. Therows with individual African countries indicate the relationship between new HIV infections and

AIDS-related deaths over time. Itcan be seen that an outburst of new HIV infections in the 1990s in southern African countries resulted in high

AIDS-related death rates about a decade later in the 2000s (e). The rows of countries in

Asia , Europe , and

North America have been aggregated tohistograms, box plots, and stacked bars (f).

We demonstrate Taggle’s utility by means of a case studyon analyzing a cancer genomics dataset for the purpose ofdrug discovery.

Tabular Data

Throughout this paper, we use an AIDS dataset fromUNAIDS AIDSinfo *1 as a guiding example. This datasetwas enriched with metadata about the countries, such aspopulation, which we retrieved from the United NationsPopulation Division *2 and the yearly Human DevelopmentReport of the United Nations Development Programme *3 .The combined dataset consists of 17 numerical columns(e.g., population , sex before the age of 15 in percent), 4categorical columns (e.g., continent , human developmentindex ), and 10 time-series matrices (e.g., AIDS-relateddeaths or new HIV infections over a period of 27 years )collected for 160 countries.Tabular datasets are usually composed of items storedin rows, which often correspond to independent variables(countries, in our example), and values (i.e., observationsabout these variables) stored in columns, which commonlycorrespond to dependent variables (e.g., population orcontinent, in our example). Lex et al. discuss heterogeneityand sources of heterogeneity in tabular data: semantics —the columns in the table have different meanings; characteristics —the columns have different data types andvalue ranges; and statistics —the columns have differentbehaviors or distributions.Homogeneous datasets lend themselves to compact andsimple visual representations, as all data items share the samemeaning and scales. Heatmaps , for example, are well suitedto homogeneous datasets, as they encode each cell with acolor value, which makes it possible to represent individualitems at minimal scale. Heterogeneous datasets have different semantics, charac-teristics, and statistics. Consequently, they may need sepa-rate scales and visual representations for each column. Forinstance, the population is given in absolute numbers and sexbefore the age of 15 is stated in percent.We distinguish between the following data types: Attribute columns where all associated records are of thesame type and semantics, such as the name , gender , and age columns in a table of people. Attributes can be categorical,numerical, temporal (date and time), or textual. Matrices arecomposed of attribute columns of the same semantics anddata type as is commonly found in, but not exclusive to,time series. An example is a country’s

GDP over multipleyears, where each year is a column in the matrix. A non-time-series example, common in the ﬁeld of genomics, is agene expression dataset, where the rows are genes and eachpatient is a column in the matrix. Although it is possible tointerpret matrices as a list of columns, it is beneﬁcial to treatthem as a matrix, because the homogeneity of the data isan opportunity for compact representation. The columns inmatrices can also be associated with attributes that describea common property of the column, such as the decadeassociated with a year, or a shared phenotype of patients.

Design Goals

Based on discussions with experts from various applicationdomains who regularly analyze large tabular datasets,literature reviews, and our own experience, we developed aset of design goals for Taggle. Our ﬁrst goal is to developan item-centric visualization technique that also explicitly ∗ http://aidsinfo.unaids.org/ ∗ ∗ http://hdr.undp.org/ Prepared using sagej.cls urmanova et al. shows all dimensions relevant to an analysis task . Thisgoal by itself is addressed by prior tabular data visualizationtechnique, but currently no tabular data visualizationtechnique addresses our second goal: providing a seamlesscombination of overview and details through selective,data-driven aggregation . A technique that would satisfythis goal would remedy the major drawback of tabular datavisualization techniques: limited context. Current tabularvisualization techniques can only provide context only byshowing neighbors through a single, global sorting, whichmakes it difﬁcult to compare items of different categories.This design goal is hence concerned with showing the detailsabout selected items and providing context, e.g., throughaggregations of data-driven groups.To fully leverage the potential of an overview plusdetail tabular data visualization technique, we need to giveusers the ability to ﬂexibly deﬁne the parameters of thedisplay. To address that, our third goal is to provide richinteraction techniques that support answering speciﬁcquestions, such as sorting, ﬁltering, and grouping . Finally,to appropriately visualize the diverse data types and differentlevels of aggregations, we need to provide a varietyof visual encodings suitable for the speciﬁc situations .One goal is to provide sensible defaults, but we alsoneed to provide the ability to ﬂexibly choose visualencodings tailored to data types and aggregation levels ,to account for the diversity of analysis questions and datasetcharacteristics. Related Work

We discuss related work in light of two considerations:(1) a review of tabular data visualization techniques, and(2) approaches to aggregation.

Tabular Data Visualization

Since tabular data analysis plays an important role inmany research ﬁelds, a substantial body of work exists onvisualizing such data. We distinguish between four types oftabular data visualization techniques:1. dimensionality reduction techniques , which show alower dimensional projection of a high-dimensionaldataset,2. axes-based techniques , which position marks foreach cell based on its value, such as parallelcoordinates, star plots, and scatterplot matrices,3. tabular techniques , which retain item positionsacross columns and encode the data within the cells,4. multiple coordinated view (MCV) and hybridtechniques , which show visualization of individualdimensions or subsets of attributes in separate butlinked views.Our four types of tabular data visualization techniquesare related to the three families proposed by Dimara etal. . In their work, they distinguish between lossy andlossless geometric projection techniques. Lossy techniquesdo not preserve the raw values of individual dimensions, i.e.,this category corresponds to the dimensionality reductiontechniques. Their family of lossless techniques includes axes-based and tabular techniques, which we keep separate,as they employ different data encoding principles. Dimensionality Reduction Techniques

Projection ordimensionality reduction techniques techniques visualizethe structure of items associated with high-dimensionaldata in a lower dimensional space. There are variouscommonly used approaches, such as principal componentanalysis, multidimensional scaling techniques, or t-SNE .For data visualization, usually a 2D or sometimes also a 3Drepresentation of the projected items is displayed. Theselow-dimensional projections show groups of similar itemsclose to each other. One problem of projections is that theycan produce artifacts showing items that are quite differentin proximity. A variety of techniques have been proposedto address this and related shortcomings . Anotherchallenge with dimensionality reduction is the sensitivity ofthe results to the choice of algorithm and the sensitivity toparameters , which often makes an iterative approach withmultiple parameters and/or algorithms necessary.A special case of dimensionality reduction is to turnrelationships and items into a network, and then renderthat network using, for example, force-directed layoutalgorithms. Examples of this approach are Ploceus ,Orion , and Origraph .We argue that projection techniques are well suited tovisualize structure in a high-dimensional dataset, but theycannot adequately show why items in a cluster belongtogether. Projection techniques are especially useful incases where the dimensions themselves are not meaningfulto human analysts, such as a table of term frequencieswhen analyzing text documents. Taggle is concerned withexactly the opposite use cases: where the properties of thedimensions are critical in making decisions. Axes-based Techniques

Axes-based technique use axesrepresenting individual attributes and spatially encode theitems’ values. Key examples are scatterplot matrices ,which place scatterplots consisting of orthogonal axes toshow pairwise relationships between attributes in a matrix,and parallel coordinates , which place axes in paralleland connect individual items to their position on the axesusing polylines. Variations of parallel coordinates are starplots , where all axes originate from a common point, orother, more general axes-based layouts . Generalizationsof axes-based techniques include FLINA , a techniquethat lets users ﬂexibly arrange axes and choose betweenconnection lines or dots, and GPLOM , which generalizesthe scatterplot matrix idea to other visualization techniquesshown in the cells.Axes-based techniques can effectively show correlationsbetween neighboring axes. However, the quality of insightsdepends on the order of the axes. Other limitations arethe visual clutter caused by crossing polylines and thefact that axes-based technique are problematic for encodingcategorical and textual attributes. Tabular Techniques

Tabular visualization techniques usea grid layout where rows represent items and columnsdimensions (although the inverse is also possible); the valueof each item in each dimension is encoded in a cell. Withinthe class of tabular techniques, we further distinguish tabularvisualizations for homogeneous tables, visualizations for

Prepared using sagej.cls

Journal Title XX(X) heterogeneous tables, and spreadsheet tools. An overviewsummarizing the features and supported tasks of individualtabular visualization techniques listed in this section can befound in Table 1 of the supplementary material.The prototypical example of a homogeneous tabularvisualization technique is a heatmap , where cell valuesare encoded using color (hue, saturation, value, or opacity).Homogeneous table visualization tools are useful for datathat has the same type and scale across all dimensions(matrices, according to our deﬁnition in Section TabularData). Heatmaps are exceptionally scalable, as the cellscan be allocated as little as a single pixel of space. Akey aspect is to ﬁnd good orderings of the rows andcolumns, which is often done using clustering or seriationapproaches . Visualization tools that provide advancedfeatures for heatmaps include the Hierarchical ClusterExplorer , GAP , PermutMatrix , Clustergrammer ,and SmartExplore . Taggle can efﬁciently visualizehomogeneous tables, but in contrast to the techniquesdiscussed here, Taggle also supports heterogeneous tables,and can combine homogeneous parts of a heterogeneoustable (matrices) and heterogeneous columns in a singlevisualization.The Table Lens is a tabular visualization techniquesuitable for heterogeneous tables . It is probably themost closely related technique to Taggle and inspired itsdevelopment. It uses visual encodings tailored to differentdata types to represent values in cells. Rich sortingoperations allow users to compare trends between separateattributes. Scalability is achieved by downscaling rows, anda combination of appropriately chosen visual encodingsand lens techniques ensures readability of trends andindividual items. The most important differences to Taggleare that the Table Lens does not support aggregation andis therefore limited in terms of scalability. Taggle alsointroduces a variety of subtle new ideas, such as embeddingspace-efﬁcient techniques for homogeneous subsets of atable. A variety of tools, such as DataComb , the VisualSpreadsheet , and the table views in some multivariate treeand network visualization tools implement ideas of theTable Lens. Another technique employing various visualencodings suitable for heterogeneous tables is Bertiﬁer . Itwas inspired by Jacques Bertin’s matrix analysis methodsand supports interactive data reordering based on similaritiesbetween rows and columns. However, the technique isintended mainly for presenting small- or medium-sizedtables.Widely used spreadsheet tools , such as MicrosoftExcel *4 , Google Sheets *5 , and Apache OpenOfﬁce Calc *6 typically support tabular operations such as sorting, ﬁltering,and grouping. However, although spreadsheet tools usuallysupport rich charting operations, they provide only limitedsupport for direct visual encoding of cells, using techniquessuch as conditional formatting.FOCUS and its successor InfoZoom are hybridspreadsheet/tabular visualization tools. In addition to theTable-Lens-like layout, InfoZoom provides an overviewmode that shows the distribution of values for individualattributes, sorting each attribute row individually. Althoughthis provides an overview of the distribution of values, it isno longer a tabular layout. Multiple Coordinated View Techniques and Hybrids

Multiple coordinated view (MCV) systems represent (setsof) attributes of a tabular dataset in separate, linked views.These systems allow users to choose representations thatare suitable for the subset of data represented by a singleview, and usually rely on linked highlighting to highlightthe same items in different views. Representative systemsin this category include Improvise and Keshif . Commonconﬁgurations of Keshif, for example, use a tabular view toidentify speciﬁc items, but represent other attributes in otherviews using histograms or bar charts, for instance.Although MCV systems can leverage visualizationtechniques that are ideal for certain attributes and that wouldpotentially not ﬁt into the conﬁnes of a tabular layout, theyalso add complexity and increase the cognitive load for theuser . Tabular layouts, in contrast, make the association ofall attributes to their item easy, but make it harder to seecorrelations between attributes or trends across the wholedataset.As the Keshif example shows, tabular visualizationtechniques, such as Taggle, are an ideal complement toMCV systems: although selected attributes can be shown indedicated views, for example, on a map or in a node-linklayout, other attributes can be shown as part of the tabularvisualization.Note that the line between MCVs and other techniques isﬂuid; a scatterplot matrix, for example, can be considered asboth an axes-based technique and an MCV system. Hybrid approaches that use multiple views and combineoverview and tabular approaches or overview and projectionapproaches are also available. In hybrid overview-tabularapproaches, the rows are preserved within subsets of thedata, but the relationships between subsets are visualizedusing an overview technique. Examples of this class includeNodeTrix , VisBricks , StratomeX , Domino , andFurby . In hybrid overview-projection approaches, selectedattributes are plotted on top of a plot of projected data, asin the technique developed by Stahnke et al. . Domino is a hybrid tabular/overview MCV technique. It is basedon the concept of placing subsets of a dataset on acanvas and choosing a suitable representation (view) forit. Multiple subsets can then be connected to show theirrelationships in various ways. Matchmaker , VisBricks ,and StratomeX are related hybrid techniques, but theyare more restricted with respect to the selection and layout ofsubsets. Aggregation Methods

Orthogonal to the design space discussed above areaggregation methods for tabular data: representing theunderlying distribution or statistical measures of a setof items is an important approach to increasing thescalability of visualization techniques. Aggregation can beapplied to a whole dataset or to multiple groups of itemsand/or attributes separately. Elmqvist and Fekete proposedseveral design guidelines for aggregation, including: Visual ∗ https://products.ofﬁce.com/en-us/excel/ ∗ ∗ Prepared using sagej.cls urmanova et al. Summary —aggregates should convey information about theunderlying data;

Discriminability —aggregates can easily bedistinguished from individual data items; and

Fidelity —measures are taken to counteract artifacts of the aggregationprocess that misrepresent true effects. The aggregationtechniques in Taggle were designed with these guidelines inmind.Examples of overview techniques using aggregation arehierarchical parallel coordinates , which visualize clustercentroids rather than individual items, and VisBricks , whichcan visualize clusters using various techniques, includingstatistical summaries such as histograms. An example MCV technique that predominantly uses aggregations isKeshif . In Keshif, a table of items is supplemented withmultiple views showing distributions for interaction-drivenexploration.To our knowledge, there is currently no interactive generaltabular visualization technique that allows aggregation.When working with large tabular data, not all data can beshown in detail, as the number of rows quickly exceeds theavailable display space. There are two potential remedies:scrolling and aggregation. Although scrolling is commonwhen working with tables, it does not preserve the contextof off-screen data items. Aggregation, in contrast, can beleveraged to preserve both details about a set of items infocus and context about the rest.Various specialized tabular visualization tools use aggre-gation in tabular layouts. iHAT aggregates amino acidsequences and associated metadata using the most frequentcategory or the average to represent aggregated items,depending on the data type. Holzh¨uter et al. use theaverage for numerical values for aggregates. Both techniquesemploy transparency to communicate ﬁdelity (the higher thevariation in a cell, the higher the transparency), but neitheraddresses ﬁdelity well. The Breakdown Visualization tech-nique by Conklin and North aggregates rows or columns ofa table based on a pre-existing aggregation hierarchy. Userscan traverse the hierarchy and pivot through intersectinghierarchies. The UpSet technique aggregates items basedon set memberships. It uses visualizations such as box plotsfor representing aggregated group statistics. In contrast tothese techniques, Taggle provides the user with the ﬂexibilityto aggregate subsets of the table, while keeping details ofother parts of the table visible in place. Visualization and Interaction Design

Taggle is an item-centric visualization technique that showsall dimensions relevant to an analysis task and at the sametime provides a seamless combination of overview anddetails through selective, data-driven aggregation. Here weintroduce this approach.Taggle enables users to group items based on hierarchicalcombinations of attributes. The result of these nestedgrouping levels is an ordered tree where all leaves areitems (Figure 2 (a)). Data-driven ﬁlter and sorting operations(Figure 2 (b) and (c)) can be used to reveal items of interest.By deﬁning groups, we can add new levels to the tree(Figures 2 (d) and (e)). For example, we can group thecountries in the AIDS dataset by continent. Groups can bedeﬁned based on categorical attributes, numerical thresholds, or user selections. Groups are represented as a row showingsummary representations for the items in the group.Each branch in the tree can be collapsed independently,hiding the items while the group summary remains, asshown in Figure 2 (f). Each row of the resulting table thencorresponds to either one item or one group. We can use thisapproach, for example, to show summaries of all continents,but also to show the individual countries on the Africancontinent. By adjusting the level at which to aggregate, userscan dynamically control the level of detail of the rows whenrendering the table .Finally, we introduce a degree of interest operation toreveal aggregated items that are especially relevant to theanalysis. Our current implementation is naive, revealing onlythe ﬁrst N items of an aggregated group. By leveragingsorting, we ensure that these items are the most relevantto the current analysis task. The operation allows us, forexample, to show a summary about the AIDS epidemic bycontinent and reveal the ten most affected countries for eachcontinent at the same time. The degree of interest can beadjusted to reveal more or fewer items (Figure 2 (g)). Thisfunction could be improved to take other aspects of the datainto account, such as a cut-off of an attribute or the size ofthe group. Overall Design

The Taggle interface consists of two parts, as shown inFigure 1: (a) the main table view and (b) a data selectionpanel that is the interface for various operations. Thetable view implements the overview plus detail conceptfor visualizing tabular data. The column headers of thetable view provide the means for sorting, changing visualencoding, ﬁltering, and grouping. The data selection panelprovides access to all available numerical, categorical, text,and matrix attributes. Its primary use is to enable analysts tochoose which attributes to show in the table view. For eachcolumn that is shown in the table view, the data selectionpanel shows a visual summary of the data in the form ofa histogram, when appropriate. Below, we introduce thevisual elements and interactions in detail, together withjustiﬁcations of our design decisions.

Layout Strategy

Complementary to our overview plus detail conceptdescribed above, we introduce two different layout modesserving the high-level tasks of (1) obtaining an overview and(2) seeing details for a subset of the items.The goal of the detail mode is to allow users to see alldetails for selected items including labels, numerical values,and category names. Although this maximizes the readabilityof items, it comes at the cost of reducing the number ofvisible items.In overview mode , the goal is to show as many rows aspossible in order to give users a good sense of the overallpatterns and distributions. To achieve this, Taggle decreasesthe height of items until the whole table ﬁts on the screen, oruntil each item has a height of a single pixel, as lower valueswould introduce uncertainty due to interpolation artifacts .Aggregated groups are shown using a ﬁxed height. Overviewmode is a complementary strategy to aggregation: it is useful Prepared using sagej.cls

Journal Title XX(X)

Figure 2.

Illustration of topological operations on aheterogeneous table (a) consisting of numerical ( ) andcategorical ( ) attributes and their results reﬂected in theaggregation hierarchy: ﬁltering (b), sorting (c), grouping by asingle categorical attribute (d), grouping by the Cartesianproduct of two categorical attributes (e), aggregating (f), anddegree of interest (g). to get an idea about the distribution of the data in the columnsand does not require that meaningful groups are deﬁned.When viewing the table in overview mode, users can stillincrease the level of detail for one or multiple items byselecting them, which is useful in cases where users spotitems of interest that they want to inspect in detail.

Sorting

Sorting is a simple way of identifying minima and maximain columns. Sorting also reveals relationships betweencolumns. In addition to sorting in ascending or descendingorder by a numerical, textual, or categorical column, Taggleenables users to sort items hierarchically, where a top-level column determines the initial sorting, a second columnbreaks ties from the initial sorting, and so on. This sortingstrategy is particularly useful when sorting by categoricalcolumns. Users can also sort matrix columns by specifying a statistical measure (minimum, maximum, lower and upperquartile, median, mean) as the sorting criterion.Although other table visualizations such as the VisualSpreadsheet sort attributes hierarchically based on theorder of the columns, we decided to separate the sortingfrom the layout. Since we expect that in most cases usersare satisﬁed with simple sorting by one attribute, clicking onthe sort button in the column header always results in thedata being sorted by the corresponding attribute. Once theuser activates the sorting by one attribute, a dedicated sortinghierarchy panel appears in the data selection panel. The panelallows users to add additional sorting attributes and changetheir order (see Figure 1 (d)). Filtering

Filters can be deﬁned by interacting with the histograms inthe data selection panel either by brushing a range in thecase of numerical data (Figure 1 (b), people knowing theyhave HIV ) or by selecting categories that are to be removedfrom the table (Figure 1 (b), continent ). Textual data canbe ﬁltered by string matching or by a regular expression. Inaddition, users can ﬁlter out items with missing values. As analternative to setting ﬁlters in the data selection panel, userscan open a ﬁlter dialog via the header of the columns.

Grouping and Aggregation

Being able to stratify tables into meaningful groups is notonly an important feature for structuring tabular data, butalso an essential prerequisite for aggregation operations inTaggle.Grouping is related to sorting since grouping alsoinﬂuences the order of items. Taggle, however, separatesthese operations in order to enable more ﬁne-grained controlof groups. As discussed before, we leverage categorical orbinned numerical attributes to group datasets. Similarly, wecan leverage regular expressions on string columns to creategroups, or use dates and date ranges on date columns. Userscan also split the table into two groups based on the currentselection. Combining multiple hierarchically sorted groupingattributes creates ﬁne-grained groups that correspond to theCartesian product of the constituting categories. In practice,we found that two to three grouping levels are sufﬁcient,because more lead to fragmented groups.Setting the grouping hierarchy is analogous to hierarchicalsorting—the order of grouping attributes is indicated in adedicated grouping hierarchy panel. Since grouping takesprecedence over sorting, the hierarchy is shown above thesorting hierarchy panel (see Figure 1 (c)). The separation ofgrouping and sorting operations gives the user tighter controlover the order of the table items. Users can, for example,group items based on a binned numerical attribute but sortthe items inside the groups according to a different attribute.A group name column summarizes the current groupingand how many items are contained. In Figure 1, forinstance, the combination of the attributes continent and thehuman development index constitutes the grouping, which isindicated in the ﬁrst column. Groups can also be sorted bytheir name, by the number of contained items, by statisticalmeasures of numerical attributes (e.g., mean or median), or

Prepared using sagej.cls urmanova et al. Figure 3.

Taggle table showing countries grouped by bins of the percentage of the population who had sex before the age of 15 (a). The fertility rate values (b) are colored according to the human development index (c), showing the correlation between the twoattributes. Missing values are encoded using a dash (d). by the most frequent category. Selected options are shown inan additional group sorting hierarchy in the panel.Figure 3 illustrates a case in which the countrieswere ﬁrst grouped based on the percentage of womenwho had sex before the age of 15 with a threshold setto 15 percent (Figure 3 (a)), but sorted according to fertility rate (Figure 3 (b)). Interestingly, only African andNorth American countries fell within the group with highpercentages of sex before the age of 15 . Sorting the table by fertility rates shows a clear difference between the countriesof the two continents, with North American countries havingmuch lower fertility rates than the African countries in thisgroup. Fertility rate also correlates inversely with the level of human development index .Groups are represented by rows showing an aggregate ofthe items they contain. Group headers are assigned a uniformheight that is about twice that of a row shown in detail mode.We use dedicated visual encodings for aggregate items. Forexample, instead of bar plots for individual items, we show ahistogram or a box plot that represents the whole group (seeFigure 1 (f)). As discussed earlier, the items in a group can beshown below its header, partially hidden based on a degreeof interest function, or completely hidden. In Figure 1, forinstance, only the ﬁrst 10 African countries with a low andmedium human development index are displayed.

Visualizing Matrices

Although many tools offer support for time-series data (e.g.,by showing sparklines), these tools usually do not supportgeneral matrices. For example, the option to reorder thedata points is usually missing, because it is not necessaryfor the time-series data. In our technique, adding a matrixto a table visualization introduces a second key for thecolumns of the matrix. We allow grouping of matrix columnsbased on this key. The individual groups of columns arethen treated as separate matrices—they can be manuallyreordered, aggregated, and sorted, and the visual encodingof each group can be adjusted individually. For example, theyears in the new HIV infections per 1,000 people matrix and

Numerical Vector Categorical Vector Textual Vector

Bar Propor � onalSymbol AfricaAsiaAfricaOceaniaAsia

Color UpSet String

AngolaBeninBurkina FasoBurundiCameron

Box PlotHistogram HistogramUpSet String

Angola, Benin, Burkina Faso, Burundi, Chad, ... I t e m s A gg r e g a t e d G r o up s Brightness

Stacked Bar

Africa Asia Europe North Ocean

Dots Dots MatrixMatrix

Figure 4.

Attribute column visualization techniques for itemsand aggregated groups by data type.

Numerical items can beencoded with bars, dot plots, proportional symbols, orbrightness. For categorical items , we offer color encoding pluslabels, and two variants of matrix representations, one with andone without color used redundantly. All items can also bedisplayed as strings.

Numerical attributes can be aggregatedinto box plots and histograms.

Distributions of categoricalvalues can be shown as a histogram, a stacked bar, a binarypresence/absence matrix inspired by UpSet , or anaggregated matrix with brightness encoding the frequency ofindividual categories in the group. An aggregated textualattribute shows examples of the group members. the AIDS-related deaths per 1,000 people matrix in Figure 1(e) introduce years as the second key, which is then usedto group these matrices by decades . Here, the 2010s use adifferent visual encoding for the groups.

Encoding and Multiform Visualizations

The table view encodes each selected column or matrixusing one of multiple alternative visual encodings suitablefor the data type, including bars, dots, proportional symbols,or brightness for numerical data; color or positional/matrixencoding for categorical data; and heatmaps for matrices.Following the multiform principle , the visual encodingfor each column can be changed on demand. For example,the default bar encoding a single numerical attribute Prepared using sagej.cls

Journal Title XX(X)

Figure 5.

Matrix visualization techniques. Matrix items can beencoded using brightness and as sparklines. Matrices can beaggregated in both column and row directions. When a matrix isaggregated in the column direction, a group of matrix columnswithin one row is merged into a single cell. The aggregatedvalues can then be visualized using box plots, dot plots, andheatmap. When a matrix is aggregated in the row direction, agroup of rows is merged into one row. Values of aggregatedrows can be displayed using a heatmap and superimposedsparklines. A matrix aggregated in both directions is encodedusing a box plot, histogram, or dot plot of all matrix values. can be interactively changed to a proportional symbol,if desired. Dedicated visual encodings are used foraggregates: box plots and histograms show the distributionof numerical values; stacked bars and histograms showrelative frequencies of categories in an aggregate. A list oftextual items is represented as a truncated list of examples.Figure 4 gives an overview of the visual encodings availablefor numerical, categorical, and textual attributes with andwithout aggregation. Figure 5 summarizes how a matrixcan be aggregated in column and row directions. In theory,the aggregation choices for the matrix rows and columnsshould be symmetric. Our limited choices of visualizationsfor aggregated rows and columns stem from our designdecision to show the aggregated rows with ﬁxed height,whereas for aggregated columns the width is ﬂexible and bydefault reﬂects the width of the matrix. Thus, most of thevisualizations available for aggregated columns (e.g., boxplots or dot plots) are not suitable for aggregated rows, asthere would not be sufﬁcient space.We limit ourselves to these choices because they eitheroffer good perceptual properties (e.g., encoding by position)or are very compact, thus allowing users to choose betweenperceptual accuracy and space utilization. We deliberately donot offer visual encodings that we consider to be problematic.For example, a bar representing an average of a groupdoes not communicate any variability and is therefore nota suitable visualization for an aggregated attribute .We chose a dash to encode missing values (Figure 3 (d)).We also considered a dedicated color, but dashes have theadvantage that their visual saliency is lower (i.e., they do notdraw as much attention), but are still clearly visible at alllevels of detail. Increased Visual Compression

BarBox PlotHeatmap

BrightnessColor

AfricaAsiaAfricaOceaniaAsia

UpSetMatrix

Figure 6.

Example of encodings at different scales. In the ﬁrstcolumn, the items are displayed at full height with white spaceseparating the rows. If a textual label is part of the visualization,it is displayed at a readable size. Compact representations(columns two and three) remove white space and string labels.Some visualizations, such as the box plot, UpSet, or the matrixrepresentation, are simpliﬁed to account for the limited space.

Compact Encodings

When the height of rows is reducedin overview mode, we take various measures to adapt thevisualization to the diminished space. We not only makethe visual representations smaller, but also reduce detailsand/or adapt the visualization. In the compact representationof box plots, for instance, we ﬁll the available vertical spaceat the position of the box and indicate the start and end ofthe whiskers by drawing vertical tick marks. However, somevisualizations, such as strings and proportional symbols,do not have an adequate downscaled version. We do notrender such cells in overview mode. Examples of visualcompression for individual visualization options can befound in Figure 6.

Animated Transitions

We support users in understanding changes in the visualiza-tion by applying animated transitions , as demonstrated inthe accompanying video. Our implementation incorporatessmooth transitions for the switch between overview anddetail as well as for changes resulting from ﬁlter, sort, andaggregation operations.Instead of simply morphing item position, we applystaged transitions, where animations are split into multiplephases . In the ﬁrst phase of a ﬁlter animation, for instance,we fade out the ﬁltered rows and then move up the remainingrows of the table to ﬁll the white space. This animationis designed to help users understand why rows outside theviewport become visible at the bottom of the table. Similarly,when items in a group are collapsed, we ﬁrst fade out theitems and then gradually reduce the height of the emptygroup area to the ﬁxed height of the aggregated group. Prepared using sagej.cls urmanova et al. Figure 7.

Possible column combinations: (a, e) nested columnthat semantically groups columns of various types; (b) stackedcolumn that creates a stacked bar plot based on multipleweighted numerical columns; (c) interleaved column that stacksthe visualizations of multiple numerical columns; (d) scriptedcolumn that, in this case, visualizes only the maximum values ofselected columns; and (f) column imposition where the marks ofa numerical column are colored by the imposed categoricalcolumn.

Combining Columns

Giving users the ability to ﬂexibly combine columns supportsvarious tasks. Users can interactively create combinedcolumns by dragging either existing ones on an emptycontainer or one column onto another. The possiblecombinations are speciﬁc to the data type of the column.The most basic combined column is a nested column ,shown in Figure 7 (a) and (e). It encloses multiple individualcolumns by adding a joint header above all columnscontained. Nested multiple columns are useful for creatingsemantic groups. The nested column is the most ﬂexiblecolumn combiner that works for all types and can mixcolumns of different types.Taggle also enables users to create stacked columns by combining two or multiple numerical columns to createa weighted sum of the items and where the individualcontributions are represented as stacked bars (see Figure 7(b)). Users can interactively change the weights of individualcolumns by adapting their widths. Stacked columns can beused to create a “score”, which, in turn, can be used to createrankings. Aggregate representations for stacked columns areshown as box plots, where the values feeding the box plotsare the weighted sums of the composing values.To enable a more effective comparison of items acrossmultiple columns, an interleaved column (Figure 7 (c))stacks the encoded values from multiple numerical columnsvertically. Depending on whether the row is an item or group,the stacked representations can be made from bars or dots, or,in case an aggregate is interleaved, from a box plot.With imposition columns , users can color the visualmarks (bar, proportional symbol, etc.) of a numerical columnby the color coding of a categorical attribute, as shown inFigure 7 (f).Taggle also enables more complex combinations, based ona set of predeﬁned functions, such as minimum, maximum (Figure 7 (d)), and mean, for combining multiple numericalattributes into a single numerical column. In addition, userscan add scripted columns that allow them to deﬁne theirown functions via a scripting interface . Sorting and Grouping of Column Subsets

Although Taggle focuses primarily on tabular visualization,keeping items in constant rows across all columns, italso supports splitting a table into multiple segmentsand sorting and grouping each instance independently.To encode the relationships between table segments, weutilize slope graphs for connecting individual items of thetables compared ,p. or bands for showing relationshipsbetween aggregated groups , de facto enabling usersto create hybrid tabular/overview representations (seeFigure 8), and in the extreme, even visualization techniquessuch as parallel sets . Implementation

In the demo application *7 , users can switch between multiplepreloaded datasets, upload datasets, and download existingdatasets in various formats. Users can locally save andrestore a Taggle table together with the analysis session thatincludes the history of all user interactions.The Taggle feature set is fully integrated into the LineUp.js library *8 , which is written in TypeScript andavailable as open source *9 . A demo version can beaccessed at https://taggle.caleydoapp.org/ .Making Taggle available as an open-source library increasesthe potential for adoption of the technique. Taggle is alsodesigned to be combined with other techniques. To that end,we provide various interfaces. For example, the library canbe embedded in Jupyter Notebooks *4,*10 and used as anHTML widget *11,*12 , which allows integration into Shinyapplications *13 , R Notebooks *14 , Anuglar.js *15 , Vue.js *16 ,and React.js *17 . We provide examples for how to embedTaggle in each of these frameworks in the repository. Notethat Taggle can also be embedded as a component inside alarger web-based application. The case study described inthe following section is based on the Ordino visual canceranalysis tool . The server-side of Ordino retrieves over 500GB of cancer data from a PostgreSQL database. Complexaggregation queries that need to iterate over a large set oftable entries are handled by the database, while the client-side with Taggle then receives only the data subset neededfor rendering. ∗ https://taggle.caleydoapp.org/ ∗ https://lineup.js.org/ ∗ https://github.com/lineupjs/lineupjs/ ∗ https://jupyter.org/ ∗ https://github.com/lineupjs/lineup_widget/ ∗ ∗ https://github.com/lineupjs/lineup_htmlwidget ∗ http://shiny.rstudio.com/ ∗ http://rmarkdown.rstudio.com/r_notebooks.html ∗ https://angularjs.org/ ∗ https://vuejs.org/ ∗ https://reactjs.org/ Prepared using sagej.cls Journal Title XX(X)

Figure 8.

Comparison of two table segments. The segment on the left shows countries grouped by continent . The segment on theright visualizes countries grouped by the human development index . Both table segments are ranked by the number of peopleknowing they have HIV . The steeper the angle of the lines connecting the two instances, the greater is the change in the ranking.Bands show relationships between aggregated groups.

Case Study: Drug Target Discovery

Taggle was developed in tight collaboration with domainexperts working on a drug discovery team at a pharma-ceutical company. We demonstrate Taggle by means of acase study conducted on complex genomics data for thepurpose of drug target discovery. The case study summarizesan analysis session carried out by one of our collaborators.For the case study, we integrated Taggle into the OrdinoTarget Discovery Platform that provides access to therequired cancer genomics data *18 . Note that the collaboratorhas experience using interactive visualization tools and wasinvolved in all phases of the project and provided continuousfeedback during the development. For the case study, thedomain expert operated Taggle himself without the help ofvisualization experts.In order to identify potential drug targets in a set of tumortypes, the analyst performs experiments with cancer celllines—cultured cells that are derived from tumors and thatcan proliferate indeﬁnitely in the laboratory. These cell linesare characterized by various properties, such as tumor type(lung cancer, prostate cancer, etc.) and the set of genes thatare mutated.One very important gene in the context of cancer is TP53 .It encodes the p53 protein, whose presence is known tosuppress the uncontrolled division of cells. However, when

TP53 is mutated—which is the case for over 50% of cancerpatients—it can lose its suppressing function, which resultsin tumor growth. Due to its important role, scientists wantto know whether

TP53 is mutated in a set of cell lines.However, the mutation status of

TP53 is not always known.It has recently been shown that the mean expression level(expression is a measure of the activity of genes) of 13genes that are biologically related to

TP53 is correlated withits mutation status. The expression level of these genes canhence be used to predict the mutation status of

TP53 .In this case study, the analyst ﬁrst wants to ﬁnd out howwell this predictor works for the set of cell lines containedin the database. Based on this knowledge and other criteria, the analyst then wants to select cell lines for a wet-labexperiment.The analyst starts by loading a list of 1,009 cell lines fromthe public CCLE dataset into Taggle. By default, the tablecontains a textual column representing the names of celllines and a categorical column indicating tumor type . Sinceonly a subset of tumor types is of interest, the analyst ﬁltersfor astrocytoma/glioblastoma (type of cancer of the brain), bone sarcoma , melanoma , and non-small-cell lung cancer(NSCLC) , after which 255 cell lines remain.As the analyst wants to investigate the TP53 gene, heloads a categorical column with the mutation status (mutatedvs. nonmutated) and a textual column that provides furtherdetails about the mutation (if present). According to themutation histogram in the data selection panel, the status isunknown for 59 cell lines. To investigate the effectivenessof the 13 genes in predicting the

TP53 status, the analystloads the average expression of these genes together with amatrix column containing the individual expression values.Furthermore, he hides cell lines with unknown mutationstatus. After sorting the table by average expression indescending order and switching to the overview (see Figure9), the analyst observes the overall good correlation betweenexpression and mutation status: there is a clear enrichment of

TP53 mutants among the cell lines with a low score.In order to test whether the correlation is present forall selected tumor types, the analyst groups the table bytumor type. He observes that the prediction seems towork particularly well for the astrocytoma/glioblastoma cell lines (almost perfect separation between mutated andnonmutated) and further investigates this observation by alsostratifying by mutation status and aggregating all groups (seeFigure 10). The expression box plots show good separationfor astrocytoma/glioblastoma and melanoma , whereas theexpression ranges are overlapping for

NSCLC .Having conﬁrmed that the prediction of the

TP53 mutationstatus works reasonably well in several tumor types, the ∗ https://ordino.caleydoapp.org/ Prepared using sagej.cls urmanova et al. Figure 9.

After sorting the cell lines by the

TP53 predictorscore (brown), the analyst notices that those with a low averagescore are much more likely to be mutated (green). From thisobservation, the analyst concludes that predicting the mutationstatus based on the average expression of the 13 genes thatconstitute the predictor score works reasonably well. analyst wants to select a set of cell lines for a wet-lab experiment. He is interested in melanoma cell linesthat have no

TP53 mutation. Furthermore, the activity of

CDKN2A , another important tumor suppressor gene, shouldbe impaired due to a reduced number of

CDKN2A genecopies in the genome. The analyst removes the mutationstatus grouping, includes cell lines for which it is unclearwhether

TP53 is mutated, and unfolds the melanoma celllines group. Based on the ranking, he decides to consider allcell lines with unknown

TP53 mutation status and a

TP53 predictor score greater than 110 as nonmutated. He adds acolumn with the

CDKN2A relative copy number, sorts by itin ascending order, and ﬁlters out missing data. Finally, heselects the top hits of the resulting list (see Figure 11). Allthese cell lines fulﬁll the analyst’s requirements.

Expert Feedback

Our collaborators initially planted theseed that led to the development of Taggle by pointing outrestrictions they face in current drug discovery tools. Theyparticularly mentioned the need of seamlessly combiningoverview and details in tabular data analysis for drugdiscovery.During the conception and development of Taggle, wehad biweekly feedback sessions and in-depth discussionswith our collaborators on every aspect of both the conceptand the visual interface. The most critical feedback on earlyprototypes was about the limited rendering performance thathindered their use in real-world scenarios. After makingthe prototypes more scalable, we received valuable andvery detailed feedback on a conceptual level but alsoregarding the usability of the prototype implementation. Forexample, the user interface workﬂow and visual encodingof the hierarchical grouping and sorting capabilities ledto confusion. We resolved this problem by introducing anexplicit sorting and grouping hierarchy in the data selectionpanel (see Sections Sorting and Grouping and Aggregation).Based on follow-up feedback, we also added the capabilityof controlling the order of groups, to sort them by number ofitems or by group name, for instance.The fact that Taggle recently replaced the LineUptechnique as a core component in the Ordino drug discovery tool , which is in productive use at Boehringer Ingelheim,demonstrates that the domain experts are convinced of itseffectiveness and added value.In additional high-level feedback, the domain expertsmentioned that they would like to conﬁrm the statisticalsigniﬁcance of visual patterns they see in the overview aswell as between groups of items. However, this approachcould easily lead to incorrect inferences, unless someprecautions are taken . In a follow-up project, we areworking on a solution that supports such conﬁrmatoryanalysis in a way that users can understand without beingtrained in statistics . Discussion and Limitations

Revisiting our discussion of visualization techniques fortabular data (overview, projection, tabular, and MCVtechniques), we argue that Taggle is primarily a tabularvisualization technique, as it retains a tabular layout andencodes data within a cell, but also has some aspects of anoverview technique due to its capabilities to aggregate and itsability to sort and group subsets of columns independently.Interactive deﬁnition of groups and their aggregation insummary visualizations, such as box plots and histograms,provides a meaningful overview even for large datasets andenables an intuitive comparison of grouped data subsets(Figure 10). At the same time, Taggle enables the explorationof items at a detailed level to identify their precise properties(Figure 11). We also designed Taggle so that it can be usedwithin an MCV framework.This combination sets Taggle apart from existing tabulartechniques, which provide only a coarse overview of items(e.g., using the lens technique, which is insufﬁcient forrepresentation or comparison of large datasets) or lackinteractivity, which is essential to the exploration process.

Scalability

Taggle scales to more than 1 million rows on amodern browser, as demonstrated when loading the in the demo application. We achieve thisperformance by leveraging rendering optimizations, whichensure that only visible rows are processed. Although therendering time stays almost constant, larger datasets requiremore time for data operations, such as sorting, grouping, orcomputing histograms—which always need to be done forthe full dataset. The performance depends on the number ofCPU cores available on the client machine, as the workloadis distributed between multiple parallel web workers, ifpossible.To demonstrate the computational scalability, we executedperformance measurements for common operations on ﬁvedatasets consisting of 100, 1,000, 10,000, 100,000 and1,000,000 data items. Each dataset consisted of one textual,two numerical, and two categorical attributes generatedwith uniform distribution. For each tested operation, wemeasured the time between triggering the operation (e.g.,pressing the sort button) and the end result appearing onscreen. Animation is not useful when rearranging largedatasets; hence, it is disabled by default for such datasets.To make the results comparable across all datasets, wedisabled animations for all conditions when benchmarking.

Prepared using sagej.cls Journal Title XX(X)

Figure 10.

The analyst groups the cell lines ﬁrst by the attribute tumor type and then by

TP53 mutation status . For the tumor type astrocytoma/glioblastoma , the box plots representing the

TP53 predictor score show a clear separation between the groups mutated and nonmutated . For the other tumor types, the whiskers of the box plots overlap, indicating that the predictor score doesnot work as effectively.

Figure 11.

Continuing from the visualization state shown in Figure 10, the analyst removes the grouping on the

TP53 mutation column and unaggregates the melanoma group to inspect the cell lines in further detail. With the goal to ﬁnd cell lines for a wet-labexperiments, the analyst adds the copy number value of CDKN2A as an additional column (shown in pink). Finally, he selects celllines that have a low copy number value and are either nonmutated or have unknown mutation status and a

TP53 predictor scoreabove 110.

For measurements we used the performance proﬁler fromGoogle Chrome DevTools (v. 71.0.3578.98). We repeatedeach measurement ﬁve times. Table 1 presents the averagetimes in milliseconds. The tests were done on a machinewith Intel Core i7-5930K processor (3.5 GHz, 6 cores), 32GB RAM, NVIDIA GeForce GTX 970 graphics card. Notethat the browser-based tracking tool may slightly decreasethe actual performance.Since the full dataset needs to be loaded into memoryﬁrst, the size of the data table determines the loading time.Naturally, the number of rendered items also inﬂuencesthe run-time performance. Although we optimize therendering to process only visible rows, there can be notableperformance differences between detail and overview mode,since the number of rendered items is much larger inoverview mode. For example, in our full-HD setup withviewport size 1,387 ×

882 pixel, detail mode (DM) allowed

100 1K 10K 100K 1MLoad

529 545 611 1,012 4,107

Sort numerical

DM 321 338 358 643 2,642OM 288 681 741 970 3,626

Sort grouped

DM 324 324 381 518 1,911OM 306 661 743 923 2,830

Sort textual

DM 300 367 397 728 3,639OM 302 647 730 1,069 5,075

Filter numerical

DM 407 415 460 598 1,442OM 419 1,883 1,745 1,876 2,858

Filter categorical

DM 357 435 475 562 1,372OM 403 1,982 1,079 1,196 2,261

Table 1.

Completion time in milliseconds for various operationsusing ﬁve datasets with 100 to 1 million items. DM indicatesoperations performed in detailed mode, and OM indicatesoperations performed in overview mode.

Prepared using sagej.cls urmanova et al. for 39 item rows, but in overview mode (OM) we had up to775 rows on screen. Note that due to the design decision thatevery item is at least one pixel high, the table grows out of thevisible screen space for larger tables. Table 1 shows that forthe smallest dataset, where the number of rendered elementsis low in either case, the performance difference betweenoverview and detail mode is minimal. For other datasets, thetime necessary for preparing and rendering the elements ismuch more apparent. Aggregation of Categorical and TextualAttributes

Although there are numerous possibilities for aggregationfor numerical data—ranging from aggregation in dataspace (mean, median) to spatial aggregation (box plots,histograms)—the options are limited for categorical andtextual data. In our prototype we offer three possibilities foraggregation of categorical data: a matrix, a histogram, anda distribution bar. Due to spatial restraints and the limitedscalability of the color channel, there are limits with respectto the number of categories that can be sensibly encodedthis way. Taggle uses a predeﬁned color scheme with 22distinct colors. However, if a categorical attribute has morethan 22 categories, we treat the column as textual. Colors areautomatically assigned, but users also have the possibilityto adjust colors manually to resolve cases where colors arerepeated between columns.Aggregation of textual attributes, however, is even morelimited. In our prototype implementation, we list a sample ofitems from the aggregated group to summarize the group’scontent using the order of the items. An alternative approachwould be to select samples based on other criteria suchas frequency of occurrence. This approach, however, ispractical only for data attributes with repetitive values.

Automatic Aggregation

In the design process, we investigated methods forautomatically aggregating rows and columns, with the goalof increasing scalability. For example, when in overviewmode, we tried to automatically aggregate groups to makespace for user-selected rows that are shown with increasedheight. We found, however, that users had difﬁcultiesunderstanding the unexpected changes and subsequentlyinterpreting the individual items and aggregated groups. Asthis violated the discriminability design guideline proposedby Elmqvist and Fekete , we removed the automaticaggregation feature. Instead, as part of future work, weplan to implement and evaluate a recommendation approachthat suggests possible layout changes without automaticallyapplying them. Stacking of Matrices and Vectors

Our current prototype supports grouping of matrix columnsbased on a categorical attribute (see Figure 1), but providesno means of sorting and ﬁltering the matrix columns.Furthermore, it is not possible to stack additional attributeson top of a matrix, as done, for instance, in Figures 4 and6 presented in . However, we plan to address this technicallimitation in future versions. Conclusion and Future Work

In this work, we presented Taggle, an item-centric,spreadsheet-like visualization technique for exploring andpresenting large and complex tables. Taggle is unique amongtabular data visualization techniques due to its ability todynamically aggregate subsets of a table, which allowsusers to ﬂexibly drill-down into details of large tables whilekeeping the overview as context.The open-source implementation presented as part of thiswork goes beyond a research prototype, providing a rich setof visual encodings and rendering optimizations that make itscale to a million items. Taggle can be used as a standalonetool but also integrated as a widget into MCV systems ornotebook-style environments such as R Markdown or JupyterNotebooks.As part of future work, we plan to integrate data-driven guidance capabilities into Taggle, as implementedin StratomeX . Following the idea of guided visualexploration, we plan to assist users in ﬁnding correlatedattributes or similar groups based on their input. Acknowledgements

We thank Bikram Kawan and Martin Ennemoser for theircontributions to the initial prototype implementation as well asChristian Haslinger and Andreas Wernitznig for providing valuableconceptual feedback.

Declaration of conﬂicting interests

Samuel Gratzl, Holger Stitz, Alexander Lex, and Marc Streit holdshares and/or are employed by datavisyn GmbH, which providesits customers with support for using and deploying the open sourceTaggle software.

Funding

This work was supported in part by Boehringer Ingelheim RegionalCenter Vienna; the State of Upper Austria (FFG

References

1. Rao R and Card SK. The Table Lens: Merging Graphical andSymbolic Representations in an Interactive Focus + ContextVisualization for Tabular Information. In

Proceedings of theSIGCHI Conference on Human Factors in Computing Systems(CHI ’94) . ACM, pp. 318–322. DOI:10.1145/191666.191776.2. Perin C, Dragicevic P and Fekete J. Revisiting Bertin Matrices:New Interactions for Crafting Tabular Visualizations.

IEEETransactions on Visualization and Computer Graphics (InfoVis’14)

IEEE Transactionson Visualization and Computer Graphics (InfoVis ’13)

Bioinformatics

Prepared using sagej.cls Journal Title XX(X)

5. Munzner T.

Visualization Analysis and Design . Boca Raton:CRC Press, Taylor & Francis Group, 2014. ISBN 978-1-4665-0891-0.6. Lex A, Schulz HJ, Streit M et al. VisBricks: MultiformVisualization of Large, Inhomogeneous Data.

IEEETransactions on Visualization and Computer Graphics (InfoVis’11)

Proceedings ofthe National Academy of Sciences USA

IEEE Transactionson Visualization and Computer Graphics (InfoVis’17)

Journal of Machine Learning Research

Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems (CHI ’12) . ACM, pp. 443–452.DOI:10.1145/2207676.2207738.11. Stahnke J, D¨ork M, M¨uller B et al. Probing Projections:Interaction Techniques for Interpreting Arrangements andErrors of Dimensionality Reductions.

IEEE Transactionson Visualization and Computer Graphics (InfoVis ’15)

IEEE Transactions on Visualization and Computer Graphics(InfoVis ’16)

Distill

Infor-mation Visualization

Information Visualization

Proceedings of IEEE Conference onVisual Analytics Science and Technology (VAST ’19) . IEEE. .17. Becker RA and Cleveland WS. Brushing Scatterplots.

Technometrics

IEEE Transactions on Visualization and ComputerGraphics (InfoVis ’08)

Parallel Coordinates. Visual MultidimensionalGeometry and Its Applications . Springer, 2009. ISBN 978-0-387-68628-8 0-387-68628-2. 20. Wegman EJ. Hyperdimensional Data Analysis Using ParallelCoordinates.

Journal of the American Statistical Association

Eurographics 2013 - State of the Art Reports .Girona, Spain: The Eurographics Association, pp. 95–116.DOI:10.2312/conf/EG2013/stars/095-116.22. Kandogan E. Visualizing Multi-dimensional Clusters, Trends,and Outliers using Star Coordinates. In

Proceedings ofthe SIGKDD Conference on Knowledge Discovery and DataMining (SIGKDD ’01) . ACM, pp. 107–116. DOI:10.1145/502512.502530.23. Tominski C, Abello J and Schumann H. Axes-BasedVisualizations with Radial Layouts. In

Proceedings of the ACMSymposium on Applied Computing (SAC ’04) . ACM, pp. 1242–1247. DOI:10.1145/967900.968153.24. Claessen JH and van Wijk JJ. Flexible Linked Axesfor Multivariate Data Visualization.

IEEE Transactions onVisualization and Computer Graphics (InfoVis ’11)

IEEE Transactions on Visualization and ComputerGraphics (InfoVis ’13)

The American Statistician

Statistical Analysis and Data Mining: The ASAData Science Journal

Computer

ComputationalStatistics & Data Analysis

Bioinformatics

Scientiﬁc Data

Proceedings of IEEEConference on Visual Analytics Science and Technology (VAST’18)

Transaction

Prepared using sagej.cls urmanova et al. on Visualization and Computer Graphics Transactionon Visualization and Computer Graphics (InfoVis ’18)

Proceedingsof the ACM Symposium on User Interface Software andTechnology (UIST ’96) . ACM, pp. 41–50. DOI:10.1145/237091.237097.38. Spenke M and Beilken C. InfoZoom-Analysing Formula Oneracing results with an interactive data mining and visualisationtool.

WIT Transactions on Information and CommunicationTechnologies

Proceedings of the IEEE Symposium onInformation Visualization (InfoVis ’04) . IEEE, pp. 159–166.DOI:10.1109/INFVIS.2004.12.40. Yalcin MA, Elmqvist N and Bederson BB. Keshif: Rapidand Expressive Tabular Data Exploration for Novices.

IEEETransactions on Visualization and Computer Graphics

Proceedings of the Working Conference on Advanced VisualInterfaces (AVI ’00) . ACM. ISBN 978-1-58113-252-6, pp.110–119. DOI:10.1145/345513.345271.42. Henry N, Fekete JD and McGufﬁn MJ. NodeTrix: A HybridVisualization of Social Networks.

IEEE Transactions onVisualization and Computer Graphics (InfoVis ’07)

Computer Graphics Forum

BMC Bioinformatics

IEEE Transactions on Visualization and ComputerGraphics (InfoVis’14)

BMC Bioinformatics

IEEE Transactions onVisualization and Computer Graphics (InfoVis ’10)

IEEE Transactions on Visualization and ComputerGraphics

Proceedings of the IEEE Conference on Visualization (Vis ’99) . IEEE. ISBN 0-7803-5897, pp. 43–50. DOI:10.1109/VISUAL.1999.809866.50. Heinrich J, Vehlow C, Battke F et al. iHAT: InteractiveHierarchical Aggregation Table for Genetic Association Data.

BMC Bioinformatics

Proceedingsof the SPIE Conference on Visualization and Data Analysis .IS&T/SPIE, p. 82940O. DOI:10.1117/12.908516.52. Conklin N, Prabhakar S and North C. Multiple Foci Drill-down Through Tuple and Attribute Aggregation Polyarchiesin Tabular Data. In

Proceedings of the IEEE Symposium onInformation Visualization (InfoVis ’02) . IEEE, pp. 131–134.DOI:10.1109/INFVIS.2002.1173158.53. Lex A, Gehlenborg N, Strobelt H et al. UpSet: Visualizationof Intersecting Sets.

IEEE Transactions on Visualization andComputer Graphics (InfoVis ’14)

Proceedings ofthe ACM SIGCHI Conference on Human Factors in ComputingSystems (CHI ’86) . ACM, pp. 16–23. DOI:10.1145/22339.22342.55. Tyner C, Barber GP, Casper J et al. The UCSC GenomeBrowser database: 2017 update.

Nucleic Acids Research

Nature Methods

IEEE Transactions on Visualization andComputer Graphics (InfoVis ’07)

Proceedings of theConference on Advanced Visual Interfaces (AVI ’04) . ACM, pp.150–157. DOI:10.1145/989863.989885.59. Tufte E.

The Visual Display of Quantitative Information . 2nded. Cheshire, CT, USA: Graphics Press, 2001.60. Kosara R, Bendix F and Hauser H. Parallel Sets: Interactiveexploration and visual analysis of categorical data.

IEEETransactions on Visualization and Computer Graphics

Bioinformatics eLife

Nature

Science

Proceedings of the SIGCHI Conference on Human Factors in

Prepared using sagej.cls Journal Title XX(X)

Computing Systems (CHI ’18) . ACM, p. 479. DOI:10.1145/3173574.3174053.66. Eckelt K, Adelberger P, Zichner T et al. TourDino: A SupportView for Conﬁrming Patterns in Tabular Data. In

Proceedingsof the EuroVis Workshop on Visual Analytics (EuroVA’ 19) . pp.7–11. DOI:10.2312/eurova.20191117.67. Cherniack AD, Shen H, Walter V et al. Integrated MolecularCharacterization of Uterine Carcinosarcoma.

Cancer Cell

Nature Methods