[PDF] A grammar of graphics framework for generalized parallel coordinate plots

Abstract

Parallel coordinate plots (PCP) are a useful tool in exploratory data analysis of high-dimensional numerical data. The use of PCPs is limited when working with categorical variables or a mix of categorical and continuous variables. In this paper, we propose generalized parallel coordinate plots (GPCP) to extend the ability of PCPs from just numeric variables to dealing seamlessly with a mix of categorical and numeric variables in a single plot. In this process we find that existing solutions for categorical values only, such as hammock plots or parsets become edge cases in the new framework. By focusing on individual observation rather a marginal frequency we gain additional flexibility. The resulting approach is implemented in the R package ggpcp.

Full PDF

AA grammar of graphics framework forgeneralized parallel coordinate plots

Yawei Ge ∗ Department of Statistics, Iowa State UniversityandHeike HofmannDepartment of Statistics, Iowa State UniversitySeptember 29, 2020

Abstract

Parallel coordinate plots (PCP) are a useful tool in exploratory data analysis ofhigh-dimensional numerical data. The use of PCPs is limited when working withcategorical variables or a mix of categorical and continuous variables. In this paper,we propose generalized parallel coordinate plots (GPCP) to extend the ability ofPCPs from just numeric variables to dealing seamlessly with a mix of categorical andnumeric variables in a single plot. In this process we ﬁnd that existing solutions forcategorical values only, such as hammock plots or parsets become edge cases in thenew framework. By focusing on individual observation rather a marginal frequencywe gain additional ﬂexibility. The resulting approach is implemented in the R packageggpcp.

Keywords: high dimensional visualization, data exploration, categorical variables ∗ The authors gratefully acknowledge this research was partially funded by the 2019 Google Summer ofCode a r X i v : . [ c s . G R ] S e p Introduction

Few approaches in data visualization exist that are truly high-dimensional. Most visu-alizations are (projections of data into) two or three dimensions enhanced by additionalmappings to plot aesthetics, such as point size and color, or facetting. Parallel coordinateplots are one of the exceptions: in parallel coordinate plots we can actually visualize anarbitrary many number of variables to get a visual summary of a high-dimensional dataset. In a parallel coordinate plot each variable takes the role of a vertical (or parallel) axis(giving the visualization its name). Multivariate observations are then plotted by connect-ing their respective values on each axis across all axes using polylines (cf. Figure 1). Forjust two variables this switch from orthogonal axes to parallel axes is equivalent to a switchfrom the familiar Euclidean geometry to the Projective Space. In the projective space,points take the role of lines, while lines are replaced by points, i.e. points falling on a line inthe Euclidean space correspond to lines crossing in a single point in the Projective Space.This duality provides a good basis for interpreting geometric features observed in a parallelcoordinate plot (Inselberg 1985).The origins of parallel coordinate plots date back to the 19th century and are, dependingon the source, either attributed to Maurice d’Ocagne (1885) or Henry Gannett (1880).Modern era parallel coordinate plots go back to Inselberg (1985) and Wegman (1990).Parallel coordinate plots are used in an exploratory setting as a way to get a high-leveloverview of the marginal distributions involved, to identify outliers in the data and to ﬁndpotential clusters of points. In the absence of those, Parallel Coordinate Plots are oftencriticized for the amount of clutter they produce, resembling a game of mikado rather thanorganized data. This clutter is sometimes combated by the use of α -blending (Miller &Wegman 1991), density estimation (Heinrich & Weiskopf 2009), or edge-bundling parallelcoordinate plots (McDonnell & Mueller 2008). For a detailed overview of these and othertechniques see Heinrich & Weiskopf (2013).However, parallel coordinate plots have some shortcomings. The biggest challengecomes when working with categorical variables. In current solutions, levels of categoricalvariables are transformed to numbers and variables are then used as if they were numeric.This introduces a lot of ties into the data, and the resulting parallel coordinate plot be-2omes uninformative, as it only shows lines from each level of one variable to all levels of thenext variable. Some versions of parallel coordinate plots have been speciﬁcally developedto deal with categorical data, e.g. parallel set plots (Kosara et al. 2006), Hammock plots(Schonlau 2003), and common angle plots (Hofmann & Vendettuoli 2013). These solutionsall have in common that they work with tabularized data and show bands of observationsfrom one categorical variable to the next. Hammock plots and common angle plots providesolutions to mitigate the sine-illusion’s eﬀects (Day & Stecher 1991) on parallel sets plots.An attempt to combine categorical and numeric variables in a parallel coordinate plot isintroduced in the categorical parallel coordinate plots of Pilh¨ofer & Unwin (2013). Theseplots provide an extension to parallel sets that allows numeric variables to be includedin the plot. Similar to parallel sets, this approach is also based on marginal frequenciesfor the categorical variables. Categorical parallel coordinate plots are the closest of thesevariations to our solution, but they are not implemented in the ggplot2 framework and cantherefore not be further extended.Various packages in R (R Core Team 2019) exist that contain an implementation of oneof the parallel coordinate plots. The function “parcoord” in the MASS package (Venables& Ripley 2002) make use of the base plot system of R to draw parallel coordinate plots. Thefunction cpcp in package iplots implements the parallel coordinate plot (Pilh¨ofer & Unwin2013). Developments based on the grammar of graphics (Wilkinson 2005) and the ggplot2(Wickham 2016) framework are, e.g. GGally (Schloerke et al. 2018) or ggparallel (Hofmann& Vendettuoli 2016) which provides an implementation of Hammock and common-angleplots. Those packages based on ggplot2 make use of ggplot2, but are actually wrapper ofexisting functions for highly specialized plots with tens of parameters, which do not allowthe full ﬂexibility of ggplot2 and do not make use of ggplot2’s layer framework.The remainder of the paper is organized as follows: section 2 presents the conceptualframework of generalized parallel coordinate plots and general principles informing theirconstruction. Section 3 describes the connection between generalized parallel coordinatesand the grammar of graphics. Section 4 provides three examples highlighting diﬀerentaspects of the use of generalized parallel coordinate plots in an exploratory setting.3 owmediumhigh x1 x2 x3 x4 Variables Observation Figure 1: Sketch of a parallel coordinate plot of two observations in four dimensions. Eachdimension is shown as a vertical axis, observations are connected by polylines from oneaxis to the next.

Figure 2 shows a ﬁrst example of a generalized parallel coordinate plot. Shown are EdgarAnderson’s (in)famous iris data (Anderson 1935). Each iris is shown by one polyline. Linesare colored by species. The species variable is included several times as an axis in the plot.Sepal widths of irises shows the worst discrimination by species, their petal widths the best.As can be seen, the categorical variable species is seamlessly incorporated into the parallelcoordinate plot.The main idea of generalized parallel coordinate plots is that we integrate categoricalvariables in a way that allows us to keep track of individual observations across all variables.This means that we need to address the inherently existing ties of each level of a categoricalvariable. Figure 3 shows an implementation of this approach. The same variables are shownon the left as on the right (in an A-B-B-A pattern). Instead of using one value for eachlevel, the observations within each level are spread out uniformly. A rectangle around thesevalues visually groups all observations of a level together. Some additional space is insertedbetween rectangles (for a total of 10% of the vertical space) to visually separate the levels.With this modiﬁcation we preserve the spirit of parallel coordinate plots by drawing atrackable polyline across the variables. In comparison, by using a single point for each level4 .000.250.500.751.00 Species Sepal.Length Species Sepal.Width Species Petal.Length Petal.Width Species

Species setosa versicolor virginica

Figure 2: Generalized Parallel Coordinate Plot of E. Anderson’s iris data.5 .000.250.500.751.00 old style categorical axis new style categorical axis

Figure 3: Sketch of a parallel coordinate plot with two old-style categorical variables (left)and the same (inverted) categorical axes as implemented in the generalized PCPs (right).of a categorical variable as on the left hand side of Figure 3, we end up with a plot that ismuch less informative. The most interesting piece of information from the left hand-side ofthe plot is that there are no observations in the top level of the ﬁrst variable that go intothe middle level of the second variable.In the remainder of this section we discuss diﬀerent conceptual aspects of the construc-tion of generalized parallel coordinate plots before discussing the speciﬁc implementationin the ggpcp package.

What might not be apparent at ﬁrst glance is that the order of the observations withineach level has to be chosen carefully in order to create a visualization with as little visualclutter as possible. As all of the observations in each level of a categorical variable sharethe same value, the “values” on a categorical axis are not determined by the data, insteadthey are just positions based on tie-breaking considerations and therefore provide us witha lot of freedom in choosing them. This way we are able resolve a lot of potential line6 etosaversicolorvirginica

Species setosaversicolorvirginica

Figure 4: Numeric-categorical case, values in each level of the categorical variable are sortedaccording to the values in the numeric variable on the left.crossings between adjacent axes.There are four combinations of adjacent variables (N-N, N-C, C-N, and C-C) to beconsidered with respect to their tie-breaking behavior in constructing generalized parallelcoordinate plots. While there might be ties in some numeric variables, we are not changingany of the established behavior of parallel coordinate plots, i.e. values any numeric variableare marked along the axis (after appropriately scaling the axis) and connected to theirrespective counterparts on adjacent axes. When we are dealing with adjacent numeric andcategorical variables (N-C, C-N), we use the values of the numeric variable to inform theposition of the observations within each level of the categorical variable. Note, however,that it is not possible to sort a categorical variable with respect to numeric variables onboth sides (N-C-N), shown in Figure 4. In that situation, our choice is to always sort thecategorical variable according to the numeric variable on the left hand side ﬁrst and onlyregard the values on the right if there is no numeric variable on the left.In dealing with adjacent categorical variables, we make use of the basic idea of parallelset plots (Kosara et al. 2006), applied to individual observations rather than based on7requencies, with the aforementioned advantages. For categorical variables, we extendthe construction of tie breakers to include all adjacent categorical variables. We will callthese sets of categorical variables factor blocks. Within each factor block the position ofobservations within each level of a categorical variable is informed by the joint distributionof a categorical variable with all of its right-sided categorical neighbors. factor4 ab Figure 5: Factor block of three categorical variables. The order within each level of acategorial variable is determined by the joint distribution with all of the categorical variableson its right side.That means, that for the left most categorical variable, the positions of the observationswithin each level are ordered according to their corresponding positions in its adjacentcategorical right-hand neighbor, which itself is ordered by the position of observations ofits right-hand categorical neighbor. This is equivalent to an order given by hierarchicallysorting categorical variables from right to left. Any remaining ties within a level are, as inthe N-C case, resolved by the order of values of the variable on the left. As a result, theparallel coordinate plot appears to become gradually more ordered to the right, as can beseen in Figure 5. This approach minimizes the number of crossed lines.8 .2 Break Factor Blocks

While generalized parallel coordinate plots can now deal with multiple categorical variables,we do have to pay a price in terms of complexity by adding more and more categoricalvariables into the plot. As the number of categorical variables in a factor block increases,the total number of combinations in the corresponding joint distribution increases multi-plicatively. We can see the rapidly increasing number of combinations in Figure 6. Thisﬁgure shows survival on board the RMS Titanic by gender, age, and class (Dawson 1995).Only for the right most section in the factor block can we see a simple relationship be-tween the eight possible combinations of survival and passenger status/crew. The numberof combinations rapidly devolves into an incomprehensible mess moving from the orderedright hand side to the left, as a result of utilizing the full joint distribution of the factorcombinations.In order to direct visual attention to the useful structure within the data, rather thanthe increasing fragmentation of the joint distribution of all of the displayed variables, weintroduce a method for breaking factor blocks into sub-blocks. The joint distribution ofvariables within the sub-blocks is preserved, but between adjacent sub-blocks, only theimmediately relevant relationships are maintained. Visually, this is manifested in a re-ordering of cases within a factor block, shown by diagonal lines contained within the factorboxes. This hides some of the complexity of the full joint distribution, but allows the viewerto focus on the primary relationships of interest while preserving most of the utility of theordering described above. Figure 7 shows the same data as Figure 6, but has breakpointsinserted at each of the interior variables; the re-ordering can be seen within the interiorblocks, and the resulting chart is cleaner and easier to read.Computationally, the logic behind the factor block break points is as follows: At eachbreakpoint, we arrange the right and left side of the data separately, and then reconcilethe two orderings within the breakpoint box. This contains much of the visual clutterwithin the box indicating the factor level, which reduces the visual impact of the reorderingsigniﬁcantly, but also makes it harder to track individual observations across the plot.9 hildAdult MaleFemale 1st2nd3rdCrew NoYes

Survived

NoYes

Figure 6: Fast accumulated combinations from right to left. Each line in the plot corre-sponds to one person on board the RMS Titanic. Lines are colored by survival.10 hildAdult MaleFemale 1st2nd3rdCrew NoYes

Survived

NoYes

Figure 7: Same data and structure as the previous plot, with breakpoints inserted for thesecond and third variable. Relationships between adjacent variables are emphasized.11 .3 Organized Over-plotting

One of the primary advantages of the generalized approach to categorical variables de-scribed above is the ability to follow a single observation throughout the plot. As thenumber of observations increase, this becomes less feasible because with more polylines,there are more intersections between polylines, exponentially increasing the eﬀort requiredto follow a polyline from one side of the plot to the other. To reduce this tendency towardchaos, it is necessary to carefully control the order in which lines are plotted to ensurethat it is relatively easy to follow a line (or group of polylines) across the plot. We havedeveloped three diﬀerent methods to control this overplotting in order to maintain visualorder:1. Plot the smaller groups of lines on top of the larger groups,2. Plot the larger groups of lines on top of the smaller groups of lines,3. Use the hierarchical arrangement of factor variables to order the line plotting.The user can specify which ordering method should be used with an additional param-eter. When there are multiple factor blocks, or breakpoints between factor blocks, it isnecessary to reconcile plotting order within axes as well, using the same type of logic. Notethat we have used two diﬀerent overplotting methods in Figure 6 and Figure 7. In Figure 6we plotted larger groups of lines on top of smaller groups of lines, while for Figure 7 weplotted smaller groups of lines on top of the larger groups. The fact that we can stillsee smaller lines in the ﬁrst ﬁgure is due to the additional use of α -blending, i.e. we arenot drawing lines at full saturation, but make them partially see-through. Usually, thehierarchical arrangement produces the best plots. Since its publication on CRAN in 2005, the R package ggplot2 has seen a stellar rise inuse with now tens of thousands of downloads a day.This success is due to the underlying conceptual framework. ggplot2 is based on theGrammar of Graphics (Wilkinson 2005), implemented and adjusted for usability in R12Wickham 2010). This means that a plot in ggplot2 is assembled descriptively as a setof layers. Each of these layers consists of a functional mapping between the variables in thedata set and a component of the plot, such as an x or a y axis, or plot speciﬁc aesthetics,such as the color or size of points. Generally, layers have a single geometric representation(such as e.g. points, lines or rectangles).What is unique about ggplot2 is that its implementation facilitates the creation ofextensions. Our package ggpcp is one such extension for generalized parallel coordinateplots.In accordance with the modular design principle of the tidyverse (Wickham et al. 2019)we have developed a set of functions to deal with separate aspects of generalized parallelcoordinate plots.Parallel coordinate plots are somewhat unique in that there is no one-to-one mappingbetween a variable and an axis; instead, there are arbitrarily many variables provided, andboth the x and y positions are calculated for each variable, and thus, each polyline. Toaccommodate this complication, we have utilized the vars function used throughout thetidyverse to allow the user to specify a set of variables using the familiar syntax of theselect helper functions found in the tidyselect package. Interestingly, this also enables usto specify variables multiple times in a plot, as shown in Figure 2.The user primarily interacts with the geom pcp function, which is constructed to handlethe diﬀerent aspects of parallel coordinate plots through a consistent API. This functionallows us to draw a set of polylines as in a traditional parallel coordinate plot.Consistent with the modular approach of ggplot2, this function only draws lines, andthe user must specify additional layers for additional plot components, such as the boxesfor categorical variables or text for labels.The code below generates Figure 4. As can be seen, the layers of ggpcp work seamlesslywith functionality from the tidyverse and elements of ggplot2. set.seed(20191019)iris %>%group_by(Species) %>%sample_n(10) %>% gplot(aes(vars = vars(Sepal.Length, Species,Sepal.Width))) +geom_pcp(boxwidth = 0.1, aes(color = Species),size = 1.25) +geom_pcp_box(boxwidth = 0.1) +geom_pcp_label(boxwidth = 0.1) +theme_bw() +scale_colour_manual(values = c("purple4", "darkorange", "darkgreen")) In the next section we will discuss some examples highlighting some more sophisticatedaspects of the plot.

Figure 8 shows data from Agresti (2002) published as part of the poLCA package (Linzer &Lewis 2011). Seven pathologists were asked to assess the same 118 slides for the presence orabsence of carcinoma in the uterine cervix. Binary responses for each slide were recorded(yes/no). Pathologists all agreed on about 25% of slides, which they considered to becarcinoma free, and a further 12.5% of slides, which were considered to show carcinoma byall pathologists.For the remaining 62.5% of slides there was some disagreement. However, we see that thisdisagreement is not random. When pathologists are ordered (by moving the correspondingaxes) left to right from fewest number of overall carcinoma diagnoses to highest number,we see that generally for a slide more pathologists make a carcinoma diagnosis from left toright. 14 .000.250.500.751.00 F D C A G E B

Number ofcarcinomadiagnoses

Figure 8: Pathologists’ diagnoses of absence (no) or presence (yes) of carcinoma in theuterine cervix based on 118 slides. Each slide is shown by a polyline.15

Figure 9: Missing migrants - generalized parallel coordinate plot of incidents involvingmigrants in three regions of the world. Each line corresponds to one incident report. Linesare colored by cause of the incident. The variable on the right shows number of migrantsinvolved in each incident.

The Missing Migrants Project ( https://missingmigrants.iom.int/ ) tracks incidents in-volving asylum seekers who have gone missing, were injured or have died on the way totheir destination. The project aims to identify and track missing migrants and provide areliable data source for media, researchers and the general public. The Missing MigrantsProject started as a response to the tragedies of October 2013, when at least 368 migrantsdied in shipwrecks near the Italian island of Lampedusa. Here, we are exploring data fromthe three regions with the highest number of incidents: North Africa, the Mediterraneanand the US-Mexico Border. In total we are considering 3273 incidents between Jan 1 2015and Dec 31 2018. 16igure 10: Missing migrants - each line corresponds to one migrant.Incidents are reported with several hundred classiﬁcations of causes. Here, we arefocusing on the top three: drowning, sickness, and unknown and combine all other causesunder ‘other’.Figure 9 shows a generalized parallel coordinate plot of the Missing Migrant Projectdata. Each line corresponds to one incident, lines are colored by the cause of the incident.We can see that the number of incidents in each of the three regions is similar, but in termsof the number of people involved, the Mediterranean clearly dominates the picture, withthe worst incident estimated to have cost the lives of more than 1000 people.We see that in 2016 one leading cause of reported incidents were sickness and diseases,mostly reported in North Africa. Further investigation reveals that these are likely ex-perienced by refugees from Sudanese camps, where poor sanitation and complete lack ofmedical care led to epidemic outbreaks of cholera, typhus and other preventable diseases.Figure 10 shows a generalized parallel coordinate plot of a diﬀerent aspect of the same17ata. Each line now corresponds to an individual instead of a group involved in an incident.An outcome variable is added to report on each individual’s presumed status.The number of migrants involved in incidents peaked in 2016 and numbers have sincedecreased. Drowning is the leading cause of death for migrants in the Mediterranean, butas can be seen in Figure 9 there are a substantial number of drowning incidents along theUS-Mexico border.Even though we saw in Figure 9 sickness as one of the leading number of incident reportsin 2016, fortunately the number of migrants aﬀected is relatively small.

In this last example, we re-visit a data set that was used for the ASA Data Expo in 2006.The nasa data, made available as part of the ggpcp package provides an extension tothe data provided in the

GGally package (Schloerke et al. 2018). It consists of monthlymeasurements of several climate variables, such as cloud coverage, temperature, pressure,and ozone values, captured on a 24x24 grid across Central America between 1995 and 2000.Using a hierarchical clustering (based on Ward’s distance) of all January and Julymeasurements of all climate variables and the elevation, we group locations into 6 clusters.The resulting cluster membership can then be summarized visually. Figure 11 shows a tileplot of the geography colored by cluster. We see that the clusters have a very distinctgeographic pattern.From the parallel coordinate plot in Figure 12 we see that cloud coverage in low, mediumand high altitude distinguishes quite succinctly between some of the clusters. (Relative)temperatures in January and July are very eﬀective at separating between clusters in theSouthern and Northern hemisphere. The connection between the US gulf coast line andthe upper region of the Amazon (cluster 2) can probably be explained by a relatively lowelevation combined with similar humidity levels.A parallel coordinate plot allows us to visualize a part of the dendrogram correspondingto the hierarchical clustering.Using the generalized parallel coordinate plots we can visualize the clustering process inplots similar to what Schonlau (Schonlau 2002, 2004) coined the clustergram, see Figure 1318 luster

Figure 11: Tile plot of the (gridded) geographic area underlying the data. Each tile iscolored by its cluster membership.and Figure 14.Along the x-axis the number of clusters are plotted with one PCP axis each, from twoclusters (left) to 10 clusters (right most PCP axis). Each line corresponds to one location,lines are colored by cluster assignment in the ten-cluster solution. This essentially replicatesthe dendrogram while providing information about the number of observations in eachcluster as well as the relationship between successive clustering steps.19

63 41 20.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00cl10cloudhigh−Jancloudhigh−Julcloudlow−Jancloudlow−Julcloudmid−Janelevation−Julozone−Janozone−Julpressure−Janpressure−Julsurftemp−Jansurftemp−Julcl10cloudhigh−Jancloudhigh−Julcloudlow−Jancloudlow−Julcloudmid−Janelevation−Julozone−Janozone−Julpressure−Janpressure−Julsurftemp−Jansurftemp−Julcl10cloudhigh−Jancloudhigh−Julcloudlow−Jancloudlow−Julcloudmid−Janelevation−Julozone−Janozone−Julpressure−Janpressure−Julsurftemp−Jansurftemp−Jul

Cluster

Figure 12: Overview of all variables involved in the clustering.20 .000.250.500.751.00 cl2 cl3 cl4 cl5 cl6 cl7 cl8 cl9 cl10

Number of clusters

Figure 13: Dendrogram showing number of clusters at each step of the hierarchical processusing the old-style parallel coordinate plot.

Number of clusters

Figure 14: Same dendrogram as above using the much more informative generalized parallelcoordinate plot. 21

Discussion and Further Work

The generalized parallel coordinate plot provides a visualization for a mix of categoricaland numeric variables, that incorporates existing variants of numeric only and categoricalvariables only as (trivial) special cases. Fundamental to this modiﬁed framework is theswitch from a frequency based representation of categorical variables to an observation-based representation, which allows the viewer to track an individual observation acrossthe entire plot. While the observations are drawn individually, visually the proximity oflines creates an implicit grouping that characterizes the joint distribution of the data andpreserves the functionality of the original implementations of both the categorical and thenumeric variants of parallel sets/parallel coordinate plots. Drawing lines for each individualmight not be the fastest strategy computationally, but it allows us to focus on the humanbehind the data instead of aggregating the same information into a faceless statistic.In the implementation in ggpcp we combined the extended PCP framework with thepowerful grammar of graphics implementation of ggplot2 . The layer-based constructionof generalized parallel coordinate plots allows the user to explicitly describe each layer ofthe plot in a familiar manner and therefore provides ﬂexibility in specifying and changingeach aspect of the plot’s appearance. As shown in the example of migrant casualties, thetidyverse facilitates transitioning between diﬀerent levels of aggregation; the generalizedparallel coordinate plots tap into that power to provide visualizations for diﬀerent obser-vational units. The ability to link between these plots in an interactive manner might beachievable by additionally leveraging the plotly framework (Sievert 2018) and would furtherimprove upon the usefulness of generalized parallel coordinate plotsWhile we are quite convinced that the generalized parallel coordinate plots are indeedan improvement over the traditional parallel coordinate plot or parallel sets, we would liketo conﬁrm this in future user studies.While the α -blending of the lines mitigates some of the eﬀect of the sine illusion, theexact magnitude of the mitigation heavily depends on the choice of α and the data set.Future versions of the package might implement the idea of the common angle plot, there-fore adjusting for the eﬀect by forcing all lines to appear under the same angle. This mightalso help with reducing the visual complexity.22 eferences Agresti, A. (2002),

Categorical Data Analysis , 2 edn, John Wiley & Sons, Hoboken.Anderson, E. (1935), ‘The irises of the gaspe peninsula’,

Bulletin of the American IrisSociety , 2–5.Dawson, R. J. M. (1995), ‘The “unusual episode” data revisited.’, Journal of StatisticsEducation (3).Day, R. H. & Stecher, E. J. (1991), ‘Sine of an illusion’, Perception , 49–55.d’Ocagne, M. (1885), ‘Coordonnes parallles et axiales : Mthode de transformationgomtrique et procd nouveau de calcul graphique dduits de la considration des coor-donnes parallles’, Gauthier-Villars p. 112.

URL: https://archive.org/details/coordonnesparal00ocaggoog/page/n10

Gannett, H. (1880), General summary showing the rank of states by ratios 1880, plate71, in ‘Scribner’s statistical atlas of the United States, showing by graphic methodstheir present condition and their political, social and industrial development’, CharlesScribner’s Sons, New York.Heinrich, J. & Weiskopf, D. (2009), ‘Continuous Parallel Coordinates’, IEEE Transactionson Visualization and Computer Graphics (6), 1531–1538. URL: http://ieeexplore.ieee.org/document/5290770/

Heinrich, J. & Weiskopf, D. (2013), State of the Art of Parallel Coordinates, in M. Sbert &L. Szirmay-Kalos, eds, ‘Eurographics 2013 - State of the Art Reports’, The EurographicsAssociation.Hofmann, H. & Vendettuoli, M. (2013), ‘Common Angle Plots as Perception-True Visual-izations of Categorical Associations’,

IEEE Transactions on Visualization and ComputerGraphics (12), 2297–2305. URL: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6634157 ggparallel: Variations of Parallel Coordinate Plotsfor Categorical Data . R package version 0.2.0.

URL: https://cran.r-project.org/package=ggparallel

Inselberg, A. (1985), ‘The plane with parallel coordinates’,

The Visual Computer (2), 69–91. URL: http://link.springer.com/10.1007/BF01898350

Kosara, R., Bendix, F. & Hauser, H. (2006), ‘Parallel Sets: interactive exploration andvisual analysis of categorical data’,

IEEE Transactions on Visualization and ComputerGraphics (4), 558–568. URL: http://ieeexplore.ieee.org/document/1634321/

Linzer, D. A. & Lewis, J. B. (2011), ‘poLCA: An R package for polytomous variable latentclass analysis’,

Journal of Statistical Software (10), 1–29. URL:

McDonnell, K. T. & Mueller, K. (2008), ‘Illustrative Parallel Coordinates’,

ComputerGraphics Forum (3), 1031–1038. URL: http://doi.wiley.com/10.1111/j.1467-8659.2008.01239.x

Miller, J. J. & Wegman, E. J. (1991), Computing and graphics in statistics, Springer-VerlagNew York, Inc., New York, NY, USA, chapter Construction of Line Densities for ParallelCoordinate Plots, pp. 107–123.

URL: http://dl.acm.org/citation.cfm?id=140806.140816

Pilh¨ofer, A. & Unwin, A. (2013), ‘New Approaches in Visualization of Categorical Data:R Package extracat’,

Journal of Statistical Software (7). URL:

R Core Team (2019),

R: A Language and Environment for Statistical Computing , R Foun-dation for Statistical Computing, Vienna, Austria.

URL:

GGally: Extension to ’ggplot2’ . R package version 1.4.0.

URL: https://CRAN.R-project.org/package=GGally

Schonlau, M. (2002), ‘The clustergram: a graph for visualizing hierarchical and non-hierarchical cluster analyses’,

The Stata Journal (4), 391–402.Schonlau, M. (2003), Visualizing Categorical Data Arising in the Health Sciences UsingHammock Plots., in ‘Proceedings of the Section on Statistical Graphics, American Sta-tistical Association’.Schonlau, M. (2004), ‘Visualizing hierarchical and non-hierarchical cluster analyses withclustergrams’, Computational Statistics (1), 95–111.Sievert, C. (2018), plotly for R . URL: https://plotly-r.com

Venables, W. N. & Ripley, B. D. (2002),

Modern Applied Statistics with S , 4 edn, Springer.

URL:

Wegman, E. J. (1990), ‘Hyperdimensional data analysis using parallel coordinates’,

Journalof the American Statistical Assoiation , 664–675.Wickham, H. (2010), ‘A layered grammar of graphics’, Journal of Computational andGraphical Statistics , 3–28. URL: https://doi.org/10.1198/jcgs.2009.07098

Wickham, H. (2016), ggplot2: Elegant Graphics for Data Analysis , 2 edn, Springer-VerlagNew York.

URL: https://ggplot2.tidyverse.org

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., Franois, R., Grole-mund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E.,Bache, S. M., Mller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi,K., Vaughan, D., Wilke, C., Woo, K. & Yutani, H. (2019), ‘Welcome to the tidyverse’,