Scatterplot Selection Applying a Graph Coloring Problem
SScatterplot Selection Applying a Graph Coloring Problem
Takayuki Itoh Asuka Nakabayashi Mariko Hagita
1) Ochanomizu University
Figure 1. Visualization example by our technique. Several scatterplots show strong correlations between dimension pairs, several ones clearly show clusters or outliers, while several others show how two labels drawn in red and blue are separated. Our technique selects a variety of scatterplots to show various characteristics of the input multidimensional dataset in a single display space. A BSTRACT
Scatterplot selection is an effective approach to represent essential portions of multidimensional data in a limited display space. Various metrics for evaluation of scatterplots such as scagnostics have been presented and applied to scatterplot selection. This paper presents a new scatterplot selection technique that applies multiple metrics. The technique firstly calculates scores of scatterplots with multiple metrics and then constructs a graph by connecting similar scatterplots. The technique applies a graph coloring problem so that different colors are assigned to similar scatterplots. We can extract a set of various scatterplots by selecting them that the specific same color is assigned. This paper introduces visualization examples with a retail dataset containing multidimensional climate and sales values.
1 I
NTRODUCTION
Multidimensional data visualization has been one of the most active research issues in the visualization community. There have been various techniques on multidimensional data visualization including geometric techniques such as scatterplot matrix (SPM) and parallel coordinate plots (PCP), iconic techniques, and pixel-based techniques. Dimension selection [5,19,21] is one of the most important issues for the visualization of high-dimensional data. It is not reasonable to represent every dimension in a limited display space; therefore, it is important to remove noisy or meaningless dimensions and to focus on the visualization of informative dimensions. Many recent studies on multidimensional data visualization presented a variety of metrics that denote the informativeness of scatterplots. Scagnostics [18] is one of the typical metrics for the scatterplots and has been applied to scatterplot selection problems [10,17,21]. Meanwhile, it is not always appropriate to apply a single metric for scatterplot selection to represent the overall characteristics of the multidimensional data. For example, interesting correlations are observed from some pairs of dimensions while interesting clusters are observed from some other pairs of dimensions. In this case, we may want to display scatterplots that have various characteristics in a single display space. This paper presents a new and fast technique for scatterplot selection with multiple metrics. This technique firstly generates scatterplots with arbitrary pairs of dimensions. Then, multiple scores based on multiple metrics are calculated for each scatterplot and a vector is formed from the scores. The technique constructs a graph by connecting pairs of scatterplots if the similarity between their vectors is larger than a user-defined threshold. It then assigns colors to the vertices corresponding to the scatterplots while complying with a rule that different colors are assigned to a pair of vertices connected by an edge. In other words, the same color is assigned to a set of significantly different scatterplots. The technique selectively displays a constant number of scatterplots that have the same color. As shown in Figure 1, the technique realizes the selection of a variety of scatterplots that show various characteristics of the input dataset. This paper introduces a case study with a consumer business dataset including climate and revenue values.
2 R
ELATED W ORK
Dimension selection techniques have been widely applied to multi-dimensional data visualization so that an important subset of dimensions can be effectively represented. Claessen et al. [2] visualized high-dimensional datasets by representing a set of low-dimensional subspaces as a combination of PCPs and scatterplots. Suematsu et al. [15] and Zheng et al. [22] also converted high-dimensional datasets into low-dimensional subsets and visualized these subsets using multiple PCPs or scatterplots respectively. These techniques did not provide rich interaction mechanisms to freely select the numbers of dimensions. Several recent studies have demonstrated interaction mechanisms to freely visualize interesting low-dimensional subspaces. Lee et al. [6] and Liu et al. [7] applied dimension eduction schemes to interactively select subsets of the high-dimensional data. Nohno et al. [11] presented a technique to interactively contract highly-correlated dimensions to adjust the number of axes displayed in PCPs. Itoh et al. [5], Watanabe et al. [17] and Nakabayashi et al. [10] presented a series of techniques that easily control the number of dimensions displayed in the PCPs or number of dimension pairs represented by scatterplots. It is also important to understand relationships among dimensions while extracting low-dimensional subspaces. Dimension spaces have been visualized by applying scatterplots or graphs by several recent studies [10,17,21]. This is also an effective approach to interactively select meaningful sets of dimensions. Despite a lot of studies on multidimensional data visualization have applied dimension selection techniques, there have been few studies to automatically select a variety of a limited number of informative scatterplots. We address this problem and present a new technique in this paper.
Numeric evaluation of the informativeness of scatterplots has been an active research topic. Scagnostics is a famous concept to quantitatively evaluate the informativeness of scatterplots. Wilkinson et al. [18] proposed nine features of scagnostics based on the appearance of the scatterplots. Wang et al. [16] proposed the improved scagnostics by considering human perception to several metrics including "Outlying" and "Clumpy." There have been more several more studies that focus on specific metrics of scatterplots, including correlation [4,13] and class separation [1,12,14]. There have been several visualization studies on overview and exploration of a large number of scatterplots. Dang et al. [3] presented an exploration mechanism for finding similar scatterplots and filtering scagnostics. Matute et al. [8] presented another approach to representing the distribution of characteristics of scatterplots. The goal of our study is somewhat similar to the above studies since we also focus on presenting a variety of scatterplots; however, our focus is different from these studies since we aim to select the fixed number of a variety of scatterplots.
3 S
CATTERPLOT S ELECTION A PPLYING A G RAPH C OLORING P ROBLEM
This section presents a processing flow of the presented scatterplot selection technique. We suppose that scores of scatterplots are calculated with multiple metrics and stored as vector values. Figure 2 illustrates the concept of scatterplot selection. Scatterplots are depicted as vectors in the metrics space. The requirements for scatterplot selection in this study are summarized as follows: [R1:]
Select distant and long vectors to select a variety of informative scatterplots. [R2:]
Avoid selecting multiple close vectors to avoid selecting similar scatterplots. [R3:]
Avoid selecting short vectors to avoid selecting less informative scatterplots. We present a graph coloring problem to satisfy the above requirements and display a variety of informative scatterplots.
Figure 2. Concept of scatterplot selection in the metrics space. Blue arrows illustrate the vectors of metrics. Our technique selects a variety of scatterplots while satisfying R1 , R2 and R3 . This paper formalizes the problem as follows. An input multidimensional dataset A has n individuals as A (cid:3404) (cid:4668)π (cid:2869) , π (cid:2870) , β¦ , π (cid:3041) (cid:4669). The i-th individual π (cid:3036) has the m -dimensional values as π (cid:3036) (cid:3404) (cid:4666)π (cid:3036)(cid:2869) , π (cid:3036)(cid:2870) , β¦ , π (cid:3036)(cid:3040) (cid:4667). A set of scatterplots formed from arbitrary pairs of dimensions is described as S (cid:3404) (cid:4668)π (cid:2869) , π (cid:2870) , β¦ , π (cid:3015) (cid:4669), where N is the total number of scatterplots. Each scatterplot has a set of scores calculated based on predefined metrics, described as π (cid:3036) (cid:3404) (cid:4666)π (cid:3036)(cid:2869) , π (cid:3036)(cid:2870) , β¦ , π (cid:3036)(cid:3014) (cid:4667), where M is the number of metrics. The cosine similarity between the i -th and the j β th scatterplots is described as π (cid:3036)(cid:3037) (cid:3404) (cid:4666)π (cid:3036) β π (cid:3037) (cid:4667)/(cid:4666)|π (cid:3036) ||π (cid:3037) |(cid:4667). Our technique calculates multiple scores for each scatterplot based on multiple metrics. Our current implementation supports the following metrics.
Correlation is one of the most common metrics to determine the relationship between a pair of dimensions. Our current implementation just calculates the score of the k -th scatterplot as follows: π (cid:3038)(cid:2869) (cid:3404) |π (cid:3043)(cid:3032)(cid:3028)(cid:3045) (cid:4666)π, π(cid:4667) (cid:2870) | where S pear ( i , j ) is the Spearman's rank correlation between the i -th and the j -th dimensions. A dimension pair gets a higher score if they have a strong positive/negative correlation. Newer approaches on correlation [4,13] can be also applied. It means easier to adopt a mathematical model to a set of individuals if they form thin regions in a scatterplot. We measure the thinness of the region where the individuals place in the scatterplot as Wilkinson et al. [18] did. Our implementation generates a Delaunay triangular mesh T connecting the individuals in a scatterplot and then removes all triangles which have at least one edge that is longer than a pre-defined threshold. Then. we calculate the score as follows: π (cid:3038)(cid:2870) (cid:3404) 1 (cid:3398) (cid:3493)4ππ΄ (cid:3045)(cid:3032)(cid:3028) (cid:4666)π(cid:4667)/π (cid:3032)(cid:3045)(cid:3036)(cid:3040)(cid:3032)(cid:3047)(cid:3032)(cid:3045) (cid:4666)π(cid:4667) where A rea ( T ) is the total area of T , and P erimeter ( T ) is the total length of the boundary of T . .2.3 Clumping It is remarkable if the individuals in a scatterplot are well-separated into several clusters. Our current implementation simply applies the metric "Clumpy" presented by Wilkinson et al. [18] defined as follows: π (cid:3038)(cid:2871) (cid:3404) 1 (cid:3398) πππππ‘β(cid:4666)π (cid:3040)(cid:3028)(cid:3051)(cid:3045) (cid:4667)/πππππ‘β(cid:4666)π (cid:3040)(cid:3036)(cid:3041)(cid:3031) (cid:4667) Here, our implementation generates a Delaynay triangular mesh as described in the previous section, and deletes the edges that are longer than e mind . Meanwhile, e maxr is the longest remaining edge. Newer approaches on clumping [16] can be also applied. Suppose that one of the labels is assigned to each of the individuals. It is remarkable if the individuals that have a particular same label are well-separated in a scatterplot. We measure the separateness of a particular label by calculating the entropy of the labels. In particular, we compute the entropy of the labels in the scatterplot generated with the i -th and the j -th dimensions as follows: H(cid:4666)i, j(cid:4667) (cid:3404) (cid:3398) 1π (cid:3533) (cid:3533) π(cid:4666)π¦ (cid:3038) (cid:3404) π|(cid:4666)π (cid:3038)(cid:3036) , π (cid:3038)(cid:3037) (cid:4667) log π(cid:4666)π¦ (cid:3038) (cid:3404) π|(cid:3435)π (cid:3038)(cid:3036) , π (cid:3038)(cid:3037) (cid:3439)(cid:4667) (cid:3004)(cid:3030)(cid:2880)(cid:2869)(cid:3041)(cid:3038)(cid:2880)(cid:2869) where y k is the label of the k -th individual, ( a ki, a kj ) is the position in the scatterplot of the k -th individual, and C is the number of labels. Our implementation divides the scatterplot into L subareas and calculate the entropy at the l -th subarea H ( i , j ) l by the above equation, and finally calculates the score of the k -th scatterplot as follows: π (cid:3038)(cid:2872) (cid:3404) (cid:4666)π» (cid:3040)(cid:3028)(cid:3051) (cid:3398) (cid:3533) π»(cid:4666)π, π(cid:4667) (cid:3039) (cid:4667)/π» (cid:3040)(cid:3028)(cid:3051) where H max is the maximum value of β π»(cid:4666)π, π(cid:4667) (cid:3039) . Other approaches [1,12] can be also applied to determine the class separateness. This technique applies a graph coloring problem to select a variety of scatterplots that have different characteristics. This idea is originally presented for the selection of a variety of photos from a large-scale collection [9]. Suppose a graph G ={ S , E }, where S is a set of vertices corresponding to the scatterplots, and E is a set of edges connecting pairs of scatterplots. The technique constructs the graph by generating edges between the i β th and the j -th scatterplots if their similarity d ij is larger than the pre-defined threshold d thres . Then, the technique assigns colors to the scatterplots while complying a rule that different colors are assigned to a pair of vertices connected by an edge. In other words, the same color is assigned to a set of significantly different scatterplots. Figure 3 illustrates the process. The process firstly selects the scatterplot that have the largest π (cid:3038) , and assigns the color identification c k =0. Then, adjacent vertices connected by edges are traversed in the breadth-first order. While visiting the k -th vertex, the process specifies the minimum color identification that is assigned to none of the adjacent vertices connected with the k -th vertex, and assigns it to the k -th vertex. For example, if color identifications 0, 1, and 3 have been assigned to the vertices adjacent to c k , the process specifies c k as 2. The breadth-first search is repeated until color identifications are assigned to all the vertices. Figure 3. Graph coloring. The process assigns different colors to the vertex pairs connected by edges. The numbers in this figure denote the order of the breadth-first search.
Finally, we select a predefined number of scatterplots to be displayed. The technique extracts a set of scatterplots in which the same color is assigned. We calculate the sums of the length of the vectors π (cid:3038) , for each color and select the color that brings the largest sum. The extracted set of scatterplots does not include any similar pairs because similar pairs of scatterplots are connected and therefore have different colors. In other words, it satisfies R1 and R2 because the extracted set consists of a variety of differently looking scatterplots. If the number of extracted scatterplots is larger than the predefined number, the technique selects the scatterplots in the descending order of max( s k1 , s k2 , s k3 , s k4 ), the maximum value of the four scores, to satisfy R1 and R3 . The processing flow is as follows. 1. Initialize the vertices S . Calculate the interestingness of the k -th scatterplot as π (cid:3038) ,. 2. Construct the graph. Generate an edge between the i -th and the j -th scatterplots if d ij is larger than the pre-defined threshold. 3. Select the scatterplot that has the largest interestingness as the starting vertex. 4.
Traverse the connected vertices by the breadth-first search. Assign color identifications to the traversed vertices. Repeat this traverse until the color identifications are assigned to all the vertices. 5.
Collect the vertices that have the same color identification. Select the predefined number of vertices in the descending order of max( s k1 , s k2 , s k3 , s k4 ).
4 E
XAMPLE
This paper introduces an example of visualization by the presented technique applying a retail transaction and climate dataset. Table 1 shows the explanatory variables (climate values) assigned to the horizontal axis while Table 2 shows the objective functions (retail transaction values) assigned to the vertical axis in this dataset. The dataset contained the records of 457 days from May 1, 2016, to July 31, 2017, corresponding to 457 data points in the scatterplots. We generated 35 scatterplots consisting of five horizontal axes and seven vertical axes. Remark that this dataset is perturbed by adding random small real values to each column of the original dataset. The data points are drawn in red or blue: red denotes holidays while blue denotes weekdays. Figure 1 shows an example of scatterplot selection by our technique. Here, several scatterplots show correlations between dimension pairs, several ones show clusters or outliers, while several others show how two labels drawn in red and blue are separated. This figure demonstrates that our technique successfully selects a variety of scatterplots to show arious characteristics of the dataset.
Table 1: The explanatory variables (climate values).
MinTemp Minimum temperature MaxTemp Maximum temperature SumRain Precipitation SumSunTime Sunshine duration MaxWind Maximum wind speed
Table 2: The objective functions (retail transaction values).
Revenue Revenue Guest1 Number of customer Guest2 Number of visitor Ratio Conversion rate PerGuest Average revenue per customer AveUnit Average price of purchased items AveNum Average number of purchased items
Figures 4, 5 and 6 show top four scatterplots that archived the highest scores on correlation, entropy, and clumpy. Here, the horizontal axes of scatterplots are MinTemp or MaxTemp while the vertical axes are PerGuest or AveUnit in Figure 4. It denotes that the average revenue or price well correlates with the temperature. Meanwhile, the vertex axes of scatterplots in Figure 5 are Revenue, Guest1, or Guest2. It denotes that revenue and the number of guests drastically different between holidays and weekdays. Figure 6 suggests a set of dimension pairs that bring better views to discover outliers and clusters. The scatterplot selection result shown in Figure 1 is well-balanced because it represents various characteristics of the input dataset by selecting various scores of scatterplots. Meanwhile, Figure 7 shows examples of scatterplots that have no higher scores with all the metrics. Actually, these scatterplots do not look characteristic or informative. The presented technique does not aggressively select such types of scatterplots.
Figure 4. Scatterplots that archived the highest scores on correlation.
Figure 5. Scatterplots that archived the highest scores on entropy.
Figure 6. Scatterplots that archived the highest scores on clumpy.
Figure 7. Examples of scatterplots that have no higher scores with all metrics.
Figure 8 shows the normalized scores of four metrics of sixteen scatterplots shown in Figure 1. It demonstrates the selection of various scores of scatterplots. The result of scatterplot selection strongly depends on the choice of d thres . The smaller d thres brings a larger number of edges and consequently a larger number of scatterplots groups corresponding to the number of colors in Figure 3. Table 3 shows the numbers of edges and colors in our experiments. Here, the selection of very similar scatterplots would be avoided by making a larger number of groups, but at the same time, informativeness of the selected scatterplots may be decreased. Figure 9 shows the maximum scores max( s k1 , s k2 , s k3 , s k4 ) of the selected scatterplots while adjusting the d thres values. This result suggests that we need to carefully adjust this threshold to select a variety of informative scatterplots. Table 3: The numbers of edges and colors. d thres Figure 8. Scores of four metrics of sixteen scatterplots shown in Figure 1.
Figure 9. Ranks of maximum scores while adjusting d thres .
5 C
ONCLUSION AND F UTURE W ORK
This paper presented a new scatterplot selection technique applying a graph coloring problem. The technique calculates scores based on several independent metrics for each scatterplot. Then, the technique constructs a graph by connecting vertex pairs corresponding to scatterplot pairs if these scores are similar. The graph coloring problem is applied to the graph, and scatterplots that the user-specified olor is assigned are extracted. The paper introduced examples of the scatterplot selection applying a retail transaction and climate dataset. Our future issues include the following. Firstly, we would like to add and modify the metrics. There have been various improved metrics for scagnostics as mentioned in Section 2.2. We would like to apply them and explore the best combination of the metrics for this study. Then, we would like to test the scalability of the presented technique. Especially, we suppose it is necessary to test the datasets that have a large number of dimensions, and therefore a large number of scatterplots can be generated. Also, it is necessary to test the datasets with a large number of individuals. After the above improvements and tests, we would like to have case studies with various real-world datasets and conduct user evaluations. R EFERENCES [1]
M. Aupetit and M. Sedlmair. Sepme: 2002 new visual separation measures. IEEE Pacific Visualization Symposium 2016, 43-52, 2016. [2]
J. H. T. Claessen and J. J. van Wijk. Flexible linked axes for multivariate data visualization. IEEE Transactions on Visualization and Computer Graphics, 17(12):2310-2316, 2011. [3]
T. N. Dang and L. Wilkinson. Scagexplorer: Exploring scatterplots by their scagnostics. IEEE Pacific Visualization Symposium 2014, 73-80, 2014. [4]
L. Harrison, F. Yang, S. Franconeri, and R. Chang. Ranking visualizations of correlation using weberβs law. IEEE Transactions on Visualization and Computer Graphics, 20(12):1943-1952, 2014. [5]
T. Itoh, A. Kumar, A. Klein, and J. Kim. High-dimensional data visualization by interactive construction of low-dimensional parallel coordinate plots. Journal of Visual Languages and Computing, 43(1):1-13, 2017. [6]
J. H. Lee, K. T. McDonell, A. Zelenyuk, D. Imre, and K. Muller. A structure-based distance metric for high-dimensional space exploration with multidimensional scaling. IEEE Transaction on Computer Graphics, 20(3):351-364, 2013. [7]
S. Liu, B. Wang, P.-T. Bremer, and V. Pascucci. Distortion-guided structure-driven interactive exploration of high-dimensional data. Computer Graphics Forum, 33(3):101-110, 2014. [8]
J. Matute, A. C. Telea, and L. Linsen. Skeleton-based scagnostics. IEEE Transaction on Computer Graphics, 24(1):542-552, 2017. [9]
N. Morishita, M. Hagita, H. Shioya, and T. Itoh. Graph coloring algorithms for photo selection. Conference on Applied Mathematics (in Japanese), 106-109, 2016. [10]
A. Nakabayashi and T. Itoh. A technique for selection and drawing of scatterplots for multi-dimensional data visualization. 23rd International Conference on Information Visualisation (IV2019), 62-67, 2019. [11]
K. Nohno, H.-Y. Wu, K. Watanabe, S. Takahashi, and I Fujishiro. Spectral-based contractible parallel coordinates. 18th International Conference on Information Visualisation, 7-12, 2014. [12]
M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A taxonomy of visual cluster separation factors. Computer Graphics Forum, 31(3):1335-1344, 2012. [13]
L. Shao, A. Mahajan, T. Schreck, and D. J. Lehmann. Interactive regression lens for exploring scatter plots. Computer Graphics Forum, 36(3):157-166, 2017. [14]
M. Sips, B. Neubert, J. P. Lewis, and P. Hanrahan. Selecting good views of high-dimensional data using class consistency. Computer Graphics Forum, 28(3):831-838, 2009. [15]
H. Suematsu, Y. Zheng, T. Itoh, R. Fujimaki, S. Morinaga, and Y. Kawahara. Arrangement of low-dimenional parallel coordinate plots for high-dimensional data visualization. 17th International Conference on Information Visualisation, 59-65, 2013. [16]
Y. Wang, Z. Wang, T. Liu, M. Correll, Z. Cheng, O. Deussen, and M. Sedlmair. Improving the robustness of scagnostics. IEEE Transactions on Visualization and Computer Graphics, 26(1):759-769, 2020. [17]
A. Watanabe, T. Itoh, M. Kanazaki, and K. Chiba. A scatterplots selection technique for multi-dimensional data visualization combining with parallel coordinate plots. 21st International Conference on Information Visualisation (IV2017), 78-83, 2017. [18]
L. Wilkinson, A. Anand, and R. Grossman. Graph-theoretic scagnostics. IEEE Symposium on Information Visualization, 157-164, 2005. [19]
X. Yuan, D. Ren, Z. Wang, and C. Guo. Dimension projection matrix/tree: Interactive subspace visual exploration and analysis of high dimensional data. IEEE Transactions on Visualization and Computer Graphics, 19(12):2625-2633, 2013. [20]
Z. Zhang, K. T. McDonnel, E. Zadak, and K. Muller. Visual correlation analysis of numerical and categorical data on the correlation map. IEEE Transactions on Visualization and Computer Graphics, 21(2):289-303, 2015. [21]
Z. Zhang, K. T. McDonnell, and K. Mueller. A network-based interface for the exploration of high-dimensional data spaces. IEEE Pacific Visualization Symposium 2012, 17-24, 2012. [22][22]