HyperTools: A Python toolbox for visualizing and manipulating high-dimensional data
Andrew C. Heusser, Kirsten Ziman, Lucy L. W. Owen, Jeremy R. Manning
HHyperTools : A Python toolbox for visualizingand manipulating high-dimensional data
Andrew C. Heusser † , Kirsten Ziman † , Lucy L. W. Owen, and JeremyR. Manning ‡ , Department of Psychological and Brain Sciences,Dartmouth College, Hanover, NH 03755 † Denotes equal contribution ‡ Address correspondence to [email protected]
Abstract
Data visualizations can reveal trends and patterns that are not otherwise obviousfrom the raw data or summary statistics. While visualizing low-dimensional data isrelatively straightforward (for example, plotting the change in a variable over time as( x, y ) coordinates on a graph), it is not always obvious how to visualize high-dimensionaldatasets in a similarly intuitive way. Here we present
HyperTools , a Python toolbox forvisualizing and manipulating large, high-dimensional datasets. Our primary approach isto use dimensionality reduction techniques [14, 17] to embed high-dimensional datasetsin a lower-dimensional space, and plot the data using a simple (yet powerful) API withmany options for data manipulation (e.g. hyperalignment [8], clustering, normalizing, etc.)and plot styling. The toolbox is designed around the notion of data trajectories and pointclouds . Just as the position of an object moving through space can be visualized as a 3Dtrajectory,
HyperTools uses dimensionality reduction algorithms to create similar 2D and 3Dtrajectories for time series of high-dimensional observations. The trajectories may be plottedas interactive static plots or visualized as animations. These same dimensionality reductionand alignment algorithms can also reveal structure in static datasets (e.g. collections ofobservations or attributes). We present several examples showcasing how using our toolboxto explore data through trajectories and low-dimensional embeddings can reveal deepinsights into datasets across a wide variety of domains.
Introduction “ To deal with hyper-planes in a fourteen dimensional space, visualize a 3D spaceand say ‘fourteen’ to yourself very loudly. Everyone does it. ” –Geoffrey Hinton [9] 1/22 a r X i v : . [ s t a t . O T ] J a n y x y x Mean (x)Sample Variance (x)Mean (y)Sample Variance (y)CorrelationLinear Regression 117.504.1250.816y = 3.00 + 0.500x
All Datasets Figure 1. Anscombe’s Quartet.
Each dataset shares the same descriptive statistics butexhibits a unique shape.The
HyperTools toolbox is designed to reveal geometric structure in high-dimensional datathrough visualizations and manipulations. Modern data visualizations date back to at least the16 th century, when early data pioneers began to develop the sorts of accurate maps and diagramswe might still recognize today [5, 19]. Visualizations can reveal deep insights and intuitionsabout geometric structure and patterns in complex datasets by capitalizing on the human visualsystem’s ability to quickly and efficiently extract meaning and structure from highly complexvisual information [20]. This is perhaps especially true of high-dimensional datasets, wheredifferent dimensions or features may interact in complex ways that may not be immediatelyobvious through conventional summary statistics.As an illustration of the potential for summary statistics to mislead in the absence ofvisualization, consider the classic example, Anscombe’s Quartet [1] (Fig. 1). Anscombe’s Quartetcomprises four datasets that share a common statistical profile. Because the datasets are exactlyequal along several common summary measures (mean, variance, trend lines), at first glance theyseem highly similar. However, plotting the datasets and comparing them visually reveals thatthey differ substantially in structure. Whereas low-dimensional datasets like those in Anscombe’sQuartet can be easily plotted, it is not always obvious how to visualize high-dimensional datasets(e.g. with greater than 3 dimensions) in a similarly intuitive way.A number of techniques, collectively referred to as dimensionality reduction algorithms havebeen developed over the past half-century to map high-dimensional data onto lower-dimensionalrepresentations that may be more easily manipulated and visualized. Some well-known examplesinclude Principal Components Analysis (PCA) [14], Probabilistic Principal Components Anal-2/22sis (PPCA) [17], Independent Components Analysis (ICA) [4, 11], Multidimensional Scaling(MDS) [18], and t -Distributed Stochastic Neighbor Embedding (t-SNE) [21]. While the details ofthese algorithms differ, they each provide a means of obtaining a low-dimensional representationof the original high-dimensional dataset that preserves many of the geometric properties (e.g.the overall covariance structure of the data, data grouping, etc.) to the extent possible within alow-dimensional space. In the HyperTools toolbox, we leverage these dimensionality reductionalgorithms (Fig. 2a) to aide in visualizing high-dimensional data.A second class of algorithms leveraged in our toolbox provide techniques for manipulating andaligning different high-dimensional datasets (Fig. 2b). These algorithms draw inspiration fromthe Procrustean transformation [16], which computes the affine transformations (i.e. translation,reflection, rotation, and scaling) that bring one trajectory into alignment with another (interms of minimizing the mean squared Euclidean distances between the corresponding points).The hyperalignment algorithm [8] and the Shared Response Model (SRM) [3] extend thistechnique to find a common set of transformations that bring many (more than two) high-dimensional trajectories into common alignment. Our
HyperTools toolbox leverages thesealignment algorithms to allow users to manipulate and compare high-dimensional data, evenwhen the dimensions (features) of original observations are different (e.g. brain patterns fromdifferent people, observations from different modalities, etc.).Taken together, the
HyperTools toolbox provides a set of powerful functions for visualizingand manipulating high-dimensional data using dimensionality reduction and data alignmentalgorithms. The toolbox is designed with ease of use as a primary goal, such that complexvisualizations and analyses may often be carried out with a single line of code. Another majorgoal is to enable users to easily produce visually appealing publication-quality plots, also oftenwith only a single line of code. Our toolbox is open-source and is distributed with the MITLicense.In the next section we provide a detailed overview of the components of the
HyperTools tool-box and describe how the codebase is organized. We then describe a series of analyses of datasetsfrom a wide array of domains to highlight many of the main functions of the toolbox.
Materials and Methods
Overview
The
HyperTools toolbox is written in Python and can be downloaded from our GitHub page orwith pip : pip install hypertools (1) HyperTools depends on the following open-source software packages:
Matplotlib [10] forplotting functionality,
Seaborn [23] for plot styling, scikit-learn [15] for data manipulation(dimensionality reduction, clustering, etc.), and
PPCA for inferring missing data using PPCA.The toolbox also includes a port of the hyperalignment algorithm [8] from the
PyMVPA library, aswell as the shared response model from the
BrainIAK toolbox, as an alternative data alignment3/22 s t P C n d P C s t P C n d P C ab Figure 2. Visualizing and manipulating high-dimensional data. a.
HyperTools usesdimensionality reduction algorithms to project high-dimensional data onto 2D and 3D plots. Asshown in the panel, the dimensionality reduction algorithm PCA may be used to find the axesthat explain the most variance in the original data (left panel). The data may then be projectedonto a small number of those axes to facilitate plotting (right panel). b. Another importantfeature of
HyperTools concerns aligning datasets with different fundamental coordinate systems.The left panel displays three trajectories with similar geometries but different coordinate systems,and the right panel displays how those trajectories may be aligned (via linear transformations)into a common space using hyperalignment. 4/22 ilename Description plot/plot.py
Main plotting function: parses arguments, dispatches to static.py and animate.pyplot/static.py
Handles all static plot logic plot/animate.py
Handles all animated plot logic tools/align.py
Aligns the coordinate space of a list of matrices usinghyperalignment tools/cluster.py
Parcellates observations into discrete clusters using k -meansclustering tools/describe pca.py Analyzes and plots how many principal components are neededto capture the covariance structure of the data tools/missing inds.py
Find nan s in data and returns indices tools/normalize.py z -scores rows/columns of matrices tools/df2mat.py Converts
Pandas dataframes to
Numpy arrays tools/procrustes.py
Aligns the coordinate spaces of two arrays tools/reduce.py
Reduces the dimensionality of one or more arrays using PCAand PPCA externals/srm.py
Implements the Shared Response Model (alternative alignmentalgorithm) shared/helpers.py
Collection of helper functions used across many files
Table 1.
HyperTools code organization.
The table lists the main files and functions thatcomprise the toolbox. We provide a feature complete description of the API on the project’sGitHub page and in the documentation included with the toolbox download.technique. In addition to providing a simple interface to several functions from these libraries,
HyperTools adds a number of custom arguments to facilitate data visualization and manipulationof high-dimensional data. Table 1 provides summary of the
HyperTools code base. In theremainder of this section, we provide descriptions of the primary toolbox functions, but we havenot provided an exhaustive list. A feature-complete description of the API may be found on theproject’s GitHub page and in the documentation included with the toolbox download.Nearly all of the
HyperTools functions may be accessed through the main plot function.This design enables complex data analysis, data manipulation, and plotting to be carried out ina single function call. There are two general types of plots supported by the toolbox: static plots and animated plots . Static Plots
Accessing the
HyperTools plot functionality entails first loading the to-be-analyzed dataset intothe Python workspace and converting it to a
Numpy array [22] or a
Pandas dataframe [13]. Theformat of the data should be samples ( S ) by features ( F ). Once the dataset conforms to this5/22ormat, simply import the library and call the plot function: import hypertools as hyp (2) hyp.plot(data) (3)By default (i.e. with no additional arguments specified), this function will perform dimen-sionality reduction (using PCA), convert the S × F data matrix into an S × nan s present in the dataset, these missing values will be automatically interpolatedusing PPCA [17]. If F <
3, a 2D plot is created instead of a 3D plot. This simple interfaceto plotting is deceptively powerful: with a single command, the toolbox automatically fills inmissing data and determines whether to create a 2D or 3D plot (reducing the dimensionality ofthe observations as needed).
HyperTools can also accommodate lists of
Numpy arrays or
Pandas dataframes (only single-level indexed dataframes are currently supported): hyp.plot([array1, array2, array3]) (4)When passed a list of arrays,
HyperTools will plot each array in a distinct color. Colors andstyling can be customized in a several ways. Like
Matplotlib , HyperTools can parse formatstrings passed as positional arguments. For example: hyp.plot(array1, ’k-’) (5) hyp.plot([array1, array2, array3], [’bo’, ’r--’, ’g:’]) (6)Line colors may also be specified via the color (or colors ) keyword argument: hyp.plot(array1, color=’g’) (7) hyp.plot([array1, array2, array3], colors=[’b’, ’ (8)Colors may be defined using format strings, hex codes, RGB values, or a mix of these formats.Rather than specifying the colors of each data array, colors may instead be specified for eachindividual sample by providing labels for each sample: hyp.plot(data, group=group labels) , (9)where group labels is a list of length S (number of samples). (Lists of group label arrays arealso supported, e.g. if the data are passed in as a list of arrays; the list of labels must be of thesame length as the list of data arrays.) HyperTools parses this function call by sub-dividing each data matrix into new lists definedby each unique label in group labels . For example, if each sample label is a string from theset (‘a’, ‘b’, ‘c’) then each of these unique labels will be assigned a unique color, and thedatapoints assigned to each label will be assigned that label’s color.In addition to specifying string labels for each sample,
HyperTools also supports numericallabeling. If group labels is a list of numbers,
HyperTools will bin the range covered by those6/22umerical values (excluding nan s, None s, and inf s) into n evenly spaced bins (default: n = 100)and map these values onto a color palette. The color palette used for this mapping may also becustomized using the palette keyword: hyp.plot([array1, array2], palette=’husl’) (10)All Matplotlib and
Seaborn color palettes and plot styles are supported by
HyperTools .In addition to specifying group-level labels (which are used to determine the colors of eachsample), each sample may also be labeled with an additional text label that may be shown (astext) on the plot. The label and labels keyword arguments allow the user to define a list ofstrings (or a list of lists of strings) to be displayed next to each sample datapoint with an arrowpointing to it. Each list of labels must be of length S (number of samples). (The None valuemay be used to specify “blank” labels, which will not show up on the plot.)By default, all datapoint labels are shown if the label or labels keyword is specified.However, HyperTools also supports a “data exploration” mode whereby the datapoint labelswill only be shown when the mouse pointer hovers over the corresponding datapoint: hyp.plot(data, labels=[’a’, None, ’a’, ’b’], explore=True) (11)This plotting mode is especially useful when there are many datapoints, or when the data labelsare long. If explore is set to
True and no labels are specified, the labels will be auto-generatedas an index and the PCA coordinate (e.g. ’45: (3.0, 4.0, 5.0)’ ). Note: at the time of thiswriting, the labels and explore arguments are only supported for 3D static plots.
Animated Plots
Animated 3D plots are especially useful for visualizing high-dimensional timeseries data. Tocreate an animation, simply toggle the animate keyword: hyp.plot(data, animate=True) (12)This will create a 3D animated representation of the data, where the animation occurs over therows of the data matrix. As with static plots, the user may pass a list of data matrices to plotmultiple datasets on a single plot, and format strings and keyword arguments may be used tocustomize the plot appearance. Each frame of the animation displays a portion of the total datatrajectory enclosed in a cube (Fig. 3). In successive frames, the displayed portion of the datatrajectory advances by a small amount, and the camera angle rotates around the cube, providingvisual access to different aspects of the data as the animation progresses.The formats of animated
HyperTools plots may be customized using the following keywordarguments: duration specifies the animation duration in seconds, tail duration specifies theduration of the trailing tail in seconds, rotations specifies the number of camera rotationsaround the data (over the course of the entire animation), zoom will zoom the camera in (positivenumber) or out (negative number) from the data, and setting chemtrails=True will plot a lowopacity version of samples prior to the currently active trajectory so that the full structureand history of the data may be visualized. For a complete list of animation-specific arguments,7/22 igure 3. Frames from an animated plot.
Three frames (with time increasing moving fromleft to right) from an example animation are displayed in each panel.please see the API documentation. Both animated and static plots can be saved by includingthe save path argument (with the file extension included): hyp.plot(data, save path=’path/to/the/file.pdf’) (13) hyp.plot(data, animate=True, save path=’path/to/the/file.mp4’) , (14)where Snippet 13 saves a static plot to a resolution-independent PDF, and Snippet 14 saves ananimation as an MP4 movie. Note that saving animated plots requires that ffmpeg is installedon your computer; see the API documentation for more details. Reduce
When passed high-dimensional data, the plot function uses PPCA to fill in missing data andPCA to project the data onto 3 dimensions. We provide API access to the reduce functionthat underlies these transformations. At its core, the reduce function is primarily a wrapper for scikit-learn ’s PCA function and the
PPCA package.
HyperTools extends the functionality ofthese tools by providing an easier-to-use syntax and adding support for lists of matrices. Thefunction may be used as follows: reduced data = hyp.tools.reduce(data, ndims=3) (15)Because dimensionality reduction results in information loss (relative to the original dataset),it is important to consider how accurately a low-dimensional projection of the data reflects theoriginal high-dimensional dataset.
HyperTools includes a function that plots the correlationbetween the covariance matrices of the reduced and full datasets, as function of the number ofprincipal components in the reduced dataset (Fig. 4): fig, ax, data = hyp.tools.describe pca(data) (16)The describe pca function computes these correlations iteratively (i.e. starting with oneprincipal component, then two, then three, etc.) until a local maximum is detected. The resultingplot provides insights into the increase in explanatory power (in terms of the across-samplecovariance) associated with each new principal component. 8/22 C o rr e l a t i on Figure 4. Covariance preserved as a function of the number of principal components.
For an example dataset (Example 2,
Results ) the panel displays the correlation between theupper triangle of the across-sample covariance matrices of the reduced versus original data, as afunction of the number of principal components.
Align
Two or more datasets may share geometrical structure, but reside in different coordinatesystems. Hyperalignment is a method that aligns the representational spaces over a list ofdatasets, effectively co-registering them to a common space [8]. Using linear transformations,hyperalignment find a common space that minimizes the distance between two or more datasets(Fig. 2b). Aligning them to a common space allows one to visualize commonalities between thetwo different kinds of data. The align function accepts a list of arrays as input and returns ahyperaligned list of arrays in a common geometric space: hyperaligned list = hyp.tools.align([array1, array2, array3]) (17)In addition to supporting alignment via the hyperalignment algorithm proposed by [8], wehave also added support for alignment via the Shared Response Model [3], which was portedfrom the
BrainIAK toolbox:
SRM aligned list = hyp.tools.align([array1, array2, array3], method=’SRM’) (18)
Cluster
Some datasets exhibit clustering tendencies, whereby the data may be divided into discretegroups of similar or related samples (i.e. samples that are comprised of similar features). Whenthese discrete groups are unlabeled or unknown, clustering algorithms provide heuristics forrecovering these clusters of similar samples automatically.
HyperTools incorporates the k -meansclustering algorithm [7] to facilitate automatic data clustering. Given a pre-chosen number of9/22lusters, k , the cluster keyword argument to the plot function uses k -means clustering toautomatically assign each observation to a cluster, and then colors each observation’s pointaccording to its cluster membership: hyp.plot(data, n clusters=k) (19)We also expose the k -means clustering algorithm directly through the cluster function: cluster labels = hyp.tools.cluster(data, n clusters=k) (20)The cluster function wraps the scikit-learn implementation of k -means clustering andextends it to work with lists of data matrices. Results
Example 1: Visualizing hypercubes in 3D
To illustrate how a user might visualize high-dimensional data with
HyperTools , we start byexamining four synthetic datasets with unique, known structures. We generated datasets of onecube (3 dimensions) and three hypercubes of increasing dimensionality (4, 5 and 6 dimensions),each comprised of 100 points along each of their respective edges. We then used
HyperTools toproject the hypercubes into 3 dimensional space (using PCA) and visualize the result.Figure 5 illustrates how projecting hypercubes of different dimensionalities into 3 dimensionalspace distorts some aspects of their shapes, while preserving others. In the original (high-dimensional) data, all edges of each respective cube are of equal length, and each vertex comprises n adjacent edges converging orthogonally (where n is the dimensionality of the hypercube).However, in Fig. 5, some edges appear longer than others, and some vertices appear to formacute and obtuse angles.Despite these differences, many of the underlying structural components are accuratelyreflected in the visualization. Namely, the visualization of each n -dimensional cube correctlydepict 2 n vertices, 2 n − ∗ n edges, and n edges converging at each vertex. Each edge is alsoreliably reconstructed as a straight (rather than curved) line segment. The visualizations alsodepict increasing complexity with increasing dimensionality. Example 2: Dimensionality reduction and clustering with varioustypes of mushrooms
In this section, we highlight the dimensionality reduction and clustering capabilities of
HyperTools .We retrieved the ‘mushroom classification’ dataset from the Kaggle database. The dataset con-tains annotated descriptive features of 8,124 mushrooms spanning 23 mushroom species from the
Audubon Society Field Guide to North American Mushrooms [12]. Each observation comprisesa list of 22 descriptive features (e.g. cap shape, cap surface, habitat, etc.) along with a tagidentifying each mushroom exemplar as poisonous or non-poisonous (features for five examplemushrooms are shown in Tab. 2). 10/22 bc d
Figure 5. Hypercubes with increasing dimensionality.
Each dataset comprises 100 evenlyspaced points along each edge of the corresponding cube with dimensionality a. b. c. d.
6. class cap-shape cap-surface cap-color bruises odor ... habitat0 p x s n t p ... u1 e x s y t a ... g2 e b s w t l ... m3 p x y w t p ... u4 e x s g f n ... g
Table 2. Example of mushrooms dataset.
The dataset contains annotated features(columns) of each mushroom (row), along with labels indicating whether each mushroom ispoisonous or non-poisonous (not shown). 11/22
CA ICAt-SNE poisonousnot poisonous
MDS
Figure 6. Three-dimensional embeddings of the mushrooms dataset using severaldimensionality reduction techniques.
Each point represents a sample (mushroom). Reddots indicate poisonous mushrooms and blue indicate non-poisonous mushrooms. 12/22 igure 7. Mushrooms dataset, colored by k -means cluster. Because the mushroom features are provided as character strings, they must be transformedinto numerical vectors to plot them. When passed a
Pandas dataframe with columns containingtext,
HyperTools automatically converts the data into a binary matrix, where each columnreflects one of the unique values of one of the features. The underlying function for convertingdataframes into matrices may also be called directly: matrix = hyp . tools . df2mat ( dataframe ) (21)Plotting the resulting matrix with HyperTools reveals a striking clustered structure. Overall,the samples appear to cluster by whether or not they are poisonous, but they also appear togroup into sub-clusters (Fig. 6). By default,
HyperTools uses PCA for dimensionality reduction,but different dimensionality reduction techniques can reveal distinct geometrical properties of adataset. To highlight this, we plotted the transformed binary matrix using several dimensionalityreduction techniques (PCA, ICA, t -SNE, and MDS) to visualize their effects on clustering (Fig. 6).Each technique produces a unique low-dimensional projection of the data, highlighting distinctstructural aspects.To highlight the sub-clustering structure in this dataset, we use the n clusters argument to plot : hyp . plot ( mushrooms data , n clusters = ) (22)This command relies on k -means clustering to empirically derive cluster labels, and then ploteach cluster in a different color (Fig. 7). Example 3: Exploring factors that influence educational outcomes.
Next, we analyzed an education dataset containing, for each of 480 students around the world,performance ratings (high, medium, and low performance), demographic descriptors (e.g. gender,nationality, place of birth, etc.) as well as classroom behaviors (number of times the student13/22ender NationalITy PlaceofBirth StageID GradeID ... Class0 M KW KuwaIT lowerlevel G-04 ... M1 M KW KuwaIT lowerlevel G-04 ... M2 M KW KuwaIT lowerlevel G-04 ... L3 M KW KuwaIT lowerlevel G-04 ... L4 M KW KuwaIT lowerlevel G-04 ... M
Table 3. Example features in education dataset.
The dataset contained categorical andnumerical features, as well as student performance labels.raised their hand, days absent from class, number of times the student visited online resources,etc.) and others (features for five example students are displayed in Tab. 3; for full list of featuresand to download the data, see the Kaggle database). Given a dataframe with the studentfeatures,
HyperTools automatically converts this into a binary data matrix (as described above)for visualization.In contrast to the mushroom dataset, where the samples formed clear clusters, the distributionof samples in this dataset appear to form a single contiguous mass. Further, coloring each sample(student) by their performance rating reveals a striking correspondence between the student’sattributes and performance ratings (Fig. 8a). For example, as the student attributes vary alongthe first principal component, the student performance ratings appear to transition smoothlyfrom low, to medium, to high (Fig. 8b). To highlight this pattern, we fit a linear regression modelwhose output variable was student performance and the input variables were the first threeprincipal components (Fig. 8c). In this way, the regression model’s outputs provide a continuousestimate of student performance, whereas the original data contained only discrete (categorized)estimates. In Figure 8d, each dot from Panel a has been re-colored according to the regressionmodel’s performance predictions, resulting in a smooth gradient from low to high performance.
Example 4: Exploring linguistic data from presidential nominees’ Twit-ter posts.
Whereas the above examples illustrate how simple numerical and categorical features are processedby
HyperTools to reveal geometric patterns in the data, we can use a similar approach to extractand visualize more complex features. For example, topic models [2] may be used to derive a vectorrepresentation of each document in a corpus according to its linguistic properties. Specifically,topic models identify “themes” that are reflected in varying amounts by different documentsin the corpus, where each theme ( topic ) is defined formally as a distribution over words in thevocabulary. In other words, a neuroscience-themed topic might heavily weight words like neuron and brain , whereas a sports-themed topic might heavily weight words like running and athlete .(Fitting a topic model to a text corpus reveals what the specific topics are and how much eachdocument reflects each topic.) Once we have derived topic vectors for each document in thecorpus, we can use
HyperTools to visualize the full corpus to potentially gain insights into itsgeometric structure. 14/22 ighMiddleLow
Low Middle HighStudent Performance−3−2−10123 P C V a l ue a bc −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5PC Value−0.50.00.51.01.52.0 R eg r e ss i on O u t pu t PC1PC2PC3 d p r e d i c t e d p e r f o r m a n c e highlow Figure 8. Relationship between student attributes and performance. a.
Each pointrepresents the feature vector associated with a student, and the points are colored by studentperformance (red: high, green: middle, blue: low). b. Swarm plot of the first principal componentsplit by student performance (coloring same as above). c. Predicted student performance fromlinear regression of each PC on student performance. d. Same as (a), but points are coloredby linear regression predictions of student performance by PC1 value (red to violet gradientrepresents high to low predicted performance). 15/22 trump https donald vote today women president says millions believehttps thank amp makeamericagreatagain crowd rights fight votetrump makingrealdonaldtrump clinton hillary https trump tax did woman rncincle failing realdonaldtrump clinton hillary https trump tax did woman rncincle failingpeople americans https said speech enjoy american years history hillarytrump https donald vote today women president says millions believe trump https donald vote today women president says millions believegreat america trump2016 https media plan carolina isis ratings ticketsjob work jobs better look wow man world life hardtrump https donald vote today women president says millions believenew https going say hillaryclinton york hillary timkaine people runpresident obama bad hillary states totally united person senator crooked ClintonTrump a b c
1. Why doesn’t the failing @nytimes write the real story on the Clintons and women? The media is TOTALLY dishonest!2. ”@justininglv: @realDonaldTrump great speech today!! It's all about America and that's why you will become president!!!!” Thank you.3. The labor movement pioneered the basic work that made America great: If you work hard and do your part you should be able to get ahead.4. Obama on whether Trump could be trusted with US nuclear weapons: “Make your own judgement” https://t.co/6OZtrfIwim https://t.co/Nj20PaXF2o
Figure 9. Topic models of political Twitter data. a.
Two-dimensional representation ofClinton’s (blue) and Trump’s (red) day-by-day tweet content. b. Top ten words from each ofthe top three topics on selected days. c. Representative tweets from the selected days.As an example of this approach, we next turn to an analysis of Twitter data (“tweets”) from theTwitter accounts of Hillary Clinton (@HillaryClinton) and Donald Trump (@realDonaldTrump)over the course of their 2016 political campaigns. The dataset, sourced from FiveThirtyEight,contains 6,444 tweets sent from the candidates’ primary Twitter accounts between April 17, 2016and September 26, 2016.We began our analysis by fitting a 20-topic topic model to the entire collection of tweets fromboth candidates, yielding a single topic vector for each tweet. Separately for each candidate, wenext computed daily average topic vectors over the six month interval covered by the dataset,and we used
HyperTools to visualize the resulting day-by-day Twitter topics.Plotting the candidates’ tweet content in two dimensions reveals that Clinton’s and Trump’stweets were primarily about different topics, resulting in a V-like topic cloud (Fig. 9a). Weleveraged this structure revealed by
HyperTools to select several days of interest to examinefurther. Specifically, we examined (1) a day of Trump tweets whose topic coordinates wereespecially Trump-like (i.e. at the end of the Trump side of the V), (2) a day of Trump tweetswhose topic coordinates fell at the intersection of the V, (3) a day of Clinton tweets whose topiccoordinates fell at the intersection of the V, and (4) a day of Clinton tweets that fell at the end ofthe Clinton side of the V. For example, we wondered whether the candidates’ tweets that fell atthe extreme ends of the V might be especially representative of each candidates’ unique features,whereas tweets that fell at the intersection of the V might express points of similarity betweenthe candidates. Figure 9b displays the top 10 words from each of the top three topics for each ofthese days of interest, and Figure 9c provides representative tweets from each day. Strikingly(perhaps), the most Trump-like tweets appear to disparage Clinton, the most Clinton-like tweetsappear to disparage Trump, and the overlapping tweets appear to praise America’s greatness.
Example 5: Cyclical increases in global temperatures over time.
In addition to generating static point cloud plots,
HyperTools may be used to generate trajectoryplots to illustrate dynamic patterns in the data. To highlight this feature, we used a global16/22emperatures dataset which we acquired from Berkeley Earth. The Berkeley Earth averagingmethod takes temperature observations from a large array of weather monitoring stationsthroughout the world and produces a time-varying estimate of the underlying global temperaturefield across all of the Earth’s land areas. This temperature field may then be sampled to obtainlocation-specific temperature estimates.To visualize how the global temperature field changes over time, we acquired monthly averagetemperature estimates for 20 cities throughout the world (Fig. 10c) over the 138 year intervalfrom 1875–2013. We used
HyperTools to plot the resulting temperature trajectory (Fig. 10a,b).To visualize systematic changes over time, we plotted the month-by-month trajectory for eachyear in a different color using the group keyword argument to plot : hyp . plot ( data , group = years , palette = (cid:48) RdBu r (cid:48) ) (23)Two general trends were revealed by plotting the temperature data in this way. First, themonth-by-month temperatures within a year are cyclical (e.g. reflecting the changing seasons),which appears in the trajectory as a “figure 8” (this trend is most visible in Fig. 10b). Second,there has been a systematic shift in global temperatures over the 138 year period we examined.This appears as a systematic shift in the position of the trajectory over time (Fig. 10a), and canalso be seen by directly plotting the temperatures over time (Fig. 10d).
Example 6: Visualizing the correspondence between neural trajecto-ries and a movie stimulus
In addition to providing plotting tools for visualizing complex data,
HyperTools also providestools for aligning trajectories from different sources (see
Align ). For example, suppose we havebrain recordings from different people who all watched the same movie. The general shapes ofdifferent people’s brain data trajectories (showing how everyone’s brain responses changed overtime while watching the movie), as well as the movie trajectory (showing how the movie itselfchanged over time), might all share similar properties (e.g. reflecting the covariance structure ofthe movie and how people responded to it). However, different people’s brains may have reflectedthose similar responses differently, and the dimensions of “brain space” and “movie space” are notdirectly comparable. As described in
Materials and Methods , the
HyperTools toolbox providesan easy-to-use interface for aligning datasets. In this section example we demonstrate some usesof the align function using a previously published fMRI dataset [8], available for download here.The dataset comprises voxel responses from ventral temporal cortex, from each of 11 people,as they watched the feature-length film
Raiders of the Lost Ark . The data were processed andhyperaligned as described in the original manuscript [8].Figure 11a displays the trajectory plots for the averaged hyperaligned brain responses fromtwo groups of participants in the original experiment (six in group 1, the remaining five in group2). The trajectories appear similar in their overall shape (indicating that the two groups ofparticipants had roughly similar brain responses to the movie), but the alignment is imperfect(indicating that understanding individual differences between people’s responses might be aninteresting future direction to explore). 17/22
Year
Bangkok MumbaiCairoCape TownChicago IstanbulLondonLos AngelesMexico CityMontreal MoscowNew York RomeSantiagoSao PauloSeoulShanghai SomaliaSydneyTokyo a bc d
Year A v e r age T e m pe r a t u r e ( ° C ) BangkokBombayCairoCape_TownChicagoIstanbulLondonLos_AngelesMexicoMontreal MoscowNew_YorkRomeSantiagoSao_PauloSeoulShanghaiSomaliaSydneyTokyo
BangkokMumbaiCairoCape TownChicagoIstanbulLondonLos AngelesMexicoMontrealMoscowNew YorkRomeSantiagoSao PauloSeoulShanghaiSomaliaSydneyTokyo
Figure 10. Global temperatures from 1875–2013. a. and b. The global temperaturesdataset plotted using PCA dimensionality reduction in two views. The line colors change overtime (from the earliest time point in blue to the most recent time point in red). The view on theleft shows the temporal progression in one of the dimensions while the view on right highlightsthe cyclical nature of the dataset. c. Locations of the 20 cities in the dataset. d. Yearly meantemperatures colored by location and black LOWESS line fit to the full dataset. 18/22 roup 1group 2 moviebrain a b c o r i g i na l r e c on s t r u c t i on Figure 11. Brain/movie trajectories during movie viewing. a.
Group-averaged tra-jectory of brain activity from ventral visual cortex split into two randomly-selected groups ofsubjects (group 1: n = 6, group 2: n = 5) watching the same movie. b. Group-averagedtrajectory of brain activity from ventral visual cortex and trajectory of movie (pixel intensitiesover time) hyperaligned to a common space. c. Movie frame reconstructed from ventral visualbrain activity that is aligned to movie space.We next demonstrate how
HyperTools may be used to visualize the correspondence betweendatasets with different coordinate systems– specifically, time-varying brain responses to themovie and the time-varying pixel intensities of the movie frames. To align these spaces, we firstpreprocessed the movie frames to convert the movie into the S × F matrix format required by HyperTools (here S is the number of movie frames and F is the number of pixels per frame).We downsampled the movie frames from 540 ×
960 RGB pixels at 30 FPS to 108 ×
192 grayscalepixels at 1 FPS. We then re-shaped each downsampled frame into a 20,736-dimensional vector.We next averaged the (hyperaligned) brain responses from the 11 experimental participantsto obtain a single brain response matrix. We used piecewise cubic interpolation [6] to re-samplethis averaged brain response matrix from the original data acquisition rate (one image acquiredevery 2.5 s) to the downsampled movie frame rate (one image per second). We used the reduce function to project both the movie and brain data onto 6,641 dimensions (i.e. the number ofvoxels in the original brain data) and shifted the time labels of the brain matrix backwards by5 s to account for the hemodynamic response. We then used the procrustes function to alignthe brain and movie data: brain aligned to movie = hyp . tools . procrustes ( movie data , brain data ) (24)The resulting aligned brain data matrix may then be plotted in the same space as the movie datamatrix (Fig. 11b). This visualization can provide insights into the similarities and differencesbetween the geometric structure of the original movie and the structure of the brain responses tothe movie.In addition to facilitating visual comparisons of the geometries of the movie and brain data,the aligned data may also be compared in the “native” data space. For example, each coordinateof “movie space” corresponds to an image, which may be displayed and examined. Aligning thebrain data to this movie space (using the procrustes function) means that each brain patternnow corresponds to a coordinate in movie space, and therefore the corresponding image may19/22lso be displayed and examined (Fig. 11c). This provides a means of viewing the original moviethrough the “lens” of the brain responses to that movie. This general approach could also becarried out in a cross-validated way (i.e. using one portion of the data to compute the Procrusteantransformation from brain space to movie space, and then applying that transformation to theheld-out brain data). We plan to explore this form of alignment-based decoding in future work. Discussion
Visualizing high-dimensional data via low-dimensional embeddings provides an intuitive meansof exploring the geometric and statistical properties of complex datasets. This can help to guideanalysis decisions and facilitate hypothesis generation and testing. Returning briefly to theexample of Anscombe’s quartet we discussed in the
Introduction (Fig. 1), striking differencesbetween datasets with very different geometries may be overlooked when solely considering theirsummary statistics, and this principle can be extended to high-dimensional data as well. Our
HyperTools toolbox aims to assist in high-dimensional data visualization by providing a simple(yet powerful) set of plotting functions and data manipulation tools.We have provided brief examples of how our toolbox may be used to examine data from awide array of domains: geometry (Example 1), biological data (Example 2), educational andsociological data (Example 3), political and linguistic data (Example 4), and neuroscientific data(Example 5). We chose these particular examples to showcase a broad sampling of the types ofvisualizations and analyses our toolbox supports, but they are not intended to indicate that ourtoolbox may be used in only these ways or in these domains.We hope that
HyperTools will prove useful in analyzing and visualizing complex data froma wide array of domains. We have released the toolbox under an open-source license to facilitatetransparency and widespread adoption. We also hope that users will contribute to the toolboxby providing feedback and suggestions, and by sharing their own extensions and applicationswith the community.
Acknowledgments
We are grateful for useful discussions with Luke J. Chang and Matthijs van der Meer. We arealso grateful for the help of J. Swaroop Guntapalli in implementing our align function. Ourwork was supported in part by NSF EPSCoR Award Number 1632738. The content is solelythe responsibility of the authors and does not necessarily represent the official views of oursupporting organizations.
References [1] F. J. Anscombe. Graphs in statistical analysis.
American Statistitian , 27(1):17–21, 1973.[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
Journal of MachineLearning Research , 3:993 – 1022, 2003. 20/223] P.-H. C. Chen, J. Chen, Y. Yeshurun, U. Hasson, J. Haxby, and P. J. Ramadge. A Reduced-Dimension fMRI Shared Response Model. In C. Cortes and N. D. Lawrence and D. D.Lee and M. Sugiyama and R. Garnett, editor,
Advances in Neural Information ProcessingSystems 28 , pages 460–468. Curran Associates, Inc., 2015.[4] P. Comon, C. Jutten, and J. Herault. Blind separation of sources, part II: Problemsstatement.
Signal Processing , 24(1):11 – 20, 1991.[5] M. Friendly. A brief history of data visualization. In C. Chen and W. H¨ardle and A Unwin,editor,
Handbook of Computational Statistics: Data Visualization , volume III. Springer-Verlag, Heidelberg, 2006.[6] F. N. Fritsch and R. E. Carlson. Monotone piecewise cubic interpolation.
SIAM J. Numer.Anal. , 17(2):238–246, 1980.[7] J. A. Hartigan and M. A. Wong. Algorithm AS 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics) , 28(1):100–108, 1979.[8] J. V. Haxby, J. S. Guntupalli, A. C. Connolly, Y. O. Halchenko, B. R. Conroy, M. I. Gobbini,M. Hanke, and P. J. Ramadge. A common, high-dimensional model of the representationalspace in human ventral temporal cortex.
Neuron , 72:404–416, 2011.[9] G. Hinton. Neural networks for machine learning.
Coursera
Computing In Science & Engineering ,9(3):90–95, 2007.[11] C. Jutten and J. Herault. Blind separation of sources, part I: An adaptive algorithm basedon neuromimetic architecture.
Signal Processing , 41(1):1 – 10, 1991.[12] G. Lincoff and National Audubon Society.
The Audubon Society Field Guide to NorthAmerican Mushrooms . A Chanticleer Press edition. Knopf, 1981.[13] W. McKinney. Data structures for statistical computing in Python. In
Proceedings of the9th Python in Science Conference , pages 51–56, 2010.[14] K. Pearson. On lines and planes of closest fit to systems of points in space.
PhilosophicalMagazine , 2:559–572, 1901.[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of MachineLearning Research , 12:2825–2830, 2011.[16] P. Sch¨onemann. A generalized solution of the orthogonal Procrustes problem.
Psychometrika ,31:1–10, 1966. 21/2217] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis.
Journal ofRoyal Statistical Society, Series B , 61(3):611–622, 1999.[18] W. S. Torgerson.
Theory and methods of scaling . Wiley, New York, 1958.[19] E. R. Tufte and P. Graves-Morris.
The visual display of quantitative information , volume 2.Graphics press, Cheshire, CT, 1983.[20] S. Uddenberg, G. Newman, and B. Scholl. Perceptual averaging of scientific data: Impli-cations of ensemble representations for the perception of patterns in graphs.
Journal ofVision , 16(12):1081, 2016.[21] L. J. P. van der Maaten and G. E. Hinton. Visualizing high-dimensional data using t-SNE.
Journal of Machine Learning Research , 9:2579–2605, 2008.[22] S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy array: A structure forefficient numerical computation.