[PDF] ChemVA: Interactive Visual Analysis of Chemical Compound Similarity in Virtual Screening

Abstract

In the modern drug discovery process, medicinal chemists deal with the complexity of analysis of large ensembles of candidate molecules. Computational tools, such as dimensionality reduction (DR) and classification, are commonly used to efficiently process the multidimensional space of features. These underlying calculations often hinder interpretability of results and prevent experts from assessing the impact of individual molecular features on the resulting representations. To provide a solution for scrutinizing such complex data, we introduce ChemVA, an interactive application for the visual exploration of large molecular ensembles and their features. Our tool consists of multiple coordinated views: Hexagonal view, Detail view, 3D view, Table view, and a newly proposed Difference view designed for the comparison of DR projections. These views display DR projections combined with biological activity, selected molecular features, and confidence scores for each of these projections. This conjunction of views allows the user to drill down through the dataset and to efficiently select candidate compounds. Our approach was evaluated on two case studies of finding structurally similar ligands with similar binding affinity to a target protein, as well as on an external qualitative evaluation. The results suggest that our system allows effective visual inspection and comparison of different high-dimensional molecular representations. Furthermore, ChemVA assists in the identification of candidate compounds while providing information on the certainty behind different molecular representations.

Full PDF

CChemVA: Interactive Visual Analysis of Chemical CompoundSimilarity in Virtual Screening

Mar´ıa Virginia Sabando * , Pavol Ulbrich * , Mat´ıas Selzer, Jan Byˇska,Jan Miˇcan, Ignacio Ponzoni, Axel J. Soto, Mar´ıa Luj ´an Ganuza, Barbora Kozl´ıkov ´a A B CD

Fig. 1: Overview of the ChemVA interface: a) Hexagonal view with an overview of a selected 2D projection, b) Detail view showingall data items, which are colored according to a selected feature, c) 3D view enabling the user to observe the structural similarities ofselected compounds, d) Table view listing other important features of the compounds. All views are interactively linked.

Abstract — In the modern drug discovery process, medicinal chemists deal with the complexity of analysis of large ensembles ofcandidate molecules. Computational tools, such as dimensionality reduction (DR) and classiﬁcation, are commonly used to efﬁcientlyprocess the multidimensional space of features. These underlying calculations often hinder interpretability of results and preventexperts from assessing the impact of individual molecular features on the resulting representations. To provide a solution for scrutinizingsuch complex data, we introduce ChemVA, an interactive application for the visual exploration of large molecular ensembles and theirfeatures. Our tool consists of multiple coordinated views: Hexagonal view, Detail view, 3D view, Table view, and a newly proposedDifference view designed for the comparison of DR projections. These views display DR projections combined with biological activity,selected molecular features, and conﬁdence scores for each of these projections. This conjunction of views allows the user to drilldown through the dataset and to efﬁciently select candidate compounds. Our approach was evaluated on two case studies of ﬁndingstructurally similar ligands with similar binding afﬁnity to a target protein, as well as on an external qualitative evaluation. The resultssuggest that our system allows effective visual inspection and comparison of different high-dimensional molecular representations.Furthermore, ChemVA assists in the identiﬁcation of candidate compounds while providing information on the certainty behind differentmolecular representations.

Index Terms —Virtual screening, visual analysis, dimensionality reduction, coordinated views, cheminformatics.

NTRODUCTION * These authors contributed equally. • M. V. Sabando, I. Ponzoni, and A. J. Soto are with the Institute for ComputerScience and Engineering (UNS–CONICET) and with the Department ofComputer Science and Engineering, Universidad Nacional del Sur, Bah´ıaBlanca, Argentina. E-mails: [email protected],[email protected], [email protected].• P., Ulbrich, J. Byˇska, and B. Kozl´ıkov´a are with the Visitlab, Faculty ofInformatics, Masaryk University, Brno, Czech Republic. E-mails:[email protected], [email protected], kozlikova@ﬁ.muni.cz.• M. N. Selzer and M. L. Ganuza are with the Institute for Computer Scienceand Engineering (UNS–CONICET) and with the VyGLab ResearchLaboratory (UNS-CICPBA), Department of Computer Science andEngineering, Universidad Nacional del Sur, Bah´ıa Blanca, Argentina.E-mails: [email protected], [email protected].• J. Miˇcan is with the Loschmidt Laboratories, Department of ExperimentalBiology and RECETOX, and with the Faculty of Medicine, MasarykUniversity, Brno, Czech Republic. E-mail: [email protected].

Small organic chemical compounds are the cornerstone of drug de-sign. New medications are found by exploring a large number ofcandidate compounds or by designing new ones. In the last decades,high-throughput screening has been the main procedure applied duringthe early stages of the drug discovery process [19, 41]. This processrequires chemical synthesis, experimental testing of large libraries ofcompounds against a biological target (protein), and it has a high attri-tion rate, which makes the process costly and time-consuming. Thesedrawbacks stimulated the development of virtual screening methods,which consist of computational techniques for identifying compoundsbinding to a drug target. Virtual screening allows to signiﬁcantly nar-row down the number of drug candidate compounds at a faster pace

Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identiﬁer: xx.xxxx/TVCG.201x.xxxxxxx a r X i v : . [ c s . G R ] A ug hile lowering costs [39, 90]. These reasons make virtual screening anessential part of the early-stage drug discovery process.Computational techniques involved in virtual screening enable tosimulate and test the ﬁtness of the candidate compound towards thedesired function without the need for chemical synthesis and expensivewet laboratory work [71]. They typically consist of a high-dimensionalvector-based abstraction of a given molecule, which involves dealingwith the challenges inherent in high-dimensional data [27]. Many vi-sualization tools for virtual screening have been proposed in the pastyears to help medicinal chemists deal with the complex computationalmethods. Such tools have been focused on providing proper means toexplore the chemical space and to enhance the interpretability of resultsso that trusted and explainable decisions can be made [20, 55]. A com-mon strategy followed by such tools is the application of dimensionalityreduction (DR) techniques [16, 55], which allow mapping compoundsfrom a high-dimensional chemical space into a lower-dimensional (of-ten 2D or 3D) representation [16]. However, visual exploration of thechemical space in the context of virtual screening entails a series ofnew challenges, not yet addressed by the existing tools.One of these important yet unsolved challenges is to study compoundsimilarity under the premise that similar molecules tend to have similarbioactivity proﬁles [26]. Visualization tools should enable the domainexpert to interact with different sources of molecular information, andprovide complementary views that help ﬁnd similarity determinants.Most existing tools rely on a single mapping based on an arbitrarilychosen set of molecular features. A visualization tool for virtual screen-ing should also provide views and interactions that allow the domainexpert to assess the relationship between the bioactivity proﬁles of theanalyzed compounds and their spatial organization in the projected low-dimensional space. Lastly, as a consequence of DR, pairwise distancesbetween compounds in the low-dimensional mapping can differ fromthe corresponding pairwise distances in the high-dimensional space.Information about the trustworthiness of these mappings should be visu-ally presented to the domain expert in an interpretable way. Despite itsimportance, most of the existing tools do not display this information.These issues were the main driving force behind our research and,as a result, we introduce ChemVA, an interactive system for the vi-sual exploration of chemical compounds, targeted for virtual screening.ChemVA provides domain experts with several linked views. Oneset of views, consisting of 2D plots, supports the visual inspectionof multiple molecular representations after undergoing DR. Each DRmethod produces a projection that can be displayed and mutually com-pared using our newly proposed Difference view . Our tool handles theoverplotting problem of such projections, which is common for largedatasets. ChemVA also incorporates a correlation encoding to the plots,which conveys the trustworthiness of a low-dimensional projectionbased on the distortion with regard to the pairwise distances in the orig-inal space. The plots are complemented by an interactive

Table view ,which allows for sorting and ﬁltering and shows detailed informationabout the compounds being displayed, focusing primarily on featuresrelated to drug-likeness. The tool also contains a

3D view that allows toexplore the structural similarity among selected compounds. This viewperforms an alignment of selected compounds with respect to theirmaximum common substructure, which serves to identify commonali-ties and differences among them. Finally, ChemVA allows loading newcompounds to an existing dataset, which enables to study potential newdrug candidates in the context of the dataset under study. The maincontributions of this paper can be summarized as follows:• A novel visualization tool for virtual screening that allows to com-pare and contrast multiple dimensionality reduction projectionsconveying complementary sources of molecular information.• An approach for visually assessing the trustworthiness of each DRprojection, which enables the user to focus on the most suitablevector-based molecular representations.• A newly proposed

Difference view that enables the comparison ofmultiple DR projections and assessment of their trustworthinessin terms of neighborhood preservation. • Support for de novo drug design by means of a set of coordinatedviews that allows to analyze newly designed compounds andcompare them with an existing set of chemical compounds.• Functional validation of the proposed tool by means of two casestudies and a qualitative testing of the proposed views. Theseevaluation activities were conducted by three domain experts andone visualization expert.

ELATED W ORK

In this section, we conduct a brief survey of existing approaches inthe different areas related to our work. We review DR techniquesand their application to the visualization of high-dimensional spaces,focusing mainly on parametric DR methods. Then, we discuss thestate-of-the-art molecular visualization and visual exploration tools forlarge molecular ensembles.

Over the last few decades, a variety of visualization methods for multi-dimensional data have been proposed [5, 40, 82]. Discovering patternsin multidimensional data using a combination of visual and machinelearning techniques represents a well-known challenge in visual analyt-ics. Visual exploration of multidimensional data allows for the injectionof unique human perceptual and cognitive abilities directly into theprocess of discovering multidimensional patterns [32]. The challengemostly lies in having a large number of variables and their relationshipsthat have to be considered simultaneously. As the number of variablesincreases, the user’s ability to understand interactions and correlationsbetween them is severely limited [80].In this context, a variety of approaches have been introduced tovisually convey high-dimensional data by using two-dimensional pro-jections, so that salient structures or patterns can be perceived whileexploring the projected data [45, 61, 91]. Therefore, one typical ap-proach is to transform the original dataset using a DR technique, andthen to visually encode only the reduced data [50, 65]. A frequentlyused visual encoding technique for showing the projected low dimen-sional data is a scatter plot. Sedlmair et al. [66] carried out an extensiveinvestigation on the effectiveness of visual encoding choices, including2D scatter plots, interactive 3D scatter plots, and scatter plot matri-ces. Their ﬁndings suggest that the 2D scatter plot is the most suitableapproach to explore the output of different DR algorithms.Many DR techniques have been proposed throughout the last decadesto transform multidimensional feature spaces onto a low-dimensionalmanifold—typically onto a two or three-dimensional space—whiletrying to preserve neighborhood relationships among the original datainstances [78, 86]. Among these methods, t-distributed StochasticNeighbor Embedding (t-SNE) [77] has been widely adopted, speciﬁ-cally for the visualization of very high-dimensional data [43, 57, 70, 85].Many of the proposed DR techniques are non-parametric, i.e., theyﬁnd patterns in speciﬁc manifolds and do not provide means to mapnew data points from the original high-dimensional space to the la-tent space. Therefore, their out-of-sample extension is not possible.Linear methods for DR, such as PCA [84], LDA [12], or MSR [69],have been extensively used, but are constrained to linear transforma-tions of the original data. In contrast to other non-linear techniques,e.g., SOMs [31], MDS [9] or autoencoders [34], which allow to mapnon-linear data, t-SNE tackles the crowding problem effectively byusing a heavy-tailed distribution for arranging data points in the lowdimensional space [77]. Practically speaking, addressing the crowd-ing problem yields better-looking visualizations. Although t-SNE isnon-parametric, there is a parametric t-SNE version [76], which seeksto overcome this limitation. Nonetheless, it is often difﬁcult to ﬁndan optimal conﬁguration of hyperparameters for these models, whichin turn yields noisy projections compared to those obtained using thenon-parametric version of the algorithm [44, 87, 93].In particular, visualization tools for virtual screening should allowmedicinal chemists to analyze how a previously unseen compoundinteracts with or relates to other known compounds. For this reason,DR techniques used in the visualization tools for virtual screeninghould provide means for mapping unseen compounds and they shouldbe fast enough to enable interactive use.

Molecular visualization is one of the oldest branches of visualizationas it has a well-established basis of visual representations, which hasbeen largely embraced by biochemists and biologists. These—mostly3D—representations are integrated into many available molecular vi-sualization tools, such as PyMOL [64], VMD [21], Chimera [54], orYASARA [35]. Besides these tools, there are also several others—mainly web-based ones—that can be embedded into other applications;for instance, Jmol [25], JSmol [17], or 3Dmol.js [58]. However, noneof these tools is directly applicable to the problem of visual explorationof large ensembles of molecular structures and the similarities in theirstructure and properties. A more extensive overview of the currentlyavailable approaches to molecular visualization and molecular systemscan be found in the survey by Kozl´ıkov´a et al. [33].Visual inspection of small molecules is supported in different waysby some available tools. Smaller molecules are commonly visualizedin 2D in order to better depict their structure, such as in LigPlot+ [37].This tool serves mainly for exploring the interactions between a ligandand its target protein, by projecting its molecular structure and theprotein amino acids onto the 2D plane. However, such a tool is notapplicable to the task of analyzing multiple molecular compounds.The exploration of large sets of chemical compounds constitutes anideal task from the visual analytics viewpoint, and several such toolshave already been proposed. The closest to our proposed tool is CheS-Mapper [16]. It enables to load data from several chemical databases,to calculate molecular descriptors, to observe clusters of compoundsin a reduced 2D space, and to display the 3D representation of thecompounds. Nonetheless, the tool does not provide an option to inter-actively compare different projections and does not support inspectingthe physical-chemical properties of a newly added compound.A similar tool is Data Warrior [62], which consists of several basicvisualization methods and specialized views to show an overview ofthe whole chemical and pharmacophore space of the input dataset.However, the tool is not designed for speciﬁc tasks related to the early-stage drug discovery process that is addressed in our tool. More recently,Yoshimori et al. [89] proposed a technique for visually summarizinginformation about multiple Structure-Activity Relationship Matrices,based on molecular grid maps and 3D activity landscapes. Nevertheless,due to its speciﬁcity, it cannot be extended to operate with additionalphysical-chemical properties and descriptors.Janssen et al. [24] introduced a method that uses a visual mappingbased on t-SNE projections for ﬁnding new potential kinase inhibitors.The views provided by the tool show the results of clustering and a tree-like structure of compounds. However, such views are not designed toeasily extract information about the structural similarity of compounds.A similar concept was introduced by Probst and Reymond [56]. Theirapproach is tailored to process very large high-dimensional datasets andthe compound similarity is expressed by the proximity of compoundsthrough tree branches.Naveja and Medina-Franco [49] introduced the constellation graphsof chemical compounds clustered according to a shared core scaffoldin the 2D projections of their geometry. Each cluster is annotated witha small view containing the projected core. While such constellationgraphs provide an innovative approach for the visualization of com-pounds, they do not ﬁt our requirements for visual exploration in thecontext of virtual screening. Synergy Maps [38] is another web-basedapplication for displaying relationships between compounds and under-standing potential synergies between them. The tool shows pairwisecombinations of properties of compounds using a network representa-tion. Nonetheless, the node-link diagrams do not scale properly, whichhinders its application for large ensembles of compounds.Although methods for assessing the trustworthiness of DR tech-niques have been proposed [2, 10, 53], to the best of our knowledgenone of the existing approaches for visualization of molecular struc-tures allows to perform this task. Moreover, they do not provide meansto compare the results of multiple projections.

ACKGROUND

In this section, we provide details on the theoretical background of ourproposed tool, which includes the molecular representations used andtheir semantics, as well as a short description of the chemical featuresof major relevance for drug-likeness.

Research in cheminformatics suggests that there are several differentfactors involved in assessing chemical compound similarity, which gobeyond having similar geometric arrangements of atoms and bonds [3,42]. Given the relevance of assessing compound similarity in virtualscreening, it is important to provide medicinal chemists with a varietyof vector-based molecular representations that capture different aspectsof the compounds under study and that are complementary to eachother. In this sense, ChemVA provides four different sources of data:•

Extended Connectivity Fingerprints (ECFPs) [59] encode thetopological information of the molecular structure as a ﬁxed-length vector of bits.•

Daylight Fingerprints [11] are path-based vectors of bits that en-code all fragments of a molecule, which are obtained by traversingits molecular graph.•

Molecular Descriptors [74] are numerical values associated withthe chemical constitution of a molecule. They convey informationsuch as molecular weight, electronic conﬁguration, solubility, etc.•

Molecular Embeddings [23] are a rather novel type of vector-based representations. They are typically obtained by applyingmachine learning models that are trained to learn data representa-tions of molecules from large and diverse sets of compounds.We selected these four vector-based molecular representations as oursources of data because they capture different information about thecompounds, thus allowing for a broader analysis of the data under study,and because each of them is considered to be the state-of-the-art of theirtype. While we are utilizing these four vector-based molecular repre-sentations, it is worth noting that ChemVA remains independent fromthe chosen sources of data. Details of their calculation are provided inthe Supplementary Material.

Some molecular features are of major interest to drug designers andmedicinal chemists in the context of virtual screening. In particular,molecular features related to drug-likeness allow scrutinizing the com-pounds in terms of their viability as potential new drugs. ChemVAtakes into consideration a subset of such features:•

Molecular weight represents the average mass of the moleculeexpressed in Daltons (Da). Drug-like compounds are expected toweigh from 160 to 500 Da.• logP is a quantitative measure of lipophilicity, and indicates howeasily a substance is absorbed by living tissue. Drug-like com-pounds often exhibit logP values from − . + . Acidic Dissociation Constant and

Basic Dissociation Constant are quantitative measures of the strength of an acid (or base) in achemical solution. They are good indicators for a drug’s ability toenter the bloodstream and to accumulate in tissues or secretions.•

Lipinski’s Rule of 5 (RO5) is a rule of thumb to determine whethera compound is likely to be active in humans. This rule states thatan orally active drug violates no more than one of four criteriabased on threshold values of speciﬁc chemical features.•

Weighed Quantitative Estimate of Drug-likeness (QED) is a scorecomputed based on chemical features linked to drug-likeness, andit ranges from 0 ( non drug-like ) to 1 ( drug-like ). R EQUIREMENTS

Over the course of one year, we conducted numerous sessions with thegroup of protein engineers from the Loschmidt Laboratories at MasarykUniversity. Based on their input, we identiﬁed several limitations of theexisting approaches for virtual screening and summarized them into alist of requirements. The experts agreed that these requirements coverthe most critical aspects of a virtual screening workﬂow. One of theexperts, who is also the co-author of this paper, has dedicated signiﬁ-cant time ensuring that our implementation addresses the requirementsby iteratively checking and commenting on the progress. His 4-yearresearch experience in protein engineering, along with his internal con-sultations with the head of his group, provided the necessary domainknowledge for the design of the tool.

R1: Overview and detailed analysis of a molecular ensemble in thelow-dimensional space.

For large datasets, scatter plots, which arecommonly used to represent the DR output, may suffer from occlusionproblems for large datasets. Therefore, the tool should provide visualsupport for the analysis of data on different levels of abstraction, fromthe overall distribution of the compounds within the 2D space to thedetailed view of individual compounds for a selected region of interest.

R2: Visual inspection of multiple projections.

A set of compoundscan be expressed by different vector-based molecular representations,each yielding a different DR projection. The tool should enable the userto intuitively combine information encoded in the individual projectionsto allow studying at once the similarity between compounds based ondifferent molecular representations. More speciﬁcally, this includesexploring similarities and differences between chemical compounds,expressed by the different projections. Additionally, the visual repre-sentations and interactions should help the domain experts evaluate thesuitability of the selected DR model.

R3: Evaluation of the trustworthiness of projections.

Users needproper visual support for assessing the trustworthiness of a low-dimensional projection based on the distortion with regard to pairwisedistances between compounds in the original space. When such trust-worthiness can be compared on different DR projections, users canfocus the exploration on a subset of molecular representations.

R4: Comparison of compounds according to features related todrug-likeness.

Chemical compounds have many features and descrip-tors related to drug-likeness that can complement other molecular rep-resentations in the task of virtual screening. Therefore, it is desirableto provide users with an option to visualize these additional featuresalong with the projections.

R5: Comprehensible viewing of 3D structural similarity.

The toolshould support the inspection of individual compounds in terms of their3D geometry, as well as the visual comparison of common 3D substruc-tures in a selected set of compounds. For multiple compounds, such aview should convey the information about similarities and differencesin their 3D structure.

R6: Possibility to add new compounds and comparison with theexisting data.

The tool should support the process of exploration ofdifferent features and bioactivity for newly added compounds. The newcompound should be projected using the DR model and integrated intothe remaining views, so that the user can compare its features to thoseof the compounds in the existing dataset.

HEM

VA D

ESIGN AND I MPLEMENTATION

The design process of ChemVA followed the requirements listed in Sec-tion 4. These requirements are addressed through several coordinatedviews, which pose challenges to the layout of the tool. ChemVA wasdeveloped in JavaScript using D3.js v5 [6] and a Node.js server [73]built using the Express framework and REST API for the connection tosupporting services. We used Unity3d [72] ported to WebGL [29] forthe development of 3D components. Two back-end web services, usedfor computing the alignment of 3D structures and for computing the 2Dcoordinates of newly added compounds, were developed using Flaskweb framework [60] and written in Python v3. Excluding one of theviews [13], all of the functionality was designed and written in-house, prioritizing the system responsiveness and real-time experience withinteractions.In this section, we ﬁrst describe the individual views supported byChemVA in detail, then we brieﬂy introduce the DR model applied inour tool, and lastly we present the main functionality supporting theanalysis of newly added compounds.

The initial layout, depicted in Figure 1, contains the Hexagonal view,Detail view, and 3D view (Figure 1 A, B, and C, respectively) in thetop row of the canvas, whereas the bottom row contains the Table view(Figure 1 D). These views account for requirements R1, R4 and R5. Fortasks involving the comparison of projections, stated in requirement R2,additional views need to be displayed. After careful consideration, wedecided to position these views between the two rows of the basic layout(see Figure 8), as they need to be as close as possible to the default 2Dplot views. This row can be shown or collapsed on demand and allowsthe user to choose which views should be included (e.g., Hexagonal,Detail, or Difference views). The purpose of this layout is to allowthe user to observe the data from different perspectives. In particular,the Difference view shows the differences between the compoundneighborhoods from one projection into another, thus allowing theuser to compare different projections. The tool uses the color schemeproposed by Okabe & Ito [22], so it is accessible to users with colorvision deﬁciency.

The core of ChemVA consists of 2D plots, which give the user anoverview of the distribution of compounds in a selected DR projectionusing a given molecular representation. This overview is supported bythe

Hexagonal view , a well-adopted and commonly used approach tovisualize the outcome of DR techniques [66]. The Hexagonal viewaims to overcome the overplotting problem, in which the projected dataitems overlap and cause visual clutter , thus limiting the interpretability,especially for datasets evincing high similarity between data items.The Hexagonal view of ChemVA seeks to solve this problem byaggregating individual data items into individual hexagons. The usercan interactively select a subset of hexagons of interest and explorethe distribution of individual data items within the

Detail view . Thecombination of the

Hexagonal view and the

Detail view aims to fulﬁllrequirement R1. Finally, since ChemVA is tailored to support thevisual comparison of different projections, it also offers the

Differenceview . This view was speciﬁcally designed to address that task, which isstated in requirements R2 and R3.

Hexagonal View

The Hexagonal view (Figure 2) provides the users with an overviewof one particular 2D projection of the dataset. In order to avoid over-plotting when dealing with large datasets, the chemical compounds areaggregated into hexagonal bins. We opted for the hexagonal binningapproach, ﬁrst described by Carr et al. [7], as it evidences signiﬁ-cant advantages for efﬁcient data aggregation in comparison to otherapproaches. This is related to the low perimeter-to-area ratio of thehexagonal shape, which reduces the sampling bias. The ideal shape forthat is a circle but it cannot be used for the full-coverage division ofthe plane. Hexagons are the most circular-shaped objects, enabling toconstruct such a division in an efﬁcient way.The Hexagonal view can be applied to any of the vector-basedmolecular representations described in Section 3.1. Each hexagon hasan assigned opacity, which corresponds to the number of compoundsaggregated within the hexagon. A higher opacity corresponds to a largernumber of compounds inside the hexagon. In this way, the Hexagonalview depicts the distribution of the compounds yielded by the DR,so users can recognize areas with a high density of compounds. Thecolor of a hexagon encodes the prevailing trend among its compoundsfor a selected feature, which is by default their bioactivity but can beswitched to other molecular properties, including the trustworthinessof the projection (requirement R3).ig. 2: The Hexagonal view shows the density of the distribution ofcompounds in the 2D projection, which is encoded using into opac-ity, and the prevalence of the bioactivity of the compounds, which isencoded using color. The size of the hexagons can be adjusted.The granularity of the Hexagonal view can be changed using aslider, which enables a more detailed view of the distribution ofthe compounds. In order to preserve the readability through therange of different granularity levels, the opacity is enhanced linearlywith respect to the decrease of the hexagon size. Additionally, theHexagonal view enables the user to select several hexagons at once.The compounds within the selected hexagons are then ﬁltered throughthe other linked views (Detail view, 3D view, and Table view), wherethey can be explored in more detail.

Detail View

Upon selecting a subset of compounds in the Hexagonal view, theuser can explore the selected data in the Detail view, depicted in Fig-ure 3. In this view, the compounds are represented using a standardscatter plot, enhanced by a subtle overlay of the same hexagonal gridas in the Hexagonal view, which helps users keep the correspondencebetween the zoom level in these views. This is important because theDetail view displays only the selected hexagons zoomed in after theselection. To further enhance the link between the Hexagonal andDetail views, the corresponding hexagons are highlighted when theuser hovers over them in any of these views. In order to perform aselection of individual compounds in this view, a lasso-shaped selectoris supported. The selected compounds are then displayed in the 3Dview and also highlighted in the Table view (Section 5.1.2).As in the case of the Hexagonal view, the Detail view can be set toany of the vector-based molecular representations described in SectionFig. 3: The Detail view displays individual compounds as dots in aselected projection (ECFP ﬁngerprints in this case). The color can bemapped into one of the features (activity towards Serotonin 1a receptorin this case). Fig. 4: The Difference view showing the decomposition of the localneighborhoods from one projection to another one. Here, the refer-ence projection A corresponds to ECFP ﬁngerprints and its selectedhexagons are highlighted in gray. Projection B corresponds to Daylightﬁngerprints (inner hexagons in black). The opacity of hexagons in A en-codes the value of the metric used for the trustworthiness of projection(see Section 6).3.1. In addition, all properties and features described in Section 3can be color-encoded on points representing each compound. Thesefeatures are selected from a drop-down menu, and their color encodingsare chosen according to their type, i.e., quantitative, such as molecularweight, or categorical, such as bioactivity towards a target protein.Another quantitative property that can be used for color encodingcorresponds to the correlation scores, which encode the trustworthinessof the DR projection of the compound (requirement R3). Thesescores were computed using Pearson and Kendall correlation, whosecalculation is explained in the Supplementary Material. Difference View

In order to support the task of comparing the outputs of different 2Dprojections, as stated in our requirement R2, we propose a novel viewcalled

Difference view , that combines and contrasts two selected 2DHexagonal views, A and B . Initially it displays a hexagonal layout sim-ilar to that presented in the Hexagonal view, where the opacity of eachhexagon encodes the computed correlation score of the trustworthinessof the projections A and B under study (requirement R3).The Difference view adopts the hexagonal layout from one of theprojections to be compared, which we denote as the reference projection.When we choose projection A as the reference and perform a selectionoperation in the Hexagonal view of A , we search for the positionsof the compounds falling into this selection in projection B . Thesepositions are encoded as inner smaller hexagons inscribed into theoriginal hexagonal grid of the Difference view, as depicted in Figure 4.In other words, this view shows where the compounds from the selectedhexagons in projection A are located in projection B . The size of innerhexagons corresponds to the number of compounds falling into thesame hexagonal bin.The primary purpose of this graph is to help illustrate which regionsof the dataset preserve their neighborhoods from one vector-basedmolecular representation to another one when undergoing a DR proce-dure. In this way, the user can rapidly compare two different projectionsand assess the trustworthiness of the neighborhoods generated by theDR techniques. If the level of fragmentation from one projection toanother is high, i.e., a segregation of the content of one hexagon intomany small, fairly separated and scattered hexagons is observed, itcan be inferred that these molecules behave differently according tothe chosen molecular representation, thus they might be taken intoconsideration cautiously and inspected in more detail.ig. 5: The Table view. Each row corresponds to one compound and itsselected features. The right side panel shows the statistical overview ofthe distribution of values across the dataset. Besides from the vector-based molecular representations used in theDR projections and displayed in our 2D plot views, there are severalother molecular features related to drug-likeness that should be takeninto consideration when analyzing the compounds, as stated in require-ment R4. ChemVA enables the users to explore such features, listed inSection 3.2, by means of a Table view which offers advanced interac-tion options. We adopted a well-established tool, published by Gratzl etal. [15], and its extension [13]. As these tools perfectly ﬁt to our needs,we incorporate them to ChemVA. Further details about the broad rangeof interaction possibilities can be found in the original papers.In addition to the list of compounds, the Table view provides theusers with graphical elements in the form of juxtaposed bar charts andbox plots in the right side panel. By default, the compounds in thetable are logically grouped by their membership to hexagons in theHexagonal view. These groups can be either expanded or compressed.When compressed, the table displays the box plots of the distributionof the values for each feature in the hexagon, as shown in Figure 5.This view is interactively linked with the other visual componentsof ChemVA. When performing a selection in the 2D plot views, thecorresponding compounds are automatically highlighted in the Tableview. Conversely, when the user selects compounds in the Table view,the corresponding compounds are highlighted in the Detail view, dis-played in the 3D view, and the corresponding hexagons are highlightedin the Hexagonal view.

The geometric similarity between selected compounds can be exploredusing our 3D view. The visual representation of atoms and bonds, aswell as the coloring of the compounds, are based on standard repre-sentations used in molecular chemistry, with speciﬁc colors reservedfor each chemical element. It also offers standard interactions, such aspanning, zooming, and rotation of the displayed structures.Fig. 6: 3D view with 66 aligned molecules. Common atoms and bondsare rendered more opaque than the less common components. (a) (b)

Fig. 7: Reducing the visual clutter visible in Figure 6 by (a) ﬁltering outthe non-common parts of compounds and by (b) changing the opacityof the whole structure.Compound similarity can be better perceived when the compoundsare structurally aligned in the view. To serve this purpose, we use astructural alignment functionality provided by the OpenBabel tool [51].Further details about this functionality are provided in the Supplemen-tary Material. Once molecules are aligned, the user should be ableto easily identify their common parts, i.e., the subsets of atoms andbonds that are present in most of the selected compounds, as statedin requirement R5. In addition, we incorporate opacity modulationwith respect to the frequency of occurrence of atoms and bonds. Inother words, the opacity of atoms and bonds is calculated based on thenumber of atoms of the same type that are aligned to the same spatialposition. As a consequence, common substructures are rendered moreopaque. An example of such alignment can be seen in Figure 6.Depending on the number of selected compounds and their similari-ties, the 3D visual representation can become fairly cluttered (Figure 6).To reduce this visual clutter, the 3D view offers hiding and showinghydrogen atoms and bonds on demand, changing the size of atoms andbonds, as it can be seen in Figure 7(a), and also changing the opacityof the whole structure. Figure 7(b) shows an example of using thesefunctionalities on a cluttered subset of compounds.In some cases, the user wants to focus only on the common partsof the structure. Therefore, we have also included functionality forﬁltering out atoms and bonds. Figure 7(a) shows how the visibilityof the common substructure among 66 compounds is improved byusing this feature. Finally, as the user might also want to analyze thenon-common parts of the compounds, the 3D view provides an optionto invert the opacity, so that the common part of the structure becomesmore transparent than the rest. t-Distributed Stochastic Neighborhood Embedding (t-SNE) [77] is con-sidered to be the state-of-the-art technique for dimensionality reduction,particularly for visualizing very high dimensional data. t-SNE mapsneighboring data points to a lower-dimensional space, aiming to pre-serve the neighborhood relationships ( locality) . It applies the Student’st-distribution on the low-dimensional space aiming to tackle the crowd-ing problem , which happens when many points clump together in thelow-dimensional projection. These traits make t-SNE a suitable toolfor visually exploring molecular spaces, where local neighborhoods areof utmost interest for analyzing compound similarity.However, t-SNE has a major limitation, which is that it is a non-parametric technique. After testing

Parametric t-SNE [76] and con-sistently failing to attain good quality projections, we developed ourown parametric dimensionality reduction model. This model was con-structed based on a feed-forward neural network, which was trained tolearn the 2-dimensional coordinates of a previously computed t-SNEprojection of the data. We trained four of these parametric modelsfor each reference dataset used in ChemVA, one for each molecularrepresentation. The performance of our parametric models was eval-uated by measuring the Pearson correlation coefﬁcient between thepredicted coordinates and the actual coordinates, i.e., those obtained bymeans of the t-SNE projection. We provide details on the parameteriza-tion used for t-SNE and the parametric models in the Supplementaryaterial. The t-SNE projections were computed using tools from Scikit-Learn [52], whereas the parametric models were built using Keras andTensorﬂow [8].

ChemVA provides an option to add new compounds to the datasetbeing studied in order to explore their features and compare themwith those of other compounds. By means of this functionality, theexpert can assess the potential of this newly added compound prior toextensive wet lab testing. The support for this important feature fulﬁllsrequirement R6.A new compound is loaded to ChemVA by specifying its SMILESformula [83]. A back-end service of ChemVA calculates the molecularfeatures listed in Section 3.2. After computing such molecular represen-tations and features, ChemVA uses the parametric models (describedin Section 5.2) to obtain the 2D coordinates of the compound in eachprojection. Finally, the newly added compound is displayed in all thesupported views with yellow color to ease its identiﬁcation. This colorencoding prevails throughout the views, as it can be seen in Figure 8.

VALUATION AND C ASE S TUDIES

The evaluation of ChemVA consisted of two main stages. The ﬁrststage was conducted by one domain expert involved in the functionaldesign of the tool (see Section 4), and consisted of two case studies.The second stage was conducted by one visualization expert and twodomain experts, who were not involved at any point during the designor development of ChemVA. This stage consisted of a session wherethe different views where qualitatively evaluated and general feedbackabout the usability and functionality of the tool was gathered afterwards.

During the ﬁrst stage of the evaluation, several additional functionalrequirements for the tool itself were identiﬁed by the domain expert,such as loading and storing selections, downloading the displayed3D conformations, and highlighting the compounds in the Table viewafter clicking on them in the 3D view. These additional functionalrequirements were addressed in time for the case study evaluation.ChemVA was tested on two case studies, which were built upon twodifferent datasets retrieved from ChEMBL [14]. The ﬁrst dataset wascomposed by merging ligands binding to the

Serotonin 1a receptor and Dopamine D2 receptor , whereas the second dataset comprisedligands to the P-glycoprotein 1 . We assigned a categorical label toeach compound according to its experimentally measured IC50 bioac-tivity value towards the target(s) under study. Compounds showing

IC50 values below 10 nM were labeled as Active ; compounds between10 and 1000 nM were labeled as Moderately Active , and those over1000 nM were labeled as Inactive . The serotonin-dopamine datasetcomprises 118 compounds, whereas the

P-glycoprotein dataset contains893 compounds.

This case study is based on the serotonin-dopamine dataset, a set ofcompounds that have data about antagonistic activity against the sero-tonin 5HT1A and dopamine D2 receptor. These receptors of neuro-transmitters are targets of many psychoactive pharmaceuticals, i.e.,antidepressants, antipsychotics, and anxiolytics. In the ﬁeld of psy-chopharmacology, small molecules with different activity towards sev-eral receptors are used to alleviate side-effects [48, 67].The goal of the study was to ﬁnd chemical determinants of biologicalactivity towards serotonin and dopamine receptors. According to thedomain expert, similar compounds could be easily found in the Detailview in all four projections, based on their proximity. After identifyinggroups of potentially similar compounds, the domain expert searchedfor groups of compounds that were active towards both receptors and that had desirable pharmacological properties, such as following theLipinski‘s RO5 and having a high QED score. This exploratory searchwas done on the Hexagonal view, by adjusting the size of the hexagonsto match the observed groups of compounds in the different projections.Afterwards, the domain expert used the Kendall and Pearson corre-lation color encoding in order to observe whether the projections weretrustworthy, and thus the hexagons being considered would effectivelygroup similar compounds. A set of hexagons was preselected and theTable view was used, in which the domain expert analyzed the distri-butions of all drug-likeness features of the selected compounds usingthe summary information displayed in the column headers. As a result,the expert identiﬁed a hexagon that grouped active compounds towardsboth receptors with desirable drug-likeness features.The domain expert selected those compounds and performed analignment in the 3D view, ﬁnding that their structures were very similar.These compounds were also contrasted to inactive compounds withinthe same hexagon using the 3D view. This allowed the domain expertto gain insight into the relevant substructures of a potential new drugcandidate. At this point, the domain expert highlighted how easy it wasto ﬁnd and visualize the common 3D structure of a group of compounds.Finally, the structures of active compounds towards both receptorswere downloaded from ChemVA. Five new compounds were createdand loaded to ChemVA, as shown in Figure 8. They were comparedto the downloaded structures using molecular docking. This process isdescribed in the Supplementary Material. The domain expert found outthat the new structures effectively bound to a known antagonist bindingpocket and three out of the ﬁve newly created structures showed slightlybetter binding energies than structures from the dataset used for theirdesign, which were already highly active. This shows that ChemVAcould be effectively used for drug design purposes, leading to newlydesigned compounds that have the desired qualities and bioactivityproﬁles. This case study was based on the

P-glycoprotein dataset, consisting ofsmall molecules with inhibitory activity against human P-glycoprotein.This protein is exclusively overexpressed in cells of many cancer types,causing multidrug resistance of these cells and thus impacting on theperformance of chemotherapy treatments. It also affects the effec-tivity of many drugs by altering their ADME-Tox properties. [68].P-glycoprotein has proven to bind to many structurally dissimilar sub-strates, thus an interesting use case for our tool was to ﬁnd structuraldeterminants of a good P-glycoprotein inhibitor and to compare knownactive ligands towards this target.According to the domain expert, the P-glycoprotein dataset containsa vast number of diverse compounds that interact with the target pro-tein. The goal of this case study was to ﬁnd chemical determinants ofcompounds with a very high logP value, which are scarce in the dataset.In order to achieve this goal, the domain expert used the Table viewto ﬁlter compounds with logP values higher than 6.75 and then sortedthem based on their logP value and their correlation scores, whichmeasure the trustworthiness of the molecular representation currentlybeing projected onto the 2D plot views.The domain expert used the ﬁltering options provided by the Ta-ble view in order to ﬁnd a set of compounds complying with allthe requested criteria, i.e., high logP value, activity towards the P-glycoprotein and following Lipinski‘s RO5. Afterwards, chemicallysimilar compounds were selected from this subset and further analyzedusing the Hexagonal view. Hexagons containing active compoundswere selected and thoroughly explored using the Detail view, in whichlogP was used for color encoding. Finally, the domain expert used the3D view to ﬁnd common substructures among the selected compoundsand successfully identified chemical determinants of bioactivity. Thiswas achieved by analyzing structural patterns related to lipophilicity,indicated by high logP values. This workﬂow is depicted in Figure 9.

B CD E FG

Fig. 8: The newly designed compounds were projected by ChemVA near the selected dataset compounds in the Daylight projection (top),occupying the same hexagons, and slightly further from them in the ECFP Fingerprints projection (bottom). The Difference view on the bottomright illustrates the redistribution of the selected regions. a) Hexagonal view, b-d-f) Detail view, c) 3D view, e) Difference view, g) Table view.

In the second stage we introduced ChemVA to two domain expertsfrom the Loschmidt Laboratories at Masaryk University who evaluatedour tool in terms of functionality and user-experience. We also intro-duced our newly proposed Difference graph to one visualization expert(see Acknowledgments), who has vast experience in DR techniques.None of the experts were involved at any stage during the design anddevelopment of ChemVA.First, we conducted a brief introduction to the layout and the pro-posed visualization methods. This stage took approximately thirtyminutes for each of the evaluations with the domain experts, afterwhich they were able to use the tool without any further assistance.While both of the domain experts highlighted the intuitiveness of theHexagon and 3D views, they also pointed out the need for an appropri-ate documentation describing other views, such as our novel Differenceview, in order to support usability. According to the domain experts,ChemVA allows to easily compare numerous properties, which is anuncommon but very useful feature in tools for virtual screening. Bothof the domain experts highlighted the usefulness of having differentprojections, and the ability to focus on a speciﬁc source of data usingthe correlation scores and comparing them by means of the Differenceview. According to the visualization expert, the Difference view doesa good job by enabling the user to visualize how the two mappings,hence the two feature spaces, are correlated in terms of grouping thesame set of molecules together. Both domain experts enumerated sev-eral examples from their own ongoing research activities which theyclaimed could be enhanced by ChemVA, in terms of time and effort Forinstance, a drug design task aimed at identifying on- and off-target drugresponses, or the analysis of 4300 FDA-approved drugs for usabilityin the treatment of COVID-19, which is currently carried out usingcumbersome scripting and manual feature selection.

ESULTS AND D ISCUSSION

In this section we summarize some ﬁnal remarks and feedback providedby the domain experts who conducted both stages of the evaluationof ChemVA. We discuss the results of both case studies, as well asstrengths and limitations of our tool. As shown in Section 6.1, thegoals for both case studies were effectively fulﬁlled using ChemVA,which suggests that the tool has the potential to be widely adopted andused by medicinal chemists. This claim is also supported by the resultsof the qualitative evaluation performed by external experts, describedin Section 6.2. Our tool allowed an intuitive comparison of sets ofchemically similar compounds. In contrast to other tools, which onlyallow to sort and ﬁlter compounds by a set of properties, ChemVA en-abled a comprehensive analysis of the data by taking into considerationmultiple molecular representations and levels of granularity. In the casestudies, this was aided by speciﬁc features of our tool, such as groupingcompounds in hexagons and allowing to adjust their size.In the ﬁrst case study, the Detail view assisted the domain expertin distinguishing similar compounds based on proximity in all fourprojections (R1 & R2). Furthermore, the colors and shapes presentedin the Hexagonal and Detail views allowed the domain expert to ﬁnd aspeciﬁc hexagon very quickly (R1).The task of analyzing compound similarity could be thoroughlytested by means of different projections provided by ChemVA (R2).Furthermore, during the two case studies it was possible to examinethese projections in terms of their trustworthiness, using the providedcorrelation scores (R3), and to compare them by means of the Dif-ference view (R2). These two features provided by ChemVA aidedthe domain expert in the ﬁrst case study in order to assess whetherthe projections were trustworthy and thus hexagons were effectivelygrouping similar compounds.The evaluation also showed the usefulness of the 3D view to dis-cover and analyze similar 3D structures among the selected compoundsig. 9: Illustration of the exploratory process in the P-glycoprotein dataset, displaying only a subset of interesting molecules in the Detail viewand selecting those with high logP values.(R5). Examining the chemical determinants for bioactivity in bothcase studies was straightforward by means of the molecular alignmentfunction provided by ChemVA, according to the domain expert. Everysubset of molecules could be quickly aligned on demand, in contrast toother chemical software tools, where this task constitutes a strenuousprocess and it is often restricted to molecules above a certain thresholdof structural similarity [16]. Moreover, structural patterns could be fur-ther explored using the opacity ﬁlter and invert opacity option , whichallow highlighting the different parts of the compounds.The coordination of the plots with the Table view was also founduseful by the domain experts (R4), offering fast access to summaryinformation in its column headers. The Table view proved to be effec-tive for exploring a large dataset, such as

P-glycoprotein , enabling thesearch for rare feature values through ﬁltering and hierarchical groupingduring the second case study. Other functionalities, such as showingSMILES formulas for copying, the option for exporting 3D structures,or searching compounds by SMILES formula, made ChemVA a ﬂexibletool, easily interoperable with other cheminformatics tools.One limitation identiﬁed during the ﬁrst case study is that newcompounds were not always projected near to structurally similar com-pounds. This functionality performed consistently better on somemolecular representations than the others, as it can be seen in Figure 8.This might be related to the fact that the dataset serotonin-dopamineused in the experiment is small (118 compounds), thus the neural-basedparametric model trained for projecting new compounds is not able togeneralize well for every molecular representation. Another limitationwe identiﬁed is that not all compounds have available 3D structures,which may limit the usage of the 3D view with some datasets.ChemVA proved to be useful both as an exploration and visualizationtool, and it also helped the domain expert with the design and evaluationof new compounds (R6) during the ﬁrst case study. As discussed inSection 6.2, ChemVA was also praised as a useful and innovative toolby the domain experts during the external evaluation.We also identiﬁed potential directions for future extensions of thetool, accounting both for user experience and functionality. The exter-nal domain experts suggested expanding the current functionality of thetool by enabling the user to load and customize their own data. One ofthe domain experts also suggested to enable adding new compounds bydrawing them in a sketcher. Other possible directions include providingoptions for substructure search or to enhance navigation by providing breadcrumbs and saving analytical snapshots.Lastly, it is worth noticingthat although our tool is domain-speciﬁc, ChemVA has been developedby following techniques and strategies that could be applied to othertypes of data, including non-chemical data. For instance, our newlyproposed Difference view could assist in the process of visually analyz-ing the preservation of clusters and neighborhoods for most clusteringmethods. Also, the analysis of trustworthiness of DR projections couldbe applicable to any type of data undergoing a DR procedure. The ideabehind the 3D view, which allows for overlapping common molecularsubstructures, could be applied to other graph-based data or even text.

ONCLUSIONS

In this paper, we presented ChemVA, a novel tool for visual analysisof chemical compounds, which is especially focused on evaluatingmolecular similarity for virtual screening. Our tool proposes a set of co-ordinated views that support visual exploration and comparison of 2Dprojections, which are obtained from applying dimensionality reductionon different—and ideally, complementary—molecular representations.Our case studies, conducted by a medicinal chemist, conﬁrmed thatChemVA addresses the functional requirements and provided appropri-ate support for his analysis on two different datasets. The qualitativeevaluation conducted by three other external experts revealed the poten-tial of our tool for its adoption in the domain, as well as the usefulnessof our newly proposed Difference view. The evaluation process allowedus to identify potential extensions to ChemVA, which will steer ourfuture work on the tool. A CKNOWLEDGMENTS

The presented work has been supported by DFG-GACR researchproject no. GC18-18647J, by CONICET research grant PIP 112-2017-0100829, by UNS research grants PGI 24/N042 and PGI 24/N048, andby ANPCyT (Argentina) research grant PICT-2017-1246. We thankS´ergio M. Marques and Ondˇrej V´avra from the Loschmidt Laborato-ries, and Micha¨el Aupetit from the Qatar Computing Research InstituteQCRI for evaluating ChemVA and for their insightful comments andsuggestions on the features of the tool.

UPPLEMENTARY M ATERIAL

In this Supplementary Material, we provide the readers with detailedinformation about data preprocessing performed for ChemVA andcomputational times, as well as the results of the molecular dockingperformed for the Case study. Although this document contains im-portant information for reproducing our approach and the experimentsconducted during the evaluation of the tool, its contents are not crucialfor understanding the visualization concepts used in ChemVA, whichconstitute the core of our proposal. D ATA P REPROCESSING

In this section we describe the computation of the vector-based molecu-lar representations and the equations used for computing the correlationscores. We also provide details on the parameterization of our DRmethod, parametric models and molecular alignment back-end service.The information presented in this section is relevant for reproducibilitypurposes.

Vector-based Molecular Representations

In order to provide domain experts with diverse information about theligands in our datasets, we computed four different molecular represen-tations, which we then used as sources of data for our 2-dimensionalprojections. For each dataset, we computed radius 2 ECFPs and Day-light ﬁngerprints, both 1024 bits long, using the

Chem and

AllChem packages of RDKit [36]. We also computed 0D, 1D, and 2D moleculardescriptors using Mordred [46], obtaining a total of 1613 descriptors.We removed all descriptors that had more than 10%

NaN values andreplaced all remaining

NaN with a ﬁxed value (maximum value perdescriptor), which yielded 1454 molecular descriptors. Finally, we useda pretrained Mol2Vec model [23], which computed 300-dimensionalembeddings based on the SMILES formulas of the compounds in thedatasets. The computation of ECFPs, Daylight ﬁngerprints and molecu-lar descriptors took less than an hour, and it was performed in a 32-corecluster with 12GB memory. This preprocessing stage was performedofﬂine, which means that it was done once and then the results werestored for future use.

Features for Drug-Likeness

The features related to drug likeness are displayed in the Table viewand can be color encoded onto the Detail and Hexagonal views as well.The values for such features were obtained using a publicly availableREST API by ChEMBL [14]. The collected values did not undergoany further preprocessing steps. t-SNE Projections and Parametric Models

We computed a t-SNE projection for each of the vector-based molecularrepresentations, for both of the datasets used in the case study. Thet-SNE projections were computed using the manifold class from Scikit-Learn [52], and the parameters used were selected by grid search. Thetested perplexity values varied from 5 to 10% of the total amount ofcompounds in the dataset. The same parameterization on each datasetwas used to ﬁt t-SNE for each of the vector-based molecular represen-tations. Fitting each t-SNE projection took around 20 minutes on a32-core cluster with 12GB memory. A summary of the parameteriza-tion used for t-SNE is provided in Table 1.Table 1: Summary of the parameters used to ﬁt the t-SNE projectionsof each dataset.

Dataset P-glycoprotein Serotonin-DopaminePerplexity 45 5 ∼ ∼ After ﬁtting each t-SNE projection, we trained one parametric modelto learn each of such projections. The parametric models were builtusing Keras and Tensorﬂow [8]. The parameterization of these models is summarized in Tables 2 and 3. The computational time investedon training these models was of approximately 30 minutes per model,on a 32-core cluster with 12GB memory. Both the ﬁtting of the t-SNE projections and the training of the parametric models consisted ofofﬂine tasks.

Correlation Scores for Measuring the Trustworthiness ofProjections t-SNE applies two different distributions to map data points to a lower-dimensional space in an effort to preserve neighborhoods accordingto a speciﬁc optimization criterion. As a result, pairwise distancesamong the compounds in the low-dimensional space could be distortedwith respect to distances in the high-dimensional space, and thosedifferences could mislead the domain expert during the virtual screeningprocess. For this reason, as a means to assess the trustworthiness of theprojections, we computed two different correlation scores for each ofthe compounds.The ﬁrst score, namely r , was computed by calculating the cosinedistance of each compound k to the rest of the compounds in the high-dimensional space and then measuring the Pearson correlation (seeEquation 1) among these distances and the ones in the low-dimensionalspace. Pearson correlation has been previously used by Strickert etal in MSR [69], which has also been applied to chemical compounds.The second score, namely τ , was computed by generating two ranksfor each compound k , one in the origin space and another one in themapped space, which sort all the compounds according to the cosinedistance to k . Afterwards, the score was computed by comparing bothranks using the Kendall rank correlation (see Equation 2).Both correlation scores were computed for each compound individ-ually, and once for each of the four vector-based molecular represen-tations used in ChemVA, which yields four different views on whichthe trustworthiness of each projection can be assessed. In addition, the difference graph uses the Kendall rank correlation scores as the defaultencoding. These scores were computed using tools in the Python DataScience suite SciPy [28] and Scikit-Learn [52]. r = n ( ∑ xy ) − ( ∑ x )( ∑ y ) (cid:114)(cid:104) n ∑ x − ( ∑ x ) (cid:105)(cid:104) n ∑ y − ( ∑ y ) (cid:105) (1) τ = n concordants − n discordants n ( n − ) / Molecular Alignment and 3D Conformations

ChemVA supports a 3D view, which allows the user to conduct a de-tailed analysis of the molecular structures under study, and to visuallyexplore common geometrical patterns among compounds. The 3Dconformations of compounds were retrieved from PubChem [30] andcomputed using OpenBabel [51]. The computed 3D conformationswere obtained using

Ghemical force ﬁeld energy minimization, forceﬁeld cleanup (500 cycles) and slow rotor search.The computation ofthese 3D conformations consisted of an ofﬂine task. It took approxi-mately between 2 and 20 minutes for each compound, depending onthe complexity of its molecular structure, and it was performed on a32-core cluster with 12GB memory.The molecular alignment function involves two steps. The ﬁrst stepconsists of ﬁnding the maximum common substructure (MCS) amongthe selected compounds, which is achieved using RDKit [36]. Thesecond step consists of aligning all the selected molecules with respectto the found MCS, and it is done using the obﬁt function providedby OpenBabel [51]. The molecular alignment task is performed ondemand : it is an online task, which means that it is performed uponthe selection of a set of compounds in the Detail view in our back-endservers. The response time of this web service is of approximately onesecond, tested on random selections from 5 to 30 compounds.

Adding New Compounds

The addition of new compounds to the Detail view is an online task,performed on demand in our back-end servers. The parametric modelsable 2: Summary of the parameters used to build and train the four parametric models for

P-glycoprotein dataset.

Parameters ECFPs Daylight fps Molecular descriptors Molecular embeddings

Table 3: Summary of the parameters used to build and train the four parametric models for serotonin-dopamine dataset.

Parameters ECFPs Daylight fps Molecular descriptors Molecular embeddings are previously loaded, which reduces signiﬁcantly the response time:approximately four seconds per compound. C ASE STUDY

OLECULAR D OCKING

This section provides details on the two datasets used in the case studies,as well as on the results of the molecular docking study performed bythe domain expert in the context of our case study 1 (see Section 6 ofthe paper).

Dataset Selection and Preprocessing

The two datasets used in the case studies, namely serotonin-dopamine and P-glycoprotein , were built upon two different datasets retrievedfrom ChEMBL [14]. We performed a preliminary selection step ofthe SMILES formulas on each of the datasets, keeping only thosecompounds that could be loaded using RDKit [36]. Details on theamount of compounds belonging to each class are provided in Tables 4and 5. The compounds were labeled according to their experimentallymeasured IC50 bioactivity value towards the targets under study.Table 4: Details on the composition per class of the

P-glycoprotein dataset.

Class P-glycoprotein

Total 893

Table 5: Details on the composition per class of the serotonin-dopamine dataset.

Class Serotonin 1a Dopamine D2

Total 118 118

Preparation of Receptor Structures

The structure of human 5HT1A receptor was constructed using ho-mology modeling using the

I-TASSER web server using default set- tings [88]. The model was constructed using homologous 5HT re-ceptors and monoamine receptors with resolved structures. Model 1was selected for the docking study with a TM-score of 0 .

52 ± 0 . Preparation of Ligand Structures

Based on the three structures selected in ChemVA, ﬁve new ligandstructures were designed using Avogadro [18]. Molecular dockingaimed to compare binding energies and binding modes of the threedataset structures and those of the ﬁve designed structures. The molecu-lar docking was done using Autodock Vina [75]. Autodock atom typesand Gasteiger charges were added to the two receptors and the ligandsusing MGLTools [47, 63]. The docking grid was selected to be a 30x 30 x 30 ˚ A cube. For the 5HT1A receptor, the cube was centered onthe bound ligand in the aligned structure of 5HT1B. For the dopamineD2 receptor, the cube was centered on the bound ligand in the receptorstructure. Results and Discussion

All of the seven structures successfully bound to the binding pocketsof the two receptors, suggesting a plausible binding mode was found.Three of the ﬁve designed structures showed a lower value bindingenergy, indicating a stronger binding. The binding energies of the bestbinding modes are shown in Table 6, and the SMILES formulas foreach of the designed compounds are provided in Table 7.In this comparison, designed compounds , , and exhibited higherpredicted afﬁnities to both receptors. These predicted results do notnecessarily mean higher inhibitory activity, given that such bioactivitycan be inﬂuenced by other factors such as ﬂexibility of the receptorproteins.In addition, we provide supplementary ﬁles comprising the resultsfrom the I-TASSER web server and the input ﬁles and results from theAutodock Vina procedure. R EFERENCES [1] R. Anandakrishnan, B. Aguilar, and A. V. Onufriev. H++ 3.0: automat-ing p k prediction and the preparation of biomolecular structures for able 6: Predicted binding afﬁnities towards the serotonin 5HT1Areceptor (5HT1A) and dopamine D2 receptor (dopD2) of both selecteddataset compounds and designed compounds, measured in kcal / mol .Rows in bold show the compounds exhibiting the highest predictedafﬁnities to both receptors.Compound Afﬁnity to 5HT1A Afﬁnity to dopD2Compound 45 -8.0 -8.7Compound 79 -8.0 -8.3Compound 117 -9.9 -11.0Designed compound 1 -8.4 -11.5 Designed compound 2 -10.5 -11.9

Designed compound 3 -9.9 -10.8

Designed compound 4 -10.3 -11.2Designed compound 5 -10.2 -11.2

Table 7: SMILES formulas of the designed compounds. Rows in boldshow the compounds exhibiting the highest predicted afﬁnities to bothreceptors.

Designed Compound SMILES formula1 Fc1ccc(cc1)c2cncc(CNCC3CCc4ccccc4C3)c2 atomistic molecular modeling and simulations.

Nucleic Acids Research ,40(W1):W537–W541, 2012.[2] M. Aupetit. Visualizing distortions and recovering topology in continuousprojection techniques.

Neurocomputing , 70(7-9):1304–1330, 2007. doi:10.1016/j.neucom.2006.11.018[3] F. Barbosa and D. Horvath. Molecular similarity and property similarity.

Current Topics in Medicinal Chemistry , 4(6):589–600, 2004. doi: 10.2174/1568026043451186[4] H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig,I. Shindyalov, P. Bourne, P. Rose, A. Prlic, et al. Rcsb protein data bank:Structural biology views for basic and applied research.

Nucleic AcidsRes , 28:235–242, 2000.[5] D. Borland, W. Wang, J. Zhang, J. Shrestha, and D. Gotz. Selection biastracking and detailed subset comparison for high-dimensional data.

IEEETransactions on Visualization and Computer Graphics , 26(1):429–439,2019. doi: 10.1109/TVCG.2019.2934209[6] M. Bostock, V. Ogievetsky, and J. Heer. D data-driven documents. IEEETransactions on Visualization and Computer Graphics , 17(12):2301–2309,2011. doi: 10.1109/TVCG.2011.185[7] D. B. Carr, R. J. Littleﬁeld, W. Nicholson, and J. Littleﬁeld. Scatter-plot matrix techniques for large n.

Journal of the American StatisticalAssociation , 82(398):424–436, 1987. doi: 10.2307/2289444[8] F. Chollet. Keras, 2015. https://github.com/fchollet/keras, onlineJuly 2020.[9] M. A. Cox and T. F. Cox. Multidimensional scaling. In

Handbook ofData Visualization , pp. 315–347. Springer, 2008. doi: 10.1007/978-3-540-33037-0 14[10] R. Cutura, S. Holzer, M. Aupetit, and M. Sedlmair. VisCoDeR: A tool forvisually comparing dimensionality reduction algorithms. In

Esann

Annals of Eugenics , 7(2):179–188, 1936. doi: 10.1111/j.1469-1809.1936.tb02137.x[13] K. Furmanov´a, S. Gratzl, H. Stitz, T. Zichner, M. Jareˇsov´a, A. Lex, andM. Streit. Taggle: Combining overview and details in tabular data vi-sualizations.

Information Visualization , 19(2):114–136, 2020. doi: 10.1177/1473871619878085[14] A. Gaulton, A. Hersey, M. Nowotka, A. P. Bento, J. Chambers, D. Mendez,P. Mutowo, F. Atkinson, L. J. Bellis, E. Cibri´an-Uhalte, et al. The ChEMBLdatabase in 2017.

Nucleic Acids Research , 45(D1):D945–d954, 2017. doi:10.1093/nar/gkw1074[15] S. Gratzl, A. Lex, N. Gehlenborg, H. Pﬁster, and M. Streit. LineUp: Visual analysis of multi-attribute rankings.

IEEE Transactions on Visualizationand Computer Graphics (InfoVis ’13) , 19(12):2277–2286, 2013. doi: 10.1109/tvcg.2013.173[16] M. G¨utlein, A. Karwath, and S. Kramer. CheS-Mapper 2.0 for visualvalidation of (Q)SAR models.

Journal of Cheminformatics , 6(1), 2014.doi: 10.1186/s13321-014-0041-7[17] R. M. Hanson, J. Prilusky, Z. Renjian, T. Nakane, and J. L. Sussman.JSmol and the next-generation web-based representation of 3d molecularstructure as applied to proteopedia.

Israel Journal of Chemistry , 53(3-4):207–216, 2013. doi: 10.1002/ijch.201300024[18] M. D. Hanwell, D. E. Curtis, D. C. Lonie, T. Vandermeersch, E. Zurek,and G. R. Hutchison. Avogadro: an advanced semantic chemical editor,visualization, and analysis platform.

Journal of cheminformatics , 4(1):17,2012.[19] R. P. Hertzberg and A. J. Pope. High-throughput screening: new technol-ogy for the 21st century.

Current Opinion in Chemical Biology , 4(4):445–451, 2000. doi: 10.1016/S1367-5931(00)00110-1[20] T. Hoffmann and M. Gastreich. The next level in chemical space naviga-tion: going far beyond enumerable compound libraries.

Drug DiscoveryToday , 24(5):1148–1156, 2019. doi: 10.1016/j.drudis.2019.02.013[21] W. Humphrey, A. Dalke, and K. Schulten. VMD: Visual molecular dy-namics.

Journal of Molecular Graphics , 14(1):33–38, 1996. doi: 10.1016/0263-7855(96)00018-5[22] Y. G. Ichihara, M. Okabe, K. Iga, Y. Tanaka, K. Musha, and K. Ito. Coloruniversal design: the selection of four easily distinguishable colors forall color vision types. In

Color Imaging XIII: Processing, Hardcopy, andApplications , vol. 6807, p. 68070O. International Society for Optics andPhotonics, 2008. doi: 10.1117/12.765420[23] S. Jaeger, S. Fulle, and S. Turk. Mol2vec: Unsupervised machine learningapproach with chemical intuition.

Journal of Chemical Information andModeling , 58(1):27–35, 2018. doi: 10.1021/acs.jcim.7b00616[24] A. P. Janssen, S. H. Grimm, R. H. Wijdeven, E. B. Lenselink, J. Neefjes,C. A. van Boeckel, G. J. van Westen, and M. van der Stelt. Drug discoverymaps, a machine learning model that visualizes and predicts kinome–inhibitor interaction landscapes.

Journal of Chemical Information andModeling

Concepts and applications of molec-ular similarity . Wiley, 1990. doi: 10.1002/jcc.540130415[27] I. M. Johnstone and D. M. Titterington. Statistical challenges of high-dimensional data.

Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences

Nucleic Acids Research , 44(D1):D1202–d1213, 2016. doi: 10.1093/nar/gkv951[31] T. Kohonen. The self-organizing map.

Proceedings of the IEEE ,78(9):1464–1480, 1990. doi: 10.1109/5.58325[32] B. Kovalerchuk.

Visual knowledge discovery and machine learning , vol.144. Springer, 2018. doi: 10.1007/978-3-319-73040-0[33] B. Kozl´ıkov´a, M. Krone, M. Falk, N. Lindow, M. Baaden, D. Baum, I. Vi-ola, J. Parulek, and H.-C. Hege. Visualization of biomolecular structures:State of the art revisited.

Computer Graphics Forum , 36(8):178–204, 2016.doi: 10.1111/cgf.13072[34] M. A. Kramer. Nonlinear principal component analysis using autoasso-ciative neural networks.

AIChE Journal , 37(2):233–243, 1991. doi: 10.1002/aic.690370209[35] E. Krieger and G. Vriend. YASARA View–molecular graphics for alldevices–from smartphones to workstations.

Bioinformatics

Journal of Chemical Informationand Modeling , 51(10):2778–2786, 2011. doi: 10.1021/ci200227u[38] R. Lewis, R. Guha, T. Korcsmaros, and A. Bender. Synergy maps: explor-ing compound combinations using network-based visualization.

Journalf Cheminformatics , 7(1), 2015. doi: 10.1186/s13321-015-0090-6[39] E. Lionta, G. Spyrou, D. K. Vassilatis, and Z. Cournia. Structure-basedvirtual screening for drug discovery: Principles, applications and recentadvances.

Current Topics in Medicinal Chemistry , 14(16):1923–1938,2014. doi: 10.2174/1568026614666140929124445[40] S. Liu, D. Maljovec, B. Wang, P.-T. Bremer, and V. Pascucci. Visualizinghigh-dimensional data: Advances in the past decade.

IEEE Transactionson Visualization and Computer Graphics , 23(3):1249–1268, 2016. doi: 10.2312/eurovisstar.20151115[41] R. Macarron, M. N. Banks, D. Bojanic, D. J. Burns, D. A. Cirovic,T. Garyantes, D. V. Green, R. P. Hertzberg, W. P. Janzen, J. W. Paslay,et al. Impact of high-throughput screening in biomedical research.

NatureReviews Drug Discovery , 10(3):188–195, 2011. doi: 10.1038/nrd3368[42] G. M. Maggiora and V. Shanmugasundaram. Molecular similarity mea-sures. In

Chemoinformatics , pp. 1–50. Springer, 2004. doi: 10.1385/1-59259-802-1:001[43] L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold ap-proximation and projection for dimension reduction. arXiv preprintarXiv:1802.03426 , 2018.[44] M. R. Min, H. Guo, and D. Shen. Parametric t-distributed stochasticexemplar-centered embedding. In

Joint European Conference on MachineLearning and Knowledge Discovery in Databases , pp. 477–493. Springer,2018. doi: 10.1007/978-3-030-10925-7 29[45] K. R. Moon, D. van Dijk, Z. Wang, D. Burkhardt, W. S. Chen, A. van denElzen, M. J. Hirn, R. R. Coifman, N. B. Ivanova, G. Wolf, et al. Visualizingtransitions and structure for high dimensional data exploration. bioRxiv ,2017. doi: 10.1101/120378[46] H. Moriwaki, Y.-S. Tian, N. Kawashita, and T. Takagi. Mordred: amolecular descriptor calculator.

Journal of Cheminformatics , 10(1):4,2018.[47] G. M. Morris, R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S.Goodsell, and A. J. Olson. Autodock4 and autodocktools4: Automateddocking with selective receptor ﬂexibility.

Journal of ComputationalChemistry , 30(16):2785–2791, 2009.[48] H. Nasrallah. Atypical antipsychotic-induced metabolic side effects: in-sights from receptor-binding proﬁles.

Molecular Psychiatry , 13(1):27–35,2008. doi: 10.1038/sj.mp.4002066[49] J. J. Naveja and J. L. Medina-Franco. Finding constellations in chemicalspace through core analysis.

Frontiers in Chemistry , 7, 2019. doi: 10.3389/fchem.2019.00510[50] L. G. Nonato and M. Aupetit. Multidimensional projection for visual ana-lytics: Linking techniques with distortions, tasks, and layout enrichment.

IEEE Transactions on Visualization and Computer Graphics , 25(8):2650–2673, 2019. doi: 10.1109/TVCG.2018.2846735[51] N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch,and G. R. Hutchison. Open babel: An open chemical toolbox.

Journal ofCheminformatics , 3(1), 2011. doi: 10.1186/1758-2946-3-33[52] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research ,12:2825–2830, 2011.[53] J. Peltonen and Z. Lin. Information retrieval approach to meta-visualization.

Machine Learning , 99(2):189–229, 2015. doi: 10.1007/s10994-014-5464-x[54] E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt,E. C. Meng, and T. E. Ferrin. UCSF Chimera–A visualization system forexploratory research and analysis.

Journal of Computational Chemistry ,25(13):1605–1612, 2004. doi: 10.1002/jcc.20084[55] D. Probst and J.-L. Reymond. Exploring drugbank in virtual reality chem-ical space.

Journal of Chemical Information and Modeling , 58(9):1731–1735, 2018. doi: 10.1021/acs.jcim.8b00402[56] D. Probst and J.-L. Reymond. Visualization of very large high-dimensionaldata sets as minimum spanning trees.

Journal of Cheminformatics , 12(1):1–13, 2020. doi: 10.1186/s13321-020-0416-x[57] P. E. Rauber, A. X. Falc˜ao, and A. C. Telea. Visualizing time-dependentdata using dynamic t-SNE. In E. Bertini, N. Elmqvist, and T. Wischgoll,eds.,

EuroVis 2016 - Short Papers . The Eurographics Association, 2016.doi: 10.2312/eurovisshort.20161164[58] N. Rego and D. Koes. 3Dmol.js: molecular visualization with WebGL.

Bioinformatics , 31(8):1322–1324, 2015. doi: 10.1093/bioinformatics/btu829[59] D. Rogers and M. Hahn. Extended-connectivity ﬁngerprints.

Journal of Chemical Information and Modeling , 50(5):742–754, 2010. doi: 10.1021/ci100050t[60] A. Ronacher. Flask: a lightweight WSGI web application framework,2009. https://palletsprojects.com/p/ﬂask/, online April 2020.[61] D. Sacha, L. Zhang, M. Sedlmair, J. A. Lee, J. Peltonen, D. Weiskopf,S. C. North, and D. A. Keim. Visual interaction with dimensionality reduc-tion: A structured literature analysis.

IEEE Transactions on Visualizationand Computer Graphics , 23(1):241–250, 2016. doi: 10.1109/tvcg.2016.2598495[62] T. Sander, J. Freyss, M. von Korff, and C. Rufener. DataWarrior: Anopen-source program for chemistry aware data visualization and analysis.

Journal of Chemical Information and Modeling , 55(2):460–473, 2015. doi:10.1021/ci500588j[63] M. F. Sanner, A. J. Olson, and J.-C. Spehner. Reduced surface: an efﬁcientway to compute molecular surfaces.

Biopolymers , 38(3):305–320, 1996.[64] L. Schr¨odinger. The PyMOL molecular graphics system, version 1.8.2015.[65] M. Sedlmair, M. Brehmer, S. Ingram, and T. Munzner. Dimensionalityreduction in the wild: Gaps and guidance.

The University of BritishColumbia, Tech. Rep , 2012.[66] M. Sedlmair, T. Munzner, and M. Tory. Empirical guidance on scatter-plot and dimension reduction technique choices.

IEEE Transactions onVisualization and Computer Graphics , 19(12):2634–2643, 2013. doi: 10.1109/tvcg.2013.153[67] S. Siaﬁs, D. Tzachanis, M. Samara, and G. Papazisis. Antipsy-chotic drugs: from receptor-binding proﬁles to metabolic side effects.

Current Neuropharmacology , 16(8):1210–1223, 2018. doi: 10.2174/1570159x15666170630163616[68] K. M. R. Srivalli and P. Lakshmi. Overview of p-glycoprotein inhibitors: arational outlook.

Brazilian Journal of Pharmaceutical Sciences , 48(3):353–367, 2012. doi: 10.1590/s1984-82502012000300002[69] M. Strickert, A. J. Soto, and G. E. Vazquez. Adaptive matrix distancesaiming at optimum regression subspaces. In

ESANN 2010, EuropeanSymposium on Artiﬁcial Neural Networks , pp. 93–98, 2010.[70] J. Tang, J. Liu, M. Zhang, and Q. Mei. Visualizing large-scale and high-dimensional data. In

Proceedings of the 25th International Conference onWorld Wide Web , pp. 287–297, 2016. doi: 10.1145/2872427.2883041[71] Y. Tanrikulu, B. Kr¨uger, and E. Proschak. The holistic integration of virtualscreening in drug discovery.

Drug Discovery Today , 18(7-8):358–364,2013. doi: 10.1016/j.drudis.2013.01.007[72] U. Technologies. Unity Engine, 2005. https://unity.com, onlineMarch 2020.[73] S. Tilkov and S. Vinoski. Node.js: Using javascript to build high-performance network programs.

IEEE Internet Computing , 14(6):80–83,2010. doi: 10.1109/MIC.2010.145[74] R. Todeschini and V. Consonni.

Handbook of molecular descriptors ,vol. 11. John Wiley & Sons, 2010. doi: 10.1002/9783527613106[75] O. Trott and A. J. Olson. Autodock vina: improving the speed andaccuracy of docking with a new scoring function, efﬁcient optimization,and multithreading.

Journal of Computational Chemistry , 31(2):455–461,2010.[76] L. Van Der Maaten. Learning a parametric embedding by preserving localstructure. In

Artiﬁcial Intelligence and Statistics , pp. 384–391, 2009.[77] L. Van Der Maaten and G. Hinton. Visualizing data using t-SNE.

Journalof Machine Learning Research , 9(Nov):2579–2605, 2008.[78] L. Van Der Maaten, E. Postma, and J. Van den Herik. Dimensionalityreduction: a comparative review.

Journal of Machine Learning Research ,10(66-71):13, 2009.[79] C. Wang, Y. Jiang, J. Ma, H. Wu, D. Wacker, V. Katritch, G. W. Han,W. Liu, X.-P. Huang, E. Vardy, et al. Structural basis for molecularrecognition at serotonin receptors.

Science , 340(6132):610–614, 2013.[80] J. Wang, X. Liu, and H.-W. Shen. High-dimensional data analysis with sub-space comparison using matrix visualization.

Information Visualization ,18(1):94–109, 2019. doi: 10.1177/1473871617733996[81] S. Wang, T. Che, A. Levit, B. K. Shoichet, D. Wacker, and B. L. Roth.Structure of the d2 dopamine receptor bound to the atypical antipsychoticdrug risperidone.

Nature , 555(7695):269–273, 2018.[82] Y. C. Wang, Q. Zhang, F. Lin, C. K. Goh, and H. S. Seah. Polarviz: adiscriminating visualization and visual analytics tool for high-dimensionaldata.

The Visual Computer , 35(11):1567–1582, 2019. doi: 10.1007/s00371-018-1558-y[83] D. Weininger. Smiles. 3. depict. graphical depiction of chemical struc-tures.

J. Chem. Inf. Comput. Sci. , 30(3):237–243, 1990. doi: 10.1021/i00067a005[84] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis.

Chemometrics and Intelligent Laboratory Systems , 2(1-3):37–52, 1987.doi: 10.1016/0169-7439(87)80084-9[85] J. Wu, J. Wang, H. Xiao, and J. Ling. Visualization of high dimensionalturbulence simulation data using t-SNE. In , 2017. doi: 10.2514/6.2017-1770[86] X. Xu, T. Liang, J. Zhu, D. Zheng, and T. Sun. Review of classicaldimensionality reduction and sample selection methods for large-scaledata processing.

Neurocomputing , 328:5–15, 2019. doi: 10.1016/j.neucom.2018.02.100[87] J. Xue, H. Zhang, and K. Dana. Deep texture manifold for ground terrainrecognition. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pp. 558–567, 2018. doi: 10.1109/CVPR.2018.00065[88] J. Yang, R. Yan, A. Roy, D. Xu, J. Poisson, and Y. Zhang. The I-TASSERSuite: protein structure and function prediction.

Nature Methods , 12(1):7,2015.[89] A. Yoshimori, T. Tanoue, and J. Bajorath. Integrating the structure–activity relationship matrix method with molecular grid maps and activitylandscape models for medicinal chemistry applications.

ACS Omega ,4(4):7061–7069, 2019. doi: 10.1021/acsomega.9b00595[90] W. Yu and A. D. MacKerell.

Computer-Aided Drug Design Methods , pp.85–106. Springer New York, New York, NY, 2017. doi: 10.1007/978-1-4939-6634-9 5[91] G. G. Zanabria, L. G. Nonato, and E. Gomez-Nieto. iStar (i*): Aninteractive star coordinates approach for high-dimensional data exploration.

Computers & Graphics , 60:107–118, 2016. doi: 10.1016/j.cag.2016.08.007[92] Y. Zhang and J. Skolnick. Scoring function for automated assessmentof protein structure template quality.

Proteins: Structure, Function, andBioinformatics , 57(4):702–710, 2004.[93] W. Zhu, Z. Webb, X. Han, K. Mao, W. Sun, and J. Romagnoli. Genericprocess visualization using parametric t-SNE.