S. Joshua Swamidass | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where S. Joshua Swamidass is active.

Explore More

Publication

Featured researches published by S. Joshua Swamidass.

intelligent systems in molecular biology | 2005

Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity

S. Joshua Swamidass; Jonathan H. Chen; Jocelyne Bruand; Peter Phung; Liva Ralaivola; Pierre Baldi

MOTIVATION Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As increasing numbers of large datasets of small molecules become available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their physical, chemical and biological properties. RESULTS Here we develop several new classes of kernels for small molecules using their 1D, 2D and 3D representations. In 1D, we consider string kernels based on SMILES strings. In 2D, we introduce several similarity kernels based on conventional or generalized fingerprints. Generalized fingerprints are derived by counting in different ways subpaths contained in the graph of bonds, using depth-first searches. In 3D, we consider similarity measures between histograms of pairwise distances between atom classes. These kernels can be computed efficiently and are applied to problems of classification and prediction of mutagenicity, toxicity and anti-cancer activity on three publicly available datasets. The results derived using cross-validation methods are state-of-the-art. Tradeoffs between various kernels are briefly discussed. AVAILABILITY Datasets available from http://www.igb.uci.edu/servers/servers.html

Bioinformatics | 2005

ChemDB: a public database of small molecules and related chemoinformatics resources

Jonathan H. Chen; S. Joshua Swamidass; Yimeng Dou; Jocelyne Bruand; Pierre Baldi

MOTIVATION The development of chemoinformatics has been hampered by the lack of large, publicly available, comprehensive repositories of molecules, in particular of small molecules. Small molecules play a fundamental role in organic chemistry and biology. They can be used as combinatorial building blocks for chemical synthesis, as molecular probes in chemical genomics and systems biology, and for the screening and discovery of new drugs and other useful compounds. RESULTS We describe ChemDB, a public database of small molecules available on the Web. ChemDB is built using the digital catalogs of over a hundred vendors and other public sources and is annotated with information derived from these sources as well as from computational methods, such as predicted solubility and three-dimensional structure. It supports multiple molecular formats and is periodically updated, automatically whenever possible. The current version of the database contains approximately 4.1 million commercially available compounds and 8.2 million counting isomers. The database includes a user-friendly graphical interface, chemical reactions capabilities, as well as unique search capabilities. AVAILABILITY Database and datasets are available on http://cdb.ics.uci.edu.

Briefings in Bioinformatics | 2016

A survey of current trends in computational drug repositioning

Jiao Li; Si Zheng; Bin Chen; Atul J. Butte; S. Joshua Swamidass; Zhiyong Lu

Computational drug repositioning or repurposing is a promising and efficient tool for discovering new uses from existing drugs and holds the great potential for precision medicine in the age of big data. The explosive growth of large-scale genomic and phenotypic data, as well as data of small molecular compounds with granted regulatory approval, is enabling new developments for computational repositioning. To achieve the shortest path toward new drug indications, advanced data processing and analysis strategies are critical for making sense of these heterogeneous molecular measurements. In this review, we show recent advancements in the critical areas of computational drug repositioning from multiple aspects. First, we summarize available data sources and the corresponding computational repositioning strategies. Second, we characterize the commonly used computational techniques. Third, we discuss validation strategies for repositioning studies, including both computational and experimental methods. Finally, we highlight potential opportunities and use-cases, including a few target areas such as cancers. We conclude with a brief discussion of the remaining challenges in computational drug repositioning.

Journal of Chemical Information and Modeling | 2007

Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time

S. Joshua Swamidass; Pierre Baldi

Chemical fingerprints are used to represent chemical molecules by recording the presence or absence, or by counting the number of occurrences, of particular features or substructures, such as labeled paths in the 2D graph of bonds, of the corresponding molecule. These fingerprint vectors are used to search large databases of small molecules, currently containing millions of entries, using various similarity measures, such as the Tanimoto or Tverskys measures and their variants. Here, we derive simple bounds on these similarity measures and show how these bounds can be used to considerably reduce the subset of molecules that need to be searched. We consider both the case of single-molecule and multiple-molecule queries, as well as queries based on fixed similarity thresholds or aimed at retrieving the top K hits. We study the speedup as a function of query size and distribution, fingerprint length, similarity threshold, and database size |D| and derive analytical formulas that are in excellent agreement with empirical values. The theoretical considerations and experiments show that this approach can provide linear speedups of one or more orders of magnitude in the case of searches with a fixed threshold, and achieve sublinear speedups in the range of O(|D|0.6) for the top K hits in current large databases. This pruning approach yields subsecond search times across the 5 million compounds in the ChemDB database, without any loss of accuracy.

Bioinformatics | 2007

ChemDB update—full-text search and virtual chemical space

Jonathan H. Chen; Erik Linstead; S. Joshua Swamidass; Dennis Ding-Hwa Wang; Pierre Baldi

UNLABELLED ChemDB is a chemical database containing nearly 5M commercially available small molecules, important for use as synthetic building blocks, probes in systems biology and as leads for the discovery of drugs and other useful compounds. The data is publicly available over the web for download and for targeted searches using a variety of powerful methods. The chemical data includes predicted or experimentally determined physicochemical properties, such as 3D structure, melting temperature and solubility. Recent developments include optimization of chemical structure (and substructure) retrieval algorithms, enabling full database searches in less than a second. A text-based search engine allows efficient searching of compounds based on over 65M annotations from over 150 vendors. When searching for chemicals by name, fuzzy text matching capabilities yield productive results even when the correct spelling of a chemical name is unknown, taking advantage of both systematic and common names. Finally, built in reaction models enable searches through virtual chemical space, consisting of hypothetical products readily synthesizable from the building blocks in ChemDB. AVAILABILITY ChemDB and Supplementary Materials are available at http://cdb.ics.uci.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal of Chemical Information and Modeling | 2007

One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties

Chloé-Agathe Azencott; Alexandre Ksikes; S. Joshua Swamidass; Jonathan H. Chen; Liva Ralaivola; Pierre Baldi

Many chemoinformatics applications, including high-throughput virtual screening, benefit from being able to rapidly predict the physical, chemical, and biological properties of small molecules to screen large repositories and identify suitable candidates. When training sets are available, machine learning methods provide an effective alternative to ab initio methods for these predictions. Here, we leverage rich molecular representations including 1D SMILES strings, 2D graphs of bonds, and 3D coordinates to derive efficient machine learning kernels to address regression problems. We further expand the library of available spectral kernels for small molecules developed for classification problems to include 2.5D surface and 3D kernels using Delaunay tetrahedrization and other techniques from computational geometry, 3D pharmacophore kernels, and 3.5D or 4D kernels capable of taking into account multiple molecular configurations, such as conformers. The kernels are comprehensively tested using cross-validation and redundancy-reduction methods on regression problems using several available data sets to predict boiling points, melting points, aqueous solubility, octanol/water partition coefficients, and biological activity with state-of-the art results. When sufficient training data are available, 2D spectral kernels in general tend to yield the best and most robust results, better than state-of-the art. On data sets containing thousands of molecules, the kernels achieve a squared correlation coefficient of 0.91 for aqueous solubility prediction and 0.94 for octanol/water partition coefficient prediction. Averaging over conformations improves the performance of kernels based on the three-dimensional structure of molecules, especially on challenging data sets. Kernel predictors for aqueous solubility (kSOL), LogP (kLOGP), and melting point (kMELT) are available over the Web through: http://cdb.ics.uci.edu.

Journal of the Royal Society Interface | 2018

Opportunities and obstacles for deep learning in biology and medicine

Travers Ching; Daniel Himmelstein; Brett K. Beaulieu-Jones; Alexandr A. Kalinin; Brian T. Do; Gregory P. Way; Enrico Ferrero; Paul-Michael Agapow; Michael Zietz; Michael M. Hoffman; Wei Xie; Gail Rosen; Benjamin J. Lengerich; Johnny Israeli; Jack Lanchantin; Stephen Woloszynek; Anne E. Carpenter; Avanti Shrikumar; Jinbo Xu; Evan M. Cofer; Christopher A. Lavender; Srinivas C. Turaga; Amr Alexandari; Zhiyong Lu; David J. Harris; Dave DeCaprio; Yanjun Qi; Anshul Kundaje; Yifan Peng; Laura Wiley

Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems—patient classification, fundamental biological processes and treatment of patients—and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural networks prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

Journal of Chemical Information and Modeling | 2007

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.

Pierre Baldi; Ryan W. Benz; Daniel S. Hirschberg; S. Joshua Swamidass

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.

ACS central science | 2015

Modeling Epoxidation of Drug-like Molecules with a Deep Machine Learning Network.

Tyler B. Hughes; Grover P. Miller; S. Joshua Swamidass

Drug toxicity is frequently caused by electrophilic reactive metabolites that covalently bind to proteins. Epoxides comprise a large class of three-membered cyclic ethers. These molecules are electrophilic and typically highly reactive due to ring tension and polarized carbon–oxygen bonds. Epoxides are metabolites often formed by cytochromes P450 acting on aromatic or double bonds. The specific location on a molecule that undergoes epoxidation is its site of epoxidation (SOE). Identifying a molecule’s SOE can aid in interpreting adverse events related to reactive metabolites and direct modification to prevent epoxidation for safer drugs. This study utilized a database of 702 epoxidation reactions to build a model that accurately predicted sites of epoxidation. The foundation for this model was an algorithm originally designed to model sites of cytochromes P450 metabolism (called XenoSite) that was recently applied to model the intrinsic reactivity of diverse molecules with glutathione. This modeling algorithm systematically and quantitatively summarizes the knowledge from hundreds of epoxidation reactions with a deep convolution network. This network makes predictions at both an atom and molecule level. The final epoxidation model constructed with this approach identified SOEs with 94.9% area under the curve (AUC) performance and separated epoxidized and non-epoxidized molecules with 79.3% AUC. Moreover, within epoxidized molecules, the model separated aromatic or double bond SOEs from all other aromatic or double bonds with AUCs of 92.5% and 95.1%, respectively. Finally, the model separated SOEs from sites of sp2 hydroxylation with 83.2% AUC. Our model is the first of its kind and may be useful for the development of safer drugs. The epoxidation model is available at http://swami.wustl.edu/xenosite.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2006

Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants

Samuel A. Danziger; S. Joshua Swamidass; Jue Zeng; Lawrence R. Dearth; Qiang Lu; Jonathan H. Chen; Jianlin Cheng; Vinh P. Hoang; Hiroto Saigo; Ray Luo; Pierre Baldi; Rainer K. Brachmann; Richard H. Lathrop

Many biomedical problems relate to mutant functional properties across a sequence space of interest, e.g., flu, cancer, and HIV. Detailed knowledge of mutant properties and function improves medical treatment and prevention. A functional census of p53 cancer rescue mutants would aid the search for cancer treatments from p53 mutant rescue. We devised a general methodology for conducting a functional census of a mutation sequence space by choosing informative mutants early. The methodology was tested in a double-blind predictive test on the functional rescue property of 71 novel putative p53 cancer rescue mutants iteratively predicted in sets of three (24 iterations). The first double-blind 15-point moving accuracy was 47 percent and the last was 86 percent; r = 0.01 before an epiphanic 16th iteration and r = 0.92 afterward. Useful mutants were chosen early (overall r = 0.80). Code and data are freely available (http://www.igb.uci.edu/research/research.html, corresponding authors: R.H.L. for computation and R.K.B. for biology)

Explore More