David A. duVerle
University of Tokyo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David A. duVerle.
PLOS ONE | 2011
David A. duVerle; Yasuko Ono; Hiroyuki Sorimachi; Hiroshi Mamitsuka
Calpain, an intracellular -dependent cysteine protease, is known to play a role in a wide range of metabolic pathways through limited proteolysis of its substrates. However, only a limited number of these substrates are currently known, with the exact mechanism of substrate recognition and cleavage by calpain still largely unknown. While previous research has successfully applied standard machine-learning algorithms to accurately predict substrate cleavage by other similar types of proteases, their approach does not extend well to calpain, possibly due to its particular mode of proteolytic action and limited amount of experimental data. Through the use of Multiple Kernel Learning, a recent extension to the classic Support Vector Machine framework, we were able to train complex models based on rich, heterogeneous feature sets, leading to significantly improved prediction quality (6% over highest AUC score produced by state-of-the-art methods). In addition to producing a stronger machine-learning model for the prediction of calpain cleavage, we were able to highlight the importance and role of each feature of substrate sequences in defining specificity: primary sequence, secondary structure and solvent accessibility. Most notably, we showed there existed significant specificity differences across calpain sub-types, despite previous assumption to the contrary. Prediction accuracy was further successfully validated using, as an unbiased test set, mutated sequences of calpastatin (endogenous inhibitor of calpain) modified to no longer block calpains proteolytic action. An online implementation of our prediction tool is available at http://calpain.org.
Proceedings of the 9th Annual International Workshop on Bioinformatics and Systems Biology (IBSB 2009) | 2010
David A. duVerle; Ichigaku Takigawa; Yasuko Ono; Hiroyuki Sorimachi; Hiroshi Mamitsuka
While the importance of modulatory proteolysis in research has steadily increased, knowledge on this process has remained largely disorganized, with the nature and role of entities composing modulatory proteolysis still uncertain. We built CaMPDB, a resource on modulatory proteolysis, with a focus on calpain, a well-studied intracellular protease which regulates substrate functions by proteolytic processing. CaMPDB contains sequences of calpains, substrates and inhibitors as well as substrate cleavage sites, collected from the literature. Some cleavage efficiencies were evaluated by biochemical experiments and a cleavage site prediction tool is provided to assist biologists in understanding calpain-mediated cellular processes. CaMPDB is freely accessible at http://calpain.org.
Briefings in Bioinformatics | 2012
David A. duVerle; Hiroshi Mamitsuka
A fundamental component of systems biology, proteolytic cleavage is involved in nearly all aspects of cellular activities: from gene regulation to cell lifecycle regulation. Current sequencing technologies have made it possible to compile large amount of cleavage data and brought greater understanding of the underlying protein interactions. However, the practical impossibility to exhaustively retrieve substrate sequences through experimentation alone has long highlighted the need for efficient computational prediction methods. Such methods must be able to quickly mark substrate candidates and putative cleavage sites for further analysis. Available methods and expected reliability depend heavily on the type and complexity of proteolytic action, as well as the availability of well-labelled experimental data sets: factors varying greatly across enzyme families. For this review, we chose to give a quick overview of the general issues and challenges in cleavage prediction methods followed by a more in-depth presentation of major techniques and implementations, with a focus on two particular families of cysteine proteases: caspases and calpains. Through their respective differences in proteolytic specificity (high for caspases, broader for calpains) and data availability (much lower for calpains), we aimed to illustrate the strengths and limitations of techniques ranging from position-based matrices and decision trees to more flexible machine-learning methods such as hidden Markov models and Support Vector Machines. In addition to a technical overview for each family of algorithms, we tried to provide elements of evaluation and performance comparison across methods.
BMC Bioinformatics | 2016
David A. duVerle; Sohiya Yotsukura; Seitaro Nomura; Hiroyuki Aburatani; Koji Tsuda
BackgroundSingle-cell RNA sequencing is fast becoming one the standard method for gene expression measurement, providing unique insights into cellular processes. A number of methods, based on general dimensionality reduction techniques, have been suggested to help infer and visualise the underlying structure of cell populations from single-cell expression levels, yet their models generally lack proper biological grounding and struggle at identifying complex differentiation paths.ResultsHere we introduce cellTree: an R/Bioconductor package that uses a novel statistical approach, based on document analysis techniques, to produce tree structures outlining the hierarchical relationship between single-cell samples, while identifying latent groups of genes that can provide biological insights.ConclusionsWith cellTree, we provide experimentalists with an easy-to-use tool, based on statistically and biologically-sound algorithms, to efficiently explore and visualise single-cell RNA data. The cellTree package is publicly available in the online Bionconductor repository at: http://bioconductor.org/packages/cellTree/.
Briefings in Bioinformatics | 2017
Sohiya Yotsukura; David A. duVerle; Timothy Hancock; Yayoi Natsume-Kitatani; Hiroshi Mamitsuka
Since the completion of the Human Genome Project, it has been widely established that most DNA is not transcribed into proteins. These non-protein-coding regions are believed to be moderators within transcriptional and post-transcriptional processes, which play key roles in the onset of diseases. Long non-coding RNAs (lncRNAs) are generally lacking in conserved motifs typically used for detection and thus hard to identify, but nonetheless present certain characteristic features that can be exploited by bioinformatics methods. By combining lncRNA detection with known miRNA, RNA-binding protein and chromatin interaction, current tools are able to recognize and functionally annotate large number of lncRNAs. This review discusses databases and platforms dedicated to cataloging and annotating lncRNAs, as well as tools geared at discovering novel sequences. We emphasize the issues posed by the diversity of lncRNAs and their complex interaction mechanisms, as well as technical issues such as lack of unified nomenclature. We hope that this wide overview of existing platforms and databases might help guide biologists toward the tools they need to analyze their experimental data, while our discussion of limitations and of current lncRNA-related methods may assist in the development of new computational tools.
ieee symposium on security and privacy | 2015
David A. duVerle; Shohei Kawasaki; Yoshiji Yamada; Jun Sakuma; Koji Tsuda
Logistic regression is the method of choice in most genome-wide association studies (GWAS). Due to the heavy cost of performing iterative parameter updates when training such a model, existing methods have prohibitive communication and computational complexities that make them unpractical for real-life usage. We propose a new sampling-based secure protocol to compute exact statistics, that requires a constant number of communication rounds and a much lower number of computations. The publicly available implementation of our protocol (and its many optional optimisations adapted to different security scenarios) can, in a matter of hours, perform statistical testing of over 600 SNP variables across thousands of patients while accounting for potential confounding factors in the clinical data.
workshop on privacy in the electronic society | 2014
Hirohito Sasakawa; Hiroki Harada; David A. duVerle; Hiroki Arimura; Koji Tsuda; Jun Sakuma
Various string matching problems can be solved by means of a deterministic finite automaton (DFA) or a non-deterministic finite automaton (NFA). In non-oblivious cases, DFAs are often preferred for their run-time efficiency despite larger sizes. In oblivious cases, however, the inevitable computation and communication costs associated with the automaton size are more favorable to NFAs. We propose oblivious protocols for NFA evaluation based on homomorphic encryption and demonstrate that our method can be orders of magnitude faster than DFA-based methods, making it applicable to real-life scenarios, such as privacy-preserving detection of viral infection using genomic data.
pacific-asia conference on knowledge discovery and data mining | 2016
Aika Terada; David A. duVerle; Koji Tsuda
Recent pattern mining algorithms such as LAMP allow us to compute statistical significance of patterns with respect to an outcome variable. Their p-values are adjusted to control the family-wise error rate, which is the probability of at least one false discovery occurring. However, they are a poor fit for medical applications, due to their inability to handle potential confounding variables such as age or gender. We propose a novel pattern mining algorithm that evaluates statistical significance under confounding variables. Using a new testability bound based on the exact logistic regression model, the algorithm can exclude a large quantity of combination without testing them, limiting the amount of correction required for multiple testing. Using synthetic data, we showed that our method could remove the bias introduced by confounding variables while still detecting true patterns correlated with the class. In addition, we demonstrated application of data integration using a confounding variable.
Bioinformatics | 2013
David A. duVerle; Ichiro Takeuchi; Yuko Murakami-Tonami; Kenji Kadomatsu; Koji Tsuda
Motivation: Although several methods exist to relate high-dimensional gene expression data to various clinical phenotypes, finding combinations of features in such input remains a challenge, particularly when fitting complex statistical models such as those used for survival studies. Results: Our proposed method builds on existing ‘regularization path-following’ techniques to produce regression models that can extract arbitrarily complex patterns of input features (such as gene combinations) from large-scale data that relate to a known clinical outcome. Through the use of the data’s structure and itemset mining techniques, we are able to avoid combinatorial complexity issues typically encountered with such methods, and our algorithm performs in similar orders of duration as single-variable versions. Applied to data from various clinical studies of cancer patient survival time, our method was able to produce a number of promising gene-interaction candidates whose tumour-related roles appear confirmed by literature. Availability: An R implementation of the algorithm described in this article can be found at https://github.com/david-duverle/regularisation-path-following Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Archive | 2018
Kei-ichiro Takahashi; David A. duVerle; Sohiya Yotsukura; Ichigaku Takigawa; Hiroshi Mamitsuka
Biclustering extracts coexpressed genes under certain experimental conditions, providing more precise insight into the genetic behaviors than one-dimensional clustering. For understanding the biological features of genes in a single bicluster, visualizations such as heatmaps or parallel coordinate plots and tools for enrichment analysis are widely used. However, simultaneously handling many biclusters still remains a challenge. Thus, we developed a web service named SiBIC, which, using maximal frequent itemset mining, exhaustively discovers significant biclusters, which turn into networks of overlapping biclusters, where nodes are gene sets and edges show their overlaps in the detected biclusters. SiBIC provides a graphical user interface for manipulating a gene set network, where users can find target gene sets based on the enriched network. This chapter provides a user guide/instruction of SiBIC with background of having developed this software. SiBIC is available at http://utrecht.kuicr.kyoto-u.ac.jp:8080/sibic/faces/index.jsp .