Marco Capuccini | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marco Capuccini is active.

Explore More

Publication

Featured researches published by Marco Capuccini.

Journal of Cheminformatics | 2018

Efficient iterative virtual screening with Apache Spark and conformal prediction.

Laeeq Ahmed; Valentin Georgiev; Marco Capuccini; Salman Zubair Toor; Wesley Schaal; Erwin Laure; Ola Spjuth

BackgroundDocking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands.ContributionIn this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as ‘low-scoring’ ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling.ResultsWe show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources.

international conference on e-science | 2017

SNIC Science Cloud (SSC): A National-Scale Cloud Infrastructure for Swedish Academia

Salman Zubair Toor; Mathias Lindberg; Ingemar Falman; Andreas Vallin; Olof Mohill; Pontus Freyhult; Linus Nilsson; Martin Agback; Lars Viklund; Henric Zazzik; Ola Spjuth; Marco Capuccini; Joakim Moller; Donal P. Murtagh; Andreas Hellander

The cloud computing paradigm have fundamentally changed the way computational resources are being offered. Although the number of large-scale providers in academia is still relatively small, there is a rapidly increasing interest and adoption of cloud Infrastructure-as-a-Service in the scientific community. The added flexibility in how applications can be implemented compared to traditional batch computing systems is one of the key success factors for the paradigm, and scientific cloud computing promises to increase adoption of simulation and data analysis in scientific communities not traditionally users of large scale e-Infrastructure, the so called ”long tail of science”. In 2014, the Swedish National Infrastructure for Computing (SNIC) initiated a project to investigate the cost and constraints of offering cloud infrastructure for Swedish academia. The aim was to build a platform where academics could evaluate cloud computing for their use-cases. SNIC Science Cloud (SSC) has since then evolved into a national-scale cloud infrastructure based on three geographically distributed regions. In this article we present the SSC vision, architectural details and user stories. We summarize the experiences gained from running a nationalscale cloud facility into ”ten simple rules” for starting up a science cloud project based on OpenStack. We also highlight some key areas that require careful attention in order to offer cloud infrastructure for ubiquitous academic needs and in particular scientific workloads.

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) | 2015

Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence

Marco Capuccini; Lars Carlsson; Ulf Norinder; Ola Spjuth

Increasing size of datasets is challenging for machine learning, and Big Data frameworks, such as Apache Spark, have shown promise for facilitating model building on distributed resources. Conformal prediction is a mathematical framework that allows to assign valid confidence levels to object-specific predictions. This contrasts to current best-practices where the overall confidence level for predictions on unseen objects is estimated based on previous performance, assuming exchangeability. Here we report a Spark-based distributed implementation of conformal prediction, which introduces valid confidence estimation in predictive modeling for Big Data analytics. Experimental results on two large-scale datasets show the validity and the scalabilty of the method, which is freely available as open source.

bioRxiv | 2018

PhenoMeNal: Processing and analysis of Metabolomics data in the Cloud

Kristian Peters; James Bradbury; Sven Bergmann; Marco Capuccini; Marta Cascante; Pedro de Atauri; Timothy M. D. Ebbels; Carles Foguet; Robert C. Glen; Alejandra Gonzalez-Beltran; Evangelos Handakas; Thomas Hankemeier; Stephanie Herman; Kenneth Haug; Petr Holub; Massimiliano Izzo; Daniel Jacob; David Johnson; Fabien Jourdan; Namrata Kale; Ibrahim Karaman; Bita Khalili; Payam Emami Khoonsari; Kim Kultima; Samuel Lampa; Anders Larsson; Pablo Moreno; Steffen Neumann; Jon Ander Novella; Claire O'Donovan

Background Metabolomics is the comprehensive study of a multitude of small molecules to gain insight into an organism’s metabolism. The research field is dynamic and expanding with applications across biomedical, biotechnological and many other applied biological domains. Its computationally-intensive nature has driven requirements for open data formats, data repositories and data analysis tools. However, the rapid progress has resulted in a mosaic of independent – and sometimes incompatible – analysis methods that are difficult to connect into a useful and complete data analysis solution. Findings The PhenoMeNal (Phenome and Metabolome aNalysis) e-infrastructure provides a complete, workflow-oriented, interoperable metabolomics data analysis solution for a modern infrastructure-as-a-service (IaaS) cloud platform. PhenoMeNal seamlessly integrates a wide array of existing open source tools which are tested and packaged as Docker containers through the project’s continuous integration process and deployed based on a kubernetes orchestration framework. It also provides a number of standardized, automated and published analysis workflows in the user interfaces Galaxy, Jupyter, Luigi and Pachyderm. Conclusions PhenoMeNal constitutes a keystone solution in cloud infrastructures available for metabolomics. It provides scientists with a ready-to-use, workflow-driven, reproducible and shareable data analysis platform harmonizing the software installation and configuration through user-friendly web interfaces. The deployed cloud environments can be dynamically scaled to enable large-scale analyses which are interfaced through standard data formats, versioned, and have been tested for reproducibility and interoperability. The flexible implementation of PhenoMeNal allows easy adaptation of the infrastructure to other application areas and ‘omics research domains.

Bioinformatics | 2018

Container-based bioinformatics with Pachyderm

Jon Ander Novella; Payam Emami Khoonsari; Stephanie Herman; Daniel Whitenack; Marco Capuccini; Joachim Burman; Kim Kultima; Ola Spjuth

Motivation: Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well‐defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline. Results: Pachyderm is an open‐source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud‐agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability. Availability and implementation: Pachyderm is available from https://github.com/pachyderm/pachyderm. The Pachyderm Helm Chart is available from https://github.com/kubernetes/charts/tree/master/stable/pachyderm. Pachyderm is available out‐of‐the‐box from the PhenoMeNal VRE (https://github.com/phnmnl/KubeNow‐plugin) and general Kubernetes environments instantiated via KubeNow. The code of the workflow used for the analysis is available on GitHub (https://github.com/pharmbio/LC‐MS‐Pachyderm). Supplementary information: Supplementary data are available at Bioinformatics online.

2017 Imperial College Computing Student Workshop (ICCSW 2017) | 2018

KubeNow: A Cloud Agnostic Platform for Microservice-Oriented Applications

Marco Capuccini; Anders Larsson; Salman Zubair Toor; Ola Spjuth

KubeNow is a platform for rapid and continuous deployment of microservice-based applications over cloud infrastructure. Within the field of software engineering, the microservice-based architecture is a methodology in which complex applications are divided into smaller, more narrow services. These services are independently deployable and compatible with each other like building blocks. These blocks can be combined in multiple ways, according to specific use cases. Microservices are designed around a few concepts: they offer a minimal and complete set of features, they are portable and platform independent, they are accessible through language agnostic APIs and they are encouraged to use standard data formats. These characteristics promote separation of concerns, isolation and interoperability, while coupling nicely with test-driven development. Among many others, some well-known companies that build their software around microservices are: Google, Amazon, PayPal Holdings Inc. and Netflix [11].

bioRxiv | 2017

Interoperable and scalable metabolomics data analysis with microservices

Payam Emami Khoonsari; Pablo Moreno; Sven Bergmann; Joachim Burman; Marco Capuccini; Matteo Carone; Marta Cascante; Pedro de Atauri; Carles Foguet; Alejandra Gonzalez-Beltran; Thomas Hankemeier; Kenneth Haug; Sijin He; Stephanie Herman; David Johnson; Namrata Kale; Anders Larsson; Steffen Neumann; Kristian Peters; Luca Pireddu; Philippe Rocca-Serra; Pierrick Roger; Rico Rueedi; Christoph Ruttkies; Noureddin Sadawi; Reza M. Salek; Susanna-Assunta Sansone; Daniel Schober; Vitaly A. Selivanov; Etienne A. Thévenot

Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed in parallel using the Kubernetes container orchestrator. The access point is a virtual research environment which can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and established workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry studies, one nuclear magnetic resonance spectroscopy study and one fluxomics study, showing that the method scales dynamically with increasing availability of computational resources. We achieved a complete integration of the major software suites resulting in the first turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, multivariate statistics, and metabolite identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science.Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We here presen ...

Journal of Cheminformatics | 2017