Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where David Koslicki is active.

Publication


Featured researches published by David Koslicki.


Nature Methods | 2017

Critical assessment of metagenome interpretation − a benchmark of computational metagenomics software

Alexander Sczyrba; Peter Hofmann; Peter Belmann; David Koslicki; Stefan Janssen; Johannes Droege; Ivan Gregor; Stephan Majda; Jessika Fiedler; Eik Dahms; Andreas Bremges; Adrian Fritz; Ruben Garrido-Oter; Tue Sparholt Jørgensen; Nicole Shapiro; Philip D. Blood; Alexey Gurevich; Yang Bai; Dmitrij Turaev; Matthew Z. DeMaere; Rayan Chikhi; Niranjan Nagarajan; Christopher Quince; Fernando Meyer; Monika Balvociute; Lars Hestbjerg Hansen; Søren J. Sørensen; Burton K H Chia; Bertrand Denis; Jeff Froula

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Nature Methods | 2017

Critical Assessment of Metagenome Interpretation — a benchmark of metagenomics software

Alexander Sczyrba; Peter Hofmann; Peter Belmann; David Koslicki; Stefan Janssen; Johannes Dröge; Ivan Gregor; Stephan Majda; Jessika Fiedler; Eik Dahms; Andreas Bremges; Adrian Fritz; Ruben Garrido-Oter; Tue Sparholt Jørgensen; Nicole Shapiro; Philip D. Blood; Alexey Gurevich; Yang Bai; Dmitrij Turaev; Matthew Z. DeMaere; Rayan Chikhi; Niranjan Nagarajan; Christopher Quince; Fernando Meyer; Monika Balvočiūtė; Lars Hestbjerg Hansen; Søren J. Sørensen; Burton K H Chia; Bertrand Denis; Jeff Froula

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


IEEE Signal Processing Letters | 2014

Sparse Recovery by Means of Nonnegative Least Squares

Simon Foucart; David Koslicki

This letter demonstrates that sparse recovery can be achieved by an L1-minimization ersatz easily implemented using a conventional nonnegative least squares algorithm. A connection with orthogonal matching pursuit is also highlighted. The preliminary results call for more investigations on the potential of the method and on its relations to classical sparse recovery algorithms.


knowledge discovery and data mining | 2014

On Entropy-Based Data Mining

Andreas Holzinger; Matthias Hörtenhuber; Christopher C. Mayer; Martin Bachler; Siegfried Wassertheurer; Armando J. Pinho; David Koslicki

In the real world, we are confronted not only with complex and high-dimensional data sets, but usually with noisy, incomplete and uncertain data, where the application of traditional methods of knowledge discovery and data mining always entail the danger of modeling artifacts. Originally, information entropy was introduced by Shannon (1949), as a measure of uncertainty in the data. But up to the present, there have emerged many different types of entropy methods with a large number of different purposes and possible application areas. In this paper, we briefly discuss the applicability of entropy methods for the use in knowledge discovery and data mining, with particular emphasis on biomedical data. We present a very short overview of the state-of-the-art, with focus on four methods: Approximate Entropy (ApEn), Sample Entropy (SampEn), Fuzzy Entropy (FuzzyEn), and Topological Entropy (FiniteTopEn). Finally, we discuss some open problems and future research challenges.


Bioinformatics | 2013

Quikr: a Method for Rapid Reconstruction of Bacterial Communities via Compressive Sensing

David Koslicki; Simon Foucart; Gail Rosen

MOTIVATION Many metagenomic studies compare hundreds to thousands of environmental and health-related samples by extracting and sequencing their 16S rRNA amplicons and measuring their similarity using beta-diversity metrics. However, one of the first steps--to classify the operational taxonomic units within the sample--can be a computationally time-consuming task because most methods rely on computing the taxonomic assignment of each individual read out of tens to hundreds of thousands of reads. RESULTS We introduce Quikr: a QUadratic, K-mer-based, Iterative, Reconstruction method, which computes a vector of taxonomic assignments and their proportions in the sample using an optimization technique motivated from the mathematical theory of compressive sensing. On both simulated and actual biological data, we demonstrate that Quikr typically has less error and is typically orders of magnitude faster than the most commonly used taxonomic assignment technique (the Ribosomal Database Projects Naïve Bayesian Classifier). Furthermore, the technique is shown to be unaffected by the presence of chimeras, thereby allowing for the circumvention of the time-intensive step of chimera filtering. AVAILABILITY The Quikr computational package (in MATLAB, Octave, Python and C) for the Linux and Mac platforms is available at http://sourceforge.net/projects/quikr/.


PLOS ONE | 2014

WGSQuikr: Fast Whole-Genome Shotgun Metagenomic Classification

David Koslicki; Simon Foucart; Gail Rosen

With the decrease in cost and increase in output of whole-genome shotgun technologies, many metagenomic studies are utilizing this approach in lieu of the more traditional 16S rRNA amplicon technique. Due to the large number of relatively short reads output from whole-genome shotgun technologies, there is a need for fast and accurate short-read OTU classifiers. While there are relatively fast and accurate algorithms available, such as MetaPhlAn, MetaPhyler, PhyloPythiaS, and PhymmBL, these algorithms still classify samples in a read-by-read fashion and so execution times can range from hours to days on large datasets. We introduce WGSQuikr, a reconstruction method which can compute a vector of taxonomic assignments and their proportions in the sample with remarkable speed and accuracy. We demonstrate on simulated data that WGSQuikr is typically more accurate and up to an order of magnitude faster than the aforementioned classification algorithms. We also verify the utility of WGSQuikr on real biological data in the form of a mock community. WGSQuikr is a Whole-Genome Shotgun QUadratic, Iterative, -mer based Reconstruction method which extends the previously introduced 16S rRNA-based algorithm Quikr. A MATLAB implementation of WGSQuikr is available at: http://sourceforge.net/projects/wgsquikr.


Bioinformatics | 2014

SEK: sparsity exploiting k-mer-based estimation of bacterial community composition

Saikat Chatterjee; David Koslicki; Siyuan Dong; Nicolas Innocenti; Lu Cheng; Yueheng Lan; Mikko Vehkaperä; Mikael Skoglund; Lars Kildehöj Rasmussen; Erik Aurell; Jukka Corander

MOTIVATION Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. RESULTS Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method. AVAILABILITY AND IMPLEMENTATION A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site.


bioRxiv | 2016

MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation

David Koslicki; Daniel Falush

Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods. ABSTRACT Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette, which uses long k-mer sizes (k = 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences, and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette . Pretrained databases are included for Archaea, Bacteria, Eukaryota, and viruses. IMPORTANCE Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods. Author Video: An author video summary of this article is available.


bioRxiv | 2016

Total RNA Sequencing reveals microbial communities in human blood and disease specific effects

Serghei Mangul; Loes M. Olde Loohuis; Anil Ori; Guillaume Jospin; David Koslicki; Harry Taegyun Yang; Timothy Wu; Marco P. Boks; Catherine Lomen-Hoerth; Martina Wiedau-Pazos; Rita M. Cantor; Willem M. de Vos; René S. Kahn; Eleazar Eskin; Roel A. Ophoff

The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis and bipolar disorder. By using high quality unmapped RNA sequencing reads as candidate microbial reads, we performed profiling of microbial transcripts detected in whole blood. We were able to detect a wide range of bacterial and archaeal phyla in blood. Interestingly, we observed an increased microbial diversity in schizophrenia patients compared to the three other groups. We replicated this finding in an independent schizophrenia case-control cohort. This increased diversity is inversely correlated with estimated cell abundance of a subpopulation of CD8+ memory T cells in healthy controls, supporting a link between microbial products found in blood, immunity and schizophrenia.


PLOS ONE | 2015

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

David Koslicki; Saikat Chatterjee; Damon Shahrivar; Alan W. Walker; Suzanna C. Francis; Louise Fraser; Mikko Vehkaperä; Yueheng Lan; Jukka Corander

Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

Collaboration


Dive into the David Koslicki's collaboration.

Top Co-Authors

Avatar

Serghei Mangul

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Eleazar Eskin

University of California

View shared research outputs
Top Co-Authors

Avatar

Jeff Froula

Joint Genome Institute

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Philip D. Blood

Pittsburgh Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Rayan Chikhi

Pennsylvania State University

View shared research outputs
Researchain Logo
Decentralizing Knowledge