Is this you? Create Your Porfile

Oliver Rübel

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oliver Rübel is active.

Explore More

Publication

Featured researches published by Oliver Rübel.

Lawrence Berkeley National Laboratory | 2009

FastBit: interactively searching massive data

Kesheng Wu; Sean Ahern; Edward W Bethel; Jacqueline H. Chen; Hank Childs; E. Cormier-Michel; Cameron Geddes; Junmin Gu; Hans Hagen; Bernd Hamann; Wendy S. Koegler; Jerome Lauret; Jeremy S. Meredith; Peter Messmer; Ekow J. Otoo; V Perevoztchikov; A. M. Poskanzer; Prabhat; Oliver Rübel; Arie Shoshani; Alexander Sim; Kurt Stockinger; Gunther H. Weber; W. M. Zhang

As scientific instruments and computer simulations produce more and more data, the task of locating the essential information to gain insight becomes increasingly difficult. FastBit is an efficient software tool to address this challenge. In this article, we present a summary of the key underlying technologies, namely bitmap compression, encoding, and binning. Together these techniques enable FastBit to answer structured (SQL) queries orders of magnitude faster than popular database systems. To illustrate how FastBit is used in applications, we present three examples involving a high-energy physics experiment, a combustion simulation, and an accelerator simulation. In each case, FastBit significantly reduces the response time and enables interactive exploration on terabytes of data.

ieee international conference on high performance computing data and analytics | 2011

Parallel index and query for large scale data analysis

Jerry Chi-Yuan Chou; Mark Howison; Brian Austin; Kesheng Wu; Ji Qiang; E. Wes Bethel; Arie Shoshani; Oliver Rübel; Prabhat; Robert D. Ryne

Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for processing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the- art index and query technology (FastBit) and is designed to process massive datasets on modern supercomputing plat- forms. We apply FastQuery to processing of a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for interesting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.

Analytical Chemistry | 2013

OpenMSI: A High-Performance Web-Based Platform for Mass Spectrometry Imaging

Oliver Rübel; Annette M. Greiner; Shreyas Cholia; Katherine Louie; E. Wes Bethel; Trent R. Northen; Benjamin P. Bowen

Mass spectrometry imaging (MSI) enables researchers to directly probe endogenous molecules directly within the architecture of the biological matrix. Unfortunately, efficient access, management, and analysis of the data generated by MSI approaches remain major challenges to this rapidly developing field. Despite the availability of numerous dedicated file formats and software packages, it is a widely held viewpoint that the biggest challenge is simply opening, sharing, and analyzing a file without loss of information. Here we present OpenMSI, a software framework and platform that addresses these challenges via an advanced, high-performance, extensible file format and Web API for remote data access (http://openmsi.nersc.gov). The OpenMSI file format supports storage of raw MSI data, metadata, and derived analyses in a single, self-describing format based on HDF5 and is supported by a large range of analysis software (e.g., Matlab and R) and programming languages (e.g., C++, Fortran, and Python). Careful optimization of the storage layout of MSI data sets using chunking, compression, and data replication accelerates common, selective data access operations while minimizing data storage requirements and are critical enablers of rapid data I/O. The OpenMSI file format has shown to provide >2000-fold improvement for image access operations, enabling spectrum and image retrieval in less than 0.3 s across the Internet even for 50 GB MSI data sets. To make remote high-performance compute resources accessible for analysis and to facilitate data sharing and collaboration, we describe an easy-to-use yet powerful Web API, enabling fast and convenient access to MSI data, metadata, and derived analysis results stored remotely to facilitate high-performance data analysis and enable implementation of Web based data sharing, visualization, and analysis.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2010

Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

Oliver Rübel; Gunther H. Weber; Min-Yu Huang; E.W. Bethel; Mark D. Biggin; Charless C. Fowlkes; C.L. Luengo Hendriks; Soile V.E. Keranen; Michael B. Eisen; David W. Knowles; Jitendra Malik; Hans Hagen; Bernd Hamann

The recent development of methods for extracting precise measurements of spatial gene expression patterns from three-dimensional (3D) image data opens the way for new analyses of the complex gene regulatory networks controlling animal development. We present an integrated visualization and analysis framework that supports user-guided data clustering to aid exploration of these new complex data sets. The interplay of data visualization and clustering-based data classification leads to improved visualization and enables a more detailed analysis than previously possible. We discuss 1) the integration of data clustering and visualization into one framework, 2) the application of data clustering to 3D gene expression data, 3) the evaluation of the number of clusters k in the context of 3D gene expression clustering, and 4) the improvement of overall analysis quality via dedicated postprocessing of clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory transcription factors.

Journal of Natural Products | 2015

Dirigent Protein-Mediated Lignan and Cyanogenic Glucoside Formation in Flax Seed: Integrated Omics and MALDI Mass Spectrometry Imaging.

Doralyn S. Dalisay; Kye Won Kim; Choonseok Lee; Hong Yang; Oliver Rübel; Benjamin P. Bowen; Laurence B. Davin; Norman G. Lewis

An integrated omics approach using genomics, transcriptomics, metabolomics (MALDI mass spectrometry imaging, MSI), and bioinformatics was employed to study spatiotemporal formation and deposition of health-protecting polymeric lignans and plant defense cyanogenic glucosides. Intact flax (Linum usitatissimum) capsules and seed tissues at different development stages were analyzed. Transcriptome analyses indicated distinct expression patterns of dirigent protein (DP) gene family members encoding (-)- and (+)-pinoresinol-forming DPs and their associated downstream metabolic processes, respectively, with the former expressed at early seed coat development stages. Genes encoding (+)-pinoresinol-forming DPs were, in contrast, expressed at later development stages. Recombinant DP expression and DP assays also unequivocally established their distinct stereoselective biochemical functions. Using MALDI MSI and ion mobility separation analyses, the pinoresinol downstream derivatives, secoisolariciresinol diglucoside (SDG) and SDG hydroxymethylglutaryl ester, were localized and detectable only in early seed coat development stages. SDG derivatives were then converted into higher molecular weight phenolics during seed coat maturation. By contrast, the plant defense cyanogenic glucosides, the monoglucosides linamarin/lotaustralin, were detected throughout the flax capsule, whereas diglucosides linustatin/neolinustatin only accumulated in endosperm and embryo tissues. A putative biosynthetic pathway to the cyanogens is proposed on the basis of transcriptome coexpression data. Localization of all metabolites was at ca. 20 μm resolution, with the web based tool OpenMSI enabling not only resolution enhancement but also an interactive system for real-time searching for any ion in the tissue under analysis.

ieee international conference on high performance computing data and analytics | 2008

High performance multivariate visual data exploration for extremely large data

Oliver Rübel; Prabhat; Kesheng Wu; Hank Childs; Jeremy S. Meredith; Cameron Geddes; E. Cormier-Michel; Sean Ahern; Gunther H. Weber; Peter Messmer; Hans Hagen; Bernd Hamann; E. Wes Bethel

One of the central challenges in modern science is the need to quickly derive knowledge and understanding from large, complex collections of data. We present a new approach that deals with this challenge by combining and extending techniques from high performance visual data analysis and scientific data management. This approach is demonstrated within the context of gaining insight from complex, time-varying datasets produced by a laser wakefield accelerator simulation. Our approach leverages histogram-based parallel coordinates for both visual information display as well as a vehicle for guiding a data mining operation. Data extraction and subsetting are implemented with state-of-the-art index/query technology. This approach, while applied here to accelerator science, is generally applicable to a broad set of science applications, and is implemented in a production-quality visual data analysis infrastructure. We conduct a detailed performance analysis and demonstrate good scalability on a distributed memory Cray XT4 system.

ieee international conference on high performance computing data and analytics | 2012

Parallel I/O, analysis, and visualization of a trillion particle simulation

Surendra Byna; J. Chou; Oliver Rübel; Prabhat; H. Karimabadi; W. S. Daughter; V. Roytershteyn; E. W. Bethel; Mark Howison; Ke-Jou Hsu; Kuan-Wu Lin; Arie Shoshani; A. Uselton; Kesheng Wu

Petascale plasma physics simulations have recently entered the regime of simulating trillions of particles. These unprecedented simulations generate massive amounts of data, posing significant challenges in storage, analysis, and visualization. In this paper, we present parallel I/O, analysis, and visualization results from a VPIC trillion particle simulation running on 120,000 cores, which produces ~30TB of data for a single timestep. We demonstrate the successful application of H5Part, a particle data extension of parallel HDF5, for writing the dataset at a significant fraction of system peak I/O rates. To enable efficient analysis, we develop hybrid parallel FastQuery to index and query data using multi-core CPUs on distributed memory hardware. We show good scalability results for the FastQuery implementation using up to 10,000 cores. Finally, we apply this indexing/query-driven approach to facilitate the first-ever analysis and visualization of the trillion particle dataset.

Analytical Chemistry | 2015

Identifying important ions and positions in mass spectrometry imaging data using CUR matrix decompositions.

Jiyan Yang; Oliver Rübel; Prabhat; Michael W. Mahoney; Benjamin P. Bowen

Mass spectrometry imaging enables label-free, high-resolution spatial mapping of the chemical composition of complex, biological samples. Typical experiments require selecting ions and/or positions from the images: ions for fragmentation studies to identify keystone compounds and positions for follow up validation measurements using microdissection or other orthogonal techniques. Unfortunately, with modern imaging machines, these must be selected from an overwhelming amount of raw data. Existing techniques to reduce the volume of data, the most popular of which are principle component analysis and non-negative matrix factorization, have the disadvantage that they return difficult-to-interpret linear combinations of actual data elements. In this work, we show that CX and CUR matrix decompositions can be used directly to address this selection need. CX and CUR matrix decompositions use empirical statistical leverage scores of the input data to provide provably good low-rank approximations of the measured data that are expressed in terms of actual ions and actual positions, as opposed to difficult-to-interpret eigenions and eigenpositions. We show that this leads to effective prioritization of information for both ions and positions. In particular, important ions can be found either by using the leverage scores as a ranking function and using a deterministic greedy selection algorithm or by using the leverage scores as an importance sampling distribution and using a random sampling algorithm; however, selection of important positions from the original matrix performed significantly better when they were chosen with the random sampling algorithm. Also, we show that 20 ions or 40 locations can be used to reconstruct the original matrix to a tolerance of 17% error for a widely studied image of brain lipids; and we provide a scalable implementation of this method that is applicable for analysis of the raw data where there are often more than a million rows and/or columns, which is larger than SVD-based low-rank approximation methods can handle. These results introduce the concept of CX/CUR matrix factorizations to mass spectrometry imaging, describing their utility and illustrating principled algorithmic approaches to deal with the overwhelming amount of data generated by modern mass spectrometry imaging.

Journal of Trading | 2012

Federal Market Information Technology in the Post–FlashCrash Era: Roles for Supercomputing

E. Wes Bethel; David Leinweber; Oliver Rübel; Kesheng Wu

This article describes collaborative work between active traders, regulators, economists, and supercomputing researchers to replicate and extend investigations of the Flash Crash and other market anomalies in a National Laboratory high-performance computing (HPC) environment. Our work suggests that supercomputing tools and methods will be valuable to market regulators in achieving the goals of market safety, stability, and security. Research results using high-frequency data and analytics are described, and directions for future development are discussed. Currently, the key mechanism for preventing catastrophic market action is “circuit breakers.” We believe a more graduated approach, similar to the “yellow light” approach in motorsports to slow down traffic, might be a better way to achieve the same goal. To enable this objective, we study a number of indicators that could foresee hazards in market conditions and explore options to confirm such predictions. Our tests confirm that volume-synchronized probability of informed trading (VPIN) and a version of the volume Herfindahl–Hirschman index (HHI) for measuring market fragmentation could have indeed given strong signals ahead of the Flash Crash event on May 6, 2010. This is a preliminary step toward a full-fledged earlywarning system for unusual market conditions.

international conference on conceptual structures | 2010

Coupling visualization and data analysis for knowledge discovery from multi-dimensional scientific data

Oliver Rübel; Sean Ahern; E. Wes Bethel; Mark D. Biggin; Hank Childs; E. Cormier-Michel; Angela H. DePace; Michael B. Eisen; Charless C. Fowlkes; Cameron Geddes; Hans Hagen; Bernd Hamann; Min-Yu Huang; Soile V.E. Keranen; David W. Knowles; Chris L. Luengo Hendriks; Jitendra Malik; Jeremy S. Meredith; Peter Messmer; Prabhat; Daniela Ushizima; Gunther H. Weber; Kesheng Wu

Knowledge discovery from large and complex scientific data is a challenging task. With the ability to measure and simulate more processes at increasingly finer spatial and temporal scales, the growing number of data dimensions and data objects presents tremendous challenges for effective data analysis and data exploration methods and tools. The combination and close integration of methods from scientific visualization, information visualization, automated data analysis, and other enabling technologies -such as efficient data management- supports knowledge discovery from multi-dimensional scientific data. This paper surveys two distinct applications in developmental biology and accelerator physics, illustrating the effectiveness of the described approach.

Explore More