Mark Murphy
University of California, Berkeley
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Mark Murphy.
ieee international conference on high performance computing data and analytics | 2008
Kaushik Datta; Mark Murphy; Vasily Volkov; Samuel Williams; Jonathan Carter; Leonid Oliker; David A. Patterson; John Shalf; Katherine A. Yelick
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.
Magnetic Resonance in Medicine | 2014
Martin Uecker; Peng Lai; Mark Murphy; Patrick Virtue; Michael Elad; John M. Pauly; Shreyas S. Vasanawala; Michael Lustig
Parallel imaging allows the reconstruction of images from undersampled multicoil data. The two main approaches are: SENSE, which explicitly uses coil sensitivities, and GRAPPA, which makes use of learned correlations in k‐space. The purpose of this work is to clarify their relationship and to develop and evaluate an improved algorithm.
IEEE Transactions on Medical Imaging | 2012
Mark Murphy; Marcus T. Alley; James Demmel; Kurt Keutzer; Shreyas S. Vasanawala; Michael Lustig
We present l1 -SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l1-SPIRiTs image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l1 -SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l1 -SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.
international conference on computer vision | 2009
Bryan Catanzaro; Bor-Yiing Su; Narayanan Sundaram; Yunsup Lee; Mark Murphy; Kurt Keutzer
Image contour detection is fundamental to many image analysis applications, including image segmentation, object recognition and classification. However, highly accurate image contour detection algorithms are also very computationally intensive, which limits their applicability, even for offline batch processing. In this work, we examine efficient parallel algorithms for performing image contour detection, with particular attention paid to local image analysis as well as the generalized eigensolver used in Normalized Cuts. Combining these algorithms into a contour detector, along with careful implementation on highly parallel, commodity processors from Nvidia, our contour detector provides uncompromised contour accuracy, with an F-metric of 0.70 on the Berkeley Segmentation Dataset. Runtime is reduced from 4 minutes to 1.8 seconds. The efficiency gains we realize enable high-quality image contour detection on much larger images than previously practical, and the algorithms we propose are applicable to several image segmentation approaches. Efficient, scalable, yet highly accurate image contour detection will facilitate increased performance in many computer vision applications.
American Journal of Roentgenology | 2012
Albert Hsiao; Michael Lustig; Marcus T. Alley; Mark Murphy; Frandics P. Chan; Robert J. Herfkens; Shreyas S. Vasanawala
OBJECTIVE The quantification of cardiac flow and ventricular volumes is an essential goal of many congenital heart MRI examinations, often requiring acquisition of multiple 2D phase-contrast and bright-blood cine steady-state free precession (SSFP) planes. Scan acquisition, however, is lengthy and highly reliant on an imager who is well-versed in structural heart disease. Although it can also be lengthy, 3D time-resolved (4D) phase-contrast MRI yields global flow patterns and is simpler to perform. We therefore sought to accelerate 4D phase contrast and to determine whether equivalent flow and volume measurements could be extracted. MATERIALS AND METHODS Four-dimensional phase contrast was modified for higher acceleration with compressed sensing. Custom software was developed to process 4D phase-contrast images. We studied 29 patients referred for congenital cardiac MRI who underwent a routine clinical protocol, including cine short-axis stack SSFP and 2D phase contrast, followed by contrast-enhanced 4D phase contrast. To compare quantitative measurements, Bland-Altman analysis, paired Student t tests, and F tests were used. RESULTS Ventricular end-diastolic, end-systolic, and stroke volumes obtained from 4D phase contrast and SSFP were well correlated (ρ = 0.91-0.95; r(2) = 0.83-0.90), with no statistically significant difference. Ejection fractions were well correlated in a subpopulation that underwent higher-resolution compressed-sensing 4D phase contrast (ρ = 0.88; r(2) = 0.77). Four-dimensional phase contrast and 2D phase contrast flow rates were also well correlated (ρ = 0.90; r(2) = 0.82). Excluding ventricles with valvular insufficiency, cardiac outputs derived from outlet valve flow and stroke volumes were more consistent by 4D phase contrast than by 2D phase contrast and SSFP. CONCLUSION Combined parallel imaging and compressed sensing can be applied to 4D phase contrast. With custom software, flow and ventricular volumes may be extracted with comparable accuracy to SSFP and 2D phase contrast. Furthermore, cardiac outputs were more consistent by 4D phase contrast.
ieee international conference on high performance computing data and analytics | 2009
Marghoob Mohiyuddin; Mark Murphy; Leonid Oliker; John Shalf; John Wawrzynek; Samuel Williams
As power has become the pre-eminent design constraint for future HPC systems, computational efficiency is being emphasized over simply peak performance. Recently, static benchmark codes have been used to find a power efficient architecture. Unfortunately, because compilers generate sub-optimal code, benchmark performance can be a poor indicator of the performance potential of architecture design points. Therefore, we present hardware/software cotuning as a novel approach for system design, in which traditional architecture space exploration is tightly coupled with software auto-tuning for delivering substantial improvements in area and power efficiency. We demonstrate the proposed methodology by exploring the parameter space of a Tensilica-based multi-processor running three of the most heavily used kernels in scientific computing, each with widely varying micro-architectural requirements: sparse matrix vector multiplication, stencil-based computations, and general matrix-matrix multiplication. Results demonstrate that co-tuning significantly improves hardware area and energy efficiency - a key driver for next generation of HPC system design.
ieee international symposium on workload characterization | 2009
Mark Murphy; Kurt Keutzer; Hong Wang
High-quality cameras are a standard feature of mobile platforms, but the computational capabilities of mobile processors limit the applications capable of exploiting them. Emerging mobile application domains, for example Mobile Augmented Reality (MAR), rely heavily on techniques from computer vision, requiring sophisticated analyses of images followed by higher-level processing. An important class of image analyses is the detection of sparse localized interest points. The Scale Invariant Feature Transform (SIFT), the most popular such analysis, is computationally representative of many other feature extractors. Using a novel code-generation framework, we demonstrate that a small set of optimizations produce high-performance SIFT implementations for three very different architectures: a laptop CPU (Core 2 Duo), a low-power CPU (Intel Atom), and a low-power GPU (GMA X3100). We improve the runtime of SIFT by more than 5X on our low-power architectures, enabling a low-power mobile device to extract SIFT features up to 63% as fast as the laptop CPU.
Multiprocessor System-on-Chip | 2011
Michael J. Anderson; Bryan Catanzaro; Jike Chong; Ekaterina Gonina; Kurt Keutzer; Chao-Yue Lai; Mark Murphy; Bor-Yiing Su; Narayanan Sundaram
Parallel programming using the current state-of-the-art in software engineering techniques is hard. Expertise in parallel programming is necessary to deliver good performance in applications; however, it is very common that domain experts lack the requisite expertise in parallel programming. In order to drive the computer science research toward effectively using the available parallel hardware platforms, it is very important to make parallel programming systematical and productive. We believe that the key to designing parallel programs in a systematical way is software architecture, and the key to improve the productivity of developing parallel programs is software frameworks. The basis of both is design patterns and a pattern language.
Radiology | 2012
Albert Hsiao; Michael Lustig; Marcus T. Alley; Mark Murphy; Shreyas S. Vasanawala
usenix conference on hot topics in parallelism | 2011
Michael J. Anderson; Bryan Catanzaro; Jike Chong; Ekaterina Gonina; Kurt Keutzer; Chao-Yue Lai; Mark Murphy; David Sheffield; Bor-Yiing Su; Narayanan Sundaram
