Andres Marquez
Pacific Northwest National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andres Marquez.
computing frontiers | 2007
Jarek Nieplocha; Andres Marquez; John Feo; Daniel G. Chavarría-Miranda; George Chin; Chad Scherrer; Nathaniel Beagley
The resurgence of current and upcoming multithreaded architectures and programming models led us to conduct a detailed study to understand the potential of these platforms to increase the performance of data-intensive, irregular scientific applications. Our study is based on a power system state estimation application and a novel anomaly detection application applied to network traffic data. We also conducted a detailed evaluation of the platforms using microbenchmarks in order to gain insight into their architectural capabilities and their interaction with programming models and application software. The evaluation was performed on the Cray MTA-2 and the Sun Niagar.
international parallel and distributed processing symposium | 2014
Yang You; Shuaiwen Leon Song; Haohuan Fu; Andres Marquez; Maryam Mehri Dehnavi; Kevin J. Barker; Kirk W. Cameron; Amanda Randles; Guangwen Yang
Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.
2006 IEEE Power Engineering Society General Meeting | 2006
Jaroslaw Nieplocha; Andres Marquez; Vinod Tipparaju; Daniel G. Chavarría-Miranda; Ross T. Guttromson; H. Huang
We are investigating the effectiveness of parallel weighted- least-square (WLS) state estimation solvers on shared-memory parallel computers. Shared-memory parallel architectures are rapidly becoming ubiquitous due to the advent of multi-core processors. In the current evaluation, we are using an LU-based solver as well as a conjugate gradient (CG)-based solver for a 1177-bus system. In lieu of a very wide multi-core system we evaluate the effectiveness of the solvers on an SGI Altix system on up to 32 processors. On this platform, as expected, the shared memory implementation (pthreads) of the LU solver was found to be more efficient than the MPI version. Our implementation of the CG solver scales and performs significantly better than the state-of-the-art implementation of the LU solver: with CG we can solve the problem 4.75 times faster than using LU. These findings indicate that CG algorithms should be quite effective on multicore processors
green computing and communications | 2010
Abhinav Vishnu; Shuaiwen Song; Andres Marquez; Kevin J. Barker; Darren J. Kerbyson; Kirk W. Cameron; Pavan Balaji
The insatiable demand of high performance computing is being driven by the most computationally intensive applications such as computational chemistry, climate modeling, nuclear physics, etc. The last couple of decades have observed a tremendous rise in supercomputers with architectures ranging from traditional clusters to system-on-a-chip in order to achieve the petaflop computing barrier. However, with advent of petaflop-plus computing, we have ushered in an era where power efficient system software stack is imperative for execution on exascale systems and beyond. At the same time, computationally intensive applications are exploring programming models beyond traditional message passing, as a combination of Partitioned Global Address Space (PGAS) languages and libraries, providing one-sided communication paradigm with put, get and accumulate primitives. To support the PGAS models, it is critical to design power efficient and high performance one-sided communication runtime systems. In this paper, we design and implement PASCoL, a high performance power aware one-sided communication library using Aggregate Remote Memory Copy Interface (ARMCI), the communication runtime system of Global Arrays. For various communication primitives provided by ARMCI, we study the impact of Dynamic Voltage/Frequency Scaling (DVFS) and a combination of interrupt (blocking)/polling based mechanisms provided by most modern interconnects. We implement our design and evaluate it with synthetic benchmarks using an Infini Band cluster. Our results indicate that PASCoL can achieve significant reduction in energy consumed per byte transfer without additional penalty for various one-sided communication primitives and various message sizes and data transfer patterns.
international parallel and distributed processing symposium | 2008
Daniel G. Chavarría-Miranda; Andres Marquez; Jaroslaw Nieplocha; Kristyn J. Maschhoff; Chad Scherrer
This paper describes our early experiences with a pre- production Cray XMT system that implements a scalable shared memory architecture with hardware support for multithreading. Unlike its predecessor, the Cray MTA-2 that had very limited I/O capability, the Cray XMT offers Lustre, a scalable high-performance parallel filesystem. Therefore it enables development of out-of-core applications that can deal with very large data sets that otherwise would not fit in the system main memory. Our application performs statistically-based anomaly detection for categorical data that can be used for analysis of Internet traffic data. Experimental results indicate that the preproduction version of the machine is able to achieve good performance and scalability for the in- and out-of-core versions of the application.
The Journal of Supercomputing | 2013
Abhinav Vishnu; Shuaiwen Song; Andres Marquez; Kevin J. Barker; Darren J. Kerbyson; Kirk W. Cameron; Pavan Balaji
As the march to the exascale computing gains momentum, energy consumption of supercomputers has emerged to be the critical roadblock. While architectural innovations are imperative in achieving computing of this scale, it is largely dependent on the systems software to leverage the architectural innovations. Parallel applications in many computationally intensive domains have been designed to leverage these supercomputers, with legacy two-sided communication semantics using Message Passing Interface. At the same time, Partitioned Global Address Space Models are being designed which provide global address space abstractions and one-sided communication for exploiting data locality and communication optimizations. PGAS models rely on one-sided communication runtime systems for leveraging high-speed networks to achieve best possible performance.In this paper, we present a design for Power Aware One-Sided Communication Llibrary – PASCoL. The proposed design detects communication slack, leverages Dynamic Voltage and Frequency Scaling (DVFS), and Interrupt driven execution to exploit the detected slack for energy efficiency. We implement our design and evaluate it using synthetic benchmarks for one-sided communication primitives, Put, Get, and Accumulate and uniformly noncontiguous data transfers. Our performance evaluation indicates that we can achieve significant reduction in energy consumption without performance loss on multiple one-sided communication primitives. The achieved results are close to the theoretical peak available with the experimental test bed.
international conference on e-science | 2010
George Chin; Sutanay Choudhury; Lars J. Kangas; Sally A. McFarlane; Andres Marquez
The Atmospheric Radiation Measurement (ARM) program operated by the U.S. Department of Energy is one of the largest climate research programs dedicated to the collection of long-term continuous measurements of cloud properties and other key components of the earth’s climate system. Given the critical role that collected ARM data plays in the analysis of atmospheric processes and conditions and in the enhancement and evaluation of global climate models, the production and distribution of high-quality data is one of ARM’s primary mission objectives. Fault detection in ARM’s distributed sensor network is one critical ingredient towards maintaining high quality and useful data. We are modeling ARM’s distributed sensor network as a dynamic Bayesian network where key measurements are mapped to Bayesian network variables. We then define the conditional dependencies between variables by discovering highly correlated variable pairs from historical data. The resultant dynamic Bayesian network provides an automated approach to identifying whether certain sensors are malfunctioning or failing in the distributed sensor network. A potential fault or failure is detected when an observed measurement is not consistent with its expected measurement and the observed measurements of other related sensors in the Bayesian network. We present some of our experiences and promising results with the fault detection dynamic Bayesian network.
international parallel and distributed processing symposium | 2009
George Chin; Andres Marquez; Sutanay Choudhury; Kristyn J. Maschhoff
Commonly represented as directed graphs, social networks depict relationships and behaviors among social entities such as people, groups, and organizations. Social network analysis denotes a class of mathematical and statistical methods designed to study and measure social networks. Beyond sociology, social network analysis methods are being applied to other types of data in other domains such as bioinformatics, computer networks, national security, and economics. For particular problems, the size of a social network can grow to millions of nodes and tens of millions of edges or more. In such cases, researchers could benefit from the application of social network analysis algorithms on high-performance architectures and systems.
Journal of Parallel and Distributed Computing | 2015
Yang You; Haohuan Fu; Shuaiwen Leon Song; Amanda Randles; Darren J. Kerbyson; Andres Marquez; Guangwen Yang; Adolfy Hoisie
Support Vector Machines (SVM) have been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design.To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for?x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.MIC-SVM achieves 4.4-84i??and 18-47i??speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, running on the NVIDIA k20x?GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns. An efficient parallel support vector machine for?x86 based multi-core platforms.The novel optimization techniques to fully utilize the multi-level parallelism.The improvement for the deficiencies of the current SVM tools.Select the best architectures for input data patterns to achieve best performance.The large-scale distributed algorithm and power-efficient approach.
ACM Journal on Emerging Technologies in Computing Systems | 2012
Landon H. Sego; Andres Marquez; Andrew R. Rawson; Tahir Cader; Kevin M. Fox; William I. Gustafson; Christopher J. Mundy
As data centers proliferate in size and number, the endeavor to improve their energy efficiency and productivity is becoming increasingly important. We discuss the properties of a number of the proposed metrics of energy efficiency and productivity. In particular, we focus on the Data Center Energy Productivity (DCeP) metric, which is the ratio of useful work produced by the data center to the energy consumed performing that work. We describe our approach for using DCeP as the principal outcome of a designed experiment using a highly instrumented, high-performance computing data center. We found that DCeP was successful in clearly distinguishing different operational states in the data center, thereby validating its utility as a metric for identifying configurations of hardware and software that would improve (or even maximize) energy productivity. We also discuss some of the challenges and benefits associated with implementing the DCeP metric, and we examine the efficacy of the metric in making comparisons within a data center and among data centers.