Ana Lucia Varbanescu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ana Lucia Varbanescu is active.

Explore More

Publication

Featured researches published by Ana Lucia Varbanescu.

international conference on parallel processing | 2011

A Comprehensive Performance Comparison of CUDA and OpenCL

Jianbin Fang; Ana Lucia Varbanescu; Henk J. Sips

This paper presents a comprehensive performance comparison between CUDA and OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world ones. We make an extensive analysis of the performance gaps taking into account programming models, ptimization strategies, architectural details, and underlying compilers. Our results show that, for most applications, CUDA performs at most 30% better than OpenCL. We also show that this difference is due to unfair comparisons: in fact, OpenCL can achieve similar performance to CUDA under a fair comparison. Therefore, we define a fair comparison of the two types of applications, providing guidelines for more potential analyses. We also investigate OpenCLs portability by running the benchmarks on other prevailing platforms with minor modifications. Overall, we conclude that OpenCLs portability does not fundamentally affect its performance, and OpenCL can be a good alternative to CUDA.

international conference on performance engineering | 2014

Test-driving Intel Xeon Phi

Jianbin Fang; Henk J. Sips; Lilun Zhang; Chuanfu Xu; Yonggang Che; Ana Lucia Varbanescu

Based on Intels Many Integrated Core (MIC) architecture, Intel Xeon Phi is one of the few truly many-core CPUs - featuring around 60 fairly powerful cores, two levels of caches, and graphic memory, all interconnected by a very fast ring. Given its promised ease-of-use and high performance, we took Xeon Phi out for a test drive. In this paper, we present this experience at two different levels: (1) the microbenchmark level, where we stress each nut and bolt of Phi in the lab, and (2) the application level, where we study Phis performance response in a real-life environment. At the microbenchmarking level, we show the high performance of five components of the architecture, focusing on their maximum achieved performance and the prerequisites to achieve it. Next, we choose a medical imaging application (Leukocyte Tracking) as a case study. We observed that it is rather easy to get functional code and start benchmarking, but the first performance numbers can be far from satisfying. Our experience indicates that a simple data structure and massive parallelism are critical for Xeon Phi to perform well. When compiler-driven parallelization and/or vectorization fails, programming Xeon Phi for performance can become very challenging.

international parallel and distributed processing symposium | 2014

How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis

Yong Guo; Marcin Biczak; Ana Lucia Varbanescu; Alexandru Iosup; Claudio Martella; Theodore L. Willke

Graph-processing platforms are increasingly used in a variety of domains. Although both industry and academia are developing and tuning graph-processing algorithms and platforms, the performance of graph-processing platforms has never been explored or compared in-depth. Thus, users face the daunting challenge of selecting an appropriate platform for their specific application. To alleviate this challenge, we propose an empirical method for benchmarking graph-processing platforms. We define a comprehensive process, and a selection of representative metrics, datasets, and algorithmic classes. We implement a benchmarking suite of five classes of algorithms and seven diverse graphs. Our suite reports on basic (user-lever) performance, resource utilization, scalability, and various overhead. We use our benchmarking suite to analyze and compare six platforms. We gain valuable insights for each platform and present the first comprehensive comparison of graph-processing platforms.

ACM Transactions on Architecture and Code Optimization | 2015

Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly

Fabio Luporini; Ana Lucia Varbanescu; Florian Rathgeber; Gheorghe-Teodor Bercea; J. Ramanujam; David A. Ham; Paul H. J. Kelly

The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved.

international conference on performance engineering | 2014

Benchmarking graph-processing platforms: a vision

Yong Guo; Ana Lucia Varbanescu; Alexandru Iosup; Claudio Martella; Theodore L. Willke

Processing graphs, especially at large scale, is an increasingly useful activity in a variety of business, engineering, and scientific domains. Already, there are tens of graph-processing platforms, such as Hadoop, Giraph, GraphLab, etc., each with a different design and functionality. For graph-processing to continue to evolve, users have to find it easy to select a graph-processing platform, and developers and system integrators have to find it easy to quantify the performance and other non-functional aspects of interest. However, the state of performance analysis of graph-processing platforms is still immature: there are few studies and, for the few that exist, there are few similarities, and relatively little understanding of the impact of dataset and algorithm diversity on performance. Our vision is to develop, with the help of the performance-savvy community, a comprehensive benchmarking suite for graph-processing platforms. In this work, we take a step in this direction, by proposing a set of seven challenges, summarizing our previous work on performance evaluation of distributed graph-processing platforms, and introducing our on-going work within the SPEC Research Groups Cloud Working Group.

computing frontiers | 2009

Evaluating multi-core platforms for HPC data-intensive kernels

Alexander S. van Amesfoort; Ana Lucia Varbanescu; Henk J. Sips; Rob V. van Nieuwpoort

Multi-core platforms have proven themselves able to accelerate numerous HPC applications. But programming data-intensive applications on such platforms is a hard, and not yet solved, problem. Not only do modern processors favor compute-intensive code, they also have diverse architectures and incompatible programming models. And even after making a difficult platform choice, extensive programming effort must be invested with an uncertain performance outcome. By taking the plunge on an irregular, data-intensive application, we present an evaluation of three platform types, namely the generic multi-core CPU, the STI Cell/B.E., and the GPU. We evaluate these platforms in terms of application performance, programming effort and cost. Although we do not select a clear winner, we do provide a list of guidelines to assist in platform choice and development of similar data-intensive applications.

international conference on multimedia and expo | 2007

Digital Media Indexing on the Cell Processor

Lurng-Kuo Liu; Qiang Liu; Apostol Natsev; Kenneth A. Ross; John R. Smith; Ana Lucia Varbanescu

We present a case study of developing a digital media indexing application, code-named MARVEL, on the STI cell broadband engine (CBE) processor. There are two aspects of the target application that require significant computing power: image analysis for feature extraction, and support vector machine (SVM) based pattern classification for concept detection. We discuss the mapping of a large application like MARVEL onto a multicore processor, and show how feature extraction and concept detection can be implemented on the CBE. We discuss how the synergistic processing units of a CBE can be used to gain dramatic performance improvements. The empirical results of our experiments, conducted on a Cell blade running at 3.2 GHz, show that the CBE provides a significant performance speed-up in our digital media indexing application.

ieee international conference on high performance computing data and analytics | 2012

A polyphase filter for GPUs and multi-core processors

Karel van der Veldt; Rob V. van Nieuwpoort; Ana Lucia Varbanescu; Chris R. Jesshope

Software radio telescopes are a new development in radio astronomy. Rather than using expensive dishes, they form distributed sensor networks of tens of thousands of simple receivers. Signals are processed in software instead of custom-built hardware, taking advantage of the flexibility that software solutions offer. In turn, the data rates are high and the processing requirements challenging. GPUs and multi-core processors are promising devices to provide the required processing power. LOFAR1, the largest radio telescope, is a prime example of a software radio telescope.n In this paper, we discuss an optimized implementation of the polyphase filter bank used by LOFAR. We compare the following architectures: Intel Core i7, NVIDIA GTX580, ATI HD5870, and MicroGrid[7]. We present a novel way to compute polyphase filters efficiently on GPUs, and also discuss hardware limitations and energy efficiency.

european conference on parallel processing | 2008

Radioastronomy Image Synthesis on the Cell/B.E.

Ana Lucia Varbanescu; Alexander S. van Amesfoort; Tim J. Cornwell; Andrew Mattingly; Bruce G. Elmegreen; Rob V. van Nieuwpoort; Ger van Diepen; Henk J. Sips

Now that large radiotelescopes like SKA, LOFAR, or ASKAP, become available in different parts of the world, radioastronomers foresee a vast increase in the amount of data to gather, store and process. To keep the processing time bounded, parallelization and execution on (massively) parallel machines are required for the commonly-used radioastronomy software kernels. In this paper, we analyze data gridding and degridding, a very time-consuming kernel of radioastronomy image synthesis. To tackle its its dynamic behavior, we devise and implement a parallelization strategy for the Cell/B.E. multi-core processor, offering a cost-efficient alternative compared to classical supercomputers. Our experiments show that the application running on one Cell/B.E. is more than 20 times faster than the original application running on a commodity machine. Based on scalability experiments, we estimate the hardware requirements for a realistic radio-telescope. We conclude that our parallelization solution exposes an efficient way to deal with dynamic data-intensive applications on heterogeneous multi-core processors.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Towards an Effective Unified Programming Model for Many-Cores

Ana Lucia Varbanescu; Pieter Hijma; Rob V. van Nieuwpoort; Henri E. Bal

Building an effective programming model for many-core processors is challenging. On the one hand, the increasing variety of platforms and their specific programming models force users to take a hardware-centric approach not only for implementing parallel applications, but also for designing them. This approach diminishes portability and, eventually, limits performance. On the other hand, to effectively cope with the increased number of large-scale workloads that require parallelization, a portable, application-centric programming model is desirable. Such a model enables programmers to focus first on extracting and exploiting parallelism from their applications, as opposed to generating parallelism for specific hardware, and only second on platform-specific implementation and optimizations. In this paper, we first present a survey of programming models designed for programming three families of many-cores: general purpose many-cores (GPMCs), graphics processing units (GPUs), and the Cell/B.E.. We analyze the usability of these models, their ability to improve platform programmability, and the specific features that contribute to this improvement. Next, we also discuss two types of generic models: parallelism-centric and application-centric. We also analyze their features and impact on platform programmability. Based on this analysis, we recommend two application-centric models (OmpSs and OpenCL) as promising candidates for a unified programming model for many-cores and we discuss potential enhancements for them.

Explore More