Bilel Hadri | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bilel Hadri is active.

Explore More

Publication

Featured researches published by Bilel Hadri.

Journal of Physics: Conference Series | 2009

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects

Emmanuel Agullo; James W. Demmel; Jack J. Dongarra; Bilel Hadri; Jakub Kurzak; Julien Langou; Hatem Ltaief; Piotr Luszczek; Stanimire Tomov

The emergence and continuing use of multi-core architectures and graphics processing units require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) and Matrix Algebra on GPU and Multics Architectures (MAGMA) are two projects that aims to achieve high performance and portability across a wide range of multi-core architectures and hybrid systems respectively. We present in this document a comparative study of PLASMAs performance against established linear algebra packages and some preliminary results of MAGMA on hybrid multi-core and GPU systems.

ieee international conference on high performance computing data and analytics | 2009

Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Emmanuel Agullo; Bilel Hadri; Hatem Ltaief; Jack Dongarrra

The emergence and continuing use of multi-core architectures require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is a project that aims to achieve both high performance and portability across a wide range of multi-core architectures. We present in this paper a comparative study of PLASMAs performance against established linear algebra packages (LAPACK and ScaLAPACK), against new approaches at parallel execution (Task Based Linear Algebra Subroutines - TBLAS), and against equivalent commercial software offerings (MKL, ESSL and PESSL). Our experiments were conducted on one-sided linear algebra factorizations (LU, QR and Cholesky) and used multi-core architectures (based on Intel Xeon EMT64 and IBM Power6). A performance improvement of 67% was for instance obtained on the Cholesky factorization of a matrix of order 4000, using 32 cores.

international parallel and distributed processing symposium | 2010

Tile QR factorization with parallel panel processing for multicore architectures

Bilel Hadri; Hatem Ltaief; Emmanuel Agullo; Jack J. Dongarra

To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a block-column, and edges represent dependencies among them. Although past approaches already achieve high performance on moderate and large square matrices, their way of processing a panel in sequence leads to limited performance when factorizing tall and skinny matrices or small square matrices. We present a new fully asynchronous method for computing a QR factorization on shared-memory multicore architectures that overcomes this bottleneck. Our contribution is to adapt an existing algorithm that performs a panel factorization in parallel (named Communication-A voiding QR and initially designed for distributed-memory machines), to the context of tile algorithms using asynchronous computations. An experimental study shows significant improvement (up to almost 10 times faster) compared to state-of-the-art approaches. We aim to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) library.

ieee international conference on high performance computing data and analytics | 2010

Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Fengguang Song; Hatem Ltaief; Bilel Hadri; Jack J. Dongarra

As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.

extreme science and engineering discovery environment | 2012

Achieve better performance with PEAK on XSEDE resources

Bilel Hadri; Haihang You; Shirley Moore

As the leading distributed cyberinfrastructure for open scientific research in the United States, XSEDE supports several supercomputers across the country, as well as computational tools that are critical to the success of those researchers. In most cases, users are looking for a systematic way of selecting and configuring the available systems software and libraries for their applications so as to obtain optimal application performance. However, few scientific application developers have the time for an exhaustive search of all the possible configurations to determine the best one, and performing such a search empirically can consume a significant proportion of their allocation hours. We present here a framework, called the Performance Environment Autoconfiguration frameworK (PEAK), to help developers and users of scientific applications to select the optimal configuration for their application on a given platform and to update that configuration when changes in the underlying hardware and systems software occur. The choices to be made include the compiler with its settings of compiling options, the numerical libraries and settings of library parameters, and settings of other environment variables to take advantage of the NUMA systems. The framework has helped us choose the optimal configuration to get a significant speedup for some scientific applications executed on XSEDE platforms such as Kraken and Nautilus.

ieee international conference on high performance computing data and analytics | 2012

Abstract: Interface for Performance Environment Autoconfiguration Framework

Liang Men; Bilel Hadri; Haihang You

Summary form only given. Performance Environment Autoconfiguration frameworK (PEAK) is presented to help developers and users of scientific applications to find the optimal configurations for their application on a given platform with rich computational resources and complicate options. The choices to be made include the compiler with its settings of compiling options, the numerical libraries and settings of library parameters, and settings of other environment variables to take advantage of the NUMA systems. A website based interface is developed for userâ€TMs convenience of choosing the optimal configuration to get a significant speedup for some scientific applications executed on different systems.

extreme science and engineering discovery environment | 2012

Optimization of density functional tight-binding and classical reactive molecular dynamics for high-throughput simulations of carbon materials

Jacek Jakowski; Bilel Hadri; Steven J. Stuart; Predrag S. Krstic; Stephan Irle; Dulma Nugawela; Sophya Garashchuk

Carbon materials and nanostructures (fullerenes, nanotubes) are promising building blocks of nanotechnology. Potential applications include optical and electronic devices, sensors, and nano-scale machines. The multiscale character of processes related to fabrication and physics of such materials requires using a combination of different approaches such as (a) classical dynamics, (b) direct Born-Oppenheimer dynamics, (c) quantum dynamics for electrons and (d) quantum dynamics for selected nuclei. We describe our effort on optimization of classical reactive molecular dynamics and density-functional tight binding method, which is a core method in our direct and quantum dynamics studies. We find that optimization is critical for efficient use of high-end machines. Choosing the optimal configuration for the numerical library and compilers can result in four-fold speedup of direct dynamics as compared with default programming environment. The integration algorithm and parallelization approach must also be tailored for the computing environment. The efficacy of possible choices is discussed.

Archive | 2010