Claude Tadonki | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Claude Tadonki is active.

Explore More

Publication

Featured researches published by Claude Tadonki.

international conference on parallel architectures and compilation techniques | 2009

Algorithmic Skeletons within an Embedded Domain Specific Language for the CELL Processor

Tarik Saidani; Joel Falcou; Claude Tadonki; Lionel Lacassagne; Daniel Etiemble

Efficiently using the hardware capabilities of the Cell processor, a heterogeneous chip multiprocessor that uses several levels of parallelism to deliver high performance, and being able to reuse legacy code are real challenges for application developers. We propose to use Generative Programming and more precisely template meta-programming to design an Domain Specific Embedded Language using algorithmic skeletons to generate applications based on a high-level mapping description. The method is easy to use by developers and delivers performance close to the performance of optimized hand-written code, as shown on various benchmarks ranging from simple BLAS kernels to image processing applications.

arXiv: Programming Languages | 2012

QIRAL: A High Level Language for Lattice QCD Code Generation

Denis Barthou; G. Grosdidier; Michael Kruse; O. Pène; Claude Tadonki

In functional programming, datatypes a la carte provide a convenient modular representation of recursive datatypes, based on their initial algebra semantics. Unfortunately it is highly challenging to implement this technique in proof assistants that are based on type theory, like Coq. The reason is that it involves type definitions, such as those of type-level fixpoint operators, that are not strictly positive. The known work-around of impredicative encodings is problematic, insofar as it impedes conventional inductive reasoning. Weak induction principles can be used instead, but they considerably complicate proofs. This paper proposes a novel and simpler technique to reason inductively about impredicative encodings, based on Mendler-style induction. This technique involves dispensing with dependent induction, ensuring that datatypes can be lifted to predicates and relying on relational formulations. A case study on proving subject reduction for structural operational semantics illustrates that the approach enables modular proofs, and that these proofs are essentially similar to conventional ones.Quantum chromodynamics (QCD) is the theory of subnuclear physics, aiming at mod- eling the strong nuclear force, which is responsible for the interactions of nuclear particles. Lattice QCD (LQCD) is the corresponding discrete formulation, widely used for simula- tions. The computational demand for the LQCD is tremendous. It has played a role in the history of supercomputers, and has also helped defining their future. Designing efficient LQCD codes that scale well on large (probably hybrid) supercomputers requires to express many levels of parallelism, and then to explore different algorithmic solutions. While al- gorithmic exploration is the key for efficient parallel codes, the process is hampered by the necessary coding effort. We present in this paper a domain-specific language, QIRAL, for a high level expression of parallel algorithms in LQCD. Parallelism is expressed through the mathematical struc- ture of the sparse matrices defining the problem. We show that from these expressions and from algorithmic and preconditioning formulations, a parallel code can be automatically generated. This separates algorithms and mathematical formulations for LQCD (that be- long to the field of physics) from the effective orchestration of parallelism, mainly related to compilation and optimization for parallel architectures.

european conference on computer systems | 2014

Excalibur: an autonomic cloud architecture for executing parallel applications

Alessandro Ferreira Leite; Tainá Raiol; Claude Tadonki; Maria Emilia Telles Walter; Christine Eisenbeis; Alba Cristina Magalhaes Alves de Melo

IaaS providers often allow the users to specify many requirements for their applications. However, users without advanced technical knowledge usually do not provide a good specification of the cloud environment, leading to low performance and/or high monetary cost. In this context, the users face the challenges of how to scale cloud-unaware applications without re-engineering them. Therefore, in this paper, we propose and evaluate a cloud architecture, namely Excalibur, to execute applications in the cloud. In our architecture, the users provide the applications and the architecture sets up the whole environment and adjusts it at run-time accordingly. We executed a genomics workflow in our architecture, which was deployed in Amazon EC2. The experiments show that the proposed architecture dynamically scales this cloud-unaware application up to 10 instances, reducing the execution time by 73% and the cost by 84% when compared to the execution in the configuration specified by the user.

Proceedings of the Real World Domain Specific Languages Workshop 2018 on | 2018

CFDlang: High-level code generation for high-order methods in fluid dynamics

Norman A. Rink; Immo Huismann; Adilla Susungi; Jeronimo Castrillon; Jörg Stiller; Jochen Fröhlich; Claude Tadonki

Numerical simulations continue to enable fast and enormous progress in science and engineering. Writing efficient numerical codes is a difficult challenge that encompasses a variety of tasks from designing the right algorithms to exploiting the full potential of a platforms architecture. Domain-specific languages (DSLs) can ease these tasks by offering the right abstractions for expressing numerical problems. With the aid of domain knowledge, efficient code can then be generated automatically from abstract expressions. In this work, we present the CFDlang DSL for expressing tensor operations that constitute the performance-critical code sections in a class of real numerical applications from fluid dynamics. We demonstrate that CFDlang can be used to generate code automatically that performs as well, if not better, than carefully hand-optimized code.

Future Generation Computer Systems | 2018

Harris corner detection on a NUMA manycore

Olfa Haggui; Claude Tadonki; Lionel Lacassagne; Fatma Ezahra Sayadi; Bouraoui Ouni

Corner detection is a key kernel for many image processing procedures including pattern recognition and motion detection. The latter, for instance, mainly relies on the corner points for which spatial analyses are performed, typically on (probably live) videos or temporal flows of images. Thus, highly efficient corner detection is essential to meet the real-time requirement of associated applications. In this paper, we consider the corner detection algorithm proposed by Harris, whose the main work-flow is a composition of basic operators represented by their approximations using 3 3 matrices. The corresponding data access patterns follow a stencil model, which is known to require careful memory organization and management. Cache misses and other additional hindering factors with NUMA architectures need to be skillfully addressed in order to reach an efficient scalable implementation. In addition, with an increasingly wide vector registers, an efficient SIMD version should be designed and explicitly implemented. In this paper, we study a direct and explicit implementation of common and novel optimization strategies, and provide a NUMA-aware parallelization. Experimental results on a dual-socket INTEL Broadwell-E/EP show a noticeably good scalability performance.

arXiv: High Energy Physics - Lattice | 2014

Automated Code Generation for Lattice Quantum Chromodynamics and beyond

Denis Barthou; Olivier Brand-Foissac; O. Pène; Gilbert Grosdidier; Romain Dolbeau; Christina Eisenbeis; Michael Kruse; Konstantin Petrov; Claude Tadonki

We present here our ongoing work on a Domain Specific Language which aims to simplify Monte-Carlo simulations and measurements in the domain of Lattice Quantum Chromodynamics. The tool-chain, called Qiral, is used to produce high-performance OpenMP C code from LaTeX sources. We discuss conceptual issues and details of implementation and optimization. The comparison of the performance of the generated code to the well-established simulation software is also made.

Software Engineering / 811: Parallel and Distributed Computing and Networks / 816: Artificial Intelligence and Applications | 2014

Seamless Multicore Parallelism in MATLAB

Claude Tadonki; Pierre-Louis Caruana

MATLAB is a popular mathematical framework composed of a built-in library implementing a significant set of commonly needed routines. It also provides a language which allows the user to script macro calculations or to write complete programs, hence called “the language of technical computing”. So far, a noticeable effort is maintained in order to keep MATLAB being able to cooperate with other standard programming languages or tools. However, this interoperability, which is essential in many circumstances including performance and portability, is not always easy to implement for ordinary scientists. The case of parallel computing is illustrative and needs to be addressed as multicore machines are now standard. In this work, we report our efforts to provide a framework that allow to intuitively express and launch parallel executions within a classical MATLAB code. We study two alternatives, one which is a pure MATLAB solution based on the MATLAB parallel computing toolbox, and another one which implies a symmetric cooperation between MATLAB and C, based on the Pthread library. The later solution does not requires the MATLAB parallel toolbox, thus clearly brings a portability benefit and makes the move to parallel computing within MATLAB less costly to standard users. Experimental results are provided and commented in order to illustrate the use and the efficiency of our solution.

international conference on image and signal processing | 2012

3D shape retrieval using bag-of-feature method basing on local codebooks

El Mostafa Daoudi; Claude Tadonki

Recent investigations illustrate that view-based methods, with pose normalization pre-processing get better performances in retrieving rigid models than other approaches and still the most popular and practical methods in the field of 3D shape retrieval [9,10,11,12]. In this paper we present an improvement of the BF-SIFT method proposed by Ohbuchi et al [1]. This method is based on bag-of-features to integrate a set of features extracted from 2D views of the 3D objects using the SIFT (Scale Invariant Feature Transform [2]) algorithm into a histogram using vector quantization which is based on a global visual codebook. In order to improve the retrieval performances, we propose to associate to each 3D object its local visual codebook instead of a unique global codebook. The experimental results obtained on the Princeton Shape Benchmark database [3] show that the proposed method performs better than the original method.

Concurrency and Computation: Practice and Experience | 2018

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks.

Yassir Samadi; Mostapha Zbakh; Claude Tadonki

Big Data has become one of the major areas of research for cloud service providers due to a large amount of data produced every day and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Datas tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular Big Data frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of them suit for Big Data processing. We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison is made based on three criteria: execution time, throughput, and speedup. We test Wordcount workload with different data sizes for more accurate results. Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed into memory and keeps them in caches for a while, just like standard databases. So the choice depends on performance level and memory constraints.

2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011) | 2011

Large Scale Kronecker Product on Supercomputers

Claude Tadonki

The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.

Explore More