Richard Lethin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard Lethin is active.

Explore More

Publication

Featured researches published by Richard Lethin.

general purpose processing on graphics processing units | 2010

A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Allen K. Leung; Nicolas Vasilache; Benoît Meister; Muthu Manikandan Baskaran; David E. Wohlford; Cédric Bastoul; Richard Lethin

Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability.n This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language.n The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.

high-performance computer architecture | 2013

Runnemede: An architecture for Ubiquitous High-Performance Computing

Nicholas P. Carter; Aditya Agrawal; Shekhar Borkar; Romain Cledat; Howard S. David; Dave Dunning; Joshua B. Fryman; Ivan Ganev; Roger A. Golliver; Rob C. Knauerhase; Richard Lethin; Benoît Meister; Asit K. Mishra; Wilfred R. Pinfold; Justin Teller; Josep Torrellas; Nicolas Vasilache; Ganesh Venkatesh; Jianping Xu

DARPAs Ubiquitous High-Performance Computing (UHPC) program asked researchers to develop computing systems capable of achieving energy efficiencies of 50 GOPS/Watt, assuming 2018-era fabrication technologies. This paper describes Runnemede, the research architecture developed by the Intel-led UHPC team. Runnemede is being developed through a co-design process that considers the hardware, the runtime/OS, and applications simultaneously. Near-threshold voltage operation, fine-grained power and clock management, and separate execution units for runtime and application code are used to reduce energy consumption. Memory energy is minimized through application-managed on-chip memory and direct physical addressing. A hierarchical on-chip network reduces communication energy, and a codelet-based execution model supports extreme parallelism and fine-grained tasks. We present an initial evaluation of Runnemede that shows the design process for our on-chip network, demonstrates 2-4x improvements in memory energy from explicit control of on-chip memory, and illustrates the impact of hardware-software co-design on the energy consumption of a synthetic aperture radar algorithm on our architecture.

Proceedings of SPIE | 2010

Generation of high-performance protocol-aware analyzers with applications in intrusion detection systems

Jordi Ros-Giralt; Peter Szilagyi; James Ezick; David E. Wohlford; Richard Lethin

Traditional Intrusion Detection and Prevention (IDP) systems scan packets quickly by applying simple byte-wise pattern signatures to network flows. Such a protocol-agnostic approach can be compromised with polymorphic attacks: slight modifications of exploits that bypass pattern signatures but still reach corresponding vulnerabilities. To protect against these attacks, a solution is to provision the IDP system with protocol awareness, at the risk of degrading performance. To balance vulnerability coverage against network performance, we introduce a hardware-aware, compiler-based platform that leverages hardware engines to accelerate the core functions of protocol parsing and protocol-aware signature evaluation.

certified programs and proofs | 2016

A unified Coq framework for verifying C programs with floating-point computations

Tahina Ramananandro; Paul Mountcastle; Benoı̂t Meister; Richard Lethin

We provide concrete evidence that floating-point computations in C programs can be verified in a homogeneous verification setting based on Coq only, by evaluating the practicality of the combination of the formal semantics of CompCert Clight and the Flocq formal specification of IEEE 754 floating-point arithmetic for the verification of properties of floating-point computations in C programs. To this end, we develop a framework to automatically compute real-number expressions of C floating-point computations with rounding error terms along with their correctness proofs. We apply our framework to the complete analysis of an energy-efficient C implementation of a radar image processing algorithm, for which we provide a certified bound on the total noise introduced by floating-point rounding errors and energy-efficient approximations of square root and sine.

ieee high performance extreme computing conference | 2014

Low-overhead load-balanced scheduling for sparse tensor computations

Muthu Manikandan Baskaran; Benoît Meister; Richard Lethin

Irregular computations over large-scale sparse data are prevalent in critical data applications and they have significant room for improvement on modern computer systems from the aspects of parallelism and data locality. We introduce new techniques to efficiently map large irregular computations onto modern multi-core systems with non-uniform memory access (NUMA) behavior. Our techniques are broadly applicable for irregular computations with multi-dimensional sparse arrays (or sparse tensors). We implement a static-cum-dynamic task scheduling scheme with low overhead for effective parallelization of sparse computations. We introduce locality-aware optimizations to the task scheduling mechanism that are driven by the sparse input data pattern. We evaluate our techniques using two popular sparse tensor decomposition methods that have wide applications in data mining, graph analysis, signal processing, and elsewhere. Our techniques not only improve parallel performance but also result in improved performance scalability with increasing number of cores. We achieve around 4-5× improvement in performance over existing parallel approaches and observe “scalable” parallel performance on modern multi-core systems with up to 32 processor cores. We take real sparse data sets as input to the sparse tensor computations and demonstrate the achieved improvements.

architectural support for programming languages and operating systems | 2013

Memory reuse optimizations in the R-Stream compiler

Nicolas Vasilache; Muthu Manikandan Baskaran; Benoît Meister; Richard Lethin

We propose a new set of automated techniques to optimize memory reuse in programs with explicitly managed memory. Our techniques are inspired by hand-tuned seismic kernels on GPUs. The solutions we develop reduce the cost of transferring data across multiple memories with different bandwidth, latency and addressability properties. They result in reduction of communication volumes from main memory and faster execution speeds, comparable to hand-tuned implementations, for out-of-place stencils. We discuss various steps of our source-to-source compiler infrastructure and focus on specific optimizations which comprise: flexible generation of different granularities of communications with respect to computations, reduction of redundant transfers, reuse of data across processing elements using a globally addressable local memory and reuse of data within the same processing elements using a local private memory. The models of memory we consider in our techniques support the GPU model with device, shared and register memories. The techniques we derive are generally applicable and their formulation within our compiler can be extended to other types of architectures.

software visualization | 2015

Polyhedral user mapping and assistant visualizer tool for the r-stream auto-parallelizing compiler

Eric Papenhausen; Bing Wang; Harper Langston; Muthu Manikandan Baskaran; Thomas Henretty; Taku Izubuchi; Ann Johnson; Chulwoo Jung; Meifeng Lin; Benoît Meister; Klaus Mueller; Richard Lethin

Existing high-level, source-to-source compilers can accept input programs in a high-level language (e.g., C) and perform complex automatic parallelization and other mappings using various optimizations. These optimizations often require trade-offs and can benefit from the users involvement in the process. However, because of the inherent complexity, the barrier to entry for new users of these high-level optimizing compilers can often be high. We propose visualization as an effective gateway for non-expert users to gain insight into the effects of parameter choices and so aid them in the selection of levels best suited to their specific optimization goals.

ieee high performance extreme computing conference | 2015

Optimization of symmetric tensor computations

Jonathon Cai; Muthu Manikandan Baskaran; Benoît Meister; Richard Lethin

For applications that deal with large amounts of high dimensional multi-aspect data, it is natural to represent such data as tensors or multi-way arrays. Tensor computations, such as tensor decompositions, are increasingly being used to extract and explain properties of such data. An important class of tensors is the symmetric tensor, which shows up in real-world applications such as signal processing, biomedical engineering, and data analysis. In this work, we describe novel optimizations that exploit the symmetry in tensors in order to reduce redundancy in computations and storage and effectively parallelize operations involving symmetric tensors. Specifically, we apply our optimizations on the matricized tensor times Khatri Rao product (mttkrp) operation, a key operation in tensor decomposition algorithms such as INDSCAL (individual differences in scaling) for symmetric tensors. We demonstrate improved performance for both sequential and parallel execution using our techniques on various synthetic and real data sets.

international conference on supercomputing | 2014

Parallelizing and optimizing sparse tensor computations

Muthu Manikandan Baskaran; Benoît Meister; Richard Lethin

Irregular computations over large-scale sparse data are prevalent in critical data applications and they have significant room for improvement on modern computer systems from the aspects of parallelism and data locality. We introduce new techniques to efficiently map large irregular computations with multi-dimensional sparse arrays (or sparse tensors) onto modern multi-core systems with non-uniform memory access (NUMA) behavior. We implement a static-cum-dynamic task scheduling scheme with low overhead for effective parallelization of sparse computations. We introduce locality-aware optimizations to the task scheduling mechanism that are driven by the sparse input data pattern. We evaluate our techniques on key sparse tensor decomposition methods that are widely used in areas such as data mining, graph analysis, and elsewhere. We achieve around 4-5x improvement in performance over existing parallel approaches and observe scalable parallel performance on modern multi-core systems with up to 32 processor cores.

international conference on parallel processing | 2016

Scalable Hierarchical Polyhedral Compilation

Benoît Pradelle; Benoît Meister; Muthu Manikandan Baskaran; Athanasios Konstantinidis; Thomas Henretty; Richard Lethin

Computers across the board, from embedded to future exascale computers, are consistently designed with deeper memory hierarchies. While this opens up exciting opportunities for improving software performance and energy efficiency, it also makes it increasingly difficult to efficiently exploit the hardware. Advanced compilation techniques are a possible solution to this difficult problem and, among them, the polyhedral compilation technology provides a pathway for performing advanced automatic parallelization and code transformations. However, the polyhedral model is also known for its poor scalability with respect to the number of dimensions in the polyhedra that are used for representing programs. Although current compilers can cope with such limitation when targeting shallow hierarchies, polyhedral optimizations often become intractable as soon as deeper hardware hierarchies are considered. We address this problem by introducing two new operators for polyhedral compilers: focalisation and defocalisation. When applied in the compilation flow, the new operators reduce the dimensionality of polyhedra, which drastically simplifies the mathematical problems solved during the compilation. We prove that the presented operators preserve the original program semantics, allowing them to be safely used in compilers. We implemented the operators in a production compiler, which drastically improved its performance and scalability when targeting deep hierarchies.

Explore More