José Nelson Amaral | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José Nelson Amaral is active.

Explore More

Publication

Featured researches published by José Nelson Amaral.

international conference on parallel architectures and compilation techniques | 2012

Evaluation of Blue Gene/Q hardware support for transactional memories

Amy Wang; Matthew Gaudet; Peng Wu; José Nelson Amaral; Martin Ohmacht; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q (BG/Q) machine. The TM programming model supports most C/C++ programming constructs on top of a best-effort HTM with the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP benchmarks on BG/Q is the first of its kind in understanding characteristics of running coarse-grained TM workloads on HTMs. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

programming language design and implementation | 2006

Shared memory programming for large scale machines

Christopher Barton; CĆlin Casçaval; George S. Almasi; Yili Zheng; Montse Farreras; Siddhartha Chatterje; José Nelson Amaral

This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.

systems man and cybernetics | 1995

Designing genetic algorithms for the state assignment problem

José Nelson Amaral; Kagan Tumer; Joydeep Ghosh

Finding the best state assignment for implementing a synchronous sequential circuit is important for reducing silicon area or chip count in many digital designs. This state assignment problem (SAP) belongs to a broader class of combinatorial optimization problems than the well studied traveling salesman problem, which can be formulated as a special case of SAP. The search for a good solution is considerably involved for the SAP due to a large number of equivalent solutions, and no effective heuristic has been found so far to cater to all types of circuits. In this paper, a matrix representation is used as the genotype for a genetic algorithm (GA) approach to this problem. A novel selection mechanism is introduced, and suitable genetic operators for crossover and mutation, are constructed. The properties of each of these elements of the GA are discussed and an analysis of parameters that influence the algorithm is given. A canonical form for a solution is defined to significantly reduce the search space and number of local minima. Experiments with several examples show that the GA approach yields results that are often comparable to, or better than those obtained using established heuristics that embody extensive domain knowledge. >

IEEE Transactions on Education | 2005

Teaching digital design to computing science students in a single academic term

José Nelson Amaral; Paul Berube; Paras Mehta

How should digital design be taught to computing science students in a single one-semester course? This work advocates the use of state-of-the-art design tools and programmable devices and presents a series of laboratory exercises to help students learn digital logic. Each exercise introduces new concepts and produces the complete design of a stand-alone apparatus that is fun and interesting to use. These exercises lead to the most challenging capstone designs for a single-semester course of which the authors are aware. Fast progress is made possible by providing students with predesigned input/output modules. Student feedback demonstrates that the students approve of this methodology. An extensive set of slides, supporting teaching material, and laboratory exercises are freely available for downloading.

compiler construction | 2001

Speculative Prefetching of Induction Pointers

Artour Stoutchinin; José Nelson Amaral; Guang R. Gao; James C. Dehnert; Suneel Jain; Alban Douillet

We present an automatic approach for prefetching data for linked list data structures. The main idea is based on the observation that linked list elements are frequently allocated at constant distance from one another in the heap. When linked lists are traversed, a regular pattern of memory accesses with constant stride emerges. This regularity in the memory footprint of linked lists enables the development of a prefetching framework where the address of the element accessed in one of the future iterations of the loop is dynamically predicted based on its previous regular behavior. We automatically identify pointer-chasing recurrences in loops that access linked lists. This identification uses a surprisingly simple method that looks for induction pointers -- pointers that are updated in each loop iteration by a load with a constant offset. We integrate induction pointer prefetching with loop scheduling. A key intuition incorporated in our framework is to insert prefetches only if there are processor resources and memory bandwidth available. In order to estimate available memory bandwidth we calculate the number of potential cache misses in one loop iteration. Our estimation algorithm is based on an application of graph coloring on a memory access interference graph derived from the control flow graph. We implemented the prefetching framework in an industry-strength production compiler, and performed experiments on ten benchmark programs with linked lists. We observed performance improvements between 15% and 35% in three of them.

mining software repositories | 2014

Syntax errors just aren't natural: improving error reporting with language models

Joshua Charles Campbell; Abram Hindle; José Nelson Amaral

A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in many errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.

international symposium on memory management | 2008

MPADS: memory-pooling-assisted data splitting

Stephen Curial; Peng Zhao; José Nelson Amaral; Yaoqing Gao; Shimin Cui; Raul Esteban Silvera; Roch Georges Archambault

This paper describes Memory-Pooling-Assisted Data Splitting (MPADS), a framework that combines data structure splitting with memory pooling --- Although it MPADS may call to mind memory padding, a distintion of this framework is that is does not insert padding. MPADS relies on pointer analysis to ensure that splitting is safe and applicable to type-unsafe language. MPADS makes no assumption about type safety. The analysis can identify cases in which the transformation could lead to incorrect code and thus MPADS abandons those cases. To make data structure splitting efficient in a commercial compiler, MPADS is designed with great attention to reduce the number of instructions required to access the data after the data-structure splitting. Moreover the implementation of MPADS reveals that architecture details should be considered carefully when re-arranging data allocation. For instance one of the most significant gains from the introduction of data-structure splitting in code targetting the IBM POWER architecture is a dramatic decrease in the amount of data prefetched by the hardware prefetch engine without a noticeable decrease in the cache utilization. Triggering fewer hardware prefetch streams frees memory bandwidth and cache space. Fewer prefetching streams also reduce the interference between the data accessed by multiple cores in modern multicore processors.

acm symposium on parallel algorithms and architectures | 2007

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Timothy Furtak; José Nelson Amaral; Robert Niewiadomski

Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries - x86-64s SSE2 and G5s AltiVec - demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22% for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39% using a similar technique.

architectural support for programming languages and operating systems | 2010

Compiling Python to a hybrid execution environment

Rahul Garg; José Nelson Amaral

A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C programming language, and jit4GPU, a just-in-time compiler from C to the AMD CAL interface. Experimental evaluation demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases.

ACM Transactions on Programming Languages and Systems | 2007

Forma : A framework for safe automatic array reshaping

Peng Zhao; Shimin Cui; Yaoqing Gao; Raul Esteban Silvera; José Nelson Amaral

This article presents Forma, a practical, safe, and automatic data reshaping framework that reorganizes arrays to improve data locality. Forma splits large aggregated data-types into smaller ones to improve data locality. Arrays of these large data types are then replaced by multiple arrays of the smaller types. These new arrays form natural data streams that have smaller memory footprints, better locality, and are more suitable for hardware stream prefetching. Forma consists of a field-sensitive alias analyzer, a data type checker, a portable structure reshaping planner, and an array reshaper. An extensive experimental study compares different data reshaping strategies in two dimensions: (1) how the data structure is split into smaller ones (maximal partition × frequency-based partition × affinity-based partition); and (2) how partitioned arrays are linked to preserve program semantics (address arithmetic-based reshaping × pointer-based reshaping). This study exposes important characteristics of array reshaping. First, a practical data reshaper needs not only an inter-procedural analysis but also a data-type checker to make sure that array reshaping is safe. Second, the performance improvement due to array reshaping can be dramatic: standard benchmarks can run up to 2.1 times faster after array reshaping. Array reshaping may also result in some performance degradation for certain benchmarks. An extensive micro-architecture-level performance study identifies the causes for this degradation. Third, the seemingly naive maximal partition achieves best or close-to-best performance in the benchmarks studied. This article presents an analysis that explains this surprising result. Finally, address-arithmetic-based reshaping always performs better than its pointer-based counterpart.

Explore More