Is this you? Create Your Porfile

James A. Ross

United States Army Research Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James A. Ross is active.

Explore More

Publication

Featured researches published by James A. Ross.

Microprocessors and Microsystems | 2016

Parallel programming model for the Epiphany many-core coprocessor using threaded MPI

James A. Ross; David A. Richie; Song Jun Park; Dale R. Shires

We investigate the use of MPI for programming the Epiphany RISC array processor.A threaded MPI implementation adapted for coprocessor offload is presented.Existing MPI code for four scientific applications was re-used with minimal changes.Demonstrated performance exceeds 12 GFLOPS with an efficiency over 20GFLOPS/W.Threaded MPI exhibits the highest performance reported using a standard parallel API. The Adapteva Epiphany many-core architecture comprises a 2D tiled mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. It offers high computational energy efficiency for both integer and floating point calculations as well as parallel scalability. Yet despite the interesting architectural features, a compelling programming model has not been presented to date. This paper demonstrates an efficient parallel programming model for the Epiphany architecture based on the Message Passing Interface (MPI) standard. Using MPI exploits the similarities between the Epiphany architecture and a conventional parallel distributed cluster of serial cores. Our approach enables MPI codes to execute on the RISC array processor with little modification and achieve high performance. We report benchmark results for the threaded MPI implementation of four algorithms (dense matrix-matrix multiplication, N-body particle interaction, five-point 2D stencil update, and 2D FFT) and highlight the importance of fast inter-core communication for the architecture.

international conference on conceptual structures | 2016

Implementing OpenSHMEM for the Adapteva Epiphany RISC Array Processor

James A. Ross; David A. Richie

The energy-efficient Adapteva Epiphany architecture exhibits massive many-core scalability in a physically compact 2D array of RISC cores with a fast network-on-chip (NoC). With fully divergent cores capable of MIMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to partitioned global address space (PGAS) parallel programming models. Following an investigation into the use of two-sided communication using threaded MPI, one-sided communication using SHMEM is being explored. Here we present work in progress on the development of an OpenSHMEM 1.2 implementation for the Epiphany architecture.

arXiv: Distributed, Parallel, and Cluster Computing | 2016

An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor

James A. Ross; David A. Richie

This paper reports the implementation and performance evaluation of the OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the Parallella single-board computer. The Epiphany architecture exhibits massive many-core scalability with a physically compact 2D array of RISC CPU cores and a fast network-on-chip (NoC). While fully capable of MPMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to Partitioned Global Address Space (PGAS) programming models and SPMD execution with SHMEM.

international conference on conceptual structures | 2016

Advances in Run-time Performance and Interoperability for the Adapteva Epiphany Coprocessor

David A. Richie; James A. Ross

The energy-efficient Adapteva Epiphany architecture exhibits massive many-core scalability in a physically compact 2D array of RISC cores with a fast network-on-chip (NoC). The architecture presents many features and constraints which contribute to software design challenges for the application developer. Addressing these challenges within the software stack that supports application development is critical to improving productivity and expanding the range of applications for the architecture. We report here on advances that have been made in the COPRTHR-2 software stack targeting the Epiphany architecture that address critical issues identified in previous work. Specifically, we describe improvements that bring greater control and precision to the design of compact compiled binary programs in the context of the limited per-core local memory of the architecture. We describe a new design for run-time support that has been implemented to dramatically improve the program load and execute performance and capabilities. Finally, we describe developments that advance host-coprocessor interoperability to expand the functionality available to the application developer.

ieee international conference on high performance computing data and analytics | 2016

Truenorth ecosystem for brain-inspired computing: scalable systems, software, and applications

Jun Sawada; Filipp Akopyan; Andrew S. Cassidy; Brian Taba; Michael DeBole; Pallab Datta; Rodrigo Alvarez-Icaza; Arnon Amir; John V. Arthur; Alexander Andreopoulos; Rathinakumar Appuswamy; Heinz Baier; Davis; David J. Berg; Carmelo di Nolfo; Steven K. Esser; Myron Flickner; Thomas A. Horvath; Bryan L. Jackson; Jeff Kusnitz; Scott Lekuch; Michael Mastro; Timothy Melano; Paul A. Merolla; Steven Edward Millman; Tapan Kumar Nayak; Norm Pass; Hartmut Penner; William P. Risk; Kai Schleupen

This paper describes the hardware and software ecosystem encompassing the brain-inspired TrueNorth processor – a 70mW reconfigurable silicon chip with 1 million neurons, 256 million synapses, and 4096 parallel and distributed neural cores. For systems, we present a scale-out system loosely coupling 16 single-chip boards and a scale-up system tightly integrating 16 chips in a 4 × 4 configuration by exploiting TrueNorths native tiling. For software, we present an end-to-end ecosystem consisting of a simulator, a programming language, an integrated programming environment, a library of algorithms and applications, firmware, tools for deep learning, a teaching curriculum, and cloud enablement. For the scale-up systems we summarize our approach to physical placement of neural network, to reduce intra- and inter-chip network traffic. The ecosystem is in use at over 30 universities and government/corporate labs. Our platform is a substrate for a spectrum of applications from mobile and embedded computing to cloud and supercomputers.

arXiv: Distributed, Parallel, and Cluster Computing | 2016

OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture

David A. Richie; James A. Ross

There is interest in exploring hybrid OpenSHMEM + X programming models to extend the applicability of the OpenSHMEM interface to more hardware architectures. We present a hybrid OpenCL + OpenSHMEM programming model for device-level programming for architectures like the Adapteva Epiphany many-core RISC array processor. The Epiphany architecture comprises a 2D array of low-power RISC cores with minimal uncore functionality connected by a 2D mesh Network-on-Chip (NoC). The Epiphany architecture offers high computational energy efficiency for integer and floating point calculations as well as parallel scalability. The Epiphany-III is available as a coprocessor in platforms that also utilize an ARM CPU host. OpenCL provides good functionality for supporting a co-design programming model in which the host CPU offloads parallel work to a coprocessor. However, the OpenCL memory model is inconsistent with the Epiphany memory architecture and lacks support for inter-core communication. We propose a hybrid programming model in which OpenSHMEM provides a better solution by replacing the non-standard OpenCL extensions introduced to achieve high performance with the Epiphany architecture. We demonstrate the proposed programming model for matrix-matrix multiplication based on Cannon’s algorithm showing that the hybrid model addresses the deficiencies of using OpenCL alone to achieve good benchmark performance.

international conference on conceptual structures | 2017

A Distributed Shared Memory Model and C++ Templated Meta-Programming Interface for the Epiphany RISC Array Processor

David A. Richie; James A. Ross; Jamie Infantolino

The Adapteva Epiphany many-core architecture comprises a scalable 2D mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. Whereas such a processor offers high computational energy efficiency and parallel scalability, developing effective programming models that address the unique architecture features has presented many challenges. We present here a distributed shared memory (DSM) model supported in software transparently using C++ templated metaprogramming techniques. The approach offers an extremely simple parallel programming model well suited for the architecture. Initial results are presented that demonstrate the approach and provide insight into the efficiency of the programming model and also the ability of the NoC to support a DSM without explicit control over data movement and localization.

international conference on computational science | 2018

Remote Procedure Calls for Improved Data Locality with the Epiphany Architecture

James A. Ross; David A. Richie

This paper describes the software implementation of an emerging parallel programming model for partitioned global address space (PGAS) architectures. Applications with irregular memory access to distributed memory do not perform well on conventional symmetric multiprocessing (SMP) architectures with hierarchical caches. Such applications tend to scale with the number of memory interfaces and corresponding memory access latency. Using a remote procedure call (RPC) technique, these applications may see reduced latency and higher throughput compared to remote memory access or explicit message passing. The software implementation of a remote procedure call method detailed in the paper is designed for the low-power Adapteva Epiphany architecture.

international conference on computational science | 2018

Architecture Emulation and Simulation of Future Many-Core Epiphany RISC Array Processors

David A. Richie; James A. Ross

The Adapteva Epiphany many-core architecture comprises a scalable 2D mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. The Epiphany architecture has demonstrated significantly higher power-efficiency compared with other more conventional general-purpose floating-point processors. The original 32-bit architecture has been updated to create a 1,024-core 64-bit processor recently fabricated using a 16 nm process. We present here our recent work in developing an emulation and simulation capability for future many-core processors based on the Epiphany architecture. We have developed an Epiphany SoC device emulator that can be installed as a virtual device on an ordinary x86 platform and utilized with the existing software stack used to support physical devices, thus creating a seamless software development environment capable of targeting new processor designs just as they would be interfaced on a real platform. These virtual Epiphany devices can be used for research in the area of many-core RISC array processors in general.

international applied computational electromagnetics society symposium italy | 2017

Portable high-performance software design using templated meta-programming for EM calculations

Jamie Infantolino; James A. Ross; David A. Richie

The Finite Difference Time Domain (FDTD) Method is used for full-wave electromagnetic (EM) simulations. FDTD is computationally intensive with performance depending critically on architecture-specific optimizations that have become more challenging given the rapidly changing architectures in modern high-performance computing platforms. We examine a templated meta-programming technique to implement the computational kernels in canonical form, without any architecture-specific optimizations, such that data layout and loop order optimizations can be applied through code transformations. These transformations are abstracted behind a simple yet flexible API for the application developer and require no special tools, relying only on a modern optimizing C++ compiler. Optimizations for data layout and loop order are selected at compile-time using C++ typedefs without the need to modify source code implementations of the algorithm.

Explore More