Jiayi Sheng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiayi Sheng is active.

Explore More

Publication

Featured researches published by Jiayi Sheng.

ieee high performance extreme computing conference | 2014

Design of 3D FFTs with FPGA clusters

Jiayi Sheng; Benjamin Humphries; Hansen Zhang; Martin C. Herbordt

The three dimensional Fast Fourier Transform (3D FFT) is widely applied in various scientific applications. Distributed 3D FFTs require global communication: this becomes a serious concern when strong scaling is required as in long timescale molecular dynamics simulations. In this paper, we propose a parameterized 3D FFT design that targets at a 3D-torus FPGA-based network of various sizes. Characteristics include direct FPGA-FPGA communication links, support for various internal switch designs, and use of table-based routing which saves chip area and routing cycles. We find that even assuming extremely conservative parameters, we are able to run the 163 FFT in 3.9μs, 323 FFT in 5.46μs, 643 FFT in 9.52μs, and 1283 FFT in 25.72μs. These results indicate that clusters based on commodity FPGAs are likely to be appropriate when strong scaling is needed in applications limited by the 3D FFT.

field-programmable custom computing machines | 2014

3D FFTs on a Single FPGA

Benjamin Humphries; Hansen Zhang; Jiayi Sheng; Raphael Landaverde; Martin C. Herbordt

An efficient adaptive multistage multiuser detection (MUD) method known as adaptive duplicated filters and interference canceller (ADIC), is presented in the context of an uplink terrestrial mobile network DS-CDMA. The performance of this proposed MUD scheme has been proven equivalent to a DF soft MPIC while affording complexity reductions by a factor of 4 (64 kbps). In this paper, an ADIC receiver is applied to yield a link satellite mobile environment based on DS-CDMA systems and able to outperform DF soft MPIC and soft MPIC methods using complexity reduction. At 64 kbps, compared to DF soft MPIC, soft MPIC and Rake receivers, ADIC MUD allows capacity increases of 8%, 30%, and 160% respectively.The 3D FFT is critical in many physical simulations and image processing applications. On FPGAs, however, the 3D FFT was thought to be inefficient relative to other methods such as convolution-based implementations of multigrid. We find the opposite: a simple design, operating at a conservative frequency, takes 4a#x039C;s for 163, 21a#x039C;s for 323, and 215a#x039C;s for 643 single precision data points. The first two of these compare favorably with the 25a#x039C;s and 29a#x039C;s obtained running on a current Nvidia GPU. Some broader significance is that this is a critical piece in implementing a large scale FPGA-based MD engine: even a single FPGA is capable of keeping the FFT off of the critical path for a large fraction of possible MD simulations.In the era of cloud computing, data centers are well-known to be bounded by the power wall issue. This issue lowers the profit of service providers and obstructs the expansions of data centers scale. As virtual machines behavior was not explored sufficiently in classic data centers power-saving strategies, in this paper we address the power consumption issue in the setting of a virtualized data center. We propose an efficient power-aware resource scheduling strategy that reduces data centers power consumption effectively based on VM live migration which is a key technical feature of cloud computing. Our scheduling algorithm leverages the Xen platform and consolidates VM workloads periodically to reduce the number of running servers. To satisfy each VMs service level agreements, our strategy keeps adjusting VM placements between scheduling rounds. We developed a power-aware data center simulator to test our algorithm. The simulator runs in time domain and includes servers segmented linear power model. We validated our simulator using measured server power trace. Our simulation shows that compared with event-driven schedulers, our strategy improves data center power budget by 35% for random workloads resembling web-requests, and improve data center power budget by 22.7% for workloads exhibiting stable resource requirements like ScaLAPACK.

ACM Sigarch Computer Architecture News | 2017

Collective Communication on FPGA Clusters with Static Scheduling

Jiayi Sheng; Qingqing Xiong; Chen Yang; Martin C. Herbordt

FPGA-centric clouds and clusters provide direct and programmable interconnects with obvious benefits for communication latency and bandwidth. One rarely studied aspect of DPI is that they facilitate application-aware routing: if communication patterns are static and known a priori, as is usually the case, then judicious routing can reduce congestion, latency, and the hardware required. In this study we explore applying the method of offline/static routing to collective operations, in particular, multicast and reduction. An entirely new communication infrastructure is proposed and implemented, including switch design and routing algorithm. A substantial improvement in performance is obtained, especially for multicast. We believe that this is one of the few general offline/static routing solutions for real HPC clusters, and FPGA-centric clusters in particular.

Journal of Parallel and Distributed Computing | 2016

Communication and cooling aware job allocation in data centers for communication-intensive workloads

Jie Meng; Eduard Llamosí; Fulya Kaplan; Chulian Zhang; Jiayi Sheng; Martin C. Herbordt; Gunar Schirner; Ayse Kivilcim Coskun

Energy consumption is an increasingly important concern in data centers. Today, nearly half of the energy in data centers is consumed by the cooling infrastructure. Existing policies on thermally-aware workload allocation do not consider applications that include many tasks (or threads) running on a large set of nodes with significant communication among the tasks. Such jobs, however, constitute most of the cycles in high performance computing (HPC) domain, and have started to appear in other data centers as well. Job allocation strongly affects the performance of such communication-intensive applications. Communication-aware job allocation methods exist, but they focus solely on performance and do not consider cooling energy. This paper proposes a novel job allocation methodology to jointly minimize communication cost and cooling energy consumption in data centers. We formulate and solve the joint optimization problem using binary quadratic programming. Our joint optimization algorithm reduces cooling energy by 16.4 % on average with only a 2.66 % average increase in application running time compared to solely performance-aware allocations. To further optimize the communication cost, we develop a Charm++ based framework that extracts the communication behavior of applications. We then integrate our job allocation policy with recursive coordinate bisection (RCB) based task mapping method to place highly-communicating tasks in close proximity. Experimental results show that task mapping further decreases the communication cost by up to 20.9 % compared to assuming all-to-all communication, a popular assumption in much of the prior work. We jointly optimize the cooling and communication costs via job allocation.Our joint allocation strategy saves 16.4% cooling energy on average.We design a framework to extract the communication patterns of HPC applications.Combining joint allocation with task mapping reduces communication costs by 20.9%.

ieee high performance extreme computing conference | 2015

Hardware-efficient compressed sensing encoder designs for WBSNs

Jiayi Sheng; Chen Yang; Martin C. Herbordt

Implanted sensors, as might be used with wireless body sensor networks, must have minimal size and power consumption. In this work we examine digital-based compressed sensing encoders for WBSN-enabled ECG and EEG monitoring, a domain that has received much recent attention. We have two major findings. The first is that using a random binary Toeplitz matrix, rather than Bernoulli, has an acceptable effect on recovery quality. The second is that, in this design space, leakage dominates over dynamic power with the result that it is highly beneficial to reduce the number of accumulators to trade off space for operating frequency. We demonstrate these results with a parameterized design and over three application domains-EEG, ECG, and small images-which together represent a variety of recovery goals and therefore compression methods. Compared with previous implementations, our new design consumes 1-to-2 orders of magnitude less area and power while still meeting timing constraints and achieving comparable recovery quality.

ieee high performance extreme computing conference | 2017

OpenCL for HPC with FPGAs: Case study in molecular electrostatics

Chen Yang; Jiayi Sheng; Rushi Patel; Ahmed Sanaullah; Vipin Sachdeva; Martin C. Herbordt

FPGAs have emerged as a cost-effective accelerator alternative in clouds and clusters. Programmability remains a challenge, however, with OpenCL being generally recognized as a likely part of the solution. In this work we seek to advance the use of OpenCL for HPC on FPGAs in two ways. The first is by examining a core HPC application, Molecular Dynamics. The second is by examining a fundamental design pattern that we believe has not yet been described for OpenCL: passing data from a set of producer datapaths to a set of consumer datapaths, in particular, where the producers generate data non-uniformly. We evaluate several designs: single level versions in Verilog and in OpenCL, a two-level Verilog version with optimized arbiter, and several two-level OpenCL versions with different arbitration and hand-shaking mechanisms, including one with an embedded Verilog module. For the Verilog designs, we find that FPGAs retain their high-efficiency with a factor of 50 χ to 80 χ performance benefit over a single core. We also find that OpenCL may be competitive with HDLs for the straightline versions of the code, but that for designs with more complex arbitration and hand-shaking, relative performance is substantially diminished.

field programmable logic and applications | 2017

HPC on FPGA clouds: 3D FFTs and implications for molecular dynamics

Jiayi Sheng; Chen Yang; Ahmed Sanaullah; Michael Papamichael; Adrian M. Caulfield; Martin C. Herbordt

The architecture of the Microsoft Catapult II cloud places the accelerator (FPGA) as a bump-in-the-wire on the way to the network and thus promises a dramatic reduction in latency as layers of hardware and software are avoided. We demonstrate this capability with an implementation of the 3D FFT. Next we examine phased application elasticity, i.e., the use of a reduced set of nodes for some phases of an HPC application. We find that, for the FFT phase within Molecular Dynamics, such contraction is beneficial with a 13%–14% performance improvement. Turning to MD, we show how this elasticity can be integrated into the existing data transformation to hide its communication overhead and increase the performance benefit to 16%–29%.

ieee high performance extreme computing conference | 2016

Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects

Alan D. George; Martin C. Herbordt; Herman Lam; Abhijeet Lawande; Jiayi Sheng; Chen Yang

While High-Performance Computing is ever more pervasive and effective, computing capability is currently only a small fraction of what is needed. Three fundamental issues limiting performance are computational efficiency, power density, and communication latency. All of these issues are being addressed through increased heterogeneity, but the last in particular by integrating communication into the accelerator. This integration enables direct and programmable communication among compute components. Novo-G# is a large-scale FPGA-centric cluster being built to investigate and develop architectures, system and tool infrastructure, and applications for this model. In this report we discuss the motivation behind and particular objectives of Novo-G#, the work completed so far, the products of that work, and their potential impact. We end with a description of and an invitation to join the Novo-G# Forum, the project users group.

ieee high performance extreme computing conference | 2017

An FPGA-based data acquisition system for directional dark matter detection

Chen Yang; Jiayi Sheng; Aravind Sridhar; Martin C. Herbordt; Catherine Nicoloff; James Battat

Directional dark matter detection seeks to reconstruct the angular distribution of dark matter particles traveling through the laboratory. A directional detector with high spatial resolution has the potential to increase the sensitivity per unit volume by over two orders of magnitude, but requires the development of a high-channel-count, high-speed readout system. This paper describes an FPGA-based digital back-end system to handle a 16Gbps data stream from 103 independent detector channels sampled at 1 MHz. Results of an implementation of this system are presented, along with plans for future development.

ieee high performance extreme computing conference | 2016

A hardware design for in-brain neural spike sorting

Yinan Liu; Jiayi Sheng; Martin C. Herbordt

Neural spike sorting is used to classify neural spike signals based on neuron type and so is an essential step in decoding brain signals. Because of its computational complexity, spike sorting is generally carried out offline or, at least, using a transmitted signal. In contrast, in-brain spike sorting would reduce the data that needs to be transmitted by orders of magnitude with a corresponding reduction in transmission power. This would enable real-time wireless neural recordings. In this paper, we design and characterize a hardware prototype for in-brain spike sorting. Our design is able to reduce the wireless transmission power by a factor of over 200 over direct transmission. Also, compared with the current state-of-the-art, our design increases the sorting accuracy from 75% to 93% while remaining within hard constraints for power, power density, and real-time processing.

Explore More