Tiansheng Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tiansheng Zhang is active.

Explore More

Publication

Featured researches published by Tiansheng Zhang.

design, automation, and test in europe | 2014

Thermal management of manycore systems with silicon-photonic networks

Tiansheng Zhang; José L. Abellán; Ajay Joshi; Ayse Kivilcim Coskun

Silicon-photonic network-on-chips (NoCs) provide high bandwidth density; therefore, they are promising candidates to replace electrical NoCs in manycore systems. The silicon-photonic NoCs, however, are sensitive to the temperature gradients that typically occur on the chip, and hence, require proactive thermal management. This paper first provides a design space exploration of silicon-photonic networks in manycore systems and quantifies the performance impact of the temperature gradients for various network bandwidths. The paper then introduces a novel job allocation technique that minimizes the temperature gradients among the ring modulators/filters to improve the application performance. Experimental results for a single-chip 256-core system demonstrate that our policy is able to maintain the maximum network bandwidth. Compared to existing workload allocation policies, the proposed policy improves system performance by up to 26.1% when running a single application and 18.3% for multi-program scenarios.

networks on chips | 2014

Sharing and placement of on-chip laser sources in silicon-photonic NoCs

Chao Chen; Tiansheng Zhang; Pietro Contu; Jonathan Klamkin; Ayse Kivilcim Coskun; Ajay Joshi

Silicon-photonic links are projected to replace the electrical links for global on-chip communications in future manycore systems. The use of off-chip laser sources to drive these silicon-photonic links can lead to higher link losses, thermal mismatch between laser source and on-chip photonic devices, and packaging challenges. Therefore, on-chip laser sources are being evaluated as candidates to drive the on-chip photonic links. In this paper, we first explore the power, efficiency and temperature tradeoffs associated with an on-chip laser source. Using a 3D stacked system that integrates a manycore chip with the optical devices and laser sources, we explore the design space for laser source sharing (among waveguides) and placement to minimize laser power by simultaneously considering the network bandwidth requirements, thermal constraints, and physical layout constraints. As part of this exploration we consider Clos and crossbar logical topologies, U-shaped and W-shaped physical layouts, and various sharing/placement strategies: locally-placed dedicated laser sources for waveguides, locally-placed shared laser sources, and shared laser sources placed remotely along the chip edges. Our analysis shows that logical topology, physical layout, and photonic device losses strongly drive the laser source sharing and placement choices to minimize laser power.

ieee high performance extreme computing conference | 2014

An investigation of Unified Memory Access performance in CUDA

Raphael Landaverde; Tiansheng Zhang; Ayse Kivilcim Coskun; Martin C. Herbordt

Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations.

ieee international conference on solid-state and integrated circuit technology | 2010

A small-granularity solution on fault-tolerant in 2D-Mesh Network-on-Chip

Jinxiang Wang; Fangfa Fu; Tiansheng Zhang; Yu-Ping Chen

A small-granularity solution with high performance and low area cost for fault-tolerant routing of hard error in 2D-Mesh Network-on-Chip is proposed. This solution presents a new fault model, defines separately node-fault and link-fault, reduces situations classified as node-fault effectively, and consequently improves the performance of the network. By defining some new paths to substitute failure paths, data packets can be routed along the new paths which are formed by the neighbor nodes of node-fault or link-fault. Finally, a fault-tolerant wormhole router based on XY routing algorithm is designed according to the solution. The evaluation results show that network performance can be improved by 15% when link-fault occurs in the network. Compared to the solution proposed by one reference, the average latency is reduced to 50% and the throughput is almost doubled, while the silicon area penalty of this router is almost the same.

design, automation, and test in europe | 2013

3D-MMC: a modular 3D multi-core architecture with efficient resource pooling

Tiansheng Zhang; Alessandro Cevrero; Giulia Beanato; Panagiotis Athanasopoulos; Ayse Kivilcim Coskun; Yusuf Leblebici

This paper demonstrates a fully functional hardware and software design for a 3D stacked multi-core system for the first time. Our 3D system is a low-power 3D Modular Multi-Core (3D-MMC) architecture built by vertically stacking identical layers. Each layer consists of cores, private and shared memory units, and communication infrastructures. The system uses shared memory communication and Through-Silicon-Vias (TSVs) to transfer data across layers. A serialization scheme is employed for inter-layer communication to minimize the overall number of TSVs. The proposed architecture has been implemented in HDL and verified on a test chip targeting an operating frequency of 400MHz with a vertical bandwidth of 3.2Gbps. The paper first evaluates the performance, power and temperature characteristics of the architecture using a set of software applications we have designed. We demonstrate quantitatively that the proposed modular 3D design improves upon the cost and performance bottlenecks of traditional 2D multi-core design. In addition, a novel resource pooling approach is introduced to efficiently manage the shared memory of the 3D stacked system. Our approach reduces the application execution time significantly compared to 2D and 3D systems with conventional memory sharing.

design, automation, and test in europe | 2016

Cross-layer floorplan optimization for silicon photonic NoCs in many-core systems

Ayse Kivilcim Coskun; Anjun Gu; Warren Jin; Ajay Joshi; Andrew B. Kahng; Jonathan Klamkin; Yenai Ma; John Recchio; Vaishnav Srinivas; Tiansheng Zhang

Many-core chip architectures are now feasible, but the power consumption of electrical networks-on-chip does not scale well. Silicon photonic NoCs (PNoCs) are more scalable and power efficient, but floorplan optimization is challenging. Prior work optimizes PNoC floorplans through simultaneous place and route, but does not address cross-layer effects that span optical and electrical boundaries, chip thermal profiles, or effects of job scheduling policies. This paper proposes a more comprehensive, cross-layer optimization of the silicon PNoC and core cluster floorplan. Our simultaneous placement (locations of router groups and core clusters) and routing (waveguide layout) considers scheduling policy, thermal tuning, and heterogeneity in chip power profiles. The core of our optimizer is a mixed-integer linear programming formulation that minimizes NoC power, including (1) laser source power due to propagation, bend and crossing losses; (2) electrical and electrical-optical-electrical conversion power; and (3) thermal tuning power. Our experiments vary numbers of cores, optical data rate per wavelength, number of waveguides and other parameters to investigate scalability and tradeoffs through a large design space. We demonstrate how the optimal floorplan changes with cross-layer awareness: metrics of interest such as optimal waveguide length or thermal tuning power change significantly (up to 4X) based on power and utilization levels of cores, chip and cluster aspect ratio, and laser source sharing mechanism. Exploration of a large solution space is achieved with reasonable runtimes, and is perfectly parallelizable. Our optimizer thus affords designers with more accurate, cross-layer chip planning decision support to accelerate adoption of PNoC-based solutions.

ACM Journal on Emerging Technologies in Computing Systems | 2015

Dynamic Cache Pooling in 3D Multicore Processors

Tiansheng Zhang; Jie Meng; Ayse Kivilcim Coskun

Resource pooling, where multiple architectural components are shared among cores, is a promising technique for improving system energy efficiency and reducing total chip area. 3D stacked multicore processors enable efficient pooling of cache resources owing to the short interconnect latency between vertically stacked layers. This article first introduces a 3D multicore architecture that provides poolable cache resources. We then propose a runtime management policy to improve energy efficiency in 3D systems by utilizing the flexible heterogeneity of cache resources. Our policy dynamically allocates jobs to cores on the 3D system while partitioning cache resources based on cache hungriness of the jobs. We investigate the impact of the proposed cache resource pooling architecture and management policy in 3D systems, both with and without on-chip DRAM. We evaluate the performance, energy efficiency, and thermal behavior for a wide range of workloads running on 3D systems. Experimental results demonstrate that the proposed architecture and policy reduce system energy-delay product (EDP) and energy-delay-area product (EDAP) by 18.8% and 36.1% on average, respectively, in comparison to 3D processors with static cache sizes.

ifip ieee international conference on very large scale integration | 2013

Dynamic cache pooling for improving energy efficiency in 3D stacked multicore processors

Jie Meng; Tiansheng Zhang; Ayse Kivilcim Coskun

Resource pooling, where multiple architectural components are shared among multiple cores, is a promising technique for improving the system energy efficiency and reducing the total chip area. 3D stacked multicore processors enable efficient pooling of cache resources owing to the short interconnect latency between vertically stacked layers. This paper introduces a 3D multicore architecture that provides poolable cache resources. We propose a runtime policy that improves energy efficiency in 3D stacked processors by providing flexible heterogeneity of the cache resources. Our policy dynamically allocates jobs to cores on the 3D stacked system in a way that pairs applications with contrasting cache use, while also partitioning the cache resources based on the cache hungriness of the applications. Experimental results demonstrate that the proposed policy improves system energy-delay product (EDP) and energy-delay-area product (EDAP) by up to 39.2% and 57.2%, respectively, compared to 3D processors with static cache sizes.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017

Adaptive Tuning of Photonic Devices in a Photonic NoC Through Dynamic Workload Allocation

José L. Abellán; Ayse Kivilcim Coskun; Anjun Gu; Warren Jin; Ajay Joshi; Andrew B. Kahng; Jonathan Klamkin; Cristian Morales; John Recchio; Vaishnav Srinivas; Tiansheng Zhang

Photonic network-on-chip (PNoC) is a promising candidate to replace traditional electrical NoC in manycore systems that require substantial bandwidths. The photonic links in the PNoC comprise laser sources, optical ring resonators, passive waveguides, and photodetectors. Reliable link operation requires laser sources and ring resonators to have matching optical frequencies. However, inherent thermal sensitivity of photonic devices and manufacturing process variations can lead to a frequency mismatch. To avoid this mismatch, micro-heaters are used for thermal trimming and tuning, which can dissipate a significant amount of power. This paper proposes a novel FreqAlign workload allocation policy, accompanying an adaptive frequency tuning (AFT) policy, that is capable of reducing thermal tuning power of PNoC. FreqAlign uses thread allocation and thread migration to control temperature for matching the optical frequencies of ring resonators in each photonic link. The AFT policy reduces the remaining optical frequency difference among ring resonators and corresponding on-chip laser sources by hardware tuning methods. We use a full modeling stack of a PNoC that includes a performance simulator, a power simulator, and a thermal simulator with a temperature-dependent laser source power model to design and evaluate our proposed policies. Our experimental results demonstrate that FreqAlign reduces the resonant frequency gradient between ring resonators by 50%–60% when compared to existing workload allocation policies. Coupled with AFT, FreqAlign reduces localized thermal tuning power by 19.28 W on average, and is capable of saving up to 34.57 W when running realistic loads in a 256-core system without any performance degradation.

international parallel and distributed processing symposium | 2018