Qingchuan Shi
University of Connecticut
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Qingchuan Shi.
ieee international symposium on workload characterization | 2015
Masab Ahmad; Farrukh Hijaz; Qingchuan Shi; Omer Khan
Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.
international conference on computer design | 2013
Farrukh Hijaz; Qingchuan Shi; Omer Khan
Near-threshold voltage (NTV) operation is expected to enable up to 10× energy-efficiency for future processors. However, reliable operation below a minimum voltage (Vccmin) cannot be guaranteed. Specifically, SRAM bit-cell error rates are expected to rise steeply since their margins can easily be violated at near-threshold voltages. Multicore processors rely on fast private L1 caches to exploit data locality and achieve high performance. In the presence of high bit-cell error rates, an L1 cache can either sacrifice capacity or incur additional latency to correct the errors. We observe that L1 cache sensitivity to hit latency offers a design tradeoff between capacity and latency. When error rate is high at extreme Vccmin, it is worthwhile incurring additional latency to recover and utilize the additional L1 cache capacity. However, at low error rates, the additional constant latency to recover cache capacity degrades performance. With this tradeoff in mind, we propose a novel private L1 cache architecture that dynamically learns and adapts by either recovering cache capacity at the cost of additional latency overhead, or operate at lower capacity while utilizing the benefits of optimal hit latency. Using simulations of a 64-core multicore, we demonstrate that our adaptive L1 cache architecture performs better than both individual schemes at low and high error rates (i.e., various NTV conditions).
IEEE Computer | 2013
Qingchuan Shi; Omer Khan
A proposed lightweight, soft-error-resilient architecture for shared-memory multicores enables cores to autonomously perform redundant execution of uninterrupted instruction sequences. The distributed redundancy control mechanism operates in concert with the coherence protocol to provide resiliency for both computation and communication hardware. The Web extra at http://youtu.be/9A3oiIerI0w is a video interview in which guest editor Srinivas Devadas and author Omer Khan expand on how a proposed lightweight, soft-error-resilient architecture for shared-memory multicores enables cores to autonomously perform redundant execution of uninterrupted instruction sequences.
IEEE Computer Architecture Letters | 2015
Qingchuan Shi; Henry Hoffmann; Omer Khan
To protect multicores from soft-error perturbations, resiliency schemes have been developed with high coverage but high power/performance overheads (~2x). We observe that not all soft-errors affect program correctness, some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from soft-error free outcome. Thus, it is practical to improve processor efficiency by trading off resilience overheads with program accuracy. We propose the idea of declarative resilience that selectively applies resilience schemes to both crucial and non-crucial code, while ensuring program correctness. At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. The hardware collaborates with software support to enable efficient resilience with 100 percent soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of multithreaded benchmarks, declarative resilience improves completion time by an average of 21 percent over state-of-the-art hardware resilience scheme that protects all executed code. Its performance overhead is ~1.38x over a multicore that does not support resilience.
international conference on parallel architectures and compilation techniques | 2015
George Kurian; Qingchuan Shi; Srinivas Devadas; Omer Khan
Data access in modern processors contributes significantly to the overall performance and energy consumption. Traditionally, data is distributed among the cores through an on-chip cache hierarchy, and each producer/consumer accesses data through its private level-1 cache relying on the cache coherence protocol for consistency. Recently, remote access, a mechanism that reduces energy and latency through word-level access to data anywhere on chip has been proposed. Remote access does not replicate data in the private caches, and thereby removes the need for expensive cache line invalidations or updates. Researchers have implemented remote access as an auxiliary mechanism in cache coherence to improve efficiency. Unfortunately, stronger memory models, such as Intels TSO, require strict ordering among the loads and stores. This introduces serialization penalties for data classified to be accessed remotely, which hampers each cores ability to optimally exploit memory level parallelism. In this paper we propose a novel timestamp-based scheme to detect memory consistency violations. The proposed scheme enables remote accesses to be issued and completed in parallel while continuously detecting whether any ordering violations have occurred, and rolling back the pipeline state (if needed). We implement our scheme for the locality-aware cache coherence protocol that uses remote access as an auxiliary mechanism for efficient data access. Our evaluation using a 64-core multicore processor with out-of-order speculative cores shows that the proposed technique improves completion time by 26% and energy by 20% over a state-of-the-art cache management scheme.
international conference on computer design | 2013
Qingchuan Shi; Farrukh Hijaz; Omer Khan
Next generation multicores will process massive data with significant sharing. Since future processors will also be inherently limited by the off-chip bandwidth, the on-chip data management is emerging as a first-order design constraint. On-chip memory latency increases as more cores are added since the diameter of most on-chip networks increases with the number of cores. We observe that a large fraction of on-chip traffic originates from communication between the cores to maintain cache coherence. Motivated by these observations, we propose a novel on-chip data placement mechanism that optimizes shared data placement by minimizing the distance of data from the requesting cores (improve locality) while paying attention to load balancing network contention and the utilization of percore cache capacity. Using simulations of a 64-core multicore, we show that our proposal outperforms state-of-the-art static and dynamic data placement mechanisms by an average of 5.5% and 8.5% respectively.
international symposium on microarchitecture | 2012
Farrukh Hijaz; Qingchuan Shi; Omer Khan
Near-threshold voltage operation is widely acknowledged as a potential mechanism to achieve an order of magnitude reduction in energy consumption in future processors. However, processors cannot operate reliably below a minimum voltage, Vccmin, since hardware components may fail. SRAM bitcell failures in memory structures, such as caches, typically determine the Vccmin for a processor. Although the last-level shared caches (LLC) in modern multicores are protected using error correcting codes (ECC), the private caches have been left unprotected due to their performance sensitivity to the latency overhead of the ECC. This limits the operation of the processor at near-threshold voltages.In this paper, we propose mechanisms for near-threshold operation of private caches that do not require ECC support. First, we present a fine-grain mechanism to disable cache lines in private caches, with bitcell failures at the target near-threshold voltage. Second, we propose two mechanisms to better manage the capacity-stressed private caches. (1) We utilize the OS-level data classification of private and shared data and evaluate a data placement mechanism that dynamically relocates the private data blocks to the LLC slice that is physically co-located with the requesting core. (2) We propose an in-hardware yet low-overhead runtime profiling of the locality of each cache line that is classified as private data, and only allow such data to be cached in the private caches if it shows high spatio-temporal locality. These mechanisms allow the private caches to rely on the local LLC slice to cache the low-locality private data efficiently, and enable more space to hold the more frequently used private data (as well as the shared data). We show that combining cache line disabling with efficient cache management of private data performs better (in terms of application completion times) than using a single error correction double error detection (SECDED) based ECC mechanism and/or cache line disabling.
ACM Transactions in Embedded Computing Systems | 2018
Hamza Omar; Qingchuan Shi; Masab Ahmad; Halit Dogan; Omer Khan
To protect multicores from soft-error perturbations, research has explored various resiliency schemes that provide high soft-error coverage. However, these schemes incur high performance and energy overheads. We observe that not all soft-error perturbations affect program correctness, and some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from error free outcome. Thus, it is practical to improve processor efficiency by trading off resiliency overheads with program accuracy. This article proposes the idea of declarative resilience that selectively applies strong resiliency schemes for code regions that are crucial for program correctness (crucial code) and lightweight resiliency for code regions that are susceptible to program accuracy deviations as a result of soft-errors (non-crucial code). At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. A cross-layer architecture enables efficient resilience along with holistic soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of machine-learning and graph analytic benchmarks, declarative resilience reduces performance overhead over a state-of-the-art system that applies strong resiliency for all program code regions from ∼ 1.43× to ∼ 1.2×.
ACM Transactions on Architecture and Code Optimization | 2016
Qingchuan Shi; George Kurian; Farrukh Hijaz; Srinivas Devadas; Omer Khan
The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore, exploiting locality to improve on-chip traffic and resource utilization is of fundamental importance. Conventional multicore cache management schemes either manage the private cache (L1) or the Last-Level Cache (LLC), while ignoring the other. We propose a holistic locality-aware cache hierarchy management protocol for large-scale multicores. The proposed scheme improves on-chip data access latency and energy consumption by intelligently bypassing cache line replication in the L1 caches, and/or intelligently replicating cache lines in the LLC. The approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality at both L1 cache and the LLC. The decision to bypass L1 and/or replicate in LLC is then based on the measured reuse at the fine granularity of cache lines. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional cache coherence protocols. Moreover, the complexity of the protocol is low since no additional coherence states are created. However, the proposed classifier incurs a 5.6KB per-core storage overhead. On a set of parallel benchmarks, the locality-aware protocol reduces average energy consumption by 26% and completion time by 16%, when compared to the state-of-the-art Reactive-NUCA multicore cache management scheme.
The Journal of Supercomputing | 2016
Farrukh Hijaz; Qingchuan Shi; George Kurian; Srinivas Devadas; Omer Khan