Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Chenjie Yu is active.

Publication


Featured researches published by Chenjie Yu.


design automation conference | 2008

Compiler-driven register re-assignment for register file power-density and temperature reduction

Xiangrong Zhou; Chenjie Yu; Peter Petrov

Temperature hot-spots have been known to cause severe reliability problems and to significantly increase leakage power. The register file has been previously shown to exhibit the highest temperature compared to all other hardware components in a modern high- end embedded processor, which makes it particularly susceptible to faults and elevated leakage power. We show that this is mostly due to the highly clustered register file accesses where a set of few registers physically placed close to each other are accessed with very high frequency. In this paper we propose a compiler-based register reassignment methodology, which purpose is to break such groups of registers and to uniformly distribute the accesses to the register file. This is achieved with no performance and no hardware overheads. We show that the underlying problem is NP-hard, and subsequently introduce an efficient algorithmic heuristic.


IEEE Transactions on Very Large Scale Integration Systems | 2010

Low-Cost and Energy-Efficient Distributed Synchronization for Embedded Multiprocessors

Chenjie Yu; Peter Petrov

We present a framework for a distributed and lowcost implementation of synchronization mechanisms for embedded shared-memory multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely decentralized manner through low-cost and distributed synchronization controllers performing distributed synchronization management protocols. The proposed approach achieves three major benefits. First, it completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable. Second, it exhibits extremely low best-case latency of lock acquisition (with zero bus transactions). Third, the approach enables multiple venues for high energy efficiency as the local synchronization controllers can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when a lock is made available to or a barrier enabled at the local processor. It becomes possible for the system software or the thread library to employ various low-power policies.


international conference on hardware/software codesign and system synthesis | 2008

Distributed and low-power synchronization architecture for embedded multiprocessors

Chenjie Yu; Peter Petrov

In this paper we present a framework for a distributed and very low-cost implementation of synchronization controllers and protocols for embedded multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely distributed way. The proposed approach to synchronization implementation not only completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable, but also achieves very high energy efficiency as the local synchronization controller can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when the lock is made available to the local processor. Application-specific information regarding synchronization variables in the local task is exploited in implementing the distributed synchronization protocol. The local synchronization controllers enable the system software or the thread library to implement various low-power policies, such as disabling the cache accesses or even completely powering down the local processor while waiting for a synchronization variable.


ACM Transactions on Design Automation of Electronic Systems | 2009

Temperature-aware register reallocation for register file power-density minimization

Xiangrong Zhou; Chenjie Yu; Peter Petrov

Increased chip temperature has been known to cause severe reliability problems and to significantly increase leakage power. The register file has been previously shown to exhibit the highest temperature compared to all other hardware components in a modern high-end embedded processor, which makes it particularly susceptible to faults and elevated leakage power. We show that this is mostly due to the highly clustered register file accesses where a set of few registers physically placed close to each other are accessed with very high frequency. We propose compile-time temperature-aware register reallocation methodologies for breaking such groups of registers and to uniformly distribute the accesses to the register file. This is achieved with no performance and no hardware overheads. We show that the underlying problem is NP-hard, and subsequently introduce and evaluate two efficient algorithmic heuristics. Our extensive experimental study demonstrates the efficiency of the proposed methodology.


international conference on hardware/software codesign and system synthesis | 2007

Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors

Chenjie Yu; Peter Petrov

Snoop-based cache coherence protocols are typically used when multiple processor cores share memory through a common bus. It is well known, however, that these coherence protocols introduce an excessive power overhead. To help alleviate this problem, we propose an application-driven customization technique where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache tag lookups even for references to shared data, thus, achieving significant power reduction with minimal hardware cost. Snoop-induced cache tag lookups for accesses to both shared and private data are eliminated when it is ensured that such lookups will not result in extra knowledge regarding the cache state in respect to the other caches and memories. The proposed methodology relies on the combined support from the compiler, the operating system, and the hardware architecture. Our experiments show average power reductions of more than 80% compared to a general-purpose snoop protocol.


symposium on integrated circuits and systems design | 2010

Adaptive multi-threading for dynamic workloads in embedded multiprocessors

Chenjie Yu; Peter Petrov

We present a framework for run-time parallelism adaptation of multithreaded applications in multi-core systems. Multi-core systems often execute diverse workloads of multithreaded programs with different system resource utilizations and varying levels of parallelism. As a result, the availability of system resources for individual components of the workloads changes at run-time in an unpredictable manner. Consequently, the level of statically determined parallelism by the system infrastructure, e.g. number of concurrent threads, could be suboptimal and lead to performance degradations. The proposed framework monitors the dynamically changing shared system resources, such as the available processor cores, and adapts the number of threads used by the applications throughout a parallel loop execution so as to match the parallelism level to the changed state of the system resources. The end result is the elimination of sizable overheads due to improper level of parallelism, and the resultant serialization of threads on a single core, that could easily occur in a dynamic system environment.


design automation conference | 2008

Latency and bandwidth efficient communication through system customization for embedded multiprocessors

Chenjie Yu; Peter Petrov

We present a cross-layer customization methodology for latency and bandwidth efficient inter-core communication in embedded multiprocessors. The methodology integrates compiler, operating system, and hardware support to achieve a bandwidth efficient, snoop- free, and coherence cache miss-free shared memory communication between synchronized producer and consumers cores. A compiler- driven code transformation is introduced that utilizes a simple ISA support in the form of a special write-through store instruction. It ensures that producer writes are propagated to the consumers with a single bus transaction per cache block when the producer performs the last write to that cache line before exiting its synchronization region. Information regarding the shared buffers involved in the communications is captured by the OS and provided to the cores with the purpose of filtering bus traffic and performing remote updates when necessary. The end result of the proposed methodology is a single bus transaction per shared cache block and snoop-free communication between a producer and a set of consumers with no intervening coherence misses on the consumer caches. Our experiments demonstrate the significant reductions in both bus traffic and cache misses for a set of multiprocessor benchmarks.


ACM Transactions on Design Automation of Electronic Systems | 2008

Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors

Xiangrong Zhou; Chenjie Yu; Alokika Dash; Peter Petrov

Maintaining local caches coherently in shared-memory multiprocessors results in significant power consumption. The customization methodology we propose exploits the fact that in embedded systems, important knowledge is available to the system designers regarding memory sharing between tasks. We demonstrate how the snoop-induced cache probings can be significantly reduced by identifying and exploiting in a deterministic way the shared memory regions between the processors. Snoop activity is enabled only for the accesses referring to known shared regions. The hardware support is not only cost efficient, but also software programmable, which allows for reprogrammability and customization across different tasks and applications.


IEEE Transactions on Very Large Scale Integration Systems | 2009

Low-Power Snoop Architecture for Synchronized Producer-Consumer Embedded Multiprocessing

Chenjie Yu; Peter Petrov

We introduce a cross-layer customization methodology where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache lookups even for references to shared data, thus, achieving significant power reductions with minimal hardware cost. The technique exploits application-specific information regarding the exact producer-consumer relationships between tasks as well as information regarding the precise timing of synchronized accesses to shared memory buffers by their corresponding producers and/or consumers. Snoop-induced cache lookups for accesses to the shared data are eliminated when it is ensured that such lookups will not result in extra knowledge regarding the cache state in respect to the other caches and the memory. Our experiments show average power reductions of more than 80% compared to a general-purpose snoop protocol.


ACM Transactions on Design Automation of Electronic Systems | 2010

Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

Chenjie Yu; Peter Petrov

We present a framework for performance-, bandwidth-, and energy-efficient intercore communication in embedded MultiProcessor Systems-on-a-Chip (MPSoC). The methodology seamlessly integrates compiler, operating system, and hardware support to achieve a low-cost communication between synchronized producers and consumers. The technique is especially beneficial for data-streaming applications exploiting pipeline parallelism with computational phases mapped to separate cores. Code transformations utilizing a simple ISA support ensure that producer writes are propagated to consumers with a single interconnect transaction per cache block just prior to the producer exiting its synchronization region. Furthermore, in order to completely eliminate misses to shared data caused by interference with private data and also to minimize the cache energy, we integrate to the proposed framework a cache way partitioning policy based on a simple cache configurability support, which isolates the shared buffers from other cache traffic. This mechanism results in significant power savings since only a subset of the cache ways needs to be looked up for each cache access. The end result of the proposed framework is a single communication transaction per shared cache block between a producer and a consumer with no coherence misses on the consumer caches. Our experiments demonstrate significant reductions in interconnect traffic, cache misses, and energy for a set of multiprocessor benchmarks.

Collaboration


Dive into the Chenjie Yu's collaboration.

Top Co-Authors

Avatar

Xiangrong Zhou

University of Hawaii at Manoa

View shared research outputs
Researchain Logo
Decentralizing Knowledge