Liqun Cheng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Liqun Cheng is active.

Explore More

Publication

Featured researches published by Liqun Cheng.

Journal of Parallel and Distributed Computing | 2005

Fast synchronization on shared-memory multiprocessors: An architectural approach

Zhen Fang; Lixin Zhang; John B. Carter; Liqun Cheng; Michael A. Parker

Synchronization is a crucial operation in many parallel applications. Conventional synchronization mechanisms are failing to keep up with the increasing demand for efficient synchronization operations as systems grow larger and network latency increases. The contributions of this paper are threefold. First, we revisit some representative synchronization algorithms in light of recent architecture innovations and provide an example of how the simplifying assumptions made by typical analytical models of synchronization mechanisms can lead to significant performance estimate errors. Second, we present an architectural innovation called active memory that enables very fast atomic operations in a shared-memory multiprocessor. Third, we use execution-driven simulation to quantitatively compare the performance of a variety of synchronization mechanisms based on both existing hardware techniques and active memory operations. To the best of our knowledge, synchronization based on active memory outforms all existing spinlock and non-hardwired barrier implementations by a large margin.

ieee international conference on high performance computing data and analytics | 2008

Extending CC-NUMA systems to support write update optimizations

Liqun Cheng; John B. Carter

Processor stalls and protocol messages caused by coherence misses limit the performance of shared memory applications. Modern multiprocessors employ write-invalidate coherence protocols, which induce read misses to ensure consistency. Previous research has shown that an invalidate protocol is not optimal for all memory access patterns - an update protocol can significantly outperform an invalidate protocol when data is heavily shared or accessed in predictable patterns. However, update protocols can generate excessive network traffic and are difficult to build on a scalable (non-bus) interconnect. To obtain the benefits of both invalidate and update protocols, we built a speculative sequentially consistent write-update mechanism on top of a write-invalidate protocol. To ensure coherence, a processor wishing to write to a block of data uses a traditional write-invalidate protocol to obtain exclusive access to the block before modifying it. To improve performance, the writing processor can later self-downgrade the modified block to the shared state and flush it back to its home node, which forwards the new data to processors that it predicts are likely to consume the data. We present a practical and cost-effective design for extending CC-NUMA systems to support this speculative update mechanism that requires no changes to the processor core, bus interface, or memory consistency model. We also present two hardware-efficient mechanisms for detecting access patterns that benefit from the speculative update mechanism, stable reader set and stream. We evaluate our update mechanisms on a wide range of scientific benchmarks and commercial applications. Using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor, we find that the mechanisms proposed in this paper reduce the average remote miss rate by 30%, reduce network traffic by 15%, and improve performance by 10%, and in no case hurt performance.

international symposium on microarchitecture | 2006

Leveraging Wire Properties at the Microarchitecture Level

Rajeev Balasubramonian; Naveen Muralimanohar; Karthik Ramani; Liqun Cheng; John B. Carter

In future microprocessors, communication will emerge as a major bottleneck. The authors advocate composing future interconnects of some wires that minimize latency, some that maximize bandwidth, and some that minimize power. A microarchitecture aware of these wire characteristics can steer on-chip data transfers to the most appropriate wires, thus improving performance and saving energy

international conference on parallel processing | 2005

Fast barriers for scalable ccNUMA systems

Liqun Cheng; John B. Carter

The contributions of this paper are threefold. First, we identify and quantify the performance deficiencies of conventional barrier implementations when they are executed on real (non-idealized) hardware. Second, we propose a queue-based barrier algorithm that has effectively O(1) time complexity as measured in round trip message latencies. Third, we demonstrate how matching the barrier implementation to the way that modern shared memory systems operate can improve performance dramatically by exploiting a hardware write-update (PUT) mechanism for signaling. The resulting barrier algorithm only costs one serialized round trip message latency to perform a barrier operation across N processors. Using a cycle-accurate execution-driven simulator of a future-generation SGI multiprocessor, we show that with no special hardware support our queue-based barrier outperforms OpenMPs LL/SC-based barrier implementation by a factor of 7.9 on 256 processors. With hardware that supports a coherent PUT operation, our queue-based barrier outperforms OpenMP barriers by a factor of 94 and outperforms barriers based on SGIs memory controller-based atomic operations by a factor of 6.5 on 256 processors.

international symposium on computer architecture | 2006