Is this you? Create Your Porfile

Changhui Lin

University of California, Riverside

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Changhui Lin is active.

Explore More

Publication

Featured researches published by Changhui Lin.

international conference on parallel architectures and compilation techniques | 2010

Efficient sequential consistency using conditional fences

Changhui Lin; Vijay Nagarajan; Rajiv Gupta

Among the various memory consistency models, the sequential consistency (SC) model, in which memory operations appear to take place in the order specified by the program, is the most intuitive and enables programmers to reason about their parallel programs the best. Nevertheless, processor designers often choose to support relaxed memory consistency models because the weaker ordering constraints imposed by such models allow for more instructions to be reordered and enable higher performance. Programs running on machines supporting weaker consistency models, can be transformed into ones in which SC is enforced. The compiler does this by computing a minimal set of memory access pairs whose ordering automatically guarantees SC. To ensure that these memory access pairs are not reordered, memory fences are inserted. Unfortunately, insertion of such memory fences can significantly slowdown the program. We observe that the ordering of the minimal set of memory accesses that the compiler strives to enforce, is typically already enforced in the normal course of program execution. A study we conducted on programs with compiler inserted memory fences shows that only 8% of the executed instances of the memory fences are really necessary to ensure SC. Motivated by this study we propose the conditional fence mechanism (C-Fence) that utilizes compiler information to decide dynamically if there is a need to stall at each fence. Our experiments with SPLASH-2 benchmarks show that, with C-Fences, programs can be transformed to enforce SC incurring only 12% slowdown, as opposed to 43% slowdown using normal fence instructions. Our approach requires very little hardware support (<300 bytes of on-chip-storage) and it avoids the use of speculation and its associated costs.

architectural support for programming languages and operating systems | 2012

Efficient sequential consistency via conflict ordering

Changhui Lin; Vijay Nagarajan; Rajiv Gupta; Bharghava Rajaram

Although the sequential consistency (SC) model is the most intuitive, processor designers often choose to support relaxed memory consistency models for higher performance. This is because SC implementations that match the performance of relaxed memory models require post-retirement speculation and its associated hardware costs. In this paper we propose an efficient approach for enforcing SC without requiring post-retirement speculation. While prior SC implementations guarantee SC by explicitly completing memory operations within a processor in program order, we guarantee SC by completing conflicting memory operations, within and across processors, in an order that is consistent with the program order. More specifically, we identify those conflicting memory operations whose ordering is critical for the maintenance of SC and explicitly order them. This allows us to safely (non-speculatively) complete memory operations past pending writes, thus reducing memory ordering stalls. Our experiments with SPLASH-2 programs show that SC can be achieved efficiently, with performance comparable to RMO (relaxed memory order).

International Journal of Parallel Programming | 2012

Efficient Sequential Consistency Using Conditional Fences

Changhui Lin; Vijay Nagarajan; Rajiv Gupta

Among the various memory consistency models, the sequential consistency (SC) model is the most intuitive and enables programmers to reason about their parallel programs the best. Nevertheless, processor designers often choose to support relaxed memory consistency models because the weaker ordering constraints imposed by such models allow for more instructions to be reordered and enable higher performance. Programs running on machines supporting weaker consistency models can be transformed into ones in which SC is enforced. The compiler does this by computing a minimal set of memory access pairs whose ordering automatically guarantees SC. To ensure that these memory access pairs are not reordered, memory fences are inserted. Unfortunately, insertion of such memory fences can significantly slowdown the program. We observe that the ordering of the minimal set of memory accesses that the compiler strives to enforce, is typically already enforced in the normal course of program execution. A study we conducted on programs with compiler inserted memory fences shows that only 8% of the executed instances of the memory fences are really necessary to ensure SC. Motivated by this study we propose the conditional fence mechanism, known as C-Fence that utilizes compiler information to decide dynamically if there is a need to stall at each fence, only stalling when necessary. Our experiments with SPLASH-2 benchmarks show that, with C-Fences and a centralized active table, programs can be transformed to enforce SC incurring only 12% slowdown, as opposed to 43% slowdown using normal fence instructions. Our approach requires very little hardware support (<350 bytes of on-chip-storage) and it avoids the use of speculation and its associated costs. Furthermore, to ameliorate the contention in the centralized active table arising from the increasing number of processors, we also design a distributed active table, which further improves the performance of C-Fence for a larger number of processors.

acm sigplan symposium on principles and practice of parallel programming | 2011

Enhanced speculative parallelization via incremental recovery

Chen Tian; Changhui Lin; Min Feng; Rajiv Gupta

The widespread availability of multicore systems has led to an increased interest in speculative parallelization of sequential programs using software-based thread level speculation. Many of the proposed techniques are implemented via state separation where non-speculative computation state is maintained separately from the speculative state of threads performing speculative computations. If speculation is successful, the results from speculative state are committed to non-speculative state. However, upon misspeculation, discard-all scheme is employed in which speculatively computed results of a thread are discarded and the computation is performed again. While this scheme is simple to implement, one disadvantage of discard-all is its inability to tolerate high misspeculation rates due to its high runtime overhead. Thus, it is not suitable for use in applications where misspeculation rates are input dependent and therefore may reach high levels. In this paper we develop an approach for incremental recovery in which, instead of discarding all of the results and reexecuting the speculative computation in its entirety, the computation is restarted from the earliest point at which a misspeculation causing value is read. This approach has two advantages. First, the cost of recovery is reduced as only part of the computation is reexecuted. Second, since recovery takes less time, the likelihood of future misspeculations is reduced. We design and implement a strategy for implementing incremental recovery that allows results of partial computations to be efficiently saved and reused. For a set of programs where misspeculation rate is input dependent, our experiments show that with inputs that result in misspeculation rates of around 40% and 80%, applying incremental recovery technique results in 1.2x-3.3x and 2.0x-6.6x speedups respectively over the discard-all recovery scheme. Furthermore, misspeculations observed during discard-all scheme are reduced when incremental recovery is employed -- reductions range from 10% to 85%.

ACM Transactions on Architecture and Code Optimization | 2011

Dynamic access distance driven cache replacement

Min Feng; Chen Tian; Changhui Lin; Rajiv Gupta

In this article, we propose a new cache replacement policy that makes the replacement decision based on the reuse information of the cache lines and the requested data. We present the architectural support and evaluate the performance of our approach using SPEC benchmarks. We also develop two reuse information predictors: a profile-based static predictor and a runtime predictor. The applicability of each predictor is discussed in this paper. We further extend our reuse information predictors so that the cache can adaptively choose between the reuse information based replacement policy and an approximation of LRU policy. According to the experimental results, our adaptive reuse information based replacement policy performs either better than or close to the LRU policy. Our experiments show that L2 cache misses are reduced by 12.32% and 19.95% using the profiling-based static and runtime adaptive predictors respectively.

ieee international conference on high performance computing data and analytics | 2014

Fence scoping

Changhui Lin; Vijay Nagarajan; Rajiv Gupta

We observe that fence instructions used by programmers are usually only intended to order memory accesses within a limited scope. Based on this observation, we propose the concept fence scope which defines the scope within which a fence enforces the order of memory accesses, called scoped fence (S-Fence). S-Fence is a customizable fence, which enables programmers to express ordering demands by specifying the scope of fences when they only want to order part of memory accesses. At runtime, hardware uses the scope information conveyed by programmers to execute fence instructions in a manner that imposes fewer memory ordering constraints than a traditional fence, and hence improves program performance. Our experimental results show that the benefit of S-Fence hinges on the characteristics of applications and hardware parameters. A group of lock-free algorithms achieve peak speedups ranging from 1.13x to 1.34x, while full applications achieve speedups ranging from 1.04x to 1.23x.

international conference on supercomputing | 2013

Address-aware fences

Changhui Lin; Vijay Nagarajan; Rajiv Gupta

Many modern multicore architectures support shared memory for ease of programming and relaxed memory models to deliver high performance. With relaxed memory models, memory accesses can be reordered dynamically and seen by other processors. Therefore, fence instructions are provided to enforce the memory orderings that are critical to the correctness of a program. However, fence instructions are costly as they cause the processor to stall. Prior works have observed that most of the executions of fence instructions are unnecessary. In this paper we propose address-aware fence, a hardware solution for reducing the overhead of fence instructions without resorting to speculation. Address-aware fence only enforces memory orderings that are necessary to maintain the effect that the traditional fence strives to enforce. This is achieved by dynamically checking a condition for when an execution of a fence must take effect and delay the memory accesses following the fence. When a fence instruction is encountered, first, necessary memory addresses are collected to form a watchlist, and then, only the memory accesses to addresses that are contained in the watchlist are delayed. The memory accesses whose addresses are not contained in the watchlist are allowed to complete without waiting for the completion of pending memory accesses from before the fence. Our experiments conducted on a group of concurrent lock-free algorithms and SPLASH-2 benchmarks show that address-aware fence eliminates nearly all the overhead due to fences and achieves an average improvement of 12.2\% on programs with traditional fences.

high performance embedded architectures and compilers | 2012

PLDS: Partitioning linked data structures for parallelism

Min Feng; Changhui Lin; Rajiv Gupta

Recently, parallelization of computations in the presence of dynamic data structures has shown promising potential. In this paper, we present PLDS, a system for easily expressing and efficiently exploiting parallelism in computations that are based on dynamic linked data structures. PLDS improves the execution efficiency by providing support for data partitioning and then distributing computation across threads based on the partitioning. Such computations often require the use of speculation to exploit dynamic parallelism. PLDS supports a conditional speculation mechanism that reduces the cost of speculation. PLDS can be employed in the context of different forms of parallelism, which to cover a wide range of parallel applications. PLDS provides easy-to-use compiler directives, using enabling programmers to choose from among a variety of data partitionings to distribute computation across threads in a partitioning-sensitive fashion, and to use conditional speculation when required. We evaluate our implementation of PLDS using ten benchmarks, of which six are parallelized using speculation. PLDS achieves 1.3x--6.9x speedups on an 8-core machine.

arXiv: Databases | 2016