Huimin Cui | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huimin Cui is active.

Explore More

Publication

Featured researches published by Huimin Cui.

symposium on code generation and optimization | 2010

An adaptive task creation strategy for work-stealing scheduling

Lei Wang; Huimin Cui; Yuelu Duan; Fang Lu; Xiaobing Feng; Pen Chung Yew

Work-stealing is a key technique in many multi-threading programming languages to get good load balancing. The current work-stealing techniques have a high implementation overhead in some applications and require a large amount of memory space for data copying to assure correctness. They also cannot handle many application programs that have an unbalanced call tree or have no definitive working sets. In this paper, we propose a new adaptive task creation strategy, called AdaptiveTC, which supports effective work-stealing schemes and also handles the above mentioned problems effectively. As shown in some experimental results, AdaptiveTC runs 2.71x faster than Cilk and 1.72x faster than Tascell for the 16-queen problem with 8 threads.

international parallel and distributed processing symposium | 2011

Automatic Library Generation for BLAS3 on GPUs

Huimin Cui; Lei Wang; Jingling Xue; Yang Yang; Xiaobing Feng

High-performance libraries, the performance-critical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential benefits of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).

international conference on parallel architectures and compilation techniques | 2013

An empirical model for predicting cross-core performance interference on multicore processors

Jiacheng Zhao; Huimin Cui; Jingling Xue; Xiaobing Feng; Youliang Yan; Wensen Yang

Despite their widespread adoption in cloud computing, multicore processors are heavily under-utilized in terms of computing resources. To avoid the potential for negative and unpredictable interference, co-location of a latency-sensitive application with others on the same multicore processor is disallowed, leaving many cores idle and causing low machine utilization. To enable co-location while providing QoS guarantees, it is challenging but important to predict performance interference between co-located applications. This research is driven by two key insights. First, the performance degradation of an application can be represented as a predictor function of the aggregate pressures on shared resources from all cores, regardless of which applications are co-running and what their individual pressures are. Second, a predictor function is piecewise rather than non-piecewise as in prior work, thereby enabling different types of dominant contention factors to be more accurately captured by different subfunctions in its different subdomains. Based on these insights, we propose to adopt a two-phase regression approach to efficiently building a predictor function. Validation using a large number of benchmarks and nine real-world datacenter applications on three different platforms shows that our approach is also precise, with an average error not exceeding 0.4%. When applied to the nine datacenter applications, our approach improves overall resource utilization from 50% to 88% at the cost of 10% QoS degradation.

international parallel and distributed processing symposium | 2012

A Highly Parallel Reuse Distance Analysis Algorithm on GPUs

Huimin Cui; Qing Yi; Jingling Xue; Lei Wang; Yang Yang; Xiaobing Feng

Reuse distance analysis is a runtime approach that has been widely used to accurately model the memory system behavior of applications. However, traditional reuse distance analysis algorithms use tree-based data structures and are hard to parallelize, missing the tremendous computing power of modern architectures such as the emerging GPUs. This paper presents a highly-parallel reuse distance analysis algorithm (HP-RDA) to speedup the process using the SPMD execution model of GPUs. In particular, we propose a hybrid data structure of hash table and local arrays to flatten the traditional tree representation of memory access traces. Further, we use a probabilistic model to correct any loss of precision from a straightforward parallelization of the original sequential algorithm. Our experimental results show that using an NVIDIA GPU, our algorithm achieves a factor of 20 speedup over the traditional sequential algorithm with less than 1% loss in precision.

international conference on supercomputing | 2015

Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters

Wenting He; Huimin Cui; Binbin Lu; Jiacheng Zhao; Shengmei Li; Gong Ruan; Jingling Xue; Xiaobing Feng; Wensen Yang; Youliang Yan

Despite the widespread adoption of heterogeneous clusters in modern data centers, modeling heterogeneity is still a big challenge, especially for large-scale MapReduce applications. In a CPU/GPU hybrid heterogeneous cluster, allocating more computing resources to a MapReduce application does not always mean better performance, since simultaneously running CPU and GPU tasks will contend for shared resources. This paper proposes a heterogeneity model to predict the shared resource contention between the simultaneously running tasks of a MapReduce application when heterogeneous computing resources (e.g. CPUs and GPUs) are allocated. To support the approach, we present a heterogeneous MapReduce framework, Hadoop+, which enables CPUs and GPUs to process big data coordinately, and leverages the heterogeneity model to assist users in selecting the computing resources for different purposes. Our experimental results show three benefits. First, Hadoop+ exploits GPU capability, and achieves 1.4x to 16.1x speedups over Hadoop for 5 real applications when running individually. Second, the heterogeneity model can be used to allocate GPUs among multiple simultaneously running MapReduce applications, bringing up to 36.9% (17.6% in average) speedup when multiple applications are running simultaneously. Third, the model is verified to be able to select the optimal or most cost-effective resource consumption.

Journal of Computer Science and Technology | 2012

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Yang Yang; Huimin Cui; Xiaobing Feng; Jingling Xue

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

computing frontiers | 2014

A collaborative divide-and-conquer K-means clustering algorithm for processing large data

Huimin Cui; Gong Ruan; Jingling Xue; Rui Xie; Lei Wang; Xiaobing Feng

K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.

acm sigplan symposium on principles and practice of parallel programming | 2016

Articulation points guided redundancy elimination for betweenness centrality

Lei Wang; Fan Yang; Liangji Zhuang; Huimin Cui; Fang Lv; Xiaobing Feng

Betweenness centrality (BC) is an important metrics in graph analysis which indicates critical vertices in large-scale networks based on shortest path enumeration. Typically, a BC algorithm constructs a shortest-path DAG for each vertex to calculate its BC score. However, for emerging real-world graphs, even the state-of-the-art BC algorithm will introduce a number of redundancies, as suggested by the existence of articulation points. Articulation points imply some common sub-DAGs in the DAGs for different vertices, but existing algorithms do not leverage such information and miss the optimization opportunity. We propose a redundancy elimination approach, which identifies the common sub-DAGs shared between the DAGs for different vertices. Our approach leverages the articulation points and reuses the results of the common sub-DAGs in calculating the BC scores, which eliminates redundant computations. We implemented the approach as an algorithm with two-level parallelism and evaluated it on a multicore platform. Compared to the state-of-the-art implementation using shared memory, our approach achieves an average speedup of 4.6x across a variety of real-world graphs, with the traversal rates up to 45 ~ 2400 MTEPS (Millions of Traversed Edges per Second).

IEEE Transactions on Parallel and Distributed Systems | 2016

Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis

Jiacheng Zhao; Huimin Cui; Jingling Xue; Xiaobing Feng

Despite their widespread adoption in cloud computing, multicore processors are heavily under-utilized in terms of computing resources. To avoid the potential for negative and unpredictable interference, co-location of a latency-sensitive application with others on the same multicore processor is disallowed, leaving many cores idle and causing low machine utilization. To enable co-location while providing QoS guarantees, it is challenging but important to predict performance interference between co-located applications. We observed that the performance degradation of an application can be represented as a piecewise predictor function of the aggregate pressures on shared resources from all cores. Based on this observation, we propose to adopt regression analysis to build a predictor function for an application. Furthermore, the prediction model thus obtained for an application is able to characterize its contentiousness and sensitivity. Validation using a large number of single-threaded and multi-threaded benchmarks and nine real-world datacenter applications on two different platforms shows that our approach is also precise, with an average error not exceeding 0.4 percent.

international symposium on microarchitecture | 2014

Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations

Qing Yi; Qian Wang; Huimin Cui

General purpose compilers aim to extract the best average performance for all possible user applications. Due to the lack of specializations for different types of computations, compiler attained performance often lags behind those of the manually optimized libraries. In this paper, we demonstrate a new approach, programmable composition, to enable the specialization of compiler optimizations without compromising their generality. Our approach uses a single pass of source-level analysis to recognize a common pattern among dense matrix computations. It then tags the recognized patterns to trigger a sequence of general-purpose compiler optimizations specially composed for them. We show that by allowing different optimizations to adequately communicate with each other through a set of coordination handles and dynamic tags inserted inside the optimized code, we can specialize the composition of general-purpose compiler optimizations to attain a level of performance comparable to those of manually written assembly code by experts, thereby allowing selected computations in applications to benefit from similar levels of optimizations as those manually applied by experts.

Explore More