Junmin Wu
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Junmin Wu.
IEEE Transactions on Computers | 2015
Zhibin Yu; Lieven Eeckhout; Nilanjan Goswami; Tao Li; Lizy Kurian John; Hai Jin; Cheng Zhong Xu; Junmin Wu
Graphics processing units (GPU), due to their massive computational power with up to thousands of concurrent threads and general-purpose GPU (GPGPU) programming models such as CUDA and OpenCL, have opened up new opportunities for speeding up general-purpose parallel applications. Unfortunately, pre-silicon architectural simulation of modern-day GPGPU architectures and workloads is extremely time-consuming. This paper addresses the GPGPU simulation challenge by proposing a framework, called GPGPU-MiniBench, for generating miniature, yet representative GPGPU workloads. GPGPU-MiniBench first summarizes the inherent execution behavior of existing GPGPU workloads in a profile. The central component in the profile is the Divergence Flow Statistics Graph (DFSG), which characterizes the dynamic control flow behavior including loops and branches of a GPGPU kernel. GPGPU-MiniBench generates a synthetic miniature GPGPU kernel that exhibits similar execution characteristics as the original workload, yet its execution time is much shorter thereby dramatically speeding up architectural simulation. Our experimental results show that GPGPU-MiniBench can speed up GPGPU architectural simulation by a factor of 49× on average and up to 589×, with an average IPC error of 4.7 percent across a broad set of GPGPU benchmarks from the CUDA SDK, Rodinia and Parboil benchmark suites. We also demonstrate the usefulness of GPGPU-MiniBench for driving GPU architecture exploration.
international symposium on parallel and distributed processing and applications | 2010
Junmin Wu; Xiufeng Sui; Yixuan Tang; Xiaodong Zhu; Jing Wang; Guoliang Chen
With recent advances of processor technology, the LRU based shared last-level cache (LLC) has been widely employed in modern Chip Multi-processors (CMP). However, past research [1,2,8,9] indicates that the cache performance of the LLC and further of the CMP processors may be degraded severely by LRU under the occurrence of the inter-thread interference or the excess of the working set size over the cache size. Existing approaches tackling this performance degradation problem have limited improvement of an overall cache performance because they usually focus on a single type of memory access behavior and thus lack full consideration of tradeoffs among different types of memory access behaviors. In this paper, we propose a unified cache management policy called Partitioning-Aware Eviction and Thread-aware Insertion/Promotion policy (PAE-TIP) that can effectively enhance capacity management, adaptive insertion/promotion, and further improve the overall cache performance. Specifically, PAE-TIP employs an adaptive mechanism to decide the position where to put the incoming lines or to move the hit lines, and chooses a victim line based on the target partitioning given by utility-based cache partitioning (UCP) [2]. In our study, we show that PAE-TIP can cover a variety of memory access behaviors simultaneously and provide a good tradeoff for overall cache performance improvement while retaining competitively low hardware and design overhead. The evaluation conducted on 4-way CMPs shows that the PAE-TIP-managed LLC can improve overall performance by19.3% on average over the LRU policy. Furthermore, the performance benefit of PAE-TIP is 1.09x compared to PIPP, 1.11x compared to TADIP and 1.12x compared to UCP.
computing frontiers | 2013
Chunkun Bo; Rui Hou; Junmin Wu; Tao Jiang; Liuhang Zhang
Driven by rapid development of cloud computing, virtualized environments are becoming popular in data center. Frequent communication among multiple virtual machines is required by a large amount of applications. Although many virtualization acceleration techniques have been proposed, the network performance is still a hot research topic due to the complicated and costly implementations of I/O virtualization mechanism. Some previous research focuses on improving the efficiency of communication among virtual machines in the same host. But studying how to accelerate cross-node virtual machine communication is also necessary. On the other hand, many high efficient, tight-coupling interconnects have been proposed as data center interconnects. They have advantages in performance and efficiency, while traditional Ethernet and InfiniBand have good scalability. However, these two kinds of interconnects can coexist very well. Tight-coupling protocol is suitable for connecting small-scale data center nodes, which we call super-node, while super-node is connected by traditional interconnect. In our opinion, data center with such hybrid interconnect architecture is one of important trends. Targeting the hybrid interconnect architecture, this paper proposes an efficient mechanism, named as TCNet (abbreviation for tight-coupling network), to accelerate cross-node virtual machine communication.n To verify the acceleration mechanism, we build a prototype system which chooses PCIe (for inner-super-node interconnect) and Ethernet (for inter-super-node interconnect) as the hybrid interconnect and use KVM as software environments. We use several benchmarks to evaluate the mechanism. The latency of TCNet is 23% shorter than that of Gigabit Ethernet on average and the bandwidth is 1.14 times as large as that of Gigabit Ethernet on average. Besides, we use Specweb2006 to evaluate its web service ability. TCNet can support 20% more clients simultaneously than that of Ethernet and response requests 19% faster. The results demonstrate that TCNet has great potential to accelerate cross-node virtual machine communication for data center with hybrid interconnect.
computing frontiers | 2010
Xiufeng Sui; Junmin Wu; Guoliang Chen; Yixuan Tang; Xiaodong Zhu
In this paper, we augment traditional cache partitioning with thread-aware adaptive insertion and promotion policies to manage shared L2 caches. The proposed mechanism can mitigate destructive inter-thread interference, and meanwhile retain some fraction of the working set in the cache, therefore results in better performance.
international symposium on performance analysis of systems and software | 2013
Xiaodong Zhu; Junmin Wu; Guoliang Chen; Tao Li
A common practice for reducing synchronization overheads in parallel simulation of a large-scale cluster is to relax synchronization with lengthened synchronous steps. However, as a side effect, simulation accuracy degrades considerably. This paper proposes a novel mechanism that keeps the running speeds of different nodes consistent by synchronizing logical clocks with the wall clock periodically within each lax step. Because speed deviations of nodes are the main source of time causality errors, through aligning speeds our mechanism only causes modest precision loss while achieving a close performance to lax synchronization. The experimental results show that it improves the performance by 2 to 11 times relative to the baseline barrier synchronization with a high accuracy (e.g. 99% in most cases). Compared to the recently proposed adaptive mechanism, it also achieves nearly 30% performance improvement.
international conference on computer design | 2010
Xiaodong Zhu; Junmin Wu; Xiufeng Sui; Wei Yin; Qingbo Wang; Zhe Gong
As the approaching of the multi-core era, chip multiprocessor(CMP) architectures present a challenge for efficient simulation, combining with the requirements of a detailed simulator running realistic workloads. Parallelization, which can exploit inherent parallelism in CMP simulation, is a common method to reduce simualtion time. We design and implement PCAsim, a parallel cycle accurate and user-level CMP simulator running on shared memory platform. The simulator is parallelized by POSIX threads according to target system architecture. Each core thread and the manager thread are synchronized with Slack mechanism [11]. But we find slack mechanism can not ensure the simulator against time violation among events generated by network activity and cache coherence protocol. To solve the problem, we propose an effective synchronous method called pending barrier. This method augments the power of traditional conservative parallel synchronous mechanism and improves simulation accuracy with negligible performance degradation. Except synchronization, we also encountered many other troublesome issues in implementing PCAsim. This paper describes some common ones and illustrates how we address them. The evaluations show that PCAsim can achieve reasonable speed-up and scalability.
IEEE Transactions on Parallel and Distributed Systems | 2017
Xiaodong Zhu; Junmin Wu; Tao Li
Due to synchronization overhead, it is challenging to apply the parallel simulation technique of multi-core processors at larger scales. Although the use of lax synchronization schemes could reduce overhead and balance the load between synchronous points, it introduces timing error and deteriorates simulation accuracy. Through observing the propagation paths of errors, we find that these paths always concentrate on some pivotal events. Based on the observation, we design a delay-calibration mechanism to alleviate errors. We decouple the timing and functional processes of the pivotal events, leveraging prediction technique of delays to connect two categories of the processes. Errors are traced throughout the timing processes of the pivotal events, and are deducted from the predicted delays before the delays are consumed by the functional processes. Therefore, through cleaning the errors at the successive pivot events, the mechanism decreases the simulated time deviations efficiently. Since the prediction and error deduction processes do not have any constraint on synchronizations, our approach largely maintains the scalability of lax synchronization schemes. Furthermore, our proposal is orthogonal to other parallel simulation techniques and can be used in conjunction with them. Experimental results show that error compensation improves the accuracy of lax synchronized simulations by 68 percent and achieves 97.8 percent accuracy when combined with an enhanced lax synchronization.
IEEE Transactions on Computers | 2016
Junmin Wu; Xiaodong Zhu; Tao Li; Xiufeng Sui
Parallelization is an efficient approach to accelerate multi-core, multi-processor and cluster architecture simulators. Nevertheless, frequent synchronization can significantly hinder the performance of a parallel simulator. A common practice in alleviating synchronization cost is to relax synchronization using lengthened synchronous steps. However, as a side effect, simulation accuracy deteriorates considerably. Through analyzing various factors contributing to the causality error in lax synchronization, we observe that a coherent speed across all nodes is critical to achieve high accuracy. To this end, we propose wall-clock based synchronization (WBSP), a novel mechanism that uses wall-clock time to maintain a coherent running speed across the different nodes by periodically synchronizing simulated clocks with the wall clock within each lax step. Our proposed method only results in a modest precision loss while achieving performance close to lax synchronization. We implement WBSP in a many-core parallel simulator and a cluster parallel simulator. Experimental results show that at a scale of 32-host threads, it improves the performance of the many-core simulator by 4.3χ on average with less than a 5.5 percent accuracy loss compared to the conservative mechanism. On the cluster simulator with 64 nodes, our proposed scheme achieves an 8.3χ speedup compared to the conservative mechanism while yielding only a 1.7 percent accuracy loss. Meanwhile, WBSP outperforms the recent proposed adaptive mechanism on simulations that exhibit heavy traffic.
international conference on parallel processing | 2015
Xiaodong Zhu; Junmin Wu; Tao Li
Due to synchronization overhead, it is challenging to apply the parallel simulation techniques of multi-core processors to a larger scale. Although the use of lax synchronization scheme reduces the synchronous overhead and balances the load between synchronous points, it introduces timing errors. To improve the accuracy of lax synchronized simulations, we propose an error compensation technique, which leverages prediction methods to compensate for simulated time deviations due to timing errors. The rationale of our approach is that, in the simulated multi-core processor systems the errors typically propagate via the delays of some pivotal events that connect subsystem models across different hierarchies. By predicting delays based on the simulation results of the preceding pivotal events, our techniques can eliminate errors from the predicted delays before they propagate to the models at higher hierarchies, thereby effectively improving the simulation accuracy. Since the predictions dont have any constraint on synchronizations, our approach largely maintains the scalability of lax synchronization schemes. Furthermore, our proposed mechanism is orthogonal to other parallel simulation techniques and can be used in conjunction with them. Experimental results show error compensation improves the accuracy of lax synchronized simulations by 60.2% and achieves 98.2% accuracy when combined with an enhanced lax synchronization.
international conference on information science and technology | 2014
Xiaoke Li; Junmin Wu; Zhibin Yu; Cheng Zhong Xu; Kai Chen
Benefiting from integrating massive parallel processors, Graphics Processing Units(GPUs) have become prevalent computing devices for general-purpose parallel applications - so called GPGPU computing. While providing powerful computation capability, GPGPUs are power hungry. Almost half of the total amount of a GPGPU-based system power is consumed by GPGPU, which has seriously hindered the application of GPGPUs. As such, its essential to build an accurate and robust model to analyze the performance and power consumption of GPGPUs. In this paper, we propose an adaptive performance and power consumption model by using random forest algorithm. The model is based on the overall GPU architecture performance counters, including multi-processor, memory access pattern and bandwidth metrics, and adapts to different NVIDIA GPU architectures. The results demonstrate that our model can achieve an average accuracy with prediction error 2.1% and 3.2% for the performance and power consumption, respectively. Furthermore, by identifying the most important impact factors and quantifying their contributions, our proposed approach can help GPGPU programmers and architects get quick insights on the performance and power consumption of GPGPU systems.