Xiang Long | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiang Long is active.

Explore More

Publication

Featured researches published by Xiang Long.

advanced parallel programming technologies | 2009

GCSim: A GPU-Based Trace-Driven Simulator for Multi-level Cache

Han Wan; Xiaopeng Gao; Xiang Long; Zhiqiang Wang

We describe the design of parallel trace-driven cache simulation for the purposes of evaluating different cache structures. As the research goes deeper, traditional simulation methods, which can only execute simulation operations in sequence, are no longer practical due to their long simulation cycles. An obvious way to achieve fast parallel simulation is to simulate the independent sets of a cache concurrently on different compute resources. We considered the use of generic GPU to accelerate cache simulation which exploits set-partitioning as the main source of parallelism. But we show this technique is not efficient in the case that just simulating one cache configuration, since a high correlation of the activity between different sets. Trace-sort and multi-configuration simulation in one single pass techniques are developed, taking advantage of the full programmability offered by the Compute Unified Device Architecture (CUDA) on the GPU. Our experimental results demonstrate that the cache simulator based on GPU-CPU platform gains 2.44x performance improvement compared to traditional sequential algorithm.

PLOS ONE | 2013

Adaptive controller for dynamic power and performance management in the virtualized computing systems.

Chengjian Wen; Xiang Long; Yifen Mu

Power and performance management problem in large scale computing systems like data centers has attracted a lot of interests from both enterprises and academic researchers as power saving has become more and more important in many fields. Because of the multiple objectives, multiple influential factors and hierarchical structure in the system, the problem is indeed complex and hard. In this paper, the problem will be investigated in a virtualized computing system. Specifically, it is formulated as a power optimization problem with some constraints on performance. Then, the adaptive controller based on least-square self-tuning regulator(LS-STR) is designed to track performance in the first step; and the resource solved by the controller is allocated in order to minimize the power consumption as the second step. Some simulations are designed to test the effectiveness of this method and to compare it with some other controllers. The simulation results show that the adaptive controller is generally effective: it is applicable for different performance metrics, for different workloads, and for single and multiple workloads; it can track the performance requirement effectively and save the power consumption significantly.

PLOS ONE | 2013

Combining Instruction Prefetching with Partial Cache Locking to Improve WCET in Real-Time Systems

Fan Ni; Xiang Long; Han Wan; Xiaopeng Gao

Caches play an important role in embedded systems to bridge the performance gap between fast processor and slow memory. And prefetching mechanisms are proposed to further improve the cache performance. While in real-time systems, the application of caches complicates the Worst-Case Execution Time (WCET) analysis due to its unpredictable behavior. Modern embedded processors often equip locking mechanism to improve timing predictability of the instruction cache. However, locking the whole cache may degrade the cache performance and increase the WCET of the real-time application. In this paper, we proposed an instruction-prefetching combined partial cache locking mechanism, which combines an instruction prefetching mechanism (termed as BBIP) with partial cache locking to improve the WCET estimates of real-time applications. BBIP is an instruction prefetching mechanism we have already proposed to improve the worst-case cache performance and in turn the worst-case execution time. The estimations on typical real-time applications show that the partial cache locking mechanism shows remarkable WCET improvement over static analysis and full cache locking.

high performance computing and communications | 2016

More Effective Synchronization Scheme in ML Using Stale Parameters

Yabin Li; Han Wan; Bo Jiang; Xiang Long

In Machine learning (ML) the model we use is increasingly important, and the models parameters, the key point of the ML, are adjusted through iteratively processing a training dataset until convergence. Although data-parallel ML systems often engage a perfect error tolerance when synchronizing the model parameters for maximizing parallelism, the synchronization of model parameters may delay in completion, a problem that generally gets worse at a large scale. This paper presents a Bounded Asynchronous Parallel (BAP) model of computation that allows computations using stale model parameters in order to reduce synchronization overheads. In the meanwhile, our BAP model ensures theoretical convergence guarantees for large scale data-parallel ML applications. This model permits distributed workers to use the stale parameters storing in the local cache, instead of waiting until the Parameter Server (PS) produces a new version. This expressively reduces the time workers spend on waiting. Furthermore, the BAP model guarantees the convergence of ML algorithm by bounding the maximum distance of the stale parameters. Experiments conducted on 4 cluster nodes with up to 32 GPUs showed that our model significantly improved the proportion of computing time relative to the waiting time and led to 1.2–2×speedup. Besides, we elaborated how to choose the staleness threshold when considering the tradeoff between Efficiency and Speed.

asian simulation conference | 2016

Parallel Computing Education Through Simulation

Han Wan; Xiaoyan Luo; Xiaopeng Gao; Xiang Long

With the advent of parallel computing, CS departments must face the question of how to integrate the parallel computing knowledge into the curricula. In this paper, we introduced our practice in the parallel computing education which used simulation methodology. In our course, students learned the basics of GPU architectures, parallel computing along with optimization techniques to tuning the performance. Furthermore, we elaborated the work which combined simulation-based architecture research and parallel computing – a cache simulator based on GPU. In this part, the common architecture research methodology with tool, such as simulation methodology and Pin tool were introduced. This study case has shown as an effective supplement to our teaching philosophy: balance design based on quantitative characterization.

AsiaSim/SCS AutumnSim (1) | 2016

Simulation Methodology Used in Computer Structure Course

Han Wan; Xiaopeng Gao; Xiang Long

We describe our reformed Computer Structure course at Beihang University, which won the national teaching achievement award. In this course, we use simulation methodology to help students in understanding the MIPS system. We show how to use MARS to help student grasp the MIPS instruction set and how to use Logisim for the single cycle processor design from sketch. Then we use the ISE to design the pipelined processor, and use FPGA board to evaluate the system design with interruption. The comparisons in terms of excellent rates, pass rates and learning assessments, had shown the blending learning experience with simulation methodology had more rewarding for students.

Journal of Systems Science & Complexity | 2015

Dynamic power saving via least-square self-tuning regulator in the virtualized computing systems

Chengjian Wen; Xiang Long; Yifen Mu

In recent years, power saving problem has become more and more important in many fields and attracted a lot of research interests. In this paper, the authors consider the power saving problem in the virtualized computing system. Since there are multiple objectives in the system as well as many factors influencing the objectives, the problem is complex and hard. The authors will formulate the problem as an optimization problem of power consumption with a prior requirement on performance, which is taken as the response time in the paper. To solve the problem, the authors design the adaptive controller based on least-square self-tuning regulator to dynamically regulate the computing resource so as to track a given reasonable reference performance and then minimize the power consumption using the tracking result supplied by the controller at each time. Simulation is implemented based on the data collected from real machines and the time delay of turning on/off the machine is included in the process. The results show that this method based on adaptive control theory can save power consumption greatly with satisfying the performance requirement at the same time, thus it is suitable and effective to solve the problem.

parallel and distributed computing: applications and technologies | 2012

Using Basic Block Based Instruction Prefetching to Optimize WCET Analysis for Real-Time Applications

Fan Ni; Xiang Long; Han Wan; Xiaopeng Gao

Cache is an important component existing in modern computer system to bridge the performance gap between the fast CPU and the slow memory system. A variety of cache optimization technologies and mechanisms are proposed to improve the cache performance, such as instruction cache prefetching. Most instruction prefetching mechanisms existing are proposed to improve the average-case cache performance. However, real-time systems care more about the worst-case performance, and the worst-case execution time (WCET) analysis of real-time applications is critical for schedulability analysis of real-time systems. Due to its unpredictable behaviour, cache disastrously complicates the WCET analysis of real-time applications. In this paper, we proposed a basic block based instruction prefetching (BBIP) mechanism to improve both the average-case cache performance and the tightness of the WCET analysis of real-time applications. Measurements on typical real-time benchmarks show that BBIP can not only eliminate most of the instruction access misses, but also result in lower WCET estimations. To discuss the effectiveness of BBIP, we measured the WCET of the benchmarks for three processor configurations with and without BBIP: 1) processor with in-order pipeline and perfect branch prediction, 2) processor with out-of-order pipeline and perfect branch prediction, and 3) processor with out-of-order pipeline and 2-level branch prediction. The results show that BBIP can provide notable improvements in the tightness of WCET estimation, with the WCET values being 30.4% to 97.7% of the original ones. Our simulation results also reveal that 70% to 80% instruction access misses are eliminated with BBIP.

ieee youth conference on information, computing and telecommunications | 2010

GPU-based time parallel cache simulator

Junjie Ma; Han Wan; Xiaopeng Gao; Xiang Long

We present the design of time parallel trace-driven cache simulation for the purpose of evaluating different cache architectures. Due to the long simulation cycles, traditional sequential simulation methods are no longer practical. An obvious way to achieve fast parallel simulation is time parallel. It splits the whole trace into small slices which are assigned to parallel processors for concurrent simulation. In this paper, we introduce a novel time parallel multi-configuration simulation on single pass method. It exploits time partitioning as the main sources of parallelism and takes the full advantage of the computational capability offered by the Compute Unified Device Architecture (CUDA) on the GPU. Our experimental results demonstrate that the cache simulator based on GPU platform gains 1.91× performance improvement compared to traditional serial algorithm.

ieee youth conference on information, computing and telecommunications | 2010