Da Qi Ren
University of Tokyo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Da Qi Ren.
parallel and distributed computing: applications and technologies | 2009
Reiji Suda; Da Qi Ren
Power dissipation is one of the most imminent limitation factors influencing the development of High Performance Computing (HPC). Toward power-efficient HPC on CPU-GPU hybrid platform, we are investigating software methodologies to achieve optimized power utilization by algorithm design and programming technique. In this paper we discuss power measurements of GPU, propose a method of automatic extraction of power data of CUDA kernels from long measurement sequence, and execute an exactitude and effective power analysis on CUDA kernels. By using the proposed method above, we measured a sample kernel that performs single precision floating point additions on GeForce 8800 GTS. Our results suggest that the power consumption by a non-working thread in underoccupied half-warp is 71% of the power consumed by a working thread.
international conference on green computing | 2010
Da Qi Ren; Reiji Suda
CPU-GPU Processing Element (PE) has become a very popular architecture to construct modern multiprocessing system because of its high performance on massively parallel processing and vector computations. Power dissipation is one of the important factors influencing design development of High Performance Computing (HPC) as a large scale scientific computation may use thousands of processors and hundreds hours of continuous execution that will result enormous energy predicament. Enhancing the utilizations of an individual PE to reach its best computation capability and power efficiency is valuable for saving the overall power cost of large multi-processing systems. Power performance of a CUDA PE is dependent on electrical features of the inside hardware components and their interconnections; also high level applications and the parallel algorithms performed on it. Based on measurements and experimental evaluations, in this work we provide a load sharing method to adjust the workload assignment within the CPU and GPU components inside a CUDA PE in order to optimize the overall power efficiency. The improvement on computation time and power consumption has been validated by examining the program executions when above method is applied on real systems.
computational science and engineering | 2009
Da Qi Ren; Reiji Suda
Power efficiency is one of the most important issues in high performance computing (HPC) interrelated to both software and hardware. Power dissipation of a program lies on algorithm design and power features of the computer components on which the program runs. In this work, we measure and model the power consumption of large matrices multiplication on multi-core CPU and GPU platform. By incorporating major physical power constrains of hardware components with the analysis of program execution behaviors, we approach to save the overall power consumption by using multithreading CPU to control two GPU devices computing in parallel synchronously. By implementing above method on real system, we show that it can save 22% of energy and speedup the kernel execution time by 71%, compare with solving the same large matrices multiplication using single CPU and GPU combination.
IEEE Transactions on Magnetics | 2006
Da Qi Ren; Dennis D. Giannacopoulos
Communication strategies in parallel finite element methods can greatly affect system performance. The communication cost for a proposed parallel 3-D mesh refinement method with tetrahedra is analyzed. A Petri Nets-based model is developed for a target mesh refinement algorithm and parallel computing system architecture, which simulates the inter-processor communication. Subsequently, estimates for performance measures are derived from discrete event simulations. The potential benefits of this approach for developing high performance parallel mesh refinement algorithms are demonstrated by optimizing the system communication costs for varying problem size and numbers of processors
IEEE Transactions on Magnetics | 2012
Da Qi Ren; Eric Bracken; Sergey Polstyanko; Nancy Lambert; Reiji Suda; Dennis D. Giannacopulos
Software power performance tuning handles the critical design constraints of software running on hardware platforms composed of large numbers of power-hungry components. The power dissipation of a Single Program/Instruction Multiple Data (SPMD/SIMD) computation such as finite element method (FEM) mesh refinement is highly dependent on the underlying algorithm and the power-consuming features of hardware Processing Elements (PE). This contribution presents a practical methodology for modeling and analyzing the power performance of parallel 3-D FEM mesh refinement on CUDA/MPI architecture based on detailed software prototypes and power parameters in order to predict the power functionality and runtime behavior of the algorithm, optimize the program design and thus achieve the best power efficiency. In detail, we have proposed approaches for GPU parallelization, dynamic CPU frequency scaling and dynamic load scheduling among PEs. The performance improvement of our designs has been demonstrated and the results have been validated on a real multi-core and GPU cluster.
Computer Science - Research and Development | 2012
Da Qi Ren; Reiji Suda
Estimating and analyzing the power consuming features of a program on a hardware platform is important for energy aware High Performance Computing (HPC) optimization, it can help to handle critical design constraints at the level of software, chose preferable algorithm in order to reach the best energy performance. Optimizing the power efficiency of CUDA program on GPU and multicore processing element is a problem in combinatorial optimization because of the complexity of power factors and criteria. A four-tuple global optimization model has been created to indicate the procedure to find optimal energy solution. In addition, an experimental method is illustrated to examine SIMD computing for capturing power parameters, five individual energy optimization methods are provided and implemented. The optimization results have been validated by comparative analysis on real systems.
IEEE Transactions on Magnetics | 2006
Dennis D. Giannacopoulos; Da Qi Ren
We develop a simulation-based approach for the computational analysis and design of dynamic load balancing algorithms in parallel three-dimensional unstructured mesh refinement with tetrahedra. A Petri Nets model is implemented based on a random polling algorithm and the target multiprocessor architecture, which simulates the behavior of the parallel mesh refinement. Subsequently, estimates for performance measures are derived from discrete event simulations. The benefits of this new approach for developing high-performance parallel mesh refinement algorithms are demonstrated with results for an example geometric mesh refinement model
parallel processing and applied mathematics | 2009
Da Qi Ren; Reiji Suda
The power efficiency of large-scale computing on multiprocessing systems is an important issue that interrelated to both of the hardware architectures and the software methodologies. Aiming to design power-efficient high performance program, we have measured the power consumption of large matrices multiplication on multi-core and GPU platform. Based on the obtained power characteristic values of each computing component, we abstract the energy estimations by incorporating physical power constrains from the hardware devices and analysis of the program execution behaviors. We optimize the matrices multiplication algorithm in order to improve its power performance, and the efficiency promotion has been finally validated by measuring the program execution.
computational sciences and optimization | 2009
Da Qi Ren; Reiji Suda
We model and estimate the power consumption of large-scale matrix multiplication by including the most basic power parameters in the parallel algorithm analysis. The matrix multiplication program has been designed based on multi-core frameworks. A Bridging Model (BM) is employed to incorporate the numerical parameters of ultimate physical constraints from power-relevant components and coarse-grained features of the multi-core platform. Consequently the power consumption is predicted by calculating the timing and power factors in the performance analysis equation. The power model and estimation results are validated by measuring the program running on real systems.
international conference on cluster computing | 2008
Da Qi Ren; Dennis D. Giannacopoulos; Reiji Suda
A new Dynamic Load Balancing (DLB) method for automatic performance tuning in parallel, adaptive, 3-D mesh refinement is developed based on study of characteristics of Finite Element Method (FEM) on electromagnetics with tetrahedra. On the top of existing DLB algorithms, the new design optimized the task pool location of each processing element (PE) and the initial data assignments in multiprocessor parallel architecture. To accomplish our method, we investigate it by applying the algorithm in implementations of parallel 3-D Hierarchical Tetrahedra and Octahedra (HTO) mesh refinement. By comparing the benchmark results derived from the performance measures of the new method with the performance results from other two existing DLB algorithms running the same HTO example geometric mesh refinement model and on the same parallel architecture, the benefits of the new method for achieving high performance parallel mesh refinement are demonstrated.