Slo-Li Chu
Chung Yuan Christian University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Slo-Li Chu.
high performance computing and communications | 2010
Slo-Li Chu; Chih-Chieh Hsiao
Due to the dramatic requirements of 3D games and applications, graphics processing unit (GPU) or general-purpose graphics processing unit (GPGPU) have become required components in the modern computer systems. While these devices enable high parallelism with huge amount of processing elements, the utilization of their capabilities in general scientific applications are still low due to their difficult programming paradigms. Therefore an open standard, OpenCL, is proposed to provide universal APIs and programming paradigms for various GPUs and accelerators. In this study, it adopts several benchmarks, with various computation characteristics, to demonstrate the capabilities of OpenCL with several platforms. These programs are parallelized by OpenMP and OpenCL, and then targeted on several GPUs and conventional servers. This paper also provides an example to illustrate the migration of the given program, from OpenMP to OpenCL. The presented experimental results show that these inexpensive GPUs will lead better performance than servers if adopt OpenCL paradigms. It will be the preliminary milestone of cheap supercomputing by the acceleration of GPUs that can be obtained ubiquitously.
embedded and ubiquitous computing | 2011
Slo-Li Chu; Chih-Chieh Hsiao; Chiu-Cheng Hsieh
The programmability of mobile GPUs have raised in recently years, where the shaders inside are instructed by shading programs for realistic 3D effects. The register files for a conventional high throughput multithreaded shader consumes 10% to 20% energy of it. However the register usages of shading program are quite low. In order to reduce the dynamic energy for register file in a multithreaded mobile shader, this paper proposed an unified register file design to reduce both dynamic and leakage energy of it. The result shows that proposed design reduces 85% of dynamic energy in a multithreaded register file. Furthermore, the proposed design reduces 59% of leakage energy and 25% of area with negligible performance degradation. Also, the energy savings in proposed designs are at least 75% more than related work.
IEEE Transactions on Multimedia | 2014
Chih-Chieh Hsiao; Slo-Li Chu; Chiu-Cheng Hsieh
In response to the remarkable increase in 3D applications in consumer electronics devices in recent years, graphics processing units (GPUs) have become widely available on mobile devices. These GPUs typically use hardware multithreaded shaders to improve their throughputs for real-time rendering, but they depend on duplicate register files to maintain the context of each hardware thread, increasing power consumption. However, the register usage of shading programs is often relatively low, which causes many registers to remain unused, thus wasting power. Long latency memory operations can also consume unnecessary power to activate registers. This study proposes a low-power register file with multiple power modes to reduce the power consumption of the register file. This study also presents an adaptive thread scheduling mechanism to achieve a tradeoff between the power consumption of the register file and frames per second (FPS). Results show that the average performance degradation from the proposed low-power register file is only 0.62%. The proposed adaptive thread scheduling has average under prediction ratio of 3.32%. The leakage reduction of the proposed low-power register file is 74.80%. This reduction can be improved to 81.49%, 82.22%, and 84.28% with adaptive thread scheduling at frame rates of 30, 25, and 20, respectively.
Computers & Graphics | 2013
Chih-Chieh Hsiao; Slo-Li Chu; Chen-Yu Chen
As 3D applications in mobile devices have become increasingly popular, mobile GPUs have become one of their most essential components. Because the lifetime of these devices is generally battery-limited, the tradeoff between energy consumption and user experience has become an important issue. Conventional mechanisms include the use of fixed-point and reducing the precision of floating-point to reduce the energy consumption of the shader in a mobile GPU. A fixed-point has a narrower numerical range than a floating-point, but is faster and more energy-efficient. However, reduced precision floating-point has a wider numerical range but consumes more energy. In this work, an Energy-aware Hybrid Precision Selection (EHPS) framework is proposed to integrate the above mechanisms with a profile-based precision selection mechanism to maximize energy savings. In addition, a built-in energy model is used to evaluate whether fixed-point or reduced floating-point is more energy-efficient for the current application. The more energy-efficient option will be used to render the current application to save more energy. The results reveal that the proposed EHPS framework reduces the energy consumed by the shader by an average of 33.66% and 31.63% in the low and high-quality modes, respectively. The average PSNRs of the resulting images are 26.89dB and 45.94dB in these two rendering modes, respectively. The proposed EHPS framework yields a better image quality and uses less energy than related works. Graphical abstractDisplay Omitted An energy-aware hybrid rendering management scheme with fixed point and reduced floating point systems.An automatic precision selection mechanism for both vertex and fragment shading during run-time.A runtime energy and precision evaluation system for determining feasible number system to render current application.
international symposium on parallel architectures, algorithms and programming | 2012
Slo-Li Chu; Shiue-Ru Chen; Sheng-Fu Weng
Nowadays, multicore processors are widely adopted in the hand-held systems. Since the hand-held systems are powered by battery, the battery life will become the dominated limitation. An efficient low power scheduling mechanism for hand-held multicore system is become important today. This paper propose a novel low power scheduling mechanism, called Bounded-Power Multicore Dynamic Frequency Scaling (BPM-DFS), which integrates a system configuration selection algorithm and a task re-scheduling mechanism. According to the assigned power budget by the user, BPM-DFS can dynamically adjust the configuration of the multicore system to control the suitable alive core number, working frequency, and task reassignment, to achieve good performance and under the limitation of power consumption. The proposed BPM-DFS has been implemented on quad-core x86 Android system to compare the actual capabilities of Linux/Android build-in power managers. The experimental results reveal that BPM-DFS can save 25 % power consumption than Linux Performance mode.
embedded and ubiquitous computing | 2011
Slo-Li Chu; Chih-Chieh Hsiao; Chen-Yu Chen
In order to extend the life for battery driven mobile devices and maintain image quality, this paper presents a dual-mode unified shader for mobile GPUs, which consists of floating-point and fixed-point SIMD shader, for high quality or energy-saving rendering. Furthermore, in order to increase the image quality in fixed-point rendering, this paper proposes a frame-based dynamic precision adjustment scheme to select appropriate precision for different 3D scenes. The proposed design has following characteristics: I) high quality rendering with floating-point and fixed-point rendering for energy saving, II) a frame-based dynamic precision adjustment scheme to select appropriate precision for given scene, III) a workload-based scene change detection mechanism to re-select precision in time. Furthermore, this paper presents side by side comparison on performance, power and image quality between floating-point and fixed-point rendering in real world 3D games. The results of proposed shader in real world 3D games have 48.6% reduction in dynamic power and 33% faster in thread execution for a shader under energy saving mode in average. Furthermore, the rendered image qualities under proposed dynamic precision are insensitive to human eyes and the PSNR outperform related work for 2.37% in average. This reveals a way to use conventional fixed-point with dynamic precision to implement low power unified shader with quality rendering for such power limited devices.
international conference on parallel and distributed systems | 2004
Slo-Li Chu
Continuous improvements in semiconductor fabrication density are supporting new classes of system-on-a-chip (SoC) architectures that combine extensive processing logic/processor with high-density memory. Such architectures are generally called processor-in-memory (PIM) or intelligent memory (I-RAM) and can support high-performance computing by reducing the performance gap between the processor and the memory. The PIM architecture combines various processors in a single system. These processors are characterized by their computation and memory-access capabilities. Therefore, a strategy must be developed to identify their capabilities and dispatch the most appropriate jobs to them in order to exploit them fully. Accordingly, this study presents a new automatic source-to-source parallelizing system, called SAGE, to exploit the advantages of PIM architectures. Unlike conventional iteration-based parallelizing systems, SAGE adopts statement-based analyzing approaches. It adopts a new pair-selection scheduling (PSS) mechanism to achieve better utilization and workload balance between the host and memory processors of PIM architectures. This paper also provides performance results and comparison of several benchmarks to demonstrate the capability of this new scheduling algorithm.
annual computer security applications conference | 2008
Slo-Li Chu; Min-Jen Lo; Hsiao-Wen Yang
Since the continuously growing of multimedia functionalities in modern portable consuming electronics, the computer systems have to integrate multiple media processors on single chip/system to provide better service. However, the insufficient bandwidth of the memory subsystem will make the performance of the multimedia modules unsatisfied. In this paper, we propose an innovative architecture of memory subsystem, aiming for extracting more potential bandwidth of memory access to fulfill the requirements of multiple multimedia processors dynamically. The proposed architecture, called MediaMem, can offers satisfied bandwidth for all attached multimedia processor by proposed two novel scheduling mechanisms that can dynamically adjust the access grants, buffer sizes, and transfer sequences according to real-time situations. Additionally, the memory interconnection is modified to avoid bus contention. The proposed MediaMem architecture has been implemented by SystemC HDL. The whole system functional verification and performance evaluation have been exam by CoWare ConvergenSC. The experimental results are also discussed.
international symposium on parallel architectures, algorithms and programming | 2012
Slo-Li Chu; Geng-Siao Lee; Yan-Wun Peng
The continuous improving of semiconductor technology integrates more processors into a single chip. While integrate multiple processors into a chip, the interconnection network of among the cores become a dominant performance bottleneck. Accordingly, this paper provides a new on-chip interconnection network, called Self Similar Cubic (SSC), for many-core architectures. By cooperating with proposed linking mechanism, routing algorithm, and switching architectures, SSC has better scalability, on-chip fabrication possibility, and high communication performance, than conventional on-chip networks, such as Mesh, and Hypercube. In this paper, the comparison of performance and area cost are proposed. The analysis results reveal that SSC can provide higher throughput and lower area cost than other on-chip networks. The performance per area cost of SSC is five times better than that of Mesh.
IEICE Electronics Express | 2012
Slo-Li Chu; Chih-Chieh Hsiao; Chiu-Cheng Hsieh
Mobile GPUs are used in modern portable devices to satisfy the growing requirements of 3D applications. These GPUs generally integrate hardware multithreaded shaders to improve the throughput for real-time rendering, but they depend on duplicate register files to maintain the context of each hardware thread. This work develops a demand-driven register file (DDRF) to reduce the power consumption by register files. The proposed DDRF is shared on demand among concurrent threads and turns off almost all unused registers. Experimental results reveal the DDRF uses 85.8% less power than a conventional multithreaded GPU. The chip area and circuit latency of DDRF are also discussed.