Is this you? Create Your Porfile

Shau-Yin Tseng

Industrial Technology Research Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shau-Yin Tseng is active.

Explore More

Publication

Featured researches published by Shau-Yin Tseng.

design automation conference | 2010

NTPT: on the end-to-end traffic prediction in the on-chip networks

Yoshi Shih-Chieh Huang; Kaven Chun-Kai Chou; Chung-Ta King; Shau-Yin Tseng

Power and thermal distribution are critical issues in chip multiprocessors (CMPs). Most previous studies focus on cores and on-chip memory subsystems and discuss how to reduce their power and control thermal distribution by using dynamic voltage/frequency scaling. However, the on-chip interconnection network, or network-on-chip (NoC), is also an important source of power consumption and heat generation. Particularly, the traffic flowing through the NoC affects directly its power and thermal distribution. Unfortunately, very few works discuss the dynamism of NoC. A key technique for NoC management is to capture its traffic patterns and predict future behaviors. In this paper, we propose a table-driven predictor called Network Traffic Prediction Table (NTPT) for recording and predicting traffic in NoC. The most unique feature of NTPT is its ability to predict end-to-end traffic, rather than switch-to-switch traffic. Thus, more application behaviors can be captured and monitored. Evaluations on Tileras TILE64 show that NTPT has very high prediction accuracy. Analyses also show that it incurs a low area overhead and is very feasible.

international symposium on pervasive systems, algorithms, and networks | 2009

Parallel Implementation and Performance Prediction of Object Detection in Videos on the Tilera Many-Core Systems

Ya-Fei Hung; Shau-Yin Tseng; Chung-Ta King; Huan-Yu Liu; Shih-Chieh Huang

Object detection plays an important role in intelligent video analysis. Unfortunately, its heavy computational complexity makes it very difficult to process in real time. Some recent studies use multi-core platforms to achieve the required performance. In this paper, we study the problem under the context of many-core platforms, e.g. for application-specific, embedded systems. We first show how object detection can be parallelized for many-core platforms and then discuss how its performance can be predicted for embedded system designs. The parallel algorithm is verified with a real implementation on a 64-core TILERA. Our implementation achieves a speedup of 37.20 with 56 cores and a processing rate of 18 frames per second for full-HD (1920 * 1080) videos. Our performance prediction equation is also evaluated using the implementation and the predicted performance is very close to real results.

international conference on parallel and distributed systems | 2009

Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation

Po-Hui Chen; Chung-Ta King; Yuan-Ying Chang; Shau-Yin Tseng

With the growing needs for advanced functionalities in modern embedded systems, it is now necessary to integrate multiple processors in the system, preferably on a single chip, to support the required computing complexity. The problem is that such multiprocessor system-on-chip (MPSoC) architecture is very complex and its internal behavior is very difficult to track. An effective tool for profiling the behavior of the MPSoC system is in great need. Such a tool is very useful during system design for exploiting various options and identifying potential bottlenecks. In this paper, we introduce the MultiProcessor Profiling Architecture (MPPA) -- a general framework for profiling MPSoC embedded systems. The MPPA framework entails the use of FPGA emulation for the target system, the embedding of performance counters for recording system events, and the development of OS drivers for collecting the profiled data. To demonstrate its use, we show the implementation of an MPSoC emulation system based on Leon3 cores following the MPPA framework. We also show how the MPPA framework and the emulator help the designers to identify performance problems and improve their MPSoC embedded system design.

international symposium on vlsi design, automation and test | 2010

Implementation of JVM tool interface on Dalvik virtual machine

Chien-Wei Chang; Chun-Yu Lin; Chung-Ta King; Yi-Fan Chung; Shau-Yin Tseng

Mobile devices such as cell phones, GPS guiding systems, and mp3 players, now become one of the most important consumer electronic products. Being an embedded system, mobile devices are highly integrated in software and hardware for robustness, high performance, and low cost. The problem is that this also makes it very difficult to understand the internal interactions of hardware as well as software modules in such devices and to identify performance bottlenecks and design faults. Profiling helps developers to understand the behaviors of a system, especially during the development of new platforms. Android is a new software platform intended for mobile devices. It is composed of Linux and a Java virtual machine called Dalvik. The ability to profile Android helps developers to familiarize with Androids features and optimize their applications. In this paper, we discuss the development of a profiling tool interface, JVM TI, on Android. With this tool interface, developers can profile their Java code running on Dalvik using JVM TI.

international symposium on parallel and distributed processing and applications | 2011

Parallel Integral Image Generation Algorithm on Multi-core System

Yi-Ta Wu; Yih-Tyng Wu; Chao-Yi Cho; Shau-Yin Tseng; Chun-Nan Liu; Chung-Ta King

Integral image becomes a very useful tool in most of the computer vision applications in recent years, and the parallelization strategy is the most popular approach for efficiently generating the integral image. However, most of current parallel integral image generation algorithms are developed based on the dedicated hardware architecture. In this paper, we developed a parallel integral image generation algorithm on Tile64 which is a MIMD-based embedded system. It can be found that our parallel integral image generation algorithm achieved 7.66 times more efficient than the original sequential integral image generation.

computer and information technology | 2010

Performance and Power Consumption Analysis of DVFS-Enabled H.264 Decoder on Heterogeneous Multi-Core Platform

Shau-Yin Tseng; Kuo-Hung Lin; Wen-Shan Wang; Chung-Ta King; Shih-Hsueh Chang

Power consumption becomes a very important criterion for the portable embedded devices and, therefore, many Dynamic Voltage/Frequency Scaling (DVFS) techniques have been introduced. This paper is trying to break down and analyze the power consumed by three main components, DSP logic, local memory, and the external DDR2, of a multi-core SoC platform. There are four configurations for this SoC platform: one DSP with full and half clock rates and two DSP’s with full and half clock rates. The DSP’s in the SoC are clone in the hardware architecture and execute the same H.264/AVC decoder software in all scenarios. With this breakdown, we can figure out the key factors for energy saving and further offer some valuable suggestions for the power management and embedded multi-core SoC designs.

network on chip architectures | 2011

Floodgate: application-driven flow control in network-on-chip for many-core architectures

Yoshi Shih-Chieh Huang; Huan-Yu Liu; Yuan-Ying Chang; Chung-Ta King; Shau-Yin Tseng

With the prevalence of multi- and many-core architecture, network-on-chip (NoC) is becoming the main paradigm for on-chip interconnection. However, the performance of NoCs can be degraded significantly if the network flow is not controlled properly. Most previous solutions have tried to detect network congestion by monitoring the hardware status of the network switches or links. Unfortunately, such strategies rely on the backpressure of the traffic flows for congestion detection and may be too slow to respond. This paper proposes a proactive strategy which predicts the global, end-to-end traffic patterns of the running application and takes preventive flow control actions to avoid congestions. The proposed system entails an application-level prediction table for accurate traffic prediction and a packet injection scheduler for congestion avoidance. The proposed scheme is evaluated by a trace-driven simulator with synthetic traffic traces as well as a real application trace of an instance in the SPLASH-2 benchmark. The results show the superior performance of the proposed scheme with negligible execution overhead.

international symposium on parallel and distributed processing and applications | 2011

An Impulse Noise Removal Algorithm by Considering Region-Wise Property for Color Image

Yi-Ta Wu; Yih-Tyng Wu; Shau-Yin Tseng; Chao-Yi Cho

Impulse noise removal is one of the important image preprocessing techniques since the noise will lead the image processing procedures into an unexpected direction. The candidate-oriented strategy that detects the corrupted pixels (noise candidates) and then updates the intensity value of those pixels can achieve better performance than the brute-force strategy. However, conventional noise detection algorithms determine noises based on the pixel-wise relationship among a fixed sized observing window, and thus it will misclassify the normal edge and detailed pixels into the noises. In this paper, a novel region feature is presented to avoid the misclassification problem. The noises pixels are treated as the small-sized regions, and labeled by the multi-scale connected component labeling algorithm. In this way, the region size can be considered as a clue during the noise detection procedure. This newly developed region feature can be easily utilized to the current noise removal algorithms. The preliminary study results show that the number of misclassification of ROAD algorithm is dramatically reduced when considering the region feature, and thus the performance of conventional impulse noise removal algorithm can improved accordingly.

international conference on parallel and distributed systems | 2011

Energy-Aware Depth Map Generation for 3D Portrait on Android Systems

Chia-Hui Kao; Chung-Ta King; Shau-Yin Tseng

The most important information in transforming a 2D image into a 3D image is the depth of each pixel in the image. However, a normal 2D image usually does not contain such information, which makes the transformation impossible. On the other hand, for certain types of pictures, such as personal portraits, it is possible to infer crude depth information from their known contexts and properties. Unfortunately, depth map generation is very involving and, if executed on a mobile phone, will consume a lot of energy. This is undesirable, particularly when the mobile device is running out of battery. The application must be aware of the energy status of the system, make appropriate tradeoffs, and then adapt accordingly. This paper presents such an energy-aware, 2D-to-3D image transformation tool for personal portraits on mobile phones. The tool will choose a suitable depth-map generation algorithm based on the remaining energy of the device. We will discuss how to make the tradeoffs and evaluate the idea on real machines.

international soc design conference | 2008

On reducing power consumption and code size of H.264 intra luma prediction on multicore DSP

Yan-Fu Chen; Chung-Ta King; Wen-Shan Wang; Shau-Yin Tseng

As the latest international video coding standard, H.264 is gaining importance not only on desktop computers but also on handheld devices. For handheld devices, power consumption of the system is of ultimate importance, which needs to be addressed at every layer of the system design. At the application layer, the code size affects the memory hierarchy behavior, which in turn affects the system power performance. In this paper, we consider the power consumption of H.264 running on handheld devices using multicore DSP processors, focusing particularly on power saving through code reduction. We use the multiple intra-mode prediction as an example. There are nine optional prediction modes for each 4 times 4 intra luma sub-macroblock in H.264, which result in nine separate routines. We examine the access patterns of the modes and propose a general scheme that replaces eight of the nine routines with one. This dramatically reduces the code size and the resultant power consumption. We implemented the code on PACDSP, resulting in a 20% reduction of the code size with negligible performance degradation.

Explore More