Tsung-Yuan Charlie Tai

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tsung-Yuan Charlie Tai is active.

Explore More

Publication

Featured researches published by Tsung-Yuan Charlie Tai.

computing frontiers | 2012

Improving energy efficiency for mobile platforms by exploiting low-power sleep states

Alexander W. Min; Ren Wang; James Tsai; Mesut A. Ergin; Tsung-Yuan Charlie Tai

Reducing energy consumption is one of the most important design aspects for small form-factor mobile platforms, such as smartphones and tablets. Despite its potential for power savings, optimally leveraging system low-power sleep states during active mobile workloads, such as video streaming and web browsing, has not been fully explored. One major challenge is to make intelligent power management decisions based on, among other things, accurate system idle duration prediction, which is difficult due to the non-deterministic system interrupt behavior. In this paper, we propose a novel framework, called E2S3 (Energy Efficient Sleep-State Selection), that dynamically enters the optimal low-power sleep state to minimize the system power consumption. In particular, E2S3 detects and exploits short idle durations during active mobile workloads by, (i) finding optimal thresholds (i.e., energy break-even times) for multiple low-power sleep states, (ii) predicting the sleep-state selection error probabilities heuristically, and by (iii) selecting the optimal sleep state based on the expected reward, e.g., power consumption, which incorporates the risks of making a wrong decision We implemented and evaluated E2S3 on Android-based smartphones, demonstrating the effectiveness of the algorithm. The evaluation results show that E2S3 significantly reduces the platform energy consumption, by up to 50% (hence extending battery life), without compromising system performance.

IEEE Journal on Selected Areas in Communications | 2011

Reducing Power Consumption for Mobile Platforms via Adaptive Traffic Coalescing

Ren Wang; James Tsai; Christian Maciocco; Tsung-Yuan Charlie Tai; Jackie Wu

Battery life remains to be a critical competitive metric for todays mobile platforms that offer ubiquitous connectivity through their wireless communication interfaces. With most usage models being driven by always-on communication activities, e.g, Internet video streaming, web browsing, etc., it is imperative to understand the impact of network activities on the overall platform power, and optimize power consumption for such activities. As shown by our investigation, various real-world network-driven workloads exhibit bursty and random behavior, which motivates our work on regulating and coalescing incoming packets to reduce platform wake events. To understand the performance impact of packet coalescing, we conduct an extensive investigation to study how coalescing may affect the throughput and user experience. Armed with the deep understandings, we propose, implement and evaluate an Adaptive Traffic Coalescing (ATC) scheme that monitors the incoming traffic at the Network Interface Card (NIC), and adaptively coalesces the packets for a limited duration in the NIC buffer, thus requiring no network or eco-system support. The proposed ATC scheme effectively reduces platform wake events, and enables the platform to enter and stay in the low-power state longer for energy efficiency. We have implemented the scheme in commercial wireless NICs. Using various mobile platforms, we evaluate the power savings and performance impact of the proposed ATC scheme. Experiments show that ATC achieves significant power saving for major platform components, around 20% for real-world Internet workloads, without impacting performance and user experience.

Proceedings of the 1st Workshop on Architectures and Systems for Big Data | 2011

A collaborative memory system for high-performance and cost-effective clustered architectures

Ahmad Samih; Ren Wang; Christian Maciocco; Tsung-Yuan Charlie Tai; Yan Solihin

With the fast development of highly integrated distributed systems (cluster systems), especially those encapsulated within a single platform [28, 9], designers have to face interesting memory hierarchy design choices that attempt to avoid disk storage swapping. Disk swapping activities slow down application execution drastically. Leveraging remote free memory through Memory Collaboration has demonstrated its cost-effectiveness compared to overprovisioning for peak load requirements. Recent studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration on the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. In this paper, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time, to optimize performance, and provide QoS measures for nodes engaging in the system. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3x performance speedup compared to a non-collaborative memory system, without perceivable performance impact on nodes that provide memory. Based on our experiments, we conduct detailed analysis on the remote memory access overhead and provide insights for future optimizations.

2016 IEEE NetSoft Conference and Workshops (NetSoft) | 2016

Exploiting integrated GPUs for network packet processing workloads

Janet Tseng; Ren Wang; James Tsai; Saikrishna Edupuganti; Alexander W. Min; Shinae Woo; Stephen Junkins; Tsung-Yuan Charlie Tai

Software-based network packet processing on standard high volume servers promises better flexibility, manageability and scalability, thus gaining tremendous momentum in recent years. Numerous research efforts have focused on boosting packet processing performance by offloading to discrete Graphics Processing Units (GPUs). While integrated GPUs, residing on the same die with the CPU, offer many advanced features such as on-chip interconnect CPU-GPU communication, and shared physical/virtual memory, their applicability for packet processing workloads has not been fully understood and exploited. In this paper, we conduct in-depth profiling and analysis to understand the integrated GPUs capabilities and performance potential for packet processing workloads. Based on that understanding, we introduce a GPU accelerated network packet processing framework that fully utilizes integrated GPUs massive parallel processing capability without the need for large numbers of packet batching, which might cause a significant processing delay. We implemented the proposed framework and evaluated the performance with several common, light-weight packet processing workloads on the Intel® Xeon® Processor E3-1200 v4 product family (codename Broadwell) with an integrated GT3e GPU. The results show that our GPU accelerated packet processing framework improved the throughput performance by 2-2.5x, compared to optimized packet processing on CPU only.

international conference on communications | 2014

Joint optimization of DVFS and low-power sleep-state selection for mobile platforms

Alexander W. Min; Ren Wang; James Tsai; Tsung-Yuan Charlie Tai

To provide the ultimate mobile user experience, extended battery life is critical to small form-factor mobile platforms such as smartphones and tablets. Dynamic voltage and frequency scaling (DVFS) and low-power CPU/platform sleep states are commonly used power management features, as they allow dynamic control of power and performance to the time-varying needs of workloads. Despite the potential power saving benefit from synergistic integration of DVFS and sleep-state selection, it is challenging to optimize them jointly for mobile workloads (e.g., video streaming), and most existing work considers them only individually. To address this problem, we study joint optimization of CPU frequency (a.k.a. CPU P-states) and CPU/platform sleep-state selections to reduce energy consumption in mobile platforms. This joint optimization becomes feasible with advanced power management techniques and power aware software development methodologies that regulate (e.g., coalesce/align) system activities, making workload characteristics and system idle duration more deterministic and predictable. We then analyze the optimal operating state that minimizes the expected platform energy consumption based on workload characteristics, and present an algorithm to adapt to it at run time. Our evaluation results on mobile workloads show that the proposed scheme can reduce system power consumption by up to 24%, compared to the conventional CPU-utilization-based approach, which seeks mainly to minimize processor energy.

international conference on future energy systems | 2012

DirectPath: high performance and energy efficient platform I/O architecture for content intensive usages

Ren Wang; Christian Maciocco; Tsung-Yuan Charlie Tai; Raj Yavatkar; Lucas Kecheng Lu; Alexander W. Min

With the widespread development of cloud computing and high speed communications, end users store or retrieve video, music, photo and other contents over the cloud or the local network for video-on-demand, wireless display and other usages. The traditional I/O model in a mobile platform consumes time and resources due to excessive memory access and copying when transferring content from a source device, e.g., network controller, to a destination device, e.g., hard disk. This model introduces unnecessary overhead and latency, negatively impacting the performance and energy consumption of content-intensive uses. In this paper, we introduce DirectPath, a low overhead I/O architecture that optimizes content movement within a platform to improve energy efficiency and throughput performance. We design, implement and validate the DirectPath architecture for a network-to-storage file download usage model. We evaluate and quantify DirectPaths energy and performance benefits on both laptop and small form-factor SoC based platforms. The measurement results show that DirectPath reduces energy consumption by up to 50% and improves throughput performance by up to 137%.

acm special interest group on data communication | 2017

Accelerating Open vSwitch with Integrated GPU

Janet Tseng; Ren Wang; James Tsai; Yipeng Wang; Tsung-Yuan Charlie Tai

With the fast development of Software Defined Networking (SDN) and network virtualization, software-based network virtual switches have emerged as a critical component to provide network services to VMs. Among virtual switches, Open vSwitch (OvS) is an open source virtual switch implementation commonly used and well-studied. Using Data Plane Development Kit (DPDK) with OvS to bypass the OS kernel and process packets in userspace provides tremendous performance benefits on general purpose platforms. Integrated GPUs, residing on the same die with the CPU on general purpose platforms, offering many advanced features such as on-chip interconnect CPU-GPU communication, and sharing physical/virtual memory, become a promising additional compute resource to further accelerate the OvS process. In this paper, we design and implement an inline GPU assisted OvS architecture, via offloading the expensive tuple space search to GPU and balancing switching processing between CPU and GPU. We evaluated the performance on an Intel® Xeon® processor of the E3-1575M v5 product family (code-name Skylake) with an integrated GT4e GPU. The results show that our proposed architecture improved the OvS throughput by 3x, compared to the optimized CPU-only OvS-DPDK implementation.

cluster computing and the grid | 2012

Evaluating Dynamics and Bottlenecks of Memory Collaboration in Cluster Systems

Ahmad Samih; Ren Wang; Christian Maciocco; Tsung-Yuan Charlie Tai; Ronghui Duan; Jiangang Duan; Yan Solihin

With the fast development of highly-integrated distributed systems (cluster systems), designers face interesting memory hierarchy design choices while attempting to avoid the notorious disk swapping. Swapping to the free remote memory through Memory Collaboration has demonstrated its cost-effectiveness compared to over provisioning the cluster for peak load requirements. Recent memory collaboration studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration of the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. Further, as the interest in memory collaboration grows, it is crucial to understand the existing performance bottlenecks, overheads, and potential optimization. In this paper we address these two issues. First, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time to optimize performance. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3× performance speedup compared to a non-collaborative memory system without perceivable performance impact on nodes that provide memory. Second, we analyze, in depth, the end-to-end memory collaboration overhead and pinpoint the corresponding bottlenecks.

acm special interest group on data communication | 2018

Hash Table Design and Optimization for Software Virtual Switches

Yipeng Wang; Sameh Gobriel; Ren Wang; Tsung-Yuan Charlie Tai; Cristian Florin Dumitrescu

Flow classification is a common first step in various virtual network functions (VNFs), Software Defined Networking (SDN) applications, as well as network infrastructure components including virtual switches and routers. Software flow classification often employs hash table based lookup mechanisms, where a key constructed from an input packet is looked up across the different rules stored in the table and the corresponding action (e.g., forward, encapsulate, etc.) is retrieved. In this paper we analyze, in depth, various hash table design options and optimizations used in the state-of-the-art virtual switches, and how hardware resources impact the performance. Based on the understanding, we summarize the pros and cons of different designs, and provide insights toward further optimizations. The understanding gained through our analysis also sheds lights on how to design optimal hash tables for flow classification for various use cases.

Archive | 2012

Energy Efficiency of Connected Mobile Platforms in Presence of Background Traffic

Sameh Gobriel; Christian Maciocco; Tsung-Yuan Charlie Tai

In the last decade there has been an explosive growth in popularity of mobile computing platforms which include laptops, notebooks, tablets, cell phones, etc. However, the usability of these devices from end users point of view is directly associated with their battery life and the fact that they are powered by non-continuous energy sources imposes a serious limitations to these devices. As a result, a typical architecture design of such mobile computing platforms will include low power platform states for individual components (e.g. CPU, memory controller, hard-disk, etc.) or for the whole platform. These low power states can be sleep states where the components or the whole platform is in a non-operational mode (e.g. operating systems defined sleep states: standby, hibernate, etc.) or scaled states where they are operating at lower than peak performance (e.g. CPU dynamic voltage and frequency scaling (DVFS)). For example the Advanced Configuration and Power Interface (ACPI) specifies power management concepts and interfaces. ACPI integrates the operating system, device drivers, system hardware components, and applications for power management and defines several power states for each component ranging from fully powered on to fully powered-off with each successive state consuming the same or less amount of power.

Explore More