Chueh-Hung Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chueh-Hung Wu is active.

Explore More

Publication

Featured researches published by Chueh-Hung Wu.

architectural support for programming languages and operating systems | 2014

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Tianshi Chen; Zidong Du; Ninghui Sun; Jia Wang; Chueh-Hung Wu; Yunji Chen; Olivier Temam

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

Radiology | 2011

Sonoelastography of the Plantar Fascia

Chueh-Hung Wu; Ke-Vin Chang; Sun Mio; Wen-Shiang Chen; Tyng-Guey Wang

PURPOSE To compare the stiffness of the plantar fascia by using sonoelastography in healthy subjects of different ages, as well as patients with plantar fasciitis. MATERIALS AND METHODS The study protocol was approved by the Research Ethics Committee of the hospital, and all of the subjects gave their informed consent. Bilateral feet of 40 healthy subjects and 13 subjects with plantar fasciitis (fasciitis group) were examined by using color-coded sonoelastography. Healthy subjects were divided into younger (18-50 years) and older (> 50 years) groups. The color scheme was red (hard), green (medium stiffness), and blue (soft). The color histogram was subsequently analyzed. Each pixel of the image was separated into red, green, and blue components (color intensity range, 0-255). The color histogram then computed the mean intensity of each color component of the pixels within a standardized area. Mixed model for repeated measurements was used for comparison of the plantar fascia thickness and the intensity of the color components on sonoelastogram. RESULTS Quantitative analysis of the color histogram revealed a significantly greater intensity of blue in older healthy subjects than in younger (94.5 ± 5.6 [± standard deviation] vs 90.0 ± 4.6, P = .002) subjects. The intensity of red and green was the same between younger and older healthy subjects (P = .68 and 0.12). The intensity of red was significantly greater in older healthy subjects than in the fasciitis group (147.8 ± 10.3 vs 133.7 ± 13.4, P < .001). The intensity of green and blue was the same between older healthy subjects and those in the fasciitis group (P = .33 and .71). CONCLUSION Sonoelastography revealed that the plantar fascia softens with age and in subjects with plantar fasciitis.

programming language design and implementation | 2010

Evaluating iterative optimization across 1000 datasets

Yang Chen; Yuanjie Huang; Lieven Eeckhout; Grigori Fursin; Liang Peng; Olivier Temam; Chueh-Hung Wu

While iterative optimization has become a popular compiler optimization approach, it is based on a premise which has never been truly evaluated: that it is possible to learn the best compiler optimizations across data sets. Up to now, most iterative optimization studies find the best optimizations through repeated runs on the same data set. Only a handful of studies have attempted to exercise iterative optimization on a few tens of data sets. In this paper, we truly put iterative compilation to the test for the first time by evaluating its effectiveness across a large number of data sets. We therefore compose KDataSets, a data set suite with 1000 data sets for 32 programs, which we release to the public. We characterize the diversity of KDataSets, and subsequently use it to evaluate iterative optimization.We demonstrate that it is possible to derive a robust iterative optimization strategy across data sets: for all 32 programs, we find that there exists at least one combination of compiler optimizations that achieves 86% or more of the best possible speedup across all data sets using Intels ICC (83% for GNUs GCC). This optimal combination is program-specific and yields speedups up to 1.71 on ICC and 2.23 on GCC over the highest optimization level (-fast and -O3, respectively). This finding makes the task of optimizing programs across data sets much easier than previously anticipated, and it paves the way for the practical and reliable usage of iterative optimization. Finally, we derive pre-shipping and post-shipping optimization strategies for software vendors.

asia and south pacific design automation conference | 2014

Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators

Zidong Du; Avinash Lingamneni; Yunji Chen; Krishna V. Palem; Olivier Temam; Chueh-Hung Wu

In recent years, inexact computing has been increasingly regarded as one of the most promising approaches for reducing energy consumption in many applications that can tolerate a degree of inaccuracy. Driven by the principle of trading tolerable amounts of application accuracy in return for significant resource savings - the energy consumed, the (critical path) delay and the (silicon) area being the resources - this approach has been limited to certain application domains. In this paper, we propose to expand the application scope, error tolerance as well as the energy savings of inexact computing systems through neural network architectures. Such neural networks are fast emerging as popular candidate accelerators for future heterogeneous multi-core platforms, and have flexible error tolerance limits owing to their ability to be trained. Our results based on simulated 65nm technology designs demonstrate that the proposed inexact neural network accelerator could achieve 43.91%-62.49% savings in energy consumption (with corresponding delay and area savings being 18.79% and 31.44% respectively) when compared to existing baseline neural network implementation, at the cost of an accuracy loss (quantified as the Mean Square Error (MSE) which increases from 0.14 to 0.20 on average).

field programmable gate arrays | 2013

Elastic CGRAs

Yuanjie Huang; Paolo Ienne; Olivier Temam; Yunji Chen; Chueh-Hung Wu

Vital technology trends such as voltage scaling and homogeneous multicore scaling have reached their limits and architects turn to alternate computing paradigms, such as heterogeneous and domain-specialized solutions. Coarse-Grain Reconfigurable Arrays (CGRAs) promise the performance of massively spatial computing while offering interesting trade-offs of flexibility versus energy efficiency. Yet, configuring and scheduling execution for CGRAs generally runs into the classic difficulties that have hampered Very-Long Instruction Word (VLIW) architectures: efficient schedules are difficult to generate, especially for applications with complex control flow and data structures, and they are inherently static - thus, in adapted to variable-latency components (such as the read ports of caches). Over the years, VLIWs have been relegated to important but specific application domains where such issues are more under the control of the designers; similarly, statically-scheduled CGRAs may prove inadequate for future general-purpose computing systems. In this paper, we introduce Elastic CGRAs, the superscalar processors of computing fabrics: no complex schedule needs to be computed at configuration time, and the operations execute dynamically in the CGRA when data are ready, thus exploiting the data parallelism that an application offers. We designed, down to a manufacturable layout, a simple CGRA where we demonstrated and optimized our elastic control circuitry. We also built a complete compilation toolchain that transforms arbitrary C code in a configuration for the array. The area overhead (26.2%), critical path overhead (8.2%) and energy overhead (53.6%) of Elastic CGRAs over non-elastic CGRAs are significantly lower than the overhead of superscalar processors over VLIWs, while providing the same benefits. At such moderate costs, elasticity may prove to be one of the key enablers to make the adoption of CGRAs widespread.

international symposium on computer architecture | 2014

Going vertical in memory management: handling multiplicity by multi-policy

Lei Liu; Yong Li; Zehan Cui; Yungang Bao; Mingyu Chen; Chueh-Hung Wu

Many emerging applications from various domains often exhibit heterogeneous memory characteristics. When running in combination on parallel platforms, these applications present a daunting variety of workload behaviors that challenge the effectiveness of any memory allocation strategy. Prior partitioning-based or random memory allocation schemes typically manage only one level of the memory hierarchy and often target specific workloads. To handle diverse and dynamically changing memory and cache allocation needs, we augment existing “horizontal” cache/DRAM bank partitioning with vertical partitioning and explore the resulting multi-policy space. We study the performance of these policies for over 2000 workloads and correlate the results with application characteristics via a data mining approach. Based on this correlation we derive several practical memory allocation rules that we integrate into a unified multi-policy framework to guide resources partitioning and coalescing for dynamic and diverse multi-programmed/threaded workloads. We implement our approach in Linux kernel 2.6.32 as a restructured page indexing system plus a series of kernel modules. Extensive experiments show that, in practice, our framework can select proper memory allocation policy and consistently outperforms the unmodified Linux kernel, achieving up to 11% performance gains compared to prior techniques.

ACM Transactions on Architecture and Code Optimization | 2012

Deconstructing iterative optimization

Yang Chen; Shuangde Fang; Yuanjie Huang; Lieven Eeckhout; Grigori Fursin; Olivier Temam; Chueh-Hung Wu

Iterative optimization is a popular compiler optimization approach that has been studied extensively over the past decade. In this article, we deconstruct iterative optimization by evaluating whether it works across datasets and by analyzing why it works. Up to now, most iterative optimization studies are based on a premise which was never truly evaluated: that it is possible to learn the best compiler optimizations across datasets. In this article, we evaluate this question for the first time with a very large number of datasets. We therefore compose KDataSets, a dataset suite with 1000 datasets for 32 programs, which we release to the public. We characterize the diversity of KDataSets, and subsequently use it to evaluate iterative optimization. For all 32 programs, we find that there exists at least one combination of compiler optimizations that achieves at least 83% or more of the best possible speedup across all datasets on two widely used compilers (Intels ICC and GNUs GCC). This optimal combination is program-specific and yields speedups up to 3.75× (averaged across datasets of a program) over the highest optimization level of the compilers (-O3 for GCC and -fast for ICC). This finding suggests that optimizing programs across datasets might be much easier than previously anticipated. In addition, we evaluate the idea of introducing compiler choice as part of iterative optimization. We find that it can further improve the performance of iterative optimization because different programs favor different compilers. We also investigate why iterative optimization works by analyzing the optimal combinations. We find that only a handful optimizations yield most of the speedup. Finally, we show that optimizations interact in a complex and sometimes counterintuitive way through two case studies, which confirms that iterative optimization is an irreplaceable and important compiler strategy.

Archives of Physical Medicine and Rehabilitation | 2010

Evaluating Displacement of the Coracoacromial Ligament in Painful Shoulders of Overhead Athletes Through Dynamic Ultrasonographic Examination

Chueh-Hung Wu; Yi-Chiang Wang; Hsing-Kuo Wang; Wen-Shiang Chen; Tyng-Guey Wang

OBJECTIVE To evaluate displacement of the coracoacromial ligament (CAL), using dynamic ultrasonography (US), for detecting instability-related impingement caused by overhead activities. DESIGN Between-group survey. SETTING Department of Physical Medicine and Rehabilitation in a tertiary care center. PARTICIPANTS Volunteer high school volleyball players with unilateral shoulder pain (n=10) and volunteer asymptomatic high school volleyball players with identical training activities as control subjects (n=16). INTERVENTIONS Not applicable. MAIN OUTCOME MEASURE The displacement of the CAL was measured during throwing simulation using dynamic US. Both shoulders of all subjects were evaluated. RESULTS During throwing simulation, the displacement of the CAL in the painful shoulders of overhead athletes increased significantly greater than the displacement in the asymptomatic shoulder (3.0+/-0.7 mm and 2.2+/-0.4 mm, respectively; P=.017). No difference was identified between the displacements of the CALs of bilateral shoulders of the control group subjects. CONCLUSIONS Dynamic US, by measuring the displacement of the CAL during simulation of throwing, may be helpful in detecting abnormal humeral head upward migration in overhead athletes.

Radiology | 2016

Elasticity of the Coracohumeral Ligament in Patients with Adhesive Capsulitis of the Shoulder

Chueh-Hung Wu; Wen-Shiang Chen; Tyng-Guey Wang

PURPOSE To evaluate the elasticity of the coracohumeral ligament (CHL) in healthy individuals and patients with clinical findings suggestive of unilaterally involved adhesive capsulitis of the shoulder (ACS). MATERIALS AND METHODS The institutional review board approved this single-institution prospective study, which was performed between November 15, 2012, and July 8, 2014. Informed consent was obtained from all subjects. Measurement of CHL thickness was performed in the axial oblique plane under shoulder maximal external rotation. Shear-wave elastography (SWE) was used to evaluate elasticity of the CHL in healthy individuals (11 men, 19 women aged 22-62 years) and those with clinical findings suggestive of ACS (nine men, 11 women aged 41-70 years). SWE was performed in the shoulder-neutral position and under maximal external rotation. The Wilcoxon signed-rank test was performed to compare the thickness and elastic modulus of the CHL between bilateral shoulders. RESULTS In all subjects, the CHL elastic modulus was larger under maximal external rotation than in the neutral position (P < .001 for all). For healthy subjects, there was no significant difference in the CHL elastic modulus between the dominant and nondominant shoulders. For patients presumed to have ACS, the CHL thickness was significantly greater in the symptomatic shoulder than in the unaffected shoulder (P < .001). The CHL elastic modulus of the symptomatic shoulder (median, 234.8 kPa; interquartile range [IQR], 174.4-256.7 kPa) was significantly greater than that of the unaffected shoulder (median, 203.3 kPa; IQR, 144.1-242.7 kPa) in the shoulder-neutral position (P = .004) but not under maximal external rotation (P = .123). When bilateral shoulders were maintained at the same angle of external rotation, the CHL elastic modulus was greater in the symptomatic shoulder than in the unaffected shoulder (P = .005). CONCLUSION In patients with clinical findings suggestive of ACS, SWE showed that the CHL is stiffer in the symptomatic shoulder than in the unaffected shoulder.

international symposium on microarchitecture | 2015

Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches

Zidong Du; Daniel David Ben-Dayan Rubin; Yunji Chen; Liqiang Hel; Tianshi Chen; Lei Zhang; Chueh-Hung Wu; Olivier Temam

A vast array of devices, ranging from industrial robots to self-driven cars or smartphones, require increasingly sophisticated processing of real-world input data (image, voice, radio, …). Interestingly, hardware neural network accelerators are emerging again as attractive candidate architectures for such tasks. The neural network algorithms considered come from two, largely separate, domains: machine-learning and neuroscience. These neural networks have very different characteristics, so it is unclear which approach should be favored for hardware implementation. Yet, few studies compare them from a hardware perspective. We implement both types of networks down to the layout, and we compare the relative merit of each approach in terms of energy, speed, area cost, accuracy and functionality. Within the limit of our study (current SNN and machine learning NN algorithms, current best effort at hardware implementation efforts, and workloads used in this study), our analysis helps dispel the notion that hardware neural network accelerators inspired from neuroscience, such as SNN+STDP, are currently a competitive alternative to hardware neural networks accelerators inspired from machine-learning, such as MLP+BP: not only in terms of accuracy, but also in terms of hardware cost for realistic implementations, which is less expected. However, we also outline that SNN+STDP carry potential for reduced hardware cost compared to machine-learning networks at very large scales, if accuracy issues can be controlled (or for applications where they are less important). We also identify the key sources of inaccuracy of SNN+STDP which are less related to the loss of information due to spike coding than to the nature of the STDP learning algorithm. Finally, we outline that for the category of applications which require permanent online learning and moderate accuracy, SNN+STDP hardware accelerators could be a very cost-efficient solution.

Explore More