Is this you? Create Your Porfile

Muhuan Huang

University of California, Los Angeles

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Muhuan Huang is active.

Explore More

Publication

Featured researches published by Muhuan Huang.

field programmable gate arrays | 2014

Combining computation and communication optimizations in system synthesis for streaming applications

Jason Cong; Muhuan Huang; Peng Zhang

Data streaming is a widely-used technique to exploit task-level parallelism in many application domains such as video processing, signal processing and wireless communication. In this paper we propose an efficient system-level synthesis flow to map streaming applications onto FPGAs with consideration of simultaneous computation and communication optimizations. The throughput of a streaming system is significantly impacted by not only the performance and number of replicas of the computation kernels, but also the buffer size allocated for the communications between kernels. In general, module selection/replication and buffer size optimization were addressed separately in previous work. Our approach combines these optimizations together in system scheduling which minimizes the area cost for both logic and memory under the required throughput constraint. We first propose an integer linear program (ILP) based solution to the combined problem which has the optimal quality of results. Then we propose an iterative algorithm which can achieve the near-optimal quality of results but has a significant improvement on the algorithm scalability for large and complex designs. The key contribution is that we have a polynomial-time algorithm for an exact schedulability checking problem and a polynomial-time algorithm to improve the system performance with better module implementation and buffer size optimization. Experimental results show that compared to the separate scheme of module select/replication and buffer size optimization, the combined optimization scheme can gain 62% area saving on average under the same performance requirements. Moreover, our heuristic can achieve 2 to 3 orders of magnitude of speed-up in runtime, with less than 10% area overhead compared to the optimal solution by ILP.

international conference on computer design | 2013

Accelerator-rich CMPs: From concept to real hardware

Yu-Ting Chen; Jason Cong; Mohammad Ali Ghodrat; Muhuan Huang; Chunyue Liu; Bingjun Xiao; Yi Zou

Application-specific accelerators provide 10-100× improvement in power efficiency over general-purpose processors. The accelerator-rich architectures are especially promising. This work discusses a prototype of accelerator-rich CMPs (PARC). During our development of PARC in real hardware, we encountered a set of technical challenges and proposed corresponding solutions. First, we provided system IPs that serve a sea of accelerators to transfer data between userspace and accelerator memories without cache overhead. Second, we designed a dedicated interconnect between accelerators and memories to enable memory sharing. Third, we implemented an accelerator manager to virtualize accelerator resources for users. Finally, we developed an automated flow with a number of IP templates and customizable interfaces to a C-based synthesis flow to enable rapid design and update of PARC. We implemented PARC in a Virtex-6 FPGA chip with integration of platform-specific peripherals and booting of unmodified Linux. Experimental results show that PARC can fully exploit the energy benefits of accelerators at little system overhead.

design, automation, and test in europe | 2012

Combining module selection and replication for throughput-driven streaming programs

Jason Cong; Muhuan Huang; Bin Liu; Peng Zhang; Yi Zou

Streaming processing is widely adopted in many data-intensive applications in various domains. FPGAs are commonly used to realize these applications since they can exploit inherent data parallelism and pipelining in the applications to achieve a better performance. In this paper we investigate the design space exploration problem (DSE) when mapping streaming applications onto FPGAs. Previous works narrowly focus on using techniques like replication or module selection to meet the throughput target. We propose to combine these two techniques together to guide the design space exploration. A formal formulation and solution to this combined problem is presented in this paper. Our objective is to optimize the total area cost subject to the throughput constraint. In particular, we are able to handle the feedback loops in the streaming programs, which, to the best of our knowledge, has never been discussed in previous work. Our methodology is evaluated with high-level synthesis tools, and we demonstrate our workflow on a set of benchmarks that vary from module kernel design such as FFT to large designs such as an MPEG-4 decoder.

field-programmable logic and applications | 2011

Accelerating Fluid Registration Algorithm on Multi-FPGA Platforms

Jason Cong; Muhuan Huang; Yi Zou

In the clinical applications, medical image registrations on the images taken from different times and/or through different modalities are needed in order to have an objective clinical assessment of the patient. Viscous fluid registration is a powerful PDE-based method that can register large deformations in the imaging process. This paper presents our implementation of the fluid registration algorithm on a multi-FPGA platform Convey HC-1. We obtain a 35X speedup versus single-threaded software on a CPU. The implementation is achieved using a high-level synthesis (HLS) tool, with additional source-code level optimizations including fixed-point conversion, tiling, prefetching, data-reuse, and streaming across modules using a ghost zone (time-tiling) approach. The experience of this case study also identifies further automation steps needed by existing HLS software.

symposium on cloud computing | 2016

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Muhuan Huang; Di Wu; Cody Hao Yu; Zhenman Fang; Matteo Interlandi; Tyson Condie; Jason Cong

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsofts FPGA deployment in its Bing search engine and Intels 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

international symposium on low power electronics and design | 2013

Energy-efficient computing using adaptive table lookup based on nonvolatile memories

Jason Cong; Milos D. Ercegovac; Muhuan Huang; Sen Li; Bingjun Xiao

Table lookup based function computation can significantly save energy consumption. However existing table lookup methods are mostly used in ASIC designs for some fixed functions. The goal of this paper is to enable table lookup computation in general-purpose processors, which requires adaptive lookup tables for different applications. We provide a complete design flow to support this requirement. We propose a novel approach to build the reconfigurable lookup tables based on emerging nonvolatile memories (NVMs), which takes full advantages of NVMs over conventional SRAMs and avoids the limitation of NVMs. We provide compiler support to optimize table resource allocation among functions within a program. We also develop a runtime table manager that can learn from history and improve its arbitration of the limited on-chip table resources among programs.

design automation conference | 2016

Invited - Heterogeneous datacenters: options and opportunities

Jason Cong; Muhuan Huang; Di Wu; Cody Hao Yu

In this paper we present our ongoing study and deployment efforts for enabling FPGAs in datacenters. An important focus is to provide a quantitative evaluation of a wide range of heterogeneous system designs and integration options, from low-power field-programmable SoCs to server-class computer nodes plus high-capacity FPGAs, with real system prototyping and implementation on real-life applications. In the meantime, we develop a cloud-friendly programming interface and a runtime environment for efficient accelerator deployment, scheduling and transparent resource management for integration of FPGAs for large-scale acceleration across different system integration platforms to enable “write once, execute everywhere”.

design automation conference | 2015

CMOST: a system-level FPGA compilation framework

Peng Zhang; Muhuan Huang; Bingjun Xiao; Hui Huang; Jason Cong

Programming difficulty is a key challenge to the adoption of FPGAs as a general high-performance computing platform. In this paper we present CMOST, an open-source automated compilation flow that maps C-code to FPGAs for acceleration. CMOST establishes a unified framework for the integration of various system-level optimizations and for different hardware platforms. We also present several novel techniques on integrating optimizations in CMOST, including task-level dependence analysis, block-based data streaming, and automated SDF generation. Experimental results show that automatically generated FPGA accelerators can achieve over 8x speedup and 120x energy gain on average compared to the multi-core CPU results from similar input C programs. CMOST results are comparable to those obtained after extensive manual source-code transformations followed by high-level synthesis.

application specific systems architectures and processors | 2011

Domain-specific processor with 3D integration for medical image processing

Jason Cong; Karthik Guruaj; Muhuan Huang; Sen Li; Bingjun Xiao; Yi Zou

The growth of 3D technology had led to opportunities for stacked multiprocessor-accelerator computing platforms with high-bandwidth and low-latency TSV connections between them, resulting in high computing performance and better energy efficiency. This work evaluates the performance and energy benefits of such an advanced architecture and addresses associated design problems. To better utilize the reconfigurable hardware resource and to explore the opportunity of kernel sharing across applications, we propose to use a dedicated domain-specific computing platform. In particular, we have chosen medical image processing as the domain in this work to accelerate due to its growing for real-time processing demand yet inadequete performance on conventional computing architectures. A design flow is proposed in this work for the 3D multiprocessor-accelerator platform and a number of methods are applied to optimize the average performance of all the applications in the targeted domain under area and bandwidth constraints. Experiments show that the applications in this domain can gain a 7.4× speed-up and 18.8× energy savings on average running on our platform using CMP cores and domain-specific accelerators as compared to their counterparts coded in CPU only.

symposium on application specific processors | 2011

3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications

Jason Cong; Muhuan Huang; Yi Zou

GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs per dimension). While this application is clearly bandwidth bounded, the difference on the memory subsystems translates to different bandwidth optimization techniques. Our implementations on the GPU and FPGA platforms show 26X and 33X speedup respectively over optimized single-thread code on CPU.

Explore More