Bingjun Xiao
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bingjun Xiao.
field programmable gate arrays | 2015
Chen Zhang; Peng Li; Guangyu Sun; Yijin Guan; Bingjun Xiao; Jason Cong
Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
international symposium on nanoscale architectures | 2011
Jason Cong; Bingjun Xiao
In this paper, we introduce a novel FPGA architecture with memristor-based reconfiguration (mrFPGA). The proposed architecture is based on the existing CMOS-compatible memristor fabrication process. The programmable interconnects of mrFPGA use only memristors and metal wires so that the interconnects can be fabricated over logic blocks, resulting in significant reduction of overall area and interconnect delay but without using a 3D die-stacking process. Using memristors to build up the interconnects can also provide capacitance shielding from unused routing paths and reduce interconnect delay further. Moreover we propose an improved architecture that allows adaptive buffer insertion in interconnects to achieve more speedup. Compared to the fixed buffer pattern in conventional FPGAs, the positions of inserted buffers in mrFPGA are optimized on demand. A complete CAD flow is provided for mrFPGA, with an advanced P&R tool named mrVPR that was developed for mrFPGA. The tool can deal with the novel routing structure of mrFPGA, the memristor shielding effect, and the algorithm for optimal buffer insertion. We evaluate the area, performance and power consumption of mrFPGA based on the 20 largest MCNC benchmark circuits. Results show that mrFPGA achieves 5.18x area savings, 2.28x speedup and 1.63x power savings. Further improvement is expected with combination of 3D technologies and mrFPGA.
design automation conference | 2010
Bingjun Xiao; Yiyu Shi; Lei He
State-of-charge (SOC) measures energy left in a battery, and it is critical for modeling and managing batteries. Developing efficient yet accurate SOC algorithms remains a challenging task. Most existing work uses regression based on a time-variant circuit model, which may be hard to converge and often does not apply to different types of batteries. Knowing open-circuit voltage (OCV) leads to SOC due to the well known mapping between OCV and SOC. In this paper, we propose an efficient yet accurate OCV algorithm that applies to all types of batteries. Using linear system analysis but without a circuit model, we calculate OCV based on the sampled terminal voltage and discharge current of the battery. Experiments show that our algorithm is numerically stable, robust to history dependent error, and obtains SOC with less than 4% error compared to a detailed battery simulation for a variety of batteries. Our OCV algorithm is also efficient, and can be used as a real-time electro-analytical tool revealing what is going on inside the battery.
IEEE Transactions on Very Large Scale Integration Systems | 2014
Jason Cong; Bingjun Xiao
In this paper we introduce a novel field programmable gate array (FPGA) architecture with resistive random access memory (RRAM)-based programmable interconnects (FPGA-RPI). Programmable interconnects are the dominant part of FPGA. We use RRAMs to build programmable interconnects, and optimize their structures by exploiting opportunities that emerge in RRAM-based circuits. FPGA-RPI can be fabricated by the existing CMOS-compatible RRAM process. Using an advanced placement and routing tool named VPR-RPI which was developed to deal with the novel architecture, a customized CAD flow is provided for FPGA-RPI. Results show that the programmable interconnects of FPGA-RR have a 96% smaller footprint, 55% higher performance, and 79% lower power consumptions compared to other FPGA counterparts.
asia and south pacific design automation conference | 2013
Jason Cong; Guojie Luo; Kalliopi Tsota; Bingjun Xiao
One of the necessary requirements for the placement process is that it should be capable of generating routable solutions. This paper describes a simple but effective method leading to the reduction of the routing congestion and the final routed wirelength for large-scale mixed-size designs. In order to reduce routing congestion and improve routability, we propose blocking narrow regions on the chip. We also propose dummy-cell insertion inside regions characterized by reduced fixed-macro density. Our placer consists of three major components: (i) narrow channel reduction by performing neighbor-based fixed-macro inflation; (ii) dummy-cell insertion inside large regions with reduced fixed-macro density; and (iii) pre-placement inflation by detecting tangled logic structures in the netlist and minimizing the maximum pin density. We evaluated the quality of our placer using the newly released DAC 2012 routability-driven placement contest designs and we compared our results to the top four teams that participated in the placement contest. The experimental results reveal that our placer improves the routability of the DAC 2012 placement contest designs and effectively reduces the routing congestion.
international conference on computer design | 2013
Yu-Ting Chen; Jason Cong; Mohammad Ali Ghodrat; Muhuan Huang; Chunyue Liu; Bingjun Xiao; Yi Zou
Application-specific accelerators provide 10-100× improvement in power efficiency over general-purpose processors. The accelerator-rich architectures are especially promising. This work discusses a prototype of accelerator-rich CMPs (PARC). During our development of PARC in real hardware, we encountered a set of technical challenges and proposed corresponding solutions. First, we provided system IPs that serve a sea of accelerators to transfer data between userspace and accelerator memories without cache overhead. Second, we designed a dedicated interconnect between accelerators and memories to enable memory sharing. Third, we implemented an accelerator manager to virtualize accelerator resources for users. Finally, we developed an automated flow with a number of IP templates and customizable interfaces to a C-based synthesis flow to enable rapid design and update of PARC. We implemented PARC in a Virtex-6 FPGA chip with integration of platform-specific peripherals and booting of unmodified Linux. Experimental results show that PARC can fully exploit the energy benefits of accelerators at little system overhead.
international conference on computer aided design | 2013
Jason Cong; Bingjun Xiao
Application-specific accelerators provide orders-of-magnitude improvement in energy-efficiency over CPUs, and accelerator-rich computing platforms are showing promise in the dark silicon age. Memory sharing among accelerators leads to huge transistor savings, but needs novel designs of interconnects between accelerators and shared memories. Accelerators run 100x faster than CPUs and post a high demand on data. This leads to resource-consuming interconnects if we follow the same design rules as those for interconnects between CPUs and shared memories, and simply duplicate the interconnect hardware to meet the accelerator data demand. In this work we develop a novel design of interconnects between accelerators and shared memories and exploit three optimization opportunities that emerge in accelerator-rich computing platforms: 1) The multiple data ports of the same accelerators are powered on/off together, and the competition for shared resources among these ports can be eliminated to save interconnect transistor cost; 2) In dark silicon, the number of active accelerators in an accelerator-rich platform is usually limited, and the interconnects can be partially populated to just fit the data access demand limited by the power budget; 3) The heterogeneity of accelerators leads to execution patterns among accelerators and, based on the probability analysis to identify these patterns, interconnects can be optimized for the expected utilization. Experiments show that our interconnect design outperforms prior work that was optimized for CPU cores or signal routing.
field-programmable custom computing machines | 2014
Jason Cong; Hui Huang; Chiyuan Ma; Bingjun Xiao; Peipei Zhou
This paper presents a satellite-based communication system dedicated to disaster recovery. The proposed solution relies on the underlay transmission of low-power emergency signals in the frequency band of a primary transparent satellite telecommunication system. Wideband spreading is used so as to guarantee that the primary system performance is not affected by the inter-system interference. The emergency system capacity is evaluated under realistic assumptions regarding the primary satellite mission. It is shown that depending on the emergency terminal characteristics, various emergency communication services can be envisaged, from simple alert missions to voice communications. As it does not require any specific space segment development nor the full-time reservation of expensive radio resources, the described solution is attractive in terms of deployment cost. Provided an extension of the regulatory framework for exceptional security-oriented missions, it might satisfactorily match a governmental need for reliable, easy-to-deploy emergency communication means.With more and more enterprises and organizations outsourcing their IT services to distributed clouds for cost savings, historical and operational data generated by these services grows exponentially, which usually is stored in the data centers located at different geographic location in the distributed cloud. Such data referred to as big data now becomes an invaluable asset to many businesses or organizations, as it can be used to identify business advantages by helping them make their strategic decisions. Big data analytics thus is emerged as a main research topic in distributed cloud computing. The challenges associated with the query evaluation for big data analytics are that (i) its cloud resource demands are typically beyond the supplies by any single data center and expand to multiple data centers, and (ii) the source data of the query is located at different data centers. This creates heavy data traffic among the data centers in the distributed cloud, thereby resulting in high communication costs. A fundamental question for query evaluation of big data analytics thus is how to admit as many such queries as possible while keeping the accumulative communication cost minimized. In this paper, we investigate this question by formulating an online query evaluation problem for big data analytics in distributed clouds, with an objective to maximize the query acceptance ratio while minimizing the accumulative communication cost of query evaluation, for which we first propose a novel metric model to model different resource utilizations of data centres, by incorporating resource workloads and resource demands of each query. We then devise an efficient online algorithm. We finally conduct extensive experiments by simulations to evaluate the performance of the proposed algorithm. Experimental results demonstrate that the proposed algorithm is promising and outperforms other heuristics.Future processor will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable architecture offers higher energy efficiency than CPUs through customized hardware and more flexibility than ASICs. FPGAs allow configurability at bit level to keep both efficiency and flexibility. However, in many computation-intensive applications, only word level customizations are necessary, which inspires coarse-grained reconfigurable arrays(CGRAs) to raise configurability to word level and to reduce configuration information, and to enable on-the-fly customization. Traditional CGRAs are designed in the era when transistor resources are scarce. Previous work in CGRAs share hardware resources among different operations via modulo scheduling and time multiplexing processing elements. In the emerging scenario where transistor resources are rich, we develop a novel CGRA architecture that features full pipelining and dynamic composition to improve energy efficiency and implement the prototype on Xilinx Virtex-6 FPGA board. Experiments show that fully pipelined and dynamically composable architecture(FPCA) can exploit the energy benefits of customization for user applications when the transistor resources are rich.
design automation conference | 2014
Jason Cong; Peng Li; Bingjun Xiao; Peng Zhang
High-level synthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highly pipelined register-transfer level specifications. The high-throughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memory partitioning is explored to increase the memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. Prior work on memory partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an early-stage exploration of nonuniform memory partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case study to show the potential benefits of nonuniform memory partitioning. Our novel method can always achieve the minimum memory size and the minimum number of memory banks, which cannot be guaranteed in any prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator.
international symposium on low power electronics and design | 2013
Jason Cong; Milos D. Ercegovac; Muhuan Huang; Sen Li; Bingjun Xiao
Table lookup based function computation can significantly save energy consumption. However existing table lookup methods are mostly used in ASIC designs for some fixed functions. The goal of this paper is to enable table lookup computation in general-purpose processors, which requires adaptive lookup tables for different applications. We provide a complete design flow to support this requirement. We propose a novel approach to build the reconfigurable lookup tables based on emerging nonvolatile memories (NVMs), which takes full advantages of NVMs over conventional SRAMs and avoids the limitation of NVMs. We provide compiler support to optimize table resource allocation among functions within a program. We also develop a runtime table manager that can learn from history and improve its arbitration of the limited on-chip table resources among programs.