Mingjie Lin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mingjie Lin is active.

Explore More

Publication

Featured researches published by Mingjie Lin.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2007

Performance Benefits of Monolithically Stacked 3-D FPGA

Mingjie Lin; Abbas El Gamal; Yi-Chang Lu; S. Simon Wong

The performance benefits of a monolithically stacked three-dimensional (3-D) field-programmable gate array (FPGA), whereby the programming overhead of an FPGA is stacked on top of a standard CMOS layer containing logic blocks (LBs) and interconnects, are investigated. A Virtex-II-style two-dimensional (2-D) FPGA fabric is used as a baseline architecture to quantify the relative improvements in logic density, delay, and power consumption achieved by such a 3-D FPGA. It is assumed that only the switch transistor and configuration memory cells can be moved to the top layers and that the 3-D FPGA employs the same LB and programmable interconnect architecture as the baseline 2-D FPGA. Assuming they are les 0.7, the area of a static random-access memory cell and switch transistors having the same characteristics as n-channel metal-oxide-semiconductor devices in the CMOS layer are used. It is shown that a monolithically stacked 3-D FPGA can achieve 3.2 times higher logic density, 1.7 times lower critical path delay, and 1.7 times lower total dynamic power consumption than the baseline 2-D FPGA fabricated in the same 65-nm technology node

field programmable gate arrays | 2006

Performance benefits of monolithically stacked 3D-FPGA

Mingjie Lin; Abbas El Gamal; Yi-Chang Lu; S. Simon Wong

IEEE Communications Letters | 2005

The throughput of a buffered crossbar switch

Mingjie Lin; Nick McKeown

The throughput of an input-queued crossbar switch-with a single FIFO queue at each input-is limited to 2-/spl radic/2/spl ap/58.6% for uniformly distributed, Bernoulli i.i.d. arrivals of fixed length packets. In this letter we prove that if the crossbar switch can buffer one packet at each crosspoint, then the throughput increases to 100% asymptotically as N/spl rarr//spl infin/, where N is the number of switch ports.

reconfigurable computing and fpgas | 2010

MARC: A Many-Core Approach to Reconfigurable Computing

Ilia A. Lebedev; Shaoyi Cheng; Austin Doupnik; James B. Martin; Christopher W. Fletcher; Daniel Burke; Mingjie Lin; John Wawrzynek

We present a Many-core Approach to Reconfigurable Computing (MARC), enabling efficient high-performance computing for applications expressed using parallel programming models such as OpenCL. The MARC system exploits abundant special FPGA resources such as distributed block memories and DSP blocks to implement complete single-chip high efficiency many-core micro architectures. The key benefits of MARC are that it (i) allows programmers to easily express parallelism through the API defined in a high-level programming language, (ii) supports coarse-grain multithreading and dataflow-style fine-grain threading while permitting bit-level resource control, and (iii) greatly reduces the effort required to re-purpose the hardware system for different algorithms or different applications. A MARC prototype machine with 48 processing nodes was implemented using a Virtex-5 (XCV5LX155T-2) FPGA for a well known Bayesian network inference problem. We compare the runtime of the MARC machine against a manually optimized implementation. With fully synthesized application-specific processing cores, our MARC machine comes within a factor of 3 of the performance of a fully optimized FPGA solution but with a considerable reduction in development effort and a significant increase in retarget ability.

International Journal of Heat and Mass Transfer | 2002

A transient liquid crystal method using a 3-D inverse transient conduction scheme

Mingjie Lin; Ting Wang

Abstract The present method utilized the hue-angle method to process the color images captured from the liquid crystal color play. Instantaneous temperature readings from embedded thermocouples were utilized for in situ calibration of hue angle for each data set. The convective heat transfer coefficient results were obtained by performing a 3-D inverse transient conduction calculation over the entire jet impingement target surface and the substrate. The results of average heat transfer coefficients agreed well with previous experimental results of point measurements by thermocouples. Comparison between 1-D and 3-D results indicates that 1-D results are higher than the 3-D results with the local maximum and minimum heat transfer values being overvalued by about 15–20% and the overall heat transfer by approximately 12%. This is due to the fact that 1-D method does not include the lateral heat flows induced by local temperature gradients.

field programmable gate arrays | 2010

High-throughput bayesian computing machine with reconfigurable hardware

Mingjie Lin; Ilia A. Lebedev; John Wawrzynek

We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evalu- ating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCM achieves high throughput by exploiting the FPGAs distributed memories and abundant hardware structures (such as long carry-chains and registers), which enables us to 1) develop an innovative memory allocation scheme based on a maximal matching algorithm that completely avoids memory stalls, 2) optimize and deeply pipeline the logic design of each processing node, and 3) optimally schedule them. The BCM architecture we present not only can be applied to many important algorithms in artificial intelligence, signal processing, and digital communications, but also has high reusability, i.e., a new application needs not change a BCMs hardware design, only new task graph processing and code compilation are necessary. Moreover, the throughput of a BCM scales almost linearly with the size of the FPGA on which it is implemented. A prototype of a Bayesian computing machine with 16 processing nodes was implemented with a Virtex-5 FPGA (XCV5LX155T-2) on a BEE3 (Berkeley Emulation Engine) platform. For a wide variety of sample Bayesian problems, comparing running the same network evaluation algorithm on a 2.4 GHz Core 2 Duo Intel processor and a GeForce 9400m using the CUDA software package, the BCM demonstrates 80x and 15x speedups respectively, with a peak throughput of 20.4 GFLOPS (Giga Floating-Point Operations per Second).

field-programmable logic and applications | 2010

OpenRCL: Low-Power High-Performance Computing with Reconfigurable Devices

Mingjie Lin; Ilia A. Lebedev; John Wawrzynek

This work presents the Open Reconfigurable Computing Language (OpenRCL) system designed to enable low-power high-performance reconfigurable computing with imperative programming language such as C/C++. The key idea is to expose the FPGA platform as a compiler target for applications expressed in the OpenCL paradigm. To this end, we present a combination of low-level virtual machine instruction set, execution model, many-core architecture, and associated compiler to achieve high performance and power efficiency by exploiting the FPGA’s distributed memories and abundant hardware structures (such as DSP blocks, long carry-chains, and registers). Our resulting OpenRCL system not only allows programmers to easily express parallelism through the API defined in the OpenCL standard but also supports coarse-grain multithreading and dataflow-style fine-grain threading while permitting bit-level resource control. An OpenRCL prototype machine with 30 processing nodes was implemented using a Virtex-5 (XCV5LX155T-2) FPGA. For the well-known Parallel Prefix Sum (Scan) problem, comparing the runtime of the same problem on a GeForce 9400m using the OpenCL SDK from Apple Inc., the OpenRCL machine demonstrates comparable performance with a 5x reduction in core power consumption.

field programmable gate arrays | 2007

A routing fabric for monolithically stacked 3D-FPGA

Mingjie Lin; Abbas El Gamal

A previous study on the benefits of monolithically stacked 3D-FPGA has estimated a 3.2x improvement in logic density, a 1.7x improvement in delay, and a 1.7x improvement in dynamic power consumption over a baseline 2D-FPGA with no change in architecture. This paper describes a new routing fabric and shows that a 3D-FPGA using this fabric can achieve a 3.3x improvement in logic density, a 2.35x improvement in delay, and a 2.82x improvement in dynamic power consumption over the same baseline 2D-FPGA. The additional improvements in delay and power consumption are achieved by reducing net loading in several ways: (i) Only Single and Double interconnect segments are used. This reduces the total interconnect length used to implement each net. (ii) The routing fabric is hierarchical. Each logic blocks inputs and outputs connect first to local segments. These segments can be then programmably connected to local segments in neighboring routing blocks via programmable buffers and/or to interconnect segments in routing channels via muxes with buffered outputs. (iii) Interconnect segments can be directly connected to form longer segments using programmable buffers without going through routing blocks. (iv) The routing block provides switching capability beyond that of a conventional switch box. A 3D-FPGA using this new routing fabric can be realized by stacking two configuration memory layers and a switch layer on top of a standard CMOS layer with a total of 12 metal layers interspersed between them. A CAD flow based on VPR with appropriate modifications to the routing graph generation and routing algorithm is developed and used in the performance analysis.

IEEE Transactions on Very Large Scale Integration Systems | 2009

A Low-Power Field-Programmable Gate Array Routing Fabric

Mingjie Lin; A. El Gamal

This paper describes a new programmable routing fabric for field-programmable gate arrays (FPGAs). Our results show that an FPGA using this fabric can achieve 1.57 times lower dynamic power consumption and 1.35 times lower average net delays with only 9% reduction in logic density over a baseline island-style FPGA implemented in the same 65-nm CMOS technology. These improvements in power and delay are achieved by 1) using only short interconnect segments to reduce routed net lengths, and 2) reducing interconnect segment loading due to programming overhead relative to the baseline FPGA without compromising routability. The new routing fabric is also well-suited to monolithically stacked 3-D-IC implementation. It is shown that a 3-D-FPGA using this fabric can achieve a 3.3 times improvement in logic density, a 2.51 times improvement in delay, and a 2.93 times improvement in dynamic power consumption over the same baseline 2-D-FPGA.

field programmable gate arrays | 2008

TORCH: a design tool for routing channel segmentation in FPGAs

Mingjie Lin; Abbas El Gamal

A design tool for routing channel segmentation in island-style FPGAs is presented. Given the FPGA architecture parameters and a set of benchmark designs, the tool optimizes routing channel segmentation using the average interconnect power-delay product as a performance metric estimated from placed and routed designs. A simulated-annealing procedure is used, whereby segmentation is incrementally changed in each iteration, the benchmark designs are mapped using VPR, and the performance metric is computed to decide whether to accept or reject the new segmentation. Run time is significantly reduced by using incremental routing in each iteration and parallelizing the metric evaluation. Experimental results using the MCNC benchmark designs demonstrate an average of 22% and 15% reduction in delay and power relative to a baseline segmentation. The results also show that average segment length should decrease with technology scaling. Finally, we demonstrate how the tool can be used to optimize other aspects of programmable routing in an FPGA

Explore More