Jungwook Choi
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jungwook Choi.
IEEE Transactions on Circuits and Systems | 2010
Young-kyu Choi; Kisun You; Jungwook Choi; Wonyong Sung
A real-time hardware-based large vocabulary speech recognizer requires high memory bandwidth. We have developed a field-programmable-gate-array (FPGA)-based 20 000-word speech recognizer utilizing efficient dynamic random access memory (DRAM) access. This system contains all the functional blocks for hidden-Markov-model-based speaker-independent continuous speech recognition: feature extraction, emission probability computation, and intraword and interword Viterbi beam search. The feature extraction is conducted in software on a soft-core-based CPU, while the other functional units are implemented using parallel and pipelined hardware blocks. In order to reduce the number of memory access operations, we used several techniques such as bitwidth reduction of the Gaussian parameters, multiframe computation of the emission probability, and two-stage language model pruning. We also employ a customized DRAM controller that supports various access patterns optimized for each functional unit of the speech recognizer. The speech recognition hardware was synthesized for the Virtex-4 FPGA, and it operates at 100 MHz. The experimental result on Nov 92 20 k test set shows that the developed system runs 1.52 and 1.39 times faster than real time using the bigram and trigram language models, respectively.
field programmable logic and applications | 2012
Jungwook Choi; Rob A. Rutenbar
In this paper, we describe hardware for inference computations on Markov Random Fields (MRFs). MRFs are widely used in applications like computer vision, but conventional software solvers are slow. Belief Propagation (BP) solvers, which use patterns of local message passing on MRFs, have been studied in hardware, but their performance is unreliable. We show how a superior method-Sequential Tree-Reweighted message passing (TRW-S)-can be rendered in hardware. TRW-S has reliable convergence, guaranteed by its so-called “sequential” computation. Analysis reveals many opportunities for TRW-S hardware acceleration. We show how to implement TRW-S in FPGA hardware so that it exploits significant parallelism and memory bandwidth. Our implementation is capable of running a standard stereo vision benchmark at rates approaching 40 frames/sec; this represents the first time TRW-S methods have been accelerated to these speeds on an FPGA platform.
international conference on acoustics, speech, and signal processing | 2010
Jungwook Choi; Kisun You; Wonyong Sung
In this paper we present a hardware architecture for large vocabulary continuous speech recognition that conducts a search over a weighted finite state transducer (WFST) network. A pipelined architecture is proposed to fully utilize the memory bandwidth. A hash table is used to manage small sized working sets efficiently. We also applied a parallelization technique that increases the traversal speed by 17%. The recognition system is fully functional on an FPGA, which runs at 100MHz. The experimental result on the Wall Street Journal 5,000 vocabulary task shows that the recognition speed of the system is 5.3× faster than real-time.
signal processing systems | 2013
Jungwook Choi; Eric P. Kim; Rob A. Rutenbar; Naresh R. Shanbhag
Message passing based inference algorithms have immense importance in real-world applications. In this paper, error resilience of a message passing based Markov random field (MRF) stereo matching architecture is explored and enhanced through application of algorithmic noise tolerance (ANT) in order to cope with nanometer imperfections in post-silicon devices. We first explore the inherent robustness of iteration based MRF inference algorithms. Analysis and simulations show that for a 20-bit architecture, small errors (e ≤ 1024) are tolerable, while large errors (e ≥ 4096) degrade the performance significantly. Based on these error characteristics, we propose an ANT architecture to effectively compensate for large magnitude circuit errors. Introducing timing errors via voltage over scaling (VQS), experimental results show that the proposed ANT based hardware can tolerate an error rate of 21.3 %, with performance degradation of only 0.47% at a gate complexity overhead of 44.7%, compared to an errorfree full precision hardware with an energy savings of 41 %.
international conference on acoustics, speech, and signal processing | 2009
Young-kyu Choi; Kisun You; Jungwook Choi; Wonyong Sung
We have developed a VLSI chip for 5,000 word speaker-independent continuous speech recognition. This chip employs a context-dependent HMM (hidden Markov model) based speech recognition algorithm, and contains emission probability and Viterbi beam search pipelined hardware units. The feature vector for speech recognition is computed using a host processor in software in order to adopt various enhancement algorithms. The amount of internal SRAM size is minimized by moving data out to the external DRAM, and a custom DRAM controller module is designed to efficiently read and write consecutive data. The experimental result shows that the implemented system has a real-time factor of 0.77 and 0.55 using SDRAM and DDR SDRAM, respectively.
signal processing systems | 2011
Kisun You; Youngkyu Choi; Jungwook Choi; Wonyong Sung
We have developed a memory access reduced VLSI chip for 5,000 word speaker-independent continuous speech recognition. This chip employs a context-dependent HMM (hidden Markov model) based speech recognition algorithm, and contains parallel and pipelined hardware units for emission probability computation and Viterbi beam search. To maximize the performance, we adopted several memory access reduction techniques such as sub-vector clustering and multi-block processing for the emission probability computation. We also employed a custom DRAM controller for efficient access of consecutive data. Moreover, we analyzed the access pattern of data to minimize the internal SRAM size while maintaining high performance. The experimental results show that the implemented system performs speech recognition 2.4 and 1.8 times faster than real-time utilizing 32-bit DDR SDRAM and SDR SDRAM, respectively.
field programmable gate arrays | 2013
Jungwook Choi; Rob A. Rutenbar
We demonstrate a video-rate stereo matching system implemented on a hybrid CPU+FPGA platform (Convey HC-1). Emerging applications such as 3D gesture recognition and automotive navigation demand fast and high quality stereo vision. We describe a custom hardware-accelerated Markov Random Field inference system for this task. Starting from a core architecture for streaming tree-reweighted message passing (TRW-S) inference, we describe the end-to-end system engineering needed to move from this single frame message update to full stereo video. We partition the stereo matching procedure across the CPU and the FPGAs, and apply both function-level pipelining and frame-level parallelism to achieve the required speed. Experimental results show that our system achieves a speed of 12 frames per second for challenging video stereo matching tasks. We note that this appears to be the first implementation of TRW-S inference at video rates, and that our system is also significantly faster than several recent GPU implementations of similar stereo inference methods based on belief propagation (BP).
field programmable logic and applications | 2015
Skand Hurkat; Jungwook Choi; Eriko Nurvitadhi; Jose F. Martinez; Rob A. Rutenbar
Maximum a posteriori probability (MAP) inference on Markov random fields (MRF) is the basis of many computer vision applications. Sequential tree-reweighted belief propagation (TRW-S) has been shown to provide very good inference quality and strong convergence properties. However, software TRW-S solvers are slow due to the algorithms high computational requirements. A state-of-the-art FPGA implementation has been developed recently, which delivers substantial speedup over software. In this paper, we improve upon the TRW-S algorithm by using a multi-level hierarchical MRF formulation. We demonstrate the benefits of Hierarchical-TRW-S over TRW-S, and incorporate the proposed improvements on a Convey HC-1 CPU-FPGA hybrid platform. Results using four Middlebury stereo vision benchmarks show a 21% to 53% reduction in inference time compared with the state-of-the-art TRW-S FPGA implementation. To the best of our knowledge, this is the fastest hardware implementation of TRW-S reported so far.
IEEE Transactions on Circuits and Systems for Video Technology | 2016
Jungwook Choi; Rob A. Rutenbar
We demonstrate a video-rate stereo matching system implemented on a hybrid CPU+field-programmable gate array (FPGA) platform (Convey HC-1). Stereo matching is a fundamental problem of computer vision, and emerging applications, such as 3-D gesture recognition and automotive navigation, demand fast and high-quality stereo matching. Markov random field (MRF)-based approaches are widely used, but conventional software solvers are slow. Belief propagation (BP) solvers, which use patterns of local message passing on MRFs, have been studied in hardware, but their performance is unreliable. We show how a superior method, sequential tree-reweighted message passing (TRW-S), can be rendered in hardware. TRW-S has reliable convergence, guaranteed by its so-called sequential computation. Analysis reveals many opportunities for TRW-S hardware acceleration. Starting from the core architecture for streaming TRW-S, we describe the end-to-end system engineering for full video stereo matching. We partition the stereo matching procedure across the CPU and the FPGAs and apply frame level optimizations, such as message reuse based on scene change detection, frame level parallelization, and function level pipelining. The experimental results show that our system achieves a speed of 22.8 frames/s for a challenging QVGA video stereo matching task. We noticed that our system is significantly faster than several recent GPU/ASIC implementations of similar stereo inference methods based on BP.
2016 IEEE International Conference on Rebooting Computing (ICRC) | 2016
Ankur Agrawal; Jungwook Choi; Kailash Gopalakrishnan; Suyog Gupta; Ravi Nair; Jinwook Oh; Daniel A. Prener; Sunil Shukla; Vijayalakshmi Srinivasan; Zehra Sura
Approximate computing is gaining traction as a computing paradigm for data analytics and cognitive applications that aim to extract deep insight from vast quantities of data. In this paper, we demonstrate that multiple approximation techniques can be applied to applications in these domains and can be further combined together to compound their benefits. In assessing the potential of approximation in these applications, we took the liberty of changing multiple layers of the system stack: architecture, programming model, and algorithms. Across a set of applications spanning the domains of DSP, robotics, and machine learning, we show that hot loops in the applications can be perforated by an average of 50% with proportional reduction in execution time, while still producing acceptable quality of results. In addition, the width of the data used in the computation can be reduced to 10-16 bits from the currently common 32/64 bits with potential for significant performance and energy benefits. For parallel applications we reduced execution time by 50% using relaxed synchronization mechanisms. Finally, our results also demonstrate that benefits compounded when these techniques are applied concurrently. Our results across different applications demonstrate that approximate computing is a widely applicable paradigm with potential for compounded benefits from applying multiple techniques across the system stack. In order to exploit these benefits it is essential to re-think multiple layers of the system stack to embrace approximations ground-up and to design tightly integrated approximate accelerators. Doing so will enable moving the applications into a world in which the architecture, programming model, and even the algorithms used to implement the application are all fundamentally designed for approximate computing.