Jike Chong
University of California, Berkeley
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jike Chong.
international conference on multimedia and expo | 2007
Jike Chong; Nadathur Satish; Bryan Catanzaro; Kaushik Ravindran; Kurt Keutzer
The H.264 decoder has a sequential, control intensive front end that makes it difficult to leverage the potential performance of emerging manycore processors. Preparsing is a functional parallelization technique to resolve this front end bottleneck. However, the resulting parallel macro block (MB) rendering tasks have highly input-dependent execution times and precedence constraints, which make them difficult to schedule efficiently on manycore processors. To address these issues, we propose a two step approach: (i) a custom preparsing technique to resolve control dependencies in the input stream and expose MB level data parallelism, (ii) an MB level scheduling technique to allocate and load balance MB rendering tasks. The run time MB level scheduling increases the efficiency of parallel execution in the rest of the H.264 decoder, providing 60% speedup over greedy dynamic scheduling and 9-15% speedup over static compile time scheduling for more than four processors. The preparsing technique coupled with run time MB level scheduling enables a potential 7times speedup for H.264 decoding.
IEEE Signal Processing Magazine | 2009
Kisun You; Jike Chong; Youngmin Yi; Ekaterina Gonina; Christopher J. Hughes; Yen-Kuang Chen; Wonyong Sung; Kurt Keutzer
We propose four application-level implementation alternatives called algorithm styles and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On a 44-min speech data set, we demonstrate substantial speedups of 3.4 X on Core i7 and 10.5 X on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms.
international conference on multimedia and expo | 2009
Jike Chong; Kisun You; Youngmin Yi; Ekaterina Gonina; Christopher J. Hughes; Wonyong Sung; Kurt Keutzer
Parallel scalability allows an application to efficiently utilize an increasing number of processing elements. In this paper we explore a design space for application scalability for an inference engine in large vocabulary continuous speech recognition (LVCSR). Our implementation of the inference engine involves a parallel graph traversal through an irregular graph-based knowledge network with millions of states and arcs. The challenge is not only to define a software architecture that exposes sufficient fine-grained application concurrency, but also to efficiently synchronize between an increasing number of concurrent tasks and to effectively utilize the parallelism opportunities in todays highly parallel processors. We propose four application-level implementation alternatives we call “algorithm styles”, and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On 44 minutes of speech data set, we demonstrate substantial speedups of 3.4× on Core i7 and 10.5× on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms.
high performance computational finance | 2009
Matthew Francis Dixon; Jike Chong; Kurt Keutzer
The proliferation of algorithmic trading, derivative usage and highly leveraged hedge funds necessitates the acceleration of market Value-at-Risk (VaR) estimation to measure the severity of portfolios losses. This paper demonstrates how solely relying on advances in computer hardware to accelerate market VaR estimation overlooks significant opportunities for acceleration. We use a simulation based delta-gamma Value-at-Risk (VaR) estimate and compute the loss function using basic linear algebra subroutines (BLAS). Our NVIDIA GeForce GTX280 graphics processing unit (GPU) based baseline implementation is a straight-forward port from the CPU implementation and only had a 8.21x speed advantage over a quadcore Intel Core2 Q9300 central processing unit (CPU) based implementation. We demonstrate three approaches to gain additional speedup over the baseline GPU implemention. Firstly, we reformulate the loss function to reduce the amount of necessary computation and achieved a 60.3x speedup. Secondly, we selected functionally equivalent distribution conversion modules to give the best convergence rate - providing an additional 2x speedup. Thirdly, we merged data-parallel computational kernels to remove redundant load store operations leading to an additional 1.85x speedup. Overall, we have achieved a speedup of 148x against the baseline GPU implementation, reducing the time of a VaR estimation with a standard error of 0.1% from minutes to less than one second.
international symposium on multimedia | 2010
Gerald Friedland; Jike Chong; Adam Janin
The following article presents an application for browsing meeting recordings by speaker and keyword which we call the Meeting Diarist. The goal of the system is to enable browsing of the content with rich meta-data in a graphical user interface shortly after the end of meeting, even when the application runs on a contemporary laptop. We there-fore developed novel parallel methods for speaker diarization and multi-hypothesis speech recognition that are optimized to run on multicore and many core architectures. This paper presents the underlying parallel speaker diarization and speech recognition realizations, a comparison of results based on NIST RT07 evaluation data, and a description of the final application.
GPU Computing Gems Emerald Edition | 2011
Jike Chong; Ekaterina Gonina; Kurt Keutzer
Publisher Summary This chapter provides an understanding of specific implementation challenges when working with the speech inference process, weighted finite state transducer (WFST) based methods, and the Viterbi algorithm. It illustrates an efficient reference implementation on the GPU that could be productively customized to meet the needs of specific usage scenarios. Automatic speech recognition (ASR) allows multimedia content to be transcribed from acoustic waveforms to word sequences. This technology is emerging as a critical component in data analytics for a wealth of media data that is being generated every day. ASR is a challenging application to parallelize. Specifically, on the GPU an efficient implementation of ASR involves resolving a series of implementation challenges specific to the data-parallel architecture of the platform. There are efficient solutions for resolving the implementation challenges of speech recognition on the GPU that achieve more than an order of magnitude speedup compared to sequential execution. This chapter identifies and resolves four types of algorithmic challenges encountered during in the implementation of speech recognition on GPUs. The techniques presented here, when used together, are capable of delivering 10.6X speedup for this challenging application when compared to an optimized sequential implementation on the CPU. This kind of application framework provides an optimized infrastructure that incorporates all the techniques discussed in this chapter to allow efficient execution of the speech inference process on the GPU.
Proceedings of the 2010 international workshop on Searching spontaneous conversational speech | 2010
Gerald Friedland; Jike Chong; Adam Janin
The following article presents an application for browsing meeting recordings by speaker, keyword, and pre-defined acoustic events (e.g., laughter), which we call the Meeting Diarist. The goal of the system is to enable browsing of the content with rich meta-data in a graphical user interface (GUI) shortly after the end of meeting, even when the application runs on a contemporary laptop. We therefore developed novel parallel methods for speaker diarization and speech recognition that are optimized to run on multicore and manycore architectures. This paper presents the application and the underlying parallel speaker diarization and speech recognition realizations.
Concurrency and Computation: Practice and Experience | 2012
Matthew Francis Dixon; Jike Chong; Kurt Keutzer
Values of portfolios in modern financial markets may change precipitously with changing market conditions. The utility of financial risk management tools is dependent on whether they can estimate Value‐at‐Risk (VaR) of portfolios on‐demand when key decisions need to be made. However, VaR estimation of portfolios uses the Monte Carlo method, which is a computationally intensive method often run as an overnight batch job. With the proliferation of highly parallel computing platforms such as multicore CPUs and manycore graphics processing units (GPUs), teraFLOPS of computation capability is now available on a desktop computer, enabling the VaR of large portfolios with thousands of risk factors to be computed within only a fraction of a second.
Multiprocessor System-on-Chip | 2011
Michael J. Anderson; Bryan Catanzaro; Jike Chong; Ekaterina Gonina; Kurt Keutzer; Chao-Yue Lai; Mark Murphy; Bor-Yiing Su; Narayanan Sundaram
Parallel programming using the current state-of-the-art in software engineering techniques is hard. Expertise in parallel programming is necessary to deliver good performance in applications; however, it is very common that domain experts lack the requisite expertise in parallel programming. In order to drive the computer science research toward effectively using the available parallel hardware platforms, it is very important to make parallel programming systematical and productive. We believe that the key to designing parallel programs in a systematical way is software architecture, and the key to improve the productivity of developing parallel programs is software frameworks. The basis of both is design patterns and a pattern language.
Proceedings of the 2010 Workshop on Parallel Programming Patterns | 2010
Jike Chong; Ekaterina Gonina; Kurt Keutzer
The Monte Carlo methods are an important set of algorithms in computer science. They involve estimating results by statistically sampling a parameter space with a thousands to millions of experiments. The algorithm requires a small set of parameters as input, with which it generates a large amount of computation, and outputs a concise set of aggregated results. The large amount of computation has many independent component with obvious boundaries for parallelization. While the algorithm is well-suited for executing on a highly parallel computing platform, there still exist many challenges such as: selecting a suitable random number generator with the appropriate statistical and computational properties, selecting a suitable distribution conversion method that preserves the statistical properties of the random sequences, leveraging the right abstraction for the computation in the experiments, and designing the an efficient data structures for a particular data working set. This paper presents the Monte Carlo Methods software programming pattern and focuses on the numerical, task, and data perspectives to guide software developers in constructing efficient implementations of applications based on Monte Carlo methods.