Kyuyeon Hwang
Seoul National University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kyuyeon Hwang.
signal processing systems | 2014
Kyuyeon Hwang; Wonyong Sung
Feedforward deep neural networks that employ multiple hidden layers show high performance in many applications, but they demand complex hardware for implementation. The hardware complexity can be much lowered by minimizing the word-length of weights and signals, but direct quantization for fixed-point network design does not yield good results. We optimize the fixed-point design by employing backpropagation based retraining. The designed fixed-point networks with ternary weights (+1, 0, and -1) and 3-bit signal show only negligible performance loss when compared to the floating-point coun-terparts. The backpropagation for retraining uses quantized weights and fixed-point signal to compute the output, but utilizes high precision values for adapting the networks. A character recognition and a phoneme recognition examples are presented.
international conference on acoustics, speech, and signal processing | 2015
Sajid Anwar; Kyuyeon Hwang; Wonyong Sung
Deep convolutional neural networks have shown promising results in image and speech recognition applications. The learning capability of the network improves with increasing depth and size of each layer. However this capability comes at the cost of increased computational complexity. Thus reduction in hardware complexity and faster classification are highly desired. This work proposes an optimization method for fixed point deep convolutional neural networks. The parameters of a pre-trained high precision network are first directly quantized using L2 error minimization. We quantize each layer one by one, while other layers keep computation with high precision, to know the layer-wise sensitivity on word-length reduction. Then the network is retrained with quantized weights. Two examples on object recognition, MNIST and CIFAR-10, are presented. Our results indicate that quantization induces sparsity in the network which reduces the effective number of network parameters and improves generalization. This work reduces the required memory storage by a factor of 1/10 and achieves better classification results than the high precision networks.
ACM Journal on Emerging Technologies in Computing Systems | 2017
Sajid Anwar; Kyuyeon Hwang; Wonyong Sung
Real-time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks: feature map-wise, kernel-wise, and intra-kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, in parallel computing environments, and in hardware-based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by assessing the misclassification rate with a corresponding connectivity pattern. The pruned network is retrained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra-kernel strided sparsity with a simple constraint can significantly reduce the size of the kernel and feature map tensors. The proposed work shows that when pruning granularities are applied in combination, we can prune the CIFAR-10 network by more than 70% with less than a 1% loss in accuracy.
international conference on acoustics, speech, and signal processing | 2014
Jonghong Kim; Kyuyeon Hwang; Wonyong Sung
Deep neural networks show very good performance in phoneme and speech recognition applications when compared to previously used GMM (Gaussian Mixture Model)-based ones. However, efficient implementation of deep neural networks is difficult because the network size needs to be very large when high recognition accuracy is demanded. In this work, we develop a digital VLSI for phoneme recognition using deep neural networks and assess the design in terms of throughput, chip size, and power consumption. The developed VLSI employs a fixed-point optimization method that only uses +Δ, 0, and -Δ for representing each of the weight. The design employs 1,024 simple processing units in each layer, which however can be scaled easily according to the needed throughput, and the throughput of the architecture varies from 62.5 to 1,000 times of the real-time processing speed.
IEEE Transactions on Signal Processing | 2013
Kyuyeon Hwang; Wonyong Sung
The application of particle filters to real-time systems is often limited because of their computational complexity, and hence the use of graphics processing units (GPUs) that contain hundreds of processing elements on a chip is very promising. However, parallel implementations of particle filters with state-of-the-art systematic resampling on a GPU suffer from a severe workload imbalance problem, which results in fluctuation of the computation speed and hinders their application to real-time systems. We analyze the computational load imbalance of the systematic resampling method in conventional implementations, and show that the workload imbalance is proportional to the variance of weights in particle filters. Then, we propose a load balanced particle replication (LBPR) algorithm for systematic resampling, which shows almost constant execution speed and outperforms the conventional algorithm in terms of the worst-case computation time. The proposed algorithm has been implemented on an NVIDIA GTX580 GPU.
signal processing systems | 2016
Minjae Lee; Kyuyeon Hwang; Jinhwan Park; Sungwook Choi; Sungho Shin; Wonyong Sung
In this paper, a neural network based real-time speech recognition (SR) system is developed using an FPGA for very low-power operation. The implemented system employs two recurrent neural networks (RNNs), one is a speech-tocharacter RNN for acoustic modeling (AM) and the other is for character-level language modeling (LM). The system also employs a statistical word-level LM to improve the recognition accuracy. The results of the AM, the character-level LM, and the word-level LM are combined using a fairly simple N-best search algorithm instead of the hidden Markov model (HMM) based network. The RNNs are implemented using massively parallel processing elements (PEs) for low latency and high throughput. The weights are quantized to 6 bits to store all of them in the on-chip memory of an FPGA. The proposed algorithm is implemented on a Xilinx XC7Z045, and the system can operate much faster than real-time.
international conference on acoustics, speech, and signal processing | 2016
Sungho Shin; Kyuyeon Hwang; Wonyong Sung
Recurrent neural networks have shown excellent performance in many applications; however they require increased complexity in hardware or software based implementations. The hardware complexity can be much lowered by minimizing the word-length of weights and signals. This work analyzes the fixed-point performance of recurrent neural networks using a retrain based quantization method. The quantization sensitivity of each layer in RNNs is studied, and the overall fixed-point optimization results minimizing the capacity of weights while not sacrificing the performance are presented. A language model and a phoneme recognition examples are used.
international conference on acoustics, speech, and signal processing | 2015
Kyuyeon Hwang; Wonyong Sung
Recurrent neural networks (RNNs) have shown outstanding performance on processing sequence data. However, they suffer from long training time, which demands parallel implementations of the training procedure. Parallelization of the training algorithms for RNNs are very challenging because internal recurrent paths form dependencies between two different time frames. In this paper, we first propose a generalized graph-based RNN structure that covers the most popular long short-term memory (LSTM) network. Then, we present a parallelization approach that automatically explores parallelisms of arbitrary RNNs by analyzing the graph structure. The experimental results show that the proposed approach shows great speed-up even with a single training stream, and further accelerates the training when combined with multiple parallel training streams.
international conference on acoustics, speech, and signal processing | 2014
Minjae Lee; Kyuyeon Hwang; Wonyong Sung
As the homeostatis characteristics of nerve systems show, artificial neural networks are considered to be robust to variation of circuit components and interconnection faults. However, the tolerance of neural networks depends on many factors, such as the fault model, the network size, and the training method. In this study, we analyze the fault tolerance of fixed-point feed-forward deep neural networks for the implementation in CMOS digital VLSI. The circuit errors caused by the interconnection as well as the processing units are considered. In addition to the conventional and dropout training methods, we develop a new technique that randomly disconnects weights during the training to increase the error resiliency. Feed-forward deep neural networks for phoneme recognition are employed for the experiments.
international conference on acoustics, speech, and signal processing | 2017
Kyuyeon Hwang; Wonyong Sung
Recurrent neural network (RNN) based character-level language models (CLMs) are extremely useful for modeling out-of-vocabulary words by nature. However, their performance is generally much worse than the word-level language models (WLMs), since CLMs need to consider longer history of tokens to properly predict the next one. We address this problem by proposing hierarchical RNN architectures, which consist of multiple modules with different timescales. Despite the multi-timescale structures, the input and output layers operate with the character-level clock, which allows the existing RNN CLM training approaches to be directly applicable without any modifications. Our CLM models show better perplexity than Kneser-Ney (KN) 5-gram WLMs on the One Billion Word Benchmark with only 2% of parameters. Also, we present real-time character-level end-to-end speech recognition examples on the Wall Street Journal (WSJ) corpus, where replacing traditional mono-clock RNN CLMs with the proposed models results in better recognition accuracies even though the number of parameters are reduced to 30%.