Eric Q. Li
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Eric Q. Li.
Journal of Visual Communication and Image Representation | 2006
Yen-Kuang Chen; Eric Q. Li; Xiaosong Zhou; Steven Ge
Abstract H.264 is an emerging video coding standard, which aims at compressing high-quality video contents at low-bit rates. While the new encoding and decoding processes are similar to many previous standards, the new standard includes a number of new features and thus requires much more computation than most existing standards do. The complexity of H.264 standard poses a large amount of challenges to implementing the encoder/decoder in real-time via software on personal computers. This work analyzes software implementation of H.264 encoder and decoder on general-purpose processors with media instructions and multi-threading capabilities. Specifically, we discuss how to optimize the algorithms of H.264 encoders and decoders on Intel Pentium 4 processors. We first analyze the reference implementation to identify the time-consuming modules, and present optimization methods using media instructions to improve the speed of these modules. After appropriate optimizations, the speed of the codec improves by more than 3×. Nonetheless, the H.264 encoder is still too complicated to be implemented in real-time on a single processor. Thus, we also study how to partition the H.264 encoder into multiple threads, which then can be run on systems with multiple processors or multi-threading capabilities. We analyze different multi-threading schemes that have different quality/performance, and propose a scheme with good scalability (i.e., speed) and good quality. Our encoder can obtain another 3.8× speedup on a four-processor system or 4.6× speedup on a four-processor system with Hyper-Threading Technology. This work demonstrates that hardware-specific algorithm modifications can speed up the H.264 decoder and encoder substantially. The performance improvement techniques on modern microprocessors demonstrated in this work can be applied not only to H.264, but also to other video or multimedia processing applications.
conference on image and video communications and processing | 2003
Xiaosong Zhou; Eric Q. Li; Yen-Kuang Chen
As emerging video coding standards, e.g. H.264, aim at high-quality video contents at low bit-rates, the encoding and decoding processes require much more computation than most existing standards do. This paper analyzes software implementation of a real-time H.264 decoder on general-purpose processors with media instructions. Specifically, we discuss how to optimize the speed of H.264 decoders on Intel Pentium 4 processors. This paper first analyzes the reference implementation to identify the time-consuming modules. Our study shows that a number of components, e.g., motion compensation and inverse integer transform, are the most time-consuming modules in the H.264 decoder. Second, we present a list of performance optimization methods using media instructions to improve the efficiency of these modules. After appropriate optimizations, the decoder speed improved by more than 3x---it can decode a 720×480 resolution video sequence at 48 frames per second on 2.4GHz Intel Pentium 4 processors compared to reference software’s 12 frames per second. The optimization techniques demonstrated in this paper can also be applied to other video/image processing applications. Additionally, after presenting detailed application behavior on general-purpose processors, this paper discusses a few recommendations on how to design future efficient/powerful video/image applications/standards with given hardware implications.
international conference on acoustics, speech, and signal processing | 2004
Xiang Li; Eric Q. Li; Yen-Kuang Chen
In the new H.264/AVC video coding standard, motion estimation takes up a significant encoding time, especially when using the straightforward full search algorithm (FS). A fast flexible multi-frame motion estimation algorithm with adaptive search strategies (FMASS) is presented. With special considerations on the multiple reference frames and block modes, several techniques, i.e., adaptive search strategies for single frame and flexible multi-frame selection, have been utilized to improve significantly the speed-up performance in H.264. Extensive simulations show that it can minimize the matching points by more than 1190 times compared with FS. In addition, the output quality of the encoded sequences loses only 0.053 dB in terms of PSNR on average. Fast speed-up performance and unnoticeable quality losses make the proposed algorithm outperform most of the other well-known algorithms proposed recently, such as ARPS3, MVFAST and UMHexagonS, of which the latter two have been already accepted by MPEG-4 and JVT respectively.
computer vision and pattern recognition | 2010
Jianguo Li; Eric Q. Li; Yurong Chen; Lin Xu; Yimin Zhang
Depth-map merging is one typical technique category for multi-view stereo (MVS) reconstruction. To guarantee accuracy, existing algorithms usually require either sub-pixel level stereo matching precision or continuous depth-map estimation. The merging of inaccurate depth-maps remains a challenging problem. This paper introduces a bundle optimization method for robust and accurate depth-map merging. In the method, depth-maps are generated using DAISY feature, followed by two stages of bundle optimization. The first stage optimizes the track of connected stereo matches to generate initial 3D points. The second stage optimizes the position and normals of 3D points. High quality point cloud is then meshed as geometric models. The proposed method can be easily parallelizable on multi-core processors. Middlebury evaluation shows that it is one of the most efficient methods among non-GPU algorithms, yet still keeps very high accuracy. We also demonstrate the effectiveness of the proposed algorithm on various real-world, high-resolution, self-calibrated data sets including objects with complex details, objects with large area of highlight, and objects with non-Lambertian surface.
ieee international symposium on workload characterization | 2008
Hao Feng; Eric Q. Li; Yurong Chen; Yimin Zhang
This paper parallelizes and characterizes an important computer vision application -Scale Invariant Feature Transform (SIFT) both on a Symmetric Multiprocessor (SMP) platform and a large scale Chip Multiprocessor (CMP) simulator. SIFT is an approach for extracting distinctive invariant features from images and has been widely applied. In many computer vision problems, a real-time or even super-real-time processing capability of SIFT is required. To meet the computation demand, we optimize and parallelize SIFT to accelerate its execution on multi-core systems. Our study shows that SIFT can achieve a 9.7x ~ llx speedup on a 16 -core SMP system. Furthermore, Single Instruction Multiple Data (SIMD) and cache-conscious optimization bring another 85% performance gain at most. But it is still three times slower than the real-time requirement for High-Definition Television (HDTV) image. Then we study the performance of SIFT on a 64 -core CMP simulator. The results show that for HDTV image, SIFT can achieve an excellent speedup of 52 x and run in real-time finally. Besides the parallelization and optimization work, we also conduct a detailed performance analysis for SIFT on those two platforms. We find that load imbalance significantly limits the scalability and SIFT suffers from intensive burst memory bandwidth requirement on the 16 -core SMP system. However, on the 64 -core CMP simulator the memory pressure is not high due to the shared last-level cache (LLC) which accommodates tremendous read-write sharing in SIFT. Thus it does not affect the scaling performance. In short, understanding the characterization of SIFT can help identify the program bottlenecks and give us further insights into designing better systems.
visual communications and image processing | 2004
Eric Q. Li; Yen-Kuang Chen
H.264 is the emerging video coding standard, which aims at compressing high-quality video contents at low bit-rates. While its new encoding and decoding processes are similar to many previous standards, the new standard includes a number of new features and thus requires much more computation than most existing standards do. The complexity of H.264 standard poses a large amount of challenges to implementing the encoder/decoder in real-time via software on personal computers. Even after 2~3x performance improvement with media instruction on modern general-purpose processors and another 2~4x improvement from algorithmic optimization, the H.264 encoder is still too complicated to be implemented in real-time on a single processor. Based on the detailed analysis of the possibilities of parallelism in H.264 encoder, we proposed an efficient multithreading implementation of the H.264 video encoder. In order to guarantee enough concurrency of the whole system, an elaborate macroblock and inter-frame parallel scheduling scheme is presented. In addition, our macroblock-based multithreading scheme achieves almost no video quality losses in contrast to other parallelization schemes. Our results show that the multithreaded encoder can obtain another 3.96x speed-up on a four-processor system or 4.6x speed-up on a four-processor system with Hyper-Treading Technology. The techniques demonstrated in this work can be applied not only to H.264, but also to other video/image coding/decoding applications on personal computers.
international conference on image processing | 2010
Lin Xu; Eric Q. Li; Jianguo Li; Yurong Chen; Yimin Zhang
This paper presents a general texture mapping framework for image-based 3D modeling. It aims to generating seamless texture map for 3D model created by real-world photos under uncontrolled environment. Our proposed method addresses two challenging problems: 1) texture discontinuity due to system error in 3D modeling from self-calibration; 2) color/lighting difference among images due to real-world uncontrolled environments. The general framework contains two stages to resolve these problems. The first stage globally optimizes the registration of texture patches and triangle faces with Markov Random Field (MRF) to optimize texture mosaic. The second stage does local radiometric correction to adjust color difference between texture patches and then blend texture boundaries to improve color continuity. The proposed method is evaluated on several 3D models by image-based 3D modeling, and demonstrates promising results.
international parallel and distributed processing symposium | 2006
Xi Deng; Eric Q. Li; Jiulong Shan; Wenguang Chen
Multiple sequence alignment is a fundamental and very computationally intensive task in molecular biology. MUSCLE, a new algorithm for creating multiple alignments of protein sequences, achieves a highest rank in accuracy and the fastest speed compared to ClustalW as well as T-Coffee, some widely used tools in multiple sequence alignment. To further accelerate the computations, we present the parallel implementation of MUSCLE in this paper. It is decomposed into several independent modules, which are parallelized with different OpenMP paradigms. We also conduct detailed performance characterization on symmetric multiple processor systems. The experiments show that MUSCLE scales well with the increase of processors, and achieves up to 15./spl times/ speedup on 16-way shared memory multiple processor system.
ieee international symposium on workload characterization | 2005
Uma Srinivasan; Peng-Sheng Chen; Qian Diao; Chu-Cheow Lim; Eric Q. Li; Yongjian Chen; Roy Ju; Yimin Zhang
Bioinformatics applications constitute an emerging data-intensive, high-performance computing (HPC) domain. While there is much research on algorithmic improvements, (2004), the actual performance of an application also depends on how well the program maps to the target hardware. This paper presents a performance study of two parallel bioinformatics applications HMMER (sequence alignment) and SVM-RFE (gene expression analysis), on Intel x86 based hyperthread-capable (2002) shared-memory multiprocessor systems. The performance characteristics varied according to the application and target hardware characteristics. For instance, HMMER is compute intensive and showed better scalability on a 3.0 GHz system versus a 2.2 GHz system. However, SVM-RFE is memory intensive and showed better absolute performance on the 2.2 GHz machine which has better memory bandwidth. The performance is also impacted by processor features, e.g. hyperthreading (HT) (2002) and prefetching. With HMMER we could obtain -75% of the performance with HT enabled with respect to doubling the number of CPUs. While load balancing optimizations can provide speedup of -30% for HMMER on a hyperthreading-enabled system, the load balancing has to adapt to the target number of processors and threads. SVM-RFE benefits differently from the same load-balancing and thread scheduling tuning. We conclude that compiler and runtime optimizations play an important role to achieve the best performance for a given bioinformatics algorithm.
international conference on multimedia and expo | 2012
Eric Q. Li; Bin Wang; Liu Yang; Ya-Ti Peng; Yangzhou Du; Yimin Zhang; Yi-Jen Chiu
Along with the inclusion of GPU cores within the same CPU die, the performance of Intels processor-graphics has been significantly improved over earlier generation of integrated graphics. The need to efficiently harness the computational power of the GPU in the same CPU die is more than ever. This paper presents a highly optimized Haar-based face detector which efficiently exploits both CPU and GPU computing power on the latest Sandy Bridge processor. The classification procedure of Haar-based cascade detector is partitioned to two phases in order to leverage both thread level and data level parallelism in the GPU. The image downscaling and integral image calculation running in the CPU core can work with the GPU in parallel. Compared to CPU-alone implementation, the experiments show that our proposed GPU accelerated implementation achieves a 3.07x speedup with more than 50% power reduction on the latest Sandy Bridge processor. On the other hand, our implementation is also more efficient than the CUDA implementation on the NVidia GT430 card in terms of performance as well as power. In addition, our proposed method presents a general approach for task partitioning between CPU and GPU, thus being beneficial not only for face detection but also for other multimedia and computer vision techniques.