Ce Guo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ce Guo is active.

Explore More

Publication

Featured researches published by Ce Guo.

field-programmable technology | 2012

A fully-pipelined expectation-maximization engine for Gaussian Mixture Models

Ce Guo; Haohuan Fu; Wayne Luk

Gaussian Mixture Models (GMMs) are powerful tools for probability density modeling and soft clustering. They are widely used in data mining, signal processing and computer vision. In many applications, we need to estimate the parameters of a GMM from data before working with it. This task can be handled by the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. In this paper we present our FPGA-based solution for the EM-GMM algorithm. We propose a pipeline-friendly EM-GMM algorithm, a variant of the original EM-GMM algorithm that can be converted to a fully-pipelined hardware architecture. To further improve the performance, we design a Gaussian probability density function evaluation unit that works with fixed-point arithmetic. In the experiments, our FPGA-based solution generates fairly accurate results while achieving a maximum of 517 times speedup over a CPU-based solution, and 28 times speedup over a GPU-based solution.

field-programmable custom computing machines | 2015

Pipelined Genetic Propagation

Liucheng Guo; Ce Guo; David B. Thomas; Wayne Luk

Genetic Algorithms (GAs) are a class of numerical and combinatorial optimisers which are especially useful for solving complex non-linear and non-convex problems. However, the required execution time often limits their application to small-scale or latency-insensitive problems, so techniques to increase the computational efficiency of GAs are needed. FPGA-based acceleration has significant potential for speeding up genetic algorithms, but existing FPGA GAs are limited by the generational approaches inherited from software GAs. Many parts of the generational approach do not map well to hardware, such as the large shared population memory and intrinsic loop-carried dependency. To address this problem, this paper proposes a new hardware-oriented approach to GAs, called Pipelined Genetic Propagation (PGP), which is intrinsically distributed and pipelined. PGP represents a GA solver as a graph of loosely coupled genetic operators, which allows the solution to be scaled to the available resources, and also to dynamically change topology at run-time to explore different solution strategies. Experiments show that pipelined genetic propagation is effective in solving seven different applications. Our PGP design is 5 times faster than a recent FPGA-based GA system, and 90 times faster than a CPU-based GA system.

field programmable logic and applications | 2014

Automated framework for FPGA-based parallel genetic algorithms

Liucheng Guo; David B. Thomas; Ce Guo; Wayne Luk

Parallel genetic algorithms (pGAs) are a variant of genetic algorithms which can promise substantial gains in both efficiency of execution and quality of results. pGAs have attracted researchers to implement them in FPGAs, but the implementation always needs large human effort. To simplify the implementation process and make the hardware pGA designs accessible to potential non-expert users, this paper proposes a general-purpose framework, which takes in a high-level description of the optimisation target and automatically generates pGA designs for FPGAs. Our pGA system exploits the two levels of parallelism found in GA instances and genetic operations, allowing users to tailor the architecture for resource constraints at compile-time. The framework also enables users to tune a subset of parameters at run-time without time-consuming recompilation. Our pGA design is more flexible than previous ones, and has an average speedup of 26 times compared to the multi-core counterparts over five combinatorial and numerical optimisation problems. When compared with a GPU, it also shows a 6.8 times speedup over a combinatorial application.

field programmable gate arrays | 2014

Accelerating parameter estimation for multivariate self-exciting point processes

Ce Guo; Wayne Luk

Self-exciting point processes are stochastic processes capturing occurrence patterns of random events. They offer powerful tools to describe and predict temporal distributions of random events like stock trading and neurone spiking. A critical calculation in self-exciting point process models is parameter estimation, which fits a model to a data set. This calculation is computationally demanding when the number of data points is large and when the data dimension is high. This paper proposes the first reconfigurable computing solution to accelerate this calculation. We derive an acceleration strategy in a mathematical specification by eliminating complex data dependency, by cutting hardware resource requirement, and by parallelising arithmetic operations. In our experimental evaluation, an FPGA-based implementation of the proposed solution is up to 79 times faster than one CPU core, and 13 times faster than the same CPU with eight cores.

field-programmable logic and applications | 2013

Accelerating maximum likelihood estimation for Hawkes point processes

Ce Guo; Wayne Luk

Hawkes processes are point processes that can be used to build probabilistic models to describe and predict occurrence patterns of random events. They are widely used in high-frequency trading, seismic analysis and neuroscience. A critical numerical calculation in Hawkes process models is parameter estimation, which is used to fit a Hawkes process model to a data set. The parameter estimation problem can be solved by searching for a parameter set that maximises the log-likelihood. A core operation of this search process, the log-likelihood evaluation, is computationally demanding if the number of data points is large. To accelerate the computation, we present a log-likelihood evaluation strategy which is suitable for hardware acceleration. We then design and optimise a pipelined engine based on our proposed strategy. In the experiments, an FPGA-based implementation of the proposed engine is shown to be up to 72 times faster than a single-core CPU, and 10 times faster than an 8-core CPU.

field-programmable technology | 2014

Accelerating transfer entropy computation

Shengjia Shao; Ce Guo; Wayne Luk; Stephen Weston

Transfer entropy is a measure of information transfer between two time series. It is an asymmetric measure based on entropy change which only takes into account the statistical dependency originating in the source series, but excludes dependency on a common external factor. Transfer entropy is able to capture system dynamics that traditional measures cannot, and has been successfully applied to various areas such as neuroscience, bioinformatics, data mining and finance. When time series becomes longer and resolution becomes higher, computing transfer entropy is demanding. This paper presents the first reconfigurable computing solution to accelerate transfer entropy computation. The novel aspects of our approach include a new technique based on Laplaces Rule of Succession for probability estimation; a novel architecture with optimised memory allocation, bit-width narrowing and mixed-precision optimisation; and its implementation targeting a Xilinx Virtex-6 SX475T FPGA. In our experiments, the proposed FPGA-based solution is up to 111.47 times faster than one Xeon CPU core, and 18.69 times faster than a 6-core Xeon CPU.

application-specific systems, architectures, and processors | 2014

Pipelined reconfigurable accelerator for ordinal pattern encoding

Ce Guo; Wayne Luk; Stephen Weston

Ordinal analysis is a statistical method for analysing the complexity of time series. This method has been used in characterising dynamic changes in time series, with various applications such as financial risk modelling and biomedical signal processing. Ordinal pattern encoding is a fundamental calculation in ordinal analysis. It is computationally demanding particularly for high query orders and large time series data. This paper presents the first reconfigurable accelerator for this encoding calculation, with four main contributions. First, we propose a two-level hardware-oriented ordinal pattern encoding scheme to avoid sequence sorting operations in the accelerator, enabling theoretically best code compactness. Second, we develop a hardware mapping method by promoting data reuse, by parallelising arithmetic operations, and by pipelining the data path. Third, we conduct an experimental implementation of the proposed system, showing promising accelerated performance compared to software solutions. Finally, we apply the proposed accelerator to the computation of permutation entropy, demonstrating the significant potential for acceleration that would benefit such computation.

application specific systems architectures and processors | 2013

Accelerating HAC estimation for multivariate time series

Ce Guo; Wayne Luk

Heteroskedasticity and autocorrelation consistent (HAC) covariance matrix estimation, or HAC estimation in short, is one of the most important techniques in time series analysis and forecasting. It serves as a powerful analytical tool for hypothesis testing and model verification. However, HAC estimation for long and high-dimensional time series is computationally expensive. This paper describes a novel pipeline-friendly HAC estimation algorithm derived from a mathematical specification, by applying transformations to eliminate conditionals, to parallelize arithmetic, and to promote data reuse in computation. We then develop a fully-pipelined hardware architecture based on the proposed algorithm. This architecture is shown to be efficient and scalable from both theoretical and empirical perspectives. Experimental results show that an FPGA-based implementation of the proposed architecture is up to 111 times faster than an optimised CPU implementation with one core, and 14 times faster than a CPU with eight cores.

field programmable logic and applications | 2015

Recursive pipelined genetic propagation for bilevel optimisation

Shengjia Shao; Liucheng Guo; Ce Guo; Thomas C. P. Chau; David B. Thomas; Wayne Luk; Stephen Weston

The bilevel optimisation problem (BLP) is a subclass of optimisation problems in which one of the constraints of an optimisation problem is another optimisation problem. BLP is widely used to model hierarchical decision making where the leader and the follower correspond to the upper level and lower level optimisation problem, respectively. In BLP, the optimal solutions to the lower level optimisation problem are the feasible solutions to the upper level problem, which makes it particularly difficult to solve. This paper proposes a novel hardware architecture known as Recursive Pipelined Genetic Propagation (RPGP), to solve BLP efficiently on FPGA. RPGP features a graph of genetic operation nodes which can be scaled to exploit hardware resources. In addition, the topology of the RPGP graph can be changed at run-time to escape from local optima. We evaluate the proposed architecture on an Altera Stratix-V FPGA, using a benchmark bilevel optimisation problem set. Our experimental results show that RPGP can achieve a significant speed-up against previous work.

signal processing systems | 2014

Pipelined HAC Estimation Engines for Multivariate Time Series

Ce Guo; Wayne Luk

Heteroskedasticity and autocorrelation consistent (HAC) covariance matrix estimation, or HAC estimation in short, is one of the most important techniques in time series analysis and forecasting. It serves as a powerful analytical tool for hypothesis testing and model verification. However, HAC estimation for long and high-dimensional time series is computationally expensive. This paper describes a pipeline-friendly HAC estimation algorithm derived from a mathematical specification, by applying transformations to eliminate conditionals, to parallelise arithmetic, and to promote data reuse in computation. We discuss an initial hardware architecture for the proposed algorithm, and propose two optimised architectures to improve the worst-case performance. Experimental systems based on proposed architectures demonstrate high performance especially for long time series. One experimental system achieves up to 12 times speedup over an optimised software system on 12 CPU cores.

Explore More