Conghui He | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Conghui He is active.

Explore More

Publication

Featured researches published by Conghui He.

Remote Sensing | 2014

Global-Scale Associations of Vegetation Phenology with Rainfall and Temperature at a High Spatio-Temporal Resolution

Nicholas Clinton; Le Yu; Haohuan Fu; Conghui He; Peng Gong

Phenology response to climatic variables is a vital indicator for understanding changes in biosphere processes as related to possible climate change. We investigated global phenology relationships to precipitation and land surface temperature (LST) at high spatial and temporal resolution for calendar years 2008–2011. We used cross-correlation between MODIS Enhanced Vegetation Index (EVI), MODIS LST and Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN) gridded rainfall to map phenology relationships at 1-km spatial resolution and weekly temporal resolution. We show these data to be rich in spatiotemporal information, illustrating distinct phenology patterns as a result of complex overlapping gradients of climate, ecosystem and land use/land cover. The data are consistent with broad-scale, coarse-resolution modeled ecosystem limitations to moisture, temperature and irradiance. We suggest that high-resolution phenology data are useful as both an input and complement to land use/land cover classifiers and for understanding climate change vulnerability in natural and anthropogenic landscapes.

ieee international conference on high performance computing data and analytics | 2016

Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer

Haohuan Fu; Junfeng Liao; Wei Xue; Lanning Wang; Dexun Chen; Long Gu; Jinxiu Xu; Nan Ding; Xinliang Wang; Conghui He; Shizhen Xu; Yishuang Liang; Jiarui Fang; Yuanchao Xu; Weijie Zheng; Jingheng Xu; Zhen Zheng; Wanjing Wei; Xu Ji; He Zhang; Bingwei Chen; Kaiwei Li; Xiaomeng Huang; Wenguang Chen; Guangwen Yang

This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). To map the large code base of CAM to the millions of cores on the Sunway system, we take OpenACC-based refactoring as the major approach, and apply source-to-source translator tools to exploit the most suitable parallelism for the CPE cluster, and to fit the intermediate variable into the limited on-chip fast buffer. For individual kernels, when comparing the original ported version using only MPEs and the refactored version using both the MPE and CPE clusters, we achieve up to 22× speedup for the compute-intensive kernels. For the 25km resolution CAM global model, we manage to scale to 24,000 MPEs, and 1,536,000 CPEs, and achieve a simulation speed of 2.81 model years per day.

ieee international conference on high performance computing data and analytics | 2017

18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios

Haohuan Fu; Conghui He; Bingwei Chen; Zekun Yin; Zhenguo Zhang; Wenqiang Zhang; Tingjian Zhang; Wei Xue; Weiguo Liu; Wanwang Yin; Guangwen Yang; Xiaofei Chen

This paper reports our large-scale nonlinear earthquake simulation software on Sunway TaihuLight. Our innovations include: (1) a customized parallelization scheme that employs the 10 million cores efficiently at both the process and the thread levels; (2) an elaborate memory scheme that integrates on-chip halo exchange through register communcation, optimized blocking configuration guided by an analytic model, and coalesced DMA access with array fusion; (3) on-the-fly compression that doubles the maximum problem size and further improves the performance by 24%. With these innovations to remove the memory constraints of Sunway TaihuLight, our software achieves over 15% of the systems peak, better than the 11.8% efficiency achieved by a similar software running on Titan, whose byte to flop ratio is 5 times better than TaihuLight. The extreme cases demonstrate a sustained performance of over 18.9 Pflops, enabling the simulation of Tangshan earthquake as an 18-Hz scenario with an 8-meter resolution.

IEEE Transactions on Computers | 2017

A Fully-Pipelined Hardware Design for Gaussian Mixture Models

Conghui He; Haohuan Fu; Ce Guo; Wayne Luk; Guangwen Yang

Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPC-X2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications.

77th EAGE Conference and Exhibition 2015 | 2015

Ensemble Full Wave Inversion with Source Encoding

Conghui He; Yanlei Chen; Haohuan Fu; Guangwen Yang

Full wave inversion (FWI) suffers from convergence toward local minima because of the inaccuracy of the initial model and the lack of low frequency data. Noises in seismograms further deteriorate the imaging quality. To relax the dependency on high quality low frequency data, we present an ensemble full wave inversion method with source encoding (EnFWI), which is an ensemble approximation of the total inversion proposed by Tarantola. The method refines the velocity model iteratively by incorporating the observation, while the nonlinear evolution of the covariance is approximated by ensemble covariance. Encoded simultaneous-source FWI (ESSFWI) is applied to improve the representation for the low rank ensemble approximation, and to increase the rate of convergence. Experiments show that EnFWI achieves larger convergence range and better tolerance to data noise with less computational costs than traditional FWI methods.

field programmable logic and applications | 2017

Exploring the potential of reconfigurable platforms for order book update

Conghui He; Haohuan Fu; Wayne Luk; Weijia Li; Guangen Yang

The order book update (OBU) algorithm is widely used in financial exchanges for rebuilding order books. The number of messages produced has drastically increased over time. The software solutions become more and more difficult to scale with the growing message rate and meet the requirement of low latency. This paper explores the potential of reconfigurable platforms in revolutionizing the order book architecture, and proposes a novel order book update algorithm optimized for maximal throughput and minimal latency. Our approach has three main contributions. First, we derive a fixed tick data structure for the order book that is easier to be mapped to the hardware. Second, we design a customized cache storing the top five levels of the order book to further reduce the latency. Third, we propose a hardware-friendly order book update algorithm based on the data structures we proposed. In the experiment, our FPGA-based solution can process 1.2–1.5 million messages per second with the throughput of 10Gb/s and the latency of 132–288 nanoseconds, which is 90–157 times faster than a CPU-based solution, and 5.2–6.6 times faster than an existing FPGA-based solution.

field programmable gate arrays | 2017

Accelerating Financial Market Server through Hybrid List Design (Abstract Only)

Haohuan Fu; Conghui He; Huabin Ruan; Itay Greenspon; Wayne Luk; Yongkang Zheng; Junfeng Liao; Qing Zhang; Guangwen Yang

The financial market server in exchanges aims to maintain the order books and provide real time market data feeds to traders. Low-latency processing is in a great demand in financial trading. Although software solutions provide the flexibility to express algorithms in high-level programming models and to recompile quickly, it is becoming increasingly uncompetitive due to the long and unpredictable response time. Nowadays, Field Programmable Gate Arrays (FPGAs) have been proved to be an established technology for achieving a low and constant latency for processing streaming packets in a hardware accelerated way. However, maintaining order books on FPGAs involves organizing packets into GBs of structural data information as well as complicated routines (sort, insertion, deletion, etc.), which is extremely challenging to FPGA designs in both design methodology and memory volume. Thus existing FPGA designs often leave the post-processing part to the CPUs. However, it largely cancels the latency gain of the network packet processing part. This paper proposes a CPU-FPGA hybrid list design to accelerate financial market servers that achieve microsecond-level latencies. This paper mainly includes four contributions. First, we design a CPU-FPGA hybrid list with two levels, a small cache list on the FPGA and a large master list at the CPU host. Both lists are sorted with different sorting schemes, where the bitonic sort is applied to the cache list while a balanced tree is used to maintain the master list. Second, in order to effectively update the hybrid sorted list, we derive a complete set of low-latency routines, including insertion, deletion, selection, sorting, etc., providing a low latency at the scale of a few cycles. Third, we propose a non-blocking on-demand synchronization strategy for the cache list and the master list to communicate with each other. Lastly, we integrate the hybrid list as well as other components, such as packets splitting, parsing, processing, etc. to form an industry-level financial market server. Our design is applied in the environment of the China Financial Futures Exchange (CFFEX), demonstrating its functionality and stability by running 600+ hours with hundreds of millions packets per day. Compared with the existing CPU-based solution in CFFEX, our system is able to support identical functionalities while significantly reducing the latency from 100+ microseconds to 2 microseconds, gaining a speedup of 50x.

field programmable custom computing machines | 2017

A Nanosecond–Level Hybrid Table Design for Financial Market Data Generators

Haohuan Fu; Conghui He; Wayne Luk; Weijia Li; Guangwen Yang

This paper proposes a hybrid sorted table design for minimizing electronic trading latency, with three main contributions. First, a hierarchical sorted table with two levels, a fast cache table in reconfigurable hardware storing megabytes of data items and a master table in software storing gigabytes of data items. Second, a full set of operations, including insertion, deletion, selection and sorting, for the hybrid table with latency in a few cycles. Third, an on-demand synchronization scheme between the cache table and the master table. An implementation has been developed that targets an FPGA-based network card in the environment of the China Financial Futures Exchange (CFFEX) which sustains 1-10Gb/s bandwidth with latency of 400 to 700 nanoseconds, providing an 80- to 125-fold latency reduction compared to a fully optimized CPU-based solution, and a 2.2-fold reduction over an existing FPGA-based solution.

computer vision and pattern recognition | 2018