Haohuan Fu
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Haohuan Fu.
Journal of remote sensing | 2013
Peng Gong; Jie Wang; Le Yu; Yongchao Zhao; Yuanyuan Zhao; Lu Liang; Z. C. Niu; Xiaomeng Huang; Haohuan Fu; Shuang Liu; Congcong Li; Xueyan Li; Wei Fu; Caixia Liu; Yue Xu; Xiaoyi Wang; Qu Cheng; Luanyun Hu; Wenbo Yao; Han Zhang; Peng Zhu; Ziying Zhao; Haiying Zhang; Yaomin Zheng; Luyan Ji; Yawen Zhang; Han Chen; An Yan; Jianhong Guo; Liang Yu
We have produced the first 30 m resolution global land-cover maps using Landsat Thematic Mapper (TM) and Enhanced Thematic Mapper Plus (ETM+) data. We have classified over 6600 scenes of Landsat TM data after 2006, and over 2300 scenes of Landsat TM and ETM+ data before 2006, all selected from the green season. These images cover most of the worlds land surface except Antarctica and Greenland. Most of these images came from the United States Geological Survey in level L1T (orthorectified). Four classifiers that were freely available were employed, including the conventional maximum likelihood classifier (MLC), J4.8 decision tree classifier, Random Forest (RF) classifier and support vector machine (SVM) classifier. A total of 91,433 training samples were collected by traversing each scene and finding the most representative and homogeneous samples. A total of 38,664 test samples were collected at preset, fixed locations based on a globally systematic unaligned sampling strategy. Two software tools, Global Analyst and Global Mapper developed by extending the functionality of Google Earth, were used in developing the training and test sample databases by referencing the Moderate Resolution Imaging Spectroradiometer enhanced vegetation index (MODIS EVI) time series for 2010 and high resolution images from Google Earth. A unique land-cover classification system was developed that can be crosswalked to the existing United Nations Food and Agriculture Organization (FAO) land-cover classification system as well as the International Geosphere-Biosphere Programme (IGBP) system. Using the four classification algorithms, we obtained the initial set of global land-cover maps. The SVM produced the highest overall classification accuracy (OCA) of 64.9% assessed with our test samples, with RF (59.8%), J4.8 (57.9%), and MLC (53.9%) ranked from the second to the fourth. We also estimated the OCAs using a subset of our test samples (8629) each of which represented a homogeneous area greater than 500 m × 500 m. Using this subset, we found the OCA for the SVM to be 71.5%. As a consistent source for estimating the coverage of global land-cover types in the world, estimation from the test samples shows that only 6.90% of the world is planted for agricultural production. The total area of cropland is 11.51% if unplanted croplands are included. The forests, grasslands, and shrublands cover 28.35%, 13.37%, and 11.49% of the world, respectively. The impervious surface covers only 0.66% of the world. Inland waterbodies, barren lands, and snow and ice cover 3.56%, 16.51%, and 12.81% of the world, respectively.
Science in China Series F: Information Sciences | 2016
Haohuan Fu; Junfeng Liao; Jinzhe Yang; Lanning Wang; Zhenya Song; Xiaomeng Huang; Chao Yang; Wei Xue; Fangfang Liu; Fangli Qiao; Wei Zhao; Xunqiang Yin; Chaofeng Hou; Chenglong Zhang; Wei Ge; Jian Zhang; Yangang Wang; Chunbo Zhou; Guangwen Yang
The Sunway TaihuLight supercomputer is the world’s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip. With 260 processing elements in one CPU, a single SW26010 provides a peak performance of over three TFlops. To alleviate the memory bandwidth bottleneck in most applications, each CPE comes with a scratch pad memory, which serves as a user-controlled cache. To support the parallelization of programs on the new many-core architecture, in addition to the basic C/C++ and Fortran compilers, the system provides a customized Sunway OpenACC tool that supports the OpenACC 2.0 syntax. This paper also reports our preliminary efforts on developing and optimizing applications on the TaihuLight system, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.
international symposium on microarchitecture | 2011
Olav Lindtjorn; Robert G. Clapp; Oliver Pell; H Fu; Michael J. Flynn; Haohuan Fu
The oil and gas industry is a major user of high-performance computing, and geoscience computational cycles are dominated by kernels that are relatively few and well defined. This project explores accelerating geoscience applications using FPGA-based hardware, optimizing the algorithm and the hardware to achieve maximum performance. This approach can deliver speedup of 20 to 70 times compared with a conventional HPC node.
Geophysics | 2010
Robert G. Clapp; Haohuan Fu; Olav Lindtjorn
Over the last couple of years, it has become more difficult to assess which hardware can deliver the most cost-optimal solution for demanding imaging tasks. The days of faster and faster CPUs are over. A clear choice of hardware has been replaced with many core technologies, and a proliferation of alternatives. In particular, accelerators like GPGPU and field programmable gate arrays (FPGAs) have emerged as strong contenders for the title of hardware platform of choice. With the radical differences in hardware architectures, it has also become more and more difficult to evaluate which platform is optimal for the application in question. An apples-to-apples comparison is no longer possible. Through the example of reverse time migration (RTM), we demonstrate that only through a careful optimization for each platform, with the involvement of hardware, computer-science and algorithmic scientists, can we come up with a reasonable assessment of the alternatives available today.
acm sigplan symposium on principles and practice of parallel programming | 2013
Chao Yang; Wei Xue; Haohuan Fu; Lin Gan; Linfeng Li; Yangtong Xu; Yutong Lu; Jiachang Sun; Guangwen Yang; Weimin Zheng
Developing highly scalable algorithms for global atmospheric modeling is becoming increasingly important as scientists inquire to understand behaviors of the global atmosphere at extreme scales. Nowadays, heterogeneous architecture based on both processors and accelerators is becoming an important solution for large-scale computing. However, large-scale simulation of the global atmosphere brings a severe challenge to the development of highly scalable algorithms that fit well into state-of-the-art heterogeneous systems. Although successes have been made on GPU-accelerated computing in some top-level applications, studies on fully exploiting heterogeneous architectures in global atmospheric modeling are still very less to be seen, due in large part to both the computational difficulties of the mathematical models and the requirement of high accuracy for long term simulations. In this paper, we propose a peta-scalable hybrid algorithm that is successfully applied in a cubed-sphere shallow-water model in global atmospheric simulations. We employ an adjustable partition between CPUs and GPUs to achieve a balanced utilization of the entire hybrid system, and present a pipe-flow scheme to conduct conflict-free inter-node communication on the cubed-sphere geometry and to maximize communication-computation overlap. Systematic optimizations for multithreading on both GPU and CPU sides are performed to enhance computing throughput and improve memory efficiency. Our experiments demonstrate nearly ideal strong and weak scalabilities on up to 3,750 nodes of the Tianhe-1A. The largest run sustains a performance of 0.8 Pflops in double precision (32% of the peak performance), using 45,000 CPU cores and 3,750 GPUs.
IEEE Transactions on Computers | 2015
Wei Xue; Chao Yang; Haohuan Fu; Xinliang Wang; Yangtong Xu; Junfeng Liao; Lin Gan; Yutong Lu; Rajiv Ranjan; Lizhe Wang
In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers.
international parallel and distributed processing symposium | 2014
Wei Xue; Chao Yang; Haohuan Fu; Xinliang Wang; Yangtong Xu; Lin Gan; Yutong Lu; Xiaoqian Zhu
This paper presents a hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the worlds current top-ranked supercomputer developed by Chinas National University of Defense Technology (NUDT). Tianhe-2 is equipped with both Intel Xeon CPUs and Intel Xeon Phi accelerators. A key idea of the hybrid algorithm is to enable flexible domain partition between an arbitrary number of processors and accelerators, so as to achieve a balanced and efficient utilization of the entire system. We also present an asynchronous and concurrent data transfer scheme to reduce the communication overhead between CPU and accelerators. The acceleration of our global atmospheric model is conducted to improve the use of the Intel MIC architecture. For the single-node test on Tianhe-2 against two Intel Ivy Bridge CPUs (24 cores), we can achieve 2.07×, 3.18×, and 4.35× speedups when using one, two, and three Intel Xeon Phi accelerators respectively. The average performance gain from SIMD vectorization on the Intel Xeon Phi processors is around 5× (out of the 8× theoretical case). Based on successful computation-communication overlapping, large-scale tests indicate that a nearly ideal weak-scaling efficiency of 93.5% is obtained when we gradually increase the number of nodes from 6 to 8,664 (nearly 1.7 million cores). In the strong-scaling test, the parallel efficiency is about 77% when the number of nodes increases from 1,536 to 8,664 for a fixed 65,664 × 5,664 × 6 mesh with 77.6 billion unknowns.
international parallel and distributed processing symposium | 2014
Yang You; Shuaiwen Leon Song; Haohuan Fu; Andres Marquez; Maryam Mehri Dehnavi; Kevin J. Barker; Kirk W. Cameron; Amanda Randles; Guangwen Yang
Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.
ieee/acm international symposium cluster, cloud and grid computing | 2013
Songbin Liu; Xiaomeng Huang; Haohuan Fu; Guangwen Yang
Understanding the inherent system characteristics is crucial to the design and optimization of cloud storage system, and few studies have systematically investigated its data characteristics and access patterns. This paper presents an analysis of file system snapshot and five-month access trace of a campus cloud storage system that has been deployed on Tsinghua campus for three years. The system provides online storage and data sharing services for more than 19,000 students and 500 student groups. We report several data characteristics including file size and file type, as well as some access patterns, including read/write ratio, read-write dependency and daily traffic. We find that there are many differences between cloud storage system and traditional file systems: our cloud storage system has larger file sizes, lower read/write ratio, and smaller set of active files than those of a typical traditional file system. With a trace-driven simulation, we find that the cache efficiency can be improved by 5 times using the guidance from our observations.
IEEE Transactions on Computers | 2010
Haohuan Fu; Oskar Mencer; Wayne Luk
Using a general polynomial approximation approach, we present an arithmetic library generator for the logarithmic number system (LNS). The generator produces optimized LNS arithmetic libraries that improve significantly over previous LNS designs on area and latency. We also provide area cost estimation and bit-accurate simulation tools that facilitate comparison between LNS and floating-point designs.