Is this you? Create Your Porfile

Melin Huang

University of Wisconsin-Madison

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Melin Huang is active.

Explore More

Publication

Featured researches published by Melin Huang.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2014

GPU-Accelerated Longwave Radiation Scheme of the Rapid Radiative Transfer Model for General Circulation Models (RRTMG)

Erik Price; Jarno Mielikainen; Melin Huang; Bormin Huang; Hung-Lung Allen Huang; Tsengdar Lee

Atmospheric radiative transfer models calculate radiative transfer of electromagnetic radiation through a planetary atmosphere. One of such models is the rapid radiative transfer model (RRTM), which evaluates longwave and shortwave atmospheric radiative fluxes and heating rates. The RRTM for general circulation models (GCMs), RRTMG, is an accelerated version based on the single-column reference of RRTM. The longwave radiation scheme of RRTM for GCMs (RRTMG_LW) is one model that utilizes the correlated-k approach to calculate longwave fluxes and heating rates for application to GCMs. In this paper, the feasibility of using graphics processing units (GPUs) to accelerate the in weather research and forecasting (WRF) model is examined. GPUs allow a substantial performance improvement in RRTMG_LW with a large number of parallel compute cores at low cost and power. Our GPU version of RRTMG_LW yields the bit-exact outputs as its original Fortran code. Our results show that NVIDIAs K40 GPU achieves a speedup of x as compared to its CPU counterpart running on one CPU core of Intel Xeon E5-2603, whereas the speedup for one CPU socket (4 cores) of the Xeon E5-2603 with respect to one CPU core is only 3.2×.

Computers & Geosciences | 2014

Comments on the paper by Huadong Xiao, Jing Sun, Xiaofeng Bian and Zhijun Dai, GPU acceleration of the WSM6 cloud microphysics scheme in GRAPES model

Jarno Mielikainen; Melin Huang; Bormin Huang; Allen Huang

The authors of the paper (Xiaoa et al., 2013) presented a speedup of 140 for the WSM6 microphysics module running on a low-cost NVidia Geforce 605 with 48 CUDA cores. Unfortunately the presented speedup cannot be achieved using that hardware. In this communication, we comment on several implementation mistakes pertaining to that paper. Their actual speedup is only about 12 . A failure in their CUDA kernel launch also explains that the potential temperature differences between their CPU and GPU versions could be as large as 1.61.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2015

Massive Parallelization of the WRF GCE Model Toward a GPU-Based End-to-End Satellite Data Simulator Unit

Melin Huang; Bormin Huang; Xiaojie Li; Allen Huang; Mitchell D. Goldberg; Ajay Mehta

Modern weather satellites provide more detailed observations of cloud and precipitation processes. To harness these observations for better satellite data assimilations, a cloud-resolving model, known as the Goddard Cumulus Ensemble (GCE) model, was developed and used by the Goddard Satellite Data Simulator Unit (G-SDSU). The GCE model has also been incorporated as part of the widely used weather research and forecasting (WRF) model. The computation of the cloud-resolving GCE model is time-consuming. This paper details our massively parallel design of GPU-based WRF GCE scheme. With one NVIDIA Tesla K40 GPU, the GPU-based GCE scheme achieves a speedup of 361× as compared to its original Fortran counterpart running on one CPU core, whereas the speedup for one CPU socket (four cores) with respect to one CPU core is only 3.9×.

ieee international conference on high performance computing data and analytics | 2014

Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model

Melin Huang; Bormin Huang; Allen Huang

The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.

ieee international conference on high performance computing data and analytics | 2014

Implementation of 5-layer thermal diffusion scheme in weather research and forecasting model with Intel Many Integrated Cores

Melin Huang; Bormin Huang; Allen Huang

For weather forecasting and research, the Weather Research and Forecasting (WRF) model has been developed, consisting of several components such as dynamic solvers and physical simulation modules. WRF includes several Land- Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the land’s state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. The features, efficient parallelization and vectorization essentials, of Intel Many Integrated Core (MIC) architecture allow us to optimize this WRF 5-layer thermal diffusion scheme. In this work, we present the results of the computing performance on this scheme with Intel MIC architecture. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.1x. Accordingly, the same CPU-based optimizations improved the performance on Intel Xeon E5- 2603 by a factor of 1.6x as compared to the first version of multi-threaded code.

international conference on parallel and distributed systems | 2013

Further Improvement on GPU-Based Parallel Implementation of WRF 5-Layer Thermal Diffusion Scheme

Melin Huang; Bormin Huang; Jarno Mielikainen; H.-L. Allen Huang; Mitchell D. Goldberg; Ajay Mehta

The Weather Research and Forecasting (WRF) model has been widely employed for weather prediction and atmospheric simulation with dual purposes in forecasting and research. Land-surface models (LSMs) are parts of the WRF model, which is used to provide information of heat and moisture fluxes over land and sea-ice points. The 5-layer thermal diffusion simulation is an LSM based on the MM5 soil temperature model with an energy budget made up of sensible, latent, and radiative heat fluxes. Owing to the feature of no interactions among horizontal grid points, the LSMs are very favorable for massively parallel processing. The study presented in this article demonstrates the parallel computing efforts on the WRF 5-layer thermal diffusion scheme using Graphics Processing Unit (GPU). Since this scheme is only one intermediate module of the entire WRF model, the involvement of the I/O transfer does not occur in the intermediate process. By employing one NVIDIA GTX 680 GPU in the case without I/O transfer, our optimization efforts on the GPU-based 5-layer thermal diffusion scheme can reach a speedup as high as 247.5x with respect to one CPU core, whereas the speedup for one CPU socket with respect to one CPU core is only 3.1x. We can even boost the speedup to 332x with respect to one CPU core when three GPUs are applied.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2017

Parallel Construction of the WRF Pleim-Xiu Land Surface Scheme With Intel Many Integrated Core (MIC) Architecture

Melin Huang; Bormin Huang; Hung-Lung Allen Huang

The weather research and forecast model (WRF), a simulation model, is built for the needs of both research and operational weather forecast and research in atmospheric science. The land-surface model (LSM), which describes one physical process in atmosphere, supplies the heat and moisture fluxes over land points and sea-ice points. Among several schemes of LSM that have been developed and incorporated into WRF, Pleim-Xiu (PX) scheme is one popular scheme in LSM. Processing the WRF simulation codes for weather prediction has acquired the benefits of dramatically increasing computing power with the advent of large-scale parallelism. Several merits such as vectorization essence, efficient parallelization, and multiprocessor computer structure of Intel Many Integrated Core (MIC) architecture allow us to accelerate the computation performance of the modeling code of the PX scheme. This paper demonstrates that our results of the optimized MIC-based PX scheme executing on Xeon Phi coprocessor 7120P enhances the computing performance by 2.3× and 11.7×, respectively, in comparison to the initial CPU-based code running on one CPU socket (eight cores) and on one single CPU core with Intel Xeon E5-2670.

ieee international conference on high performance computing data and analytics | 2015

Application of Intel Many Integrated Core (MIC) accelerators to the Pleim-Xiu land surface scheme

Melin Huang; Bormin Huang; Allen Huang

The land-surface model (LSM) is one physics process in the weather research and forecast (WRF) model. The LSM includes atmospheric information from the surface layer scheme, radiative forcing from the radiation scheme, and precipitation forcing from the microphysics and convective schemes, together with internal information on the land’s state variables and land-surface properties. The LSM is to provide heat and moisture fluxes over land points and sea-ice points. The Pleim-Xiu (PX) scheme is one LSM. The PX LSM features three pathways for moisture fluxes: evapotranspiration, soil evaporation, and evaporation from wet canopies. To accelerate the computation process of this scheme, we employ Intel Xeon Phi Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.3x and 11.7x as compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670.

data compression communications and processing | 2015

Parallel implementation of WRF double moment 5-class cloud microphysics scheme on multiple GPUs

Melin Huang; Bormin Huang; Allen Huang

The Weather Research and Forecast (WRF) Double Moment 5-class (WDM5) mixed ice microphysics scheme predicts the mixing ratio of hydrometeors and their number concentrations for warm rain species including clouds and rain. WDM5 can be computed in parallel in the horizontal domain using multi-core GPUs. In order to obtain a better GPU performance, we manually rewrite the original WDM5 Fortran module into a highly parallel CUDA C program. We explore the usage of coalesced memory access and asynchronous data transfer. Our GPU-based WDM5 module is scalable to run on multiple GPUs. By employing one NVIDIA Tesla K40 GPU, our GPU optimization effort on this scheme achieves a speedup of 252x with respect to its CPU counterpart Fortran code running on one CPU core of Intel Xeon E5-2603, whereas the speedup for one CPU socket (4 cores) with respect to one CPU core is only 4.2x. We can even boost the speedup of this scheme to 468x with respect to one CPU core when two NVIDIA Tesla K40 GPUs are applied.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2017

Acceleration of the WRF Monin–Obukhov–Janjic Surface Layer Parameterization Scheme on an MIC-Based Platform for Weather Forecast

Melin Huang; Bormin Huang; Hung-Lung Allen Huang

A state-of-the-art numerical weather prediction (NWP) model, comprising weather research and forecast (WRF) model and analysis techniques, has been extensively exercised for weather prophecy all over the world. The WRF model, the soul role in NWP, constitutes dynamic solvers and elaborate physical components for conducting fluid behavior, all of which are sketched for both atmospheric research analyses and operational weather foretell. One salient physical ingredient in WRF is the surface layer simulation, which provides surface heat and moisture fluxes through calculation of surface friction velocities and exchange coefficients. The Monin–Obukhov–Janjic (MOJ) scheme is one popular surface layer option in WRF. This is one of the schemes in WRF we choose to expedite toward an end-to-end accelerated weather model. One advantageous aspect in WRF is the independence among grid points that facilitates programming implementations in parallel computation. We here present a parallel construction on the MOJ module with application of vectorization elements and efficient parallelization essentials furnished by Intel many integrated core (MIC) architecture. To achieve high computing performance, apart from the fundamental usage of Intel MIC architecture, this paper offers some new approaches related to code structure and art of optimization skills. At the end, in comparison with the original code separately executing on one CPU core and on one CPU socket (eight cores) with Intel Xeon E5-2670, the optimized MIC-based MOJ module running on Xeon Phi coprocessor 7120P ameliorates the computing performance by 9.6× and 1.5×, respectively.

Explore More