Heshan Lin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Heshan Lin is active.

Explore More

Publication

Featured researches published by Heshan Lin.

high performance distributed computing | 2010

MOON: MapReduce On Opportunistic eNvironments

Heshan Lin; Xiaosong Ma; Jeremy S. Archuleta; Wu-chun Feng; Mark K. Gardner; Zhe Zhang

MapReduce offers an ease-of-use programming paradigm for processing large data sets, making it an attractive model for distributed volunteer computing systems. However, unlike on dedicated resources, where MapReduce has mostly been deployed, such volunteer computing systems have significantly higher rates of node unavailability. Furthermore, nodes are not fully controlled by the MapReduce framework. Consequently, we found the data and task replication scheme adopted by existing MapReduce implementations woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. Our tests on an emulated volunteer computing system, which uses a 60-node cluster where each node possesses a similar hardware configuration to a typical computer in a student lab, demonstrate that MOON can deliver a three-fold performance improvement to Hadoop in volatile, volunteer computing environments.

IEEE Transactions on Parallel and Distributed Systems | 2011

Coordinating Computation and I/O in Massively Parallel Sequence Search

Heshan Lin; Xiaosong Ma; Wu-chun Feng; Nagiza F. Samatova

With the explosive growth of genomic information, the searching of sequence databases has emerged as one of the most computation and data-intensive scientific applications. Our previous studies suggested that parallel genomic sequence-search possesses highly irregular computation and I/O patterns. Effectively addressing these runtime irregularities is thus the key to designing scalable sequence-search tools on massively parallel computers. While the computation scheduling for irregular scientific applications and the optimization of noncontiguous file accesses have been well-studied independently, little attention has been paid to the interplay between the two. In this paper, we systematically investigate the computation and I/O scheduling for data-intensive, irregular scientific applications within the context of genomic sequence search. Our study reveals that the lack of coordination between computation scheduling and I/O optimization could result in severe performance issues. We then propose an integrated scheduling approach that effectively improves sequence-search throughput by gracefully coordinating the dynamic load balancing of computation and high-performance noncontiguous I/O.

ieee international conference on high performance computing data and analytics | 2008

Massively parallel genomic sequence search on the Blue Gene/P architecture

Heshan Lin; Pavan Balaji; Ruth J. Poole; Carlos P. Sosa; Xiaosong Ma; Wu-chun Feng

This paper presents our first experiences in mapping and optimizing genomic sequence search onto the massively parallel IBM Blue Gene/P (BG/P) platform. Specifically, we performed our work on mpiBLAST, a parallel sequence-search code that has been optimized on numerous supercomputing environments. In doing so, we identify several critical performance issues. Consequently, we propose and study different approaches for mapping sequence-search and parallel I/O tasks on such massively parallel architectures. We demonstrate that our optimizations can deliver nearly linear scaling (93% efficiency) on up to 32,768 cores of BG/P. In addition, we show that such scalability enables us to complete a large-scale bioinformatics problem --- sequence searching a microbial genome database against itself to support the discovery of missing genes in genomes --- in only a few hours on BG/P. Previously, this problem was viewed as computationally intractable in practice.

international conference on parallel and distributed systems | 2011

StreamMR: An Optimized MapReduce Framework for AMD GPUs

Marwa Elteir; Heshan Lin; Wu-chun Feng; Thomas R. W. Scogland

MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally focusing on the NVIDIA GPU. Our investigation reveals that the design and mapping of the MapReduce framework needs to be revisited for AMD GPUs due to their notable architectural differences from NVIDIA GPUs. For instance, current state-of-the-art MapReduce implementations employ atomic operations to coordinate the execution of different threads. However, atomic operations can implicitly cause inefficient memory access, and in turn, severely impact performance. In this paper, we propose Streamer, an OpenCL MapReduce framework optimized for AMD GPUs. With efficient atomic-free algorithms for output handling and intermediate result shuffling, Stream MR is superior to atomic-based MapReduce designs and can outperform existing atomic-free MapReduce implementations by nearly five-fold on an AMD Radeon HD 5870.

international conference on parallel and distributed systems | 2010

Enhancing MapReduce via Asynchronous Data Processing

Marwa Elteir; Heshan Lin; Wu-chun Feng

The Map Reduce programming model simplifies large-scale data processing on commodity clusters by having users specify a map function that processes input key/value pairs to generate intermediate key/value pairs, and a reduce function that merges and converts intermediate key/value pairs into final results. Typical Map Reduce implementations such as Hadoop enforce barrier synchronization between the map and reduce phases, i.e., the reduce phase does not start until all map tasks are finished. In turn, this synchronization requirement can cause inefficient utilization of computing resources and can adversely impact performance. Thus, we present and evaluate two different approaches to cope with the synchronization drawback of existing Map Reduce implementations. The first approach, hierarchical reduction, starts a reduce task as soon as a predefined number of map tasks completes, it then aggregates the results of different reduce tasks following a tree structure. The second approach, incremental reduction, starts a predefined number of reduce tasks from the beginning and has each reduce task incrementally reduce records collected from map tasks. Together with our performance modeling, we evaluate different reducing approaches with two real applications on a 32-node cluster. The experimental results have shown that incremental reduction outperforms hierarchical reduction in general. Also, incremental reduction can speed-up the original Hadoop implementation by up to 35.33% for the word count application and 57.98% for the grep application. In addition, incremental reduction outperforms the original Hadoop in an emulated cloud environment with heterogeneous compute nodes.

Computer Science - Research and Development | 2010

A first look at integrated GPUs for green high-performance computing

Thomas R. W. Scogland; Heshan Lin; Wu-chun Feng

The graphics processing unit (GPU) has evolved from a single-purpose graphics accelerator to a tool that can greatly accelerate the performance of high-performance computing (HPC) applications. Previous studies have shown that discrete GPUs, while energy efficient for compute-intensive scientific applications, consume very high power. In fact, a compute-capable discrete GPU can draw more than 200 watts by itself, which can be as much as an entire compute node (without a GPU). This massive power draw presents a serious roadblock to the adoption of GPUs in low-power environments, such as embedded systems. Even when being considered for data centers, the power draw of a GPU presents a problem as it increases the demand placed on support infrastructure such as cooling and available supplies of power, driving up cost. With the advent of compute-capable integrated GPUs with power consumption in the tens of watts, we believe it is time to re-evaluate the notion of GPUs being power-hungry.In this paper, we present the first evaluation of the energy efficiency of integrated GPUs for green HPC. We make use of four specific workloads, each representative of a different computational dwarf, and evaluate them across three different platforms: a multicore system, a high-performance discrete GPU, and a low-power integrated GPU. We find that the integrated GPU delivers superior energy savings and a comparable energy-delay product (EDP) when compared to its discrete counterpart, and it can still outperform the CPUs of a multicore system at a fraction of the power.

cluster computing and the grid | 2012

Transparent Accelerator Migration in a Virtualized GPU Environment

Shucai Xiao; Pavan Balaji; James Dinan; Qian Zhu; Rajeev Thakur; Susan Coghlan; Heshan Lin; Gaojin Wen; Jue Hong; Wu-chun Feng

This paper presents a framework to support transparent, live migration of virtual GPU accelerators in a virtualized execution environment. Migration is a critical capability in such environments because it provides support for fault tolerance, on-demand system maintenance, resource management, and load balancing in the mapping of virtual to physical GPUs. Techniques to increase responsiveness and reduce migration overhead are explored. The system is evaluated by using four application kernels and is demonstrated to provide low migration overheads. Through transparent load balancing, our system provides a speedup of 1.7 to 1.9 for three of the four application kernels.

military communications conference | 2010

Cognitive Radio Rides on the Cloud

Feng Ge; Heshan Lin; Amin Khajeh; C. Jason Chiang; M. Eltawil Ahmed; W. Bostian Charles; Wu-chun Feng; Ritu Chadha

Cognitive Radio (CR) is capable of adaptive learning and reconfiguration, promising consistent communications performance for C4ISR1 systems even in dynamic and hostile battlefield environments. As such, the vision of Network-Centric Operations becomes feasible. However, enabling adaptation and learning in CRs may require both storing a vast volume of data and processing it fast. Because a CR usually has limited computing and storage capacity determined by its size and battery, it may not be able to achieve its full capability. The cloud2 can provide its computing and storage utility for CRs to overcome such challenges. On the other hand, the cloud can also store and process enormous amounts of data needed by C4ISR systems. However, todays wireless technologies have difficulty moving various types of data reliably and promptly in the battlefields. CR networks promise reliable and timely data communications for accessing the cloud. Overall, connecting CRs and the cloud overcomes the performance bottlenecks of each. This paper explores opportunities of this confluence and describes our prototype system.

ieee/acm international symposium cluster, cloud and grid computing | 2013

Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures

Jing Zhang; Heshan Lin; Pavan Balaji; Wu-chun Feng

Computational biology sequence alignment tools using the Burrows-Wheeler Transform (BWT) are widely used in next-generation sequencing (NGS) analysis. However, despite extensive optimization efforts, the performance of these tools still cannot keep up with the explosive growth of sequencing data. Through an in-depth performance analysis of BWA, a popular BWT-based aligner on multicore architectures, we demonstrate that such tools are limited by memory bandwidth due to their irregular memory access patterns. We then propose a locality-aware implementation of BWA that aims at optimizing its performance by better exploiting the caching mechanisms of modern multicore processors. Experimental results show that our improved BWA implementation can reduce last-level cache (LLC) misses by 30% and translation look aside buffer (TLB) misses by 20%, resulting in up to 2.6-fold speedup over the original BWA implementation.

Cluster Computing | 2012

Reliable MapReduce computing on opportunistic resources

Heshan Lin; Xiaosong Ma; Wu-chun Feng

MapReduce offers an ease-of-use programming paradigm for processing large data sets, making it an attractive model for opportunistic compute resources. However, unlike dedicated resources, where MapReduce has mostly been deployed, opportunistic resources have significantly higher rates of node volatility. As a consequence, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate on such volatile resources.In this paper, we propose MOON, short for MapReduce On Opportunistic eNvironments, which is designed to offer reliable MapReduce service for opportunistic computing. MOON adopts a hybrid resource architecture by supplementing opportunistic compute resources with a small set of dedicated resources, and it extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms to take advantage of the hybrid resource architecture. Our results on an emulated opportunistic computing system running atop a 60-node cluster demonstrate that MOON can deliver significant performance improvements to Hadoop on volatile compute resources and even finish jobs that are not able to complete in Hadoop.

Explore More