Is this you? Create Your Porfile

Verdi March

National University of Singapore

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Verdi March is active.

Explore More

Publication

Featured researches published by Verdi March.

network and parallel computing | 2008

Survey on Parallel Programming Model

Henry Kasim; Verdi March; Rita Zhang; Simon See

The development of microprocessors design has been shifting to multi-core architectures. Therefore, it is expected that parallelism will play a significant role in future generations of applications. Throughout the years, there has been a myriad number of parallel programming models proposed. In choosing a parallel programming model, not only the performance aspect is important, but also qualitative the aspect of how well parallelism is abstracted to developers. A model with a well abstraction of parallelism leads to a higher application-development productivity. In this paper, we propose seven criteria to qualitatively evaluate parallel programming models. Our focus is on how parallelism is abstracted and presented to application developers. As a case study, we use these criteria to investigate six well-known parallel programming models in the HPC community.

international parallel and distributed processing symposium | 2009

An approach for matching communication patterns in parallel applications

Chao Ma; Yong Meng Teo; Verdi March; Naixue Xiong; Ioana Romelia Pop; Yan Xiang He; Simon See

Interprocessor communication is an important factor in determining the performance scalability of parallel systems. The communication requirements of a parallel application can be quantified to understand its communication pattern and communication pattern similarities among applications can be determined. This is essential for the efficient mapping of applications on parallel systems and leads to better interprocessor communication implementation among others. This paper proposes a methodology to compare the communication pattern of distributed-memory programs. Communication correlation coefficient quantifies the degree of similarity between two applications based on the communication metrics selected to characterize the applications. To capture the network topology requirements, we extract the communication graph of each applications and quantities this similarity. We apply this methodology to four applications in the NAS parallel benchmark suite and evaluate the communication patterns by studying the effects of varying problem size and the number of logical processes (LPs).

grid computing | 2011

A Read-Only Distributed Hash Table

Verdi March; Yong Meng Teo

A distributed hash table (DHT) is an infrastructure to support resource discovery in large distributed systems. In a DHT, data items such as resources, indexes of resources or resource metadata, are distributed across an overlay network based on a hash function. However, this may not be desirable in commercial applications such as Grid and cloud computing whereby the presence of multiple administrative domains leads to the issues of data ownership and self-economic interests. In this paper, we present R-DHT (Read-only DHT), a DHT-based resource discovery scheme without distributing data items. To map each data item back onto its resource owner, a physical host, we virtualize each host into virtual nodes. Nodes are further organized as a segment-based overlay network which increases node failure resiliency without replicating data items. We demonstrate the feasibility of our proposed scheme by presenting R-Chord, an implementation of R-DHT using Chord as the underlying overlay graph, with lookup and maintenance optimizations. Through analytical and simulation analyses, we evaluate the performance of R-DHT and compare it with traditional DHTs in terms of lookup path length, resiliency to node failures, and maintenance overhead. Overall, we found that R-DHT is effective and efficient for resource indexing and discovery in large distributed systems with a strong commercial requirement.

local computer networks | 2005

Collision Detection and Resolution in Hierarchical Peer-to-Peer Systems

Verdi March; Yong Meng Teo; Hock Beng Lim; Peter Eriksson; Rassul Ayani

Structured peer-to-peer systems can be organized hierarchically as two-level overlay networks. The top-level overlay consists of groups of nodes, where each group is identified by a group identifier. In each group, one or more nodes are designated as supernodes and act as gateways to the nodes at the second level. A collision occurs during join operations, when two or more groups with the same group identifier are created at the top-level overlay. Collisions increase the lookup path length and the stabilization overhead, and reduce the scalability of hierarchical peer-to-peer systems. We propose a new scheme to detect and resolve collisions, and we study the impact of the collision problem on the performance of peer-to-peer systems. Our simulation results show the effectiveness of our scheme in reducing collisions and maintaining the size of the top-level overlay close to the ideal size

chinagrid annual conference | 2009

Research on the Performance of xVM Virtual Machine Based on HPCC

Tiezhu Zhao; Yilong Ding; Verdi March; Shoubin Dong; Simon See

The virtual machine (VM) technology has received an increasing interest a spotlight both in the industry and the research communities. Although the potential advantages of virtualization in HPC workloads have been documented, the potential impact to application performance in HPC environments is not clearly understood. This paper presents a study on performance evaluation of virtual HPC systems using High Performance Computing Challenge (HPCC) benchmark suite and xVM as the workload representative and VM technology, respectively. Based on the extended AHP (Analytic Hierarchy Process) method, we propose an efficient performance evaluation model based on extended AHP and analyze the results and quantify the performance overhead of xVM in terms of compute, memory, and network overhead. Our analysis shows that the computational and network performance in HVM is slightly better and the memory performance is significantly better compared to paravirtualization.

parallel and distributed computing applications and technologies | 2010

On Parallel Stiff ODEs Solver for Hybrid CPU-GPU Architecture

Shuming Miao; Verdi March; Xinhua Lin; Simon See; Hong Liu

Hybrid architecture based on multi-core CPU and many-core GPU is an emerging computing platform to solve complex numerical problems. However, to effectively exploit its performance potential, applications need to be optimized for efficient execution on CPU and GPU. This paper presents our experience in parallelizing and optimizing BiM for the hybrid architecture. BiM is an open-source solver for stiff ODE problem using the blended implicit method. In our proposed parallel BiM, the Newton iteration which is highly compute intensive and embarrassingly parallel, is executed on GPU. To achieve further performance, we also concurrently solve on multi-core CPU multiple upper and lower triangular matrices that are independent to each other. Thus, parallelization can still be achieved though solving each of the matrices is inherently sequential. Our experimental result validates the effectiveness of hybrid OpenMP-CUDA for BiM, where it achieves a speedup of 4.5x over the existing sequential implementation. This strategy is also shown to be more efficient than pure OpenMP or CUDA implementation of BiM, achieving a speedup of 1.7–2.1x.

international symposium on distributed computing | 2010

Batch Scheduler for Personal Multi-Core Systems

Prakhar Gupta; Tarun Atrey; Manjari Garg; Verdi March; Simon See

A multi-core personal computer can run many compute-intensive programs concurrently. An over-subscription occurs when the number of user programs exceeds CPU cores or memory resources, such that these resources are time-shared by several programs. Traditionally, over-subscription is an approach to improve system utilization when user programs are not compute intensive. However, with compute-intensive programs, peak system utilization is achieved even without over-subscription. In addition, over-subscription in such a scenario prolongs the completion time of each program, and risks trashing the memory resources. To prevent the over-subscription, we propose of a batch scheduler for personal multi-core systems. It imposes a job queuing policy to ensure that CPU cores and memory resources are not time-shared by multiple programs. To demonstrate our idea, we present a simple implementation of a personal batch scheduler by extending a batch scheduler designed for HPC (high performance computing) clusters, with virtualization technologies.

high performance computing and communications | 2009

Towards Predictive Modeling of Message-Passing Communication

Verdi March; Vijayaraghavan Murali; Yong Meng Teo; Simon See; James T. Himer

Communication has been shown to be a performance bottleneck and a limiting factor of many large parallel applications. As such, predicting the application scalability necessitates a communication performance model.This paper investigates the LogGP communication performance model for predicting message-passing communications when the system con¿guration (i.e., number of nodes) is varied. The cost functions for the message-passing operations are based on MVAPICH2 1.0, and the experiments are conducted on the Ranger system using up to 256 nodes connected with an In¿niBand net-work. For point-to-point communications, we observe that the LogGP model accurately predicts the communication performance. However, the results for three collective operations, i.e., MPI_Barrier, MPI_Alltoall, and MPI_Bcast, are varying. For MPI_Bcast, the LogGP model is able to predict its scalability up to 256 nodes, and the prediction error is at most a factor of two on 256 nodes. For the remaining collectives, the scalability--- bar that of MPI_Alltoall on small messages (m = 2 bytes) --- is predicted by LogGP, but the prediction error for 256 nodes is 3.5–12 times of the measured performance.

computational science and engineering | 2009

Performance Comparison of Four-Socket Server Architecture on HPC Workload

Henry Kasim; Verdi March; Simon See

Recent server architectures embrace a common technology feature: on-chip parallelism via multi-core and CMT (Chip Multi Threading) technologies. However, they also significantly differ in a number of key aspects includingclock speed, micro-architecture, cache hierarchy, and memory sub-system. Such differences may lead to difference levels of application performance. This paper presents a performance comparison of the recent four-socketserver architecture on various high performance computing (HPC) workloads. Our analysis is based on two benchmark suites from Standard Performance Evaluation Corporation (SPEC): SPEC CPU2006 and SPEC OMP2001. Our analysis shows that no single architecture is the best for all types of workload. In addition, we found that the CPU clock speed, which is often used as the sole performance indicator, does not always reflect application performance.

Archive | 2005