Hai-Xiang Lin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hai-Xiang Lin is active.

Explore More

Publication

Featured researches published by Hai-Xiang Lin.

Archive | 2009

Euro-Par 2009 Parallel Processing

Henk J. Sips; Dick H. J. Epema; Hai-Xiang Lin

Abstracts Invited Talks.- Multicore Programming Challenges.- Ibis: A Programming System for Real-World Distributed Computing.- What Is in a Namespace?.- Topic 1: Support Tools and Environments.- Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications.- Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions.- A Holistic Approach towards Automated Performance Analysis and Tuning.- Pattern Matching and I/O Replay for POSIX I/O in Parallel Programs.- An Extensible I/O Performance Analysis Framework for Distributed Environments.- Grouping MPI Processes for Partial Checkpoint and Co-migration.- Process Mapping for MPI Collective Communications.- Topic 2: Performance Prediction and Evaluation.- Stochastic Analysis of Hierarchical Publish/Subscribe Systems.- Characterizing and Understanding the Bandwidth Behavior of Workloads on Multi-core Processors.- Hybrid Techniques for Fast Multicore Simulation.- PSINS: An Open Source Event Tracer and Execution Simulator for MPI Applications.- A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors.- Topic 3: Scheduling and Load Balancing.- Dynamic Load Balancing of Matrix-Vector Multiplications on Roadrunner Compute Nodes.- A Unified Framework for Load Distribution and Fault-Tolerance of Application Servers.- On the Feasibility of Dynamically Scheduling DAG Applications on Shared Heterogeneous Systems.- Steady-State for Batches of Identical Task Trees.- A Buffer Space Optimal Solution for Re-establishing the Packet Order in a MPSoC Network Processor.- Using Multicast Transfers in the Replica Migration Problem: Formulation and Scheduling Heuristics.- A New Genetic Algorithm for Scheduling for Large Communication Delays.- Comparison of Access Policies for Replica Placement in Tree Networks.- Scheduling Recurrent Precedence-Constrained Task Graphs on a Symmetric Shared-Memory Multiprocessor.- Energy-Aware Scheduling of Flow Applications on Master-Worker Platforms.- Topic 4: High Performance Architectures and Compilers.- Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs.- Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors.- REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs.- Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation.- Topic 5: Parallel and Distributed Databases.- Unifying Memory and Database Transactions.- A DHT Key-Value Storage System with Carrier Grade Performance.- Selective Replicated Declustering for Arbitrary Queries.- Topic 6: Grid, Cluster, and Cloud Computing.- POGGI: Puzzle-Based Online Games on Grid Infrastructures.- Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach.- MapReduce Programming Model for .NET-Based Cloud Computing.- The Architecture of the XtreemOS Grid Checkpointing Service.- Scalable Transactions for Web Applications in the Cloud.- Provider-Independent Use of the Cloud.- MPI Applications on Grids: A Topology Aware Approach.- Topic 7: Peer-to-Peer Computing.- A Least-Resistance Path in Reasoning about Unstructured Overlay Networks.- SiMPSON: Efficient Similarity Search in Metric Spaces over P2P Structured Overlay Networks.- Uniform Sampling for Directed P2P Networks.- Adaptive Peer Sampling with Newscast.- Exploring the Feasibility of Reputation Models for Improving P2P Routing under Churn.- Selfish Neighbor Selection in Peer-to-Peer Backup and Storage Applications.- Zero-Day Reconciliation of BitTorrent Users with Their ISPs.- Surfing Peer-to-Peer IPTV: Distributed Channel Switching.- Topic 8: Distributed Systems and Algorithms.- Distributed Individual-Based Simulation.- A Self-stabilizing K-Clustering Algorithm Using an Arbitrary Metric.- Active Optimistic Message Logging for Reliable Execution of MPI Applications.- Topic 9: Parallel and Distributed Programming.- A Parallel Numerical Library for UPC.- A Multilevel Parallelization Framework for High-Order Stencil Computations.- Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores.- Parallel Skeletons for Variable-Length Lists in SkeTo Skeleton Library.- Stkm on Sca: A Unified Framework with Components, Workflows and Algorithmic Skeletons.- Grid-Enabling SPMD Applications through Hierarchical Partitioning and a Component-Based Runtime.- Reducing Rollbacks of Transactional Memory Using Ordered Shared Locks.- Topic 10: Parallel Numerical Algorithms.- Wavelet-Based Adaptive Solvers on Multi-core Architectures for the Simulation of Complex Systems.- Localized Parallel Algorithm for Bubble Coalescence in Free Surface Lattice-Boltzmann Method.- Fast Implicit Simulation of Oscillatory Flow in Human Abdominal Bifurcation Using a Schur Complement Preconditioner.- A Parallel Rigid Body Dynamics Algorithm.- Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems.- Parallel Implementation of Runge-Kutta Integrators with Low Storage Requirements.- PSPIKE: A Parallel Hybrid Sparse Linear System Solver.- Out-of-Core Computation of the QR Factorization on Multi-core Processors.- Adaptive Parallel Householder Bidiagonalization.- Topic 11: Multicore and Manycore Programming.- Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor.- An Extension of the StarSs Programming Model for Platforms with Multiple GPUs.- StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.- XJava: Exploiting Parallelism with Object-Oriented Stream Programming.- JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA.- Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades.- Searching for Concurrent Design Patterns in Video Games.- Parallelization of a Video Segmentation Algorithm on CUDA-Enabled Graphics Processing Units.- A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform.- High Performance Matrix Multiplication on Many Cores.- Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm.- Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards.- Topic 12: Theory and Algorithms for Parallel Computation.- Implementing Parallel Google Map-Reduce in Eden.- A Lower Bound for Oblivious Dimensional Routing.- Topic 13: High-Performance Networks.- A Case Study of Communication Optimizations on 3D Mesh Interconnects.- Implementing a Change Assimilation Mechanism for Source Routing Interconnects.- Dependability Analysis of a Fault-Tolerant Network Reconfiguring Strategy.- RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration.- NIC-Assisted Cache-Efficient Receive Stack for Message Passing over Ethernet.- A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks.- Hardware Implementation Study of the SCFQ-CA and DRR-CA Scheduling Algorithms.- Topic 14: Mobile and Ubiquitous Computing.- Optimal and Near-Optimal Energy-Efficient Broadcasting in Wireless Networks.

ieee international conference on high performance computing data and analytics | 1997

The Improved Quasi-minimal Residual Method on Massively Distributed Memory Computers

Tianruo Yang; Hai-Xiang Lin

For the solutions of linear systems of equations with unsymmetric coefficient matrices, we propose an improved version of the quasi-minimal residual (IQMR) method by using the Lanczos process as a major component combining elements of numerical stability and parallel algorithm design. For Lanczos process, stability is obtained by a coupled two-term procedure that generates Lanczos vectors normalized to unit length. The algorithm is derived in such a way that all inner products and matrix-vector multiplications of a single iteration step are independent, subsequently communication time required for inner products can be overlapped efficiently with computation time. Therefore, the cost of global communication on parallel distributed memory computers is significantly reduced. The resulting IQMR algorithm preserves the favorable properties of the Lanczos process without increasing computational costs. The efficiency of this method is demonstrated by numerical experimental results carried out on a massively parallel distributed memory computer, the Parsytec GC/PowerPlus.

international parallel processing symposium | 1999

LLB: A fast and effective scheduling algorithm for distributed-memory systems

A. Radulescu; A.J.C. van Gemund; Hai-Xiang Lin

This paper presents a new algorithm called List-based Load Balancing (LLB) for compile-time task scheduling on distributed-memory machines. LLB is intended as a cluster-mapping and task-ordering step in the multi-step class of scheduling algorithms. Unlike current multistep approaches, LLB integrates cluster-mapping and task-ordering in a single step. The benefits of this integration are twofold. First, it allows dynamic load balancing in time, because only the ready tasks are considered in the mapping process. Second, communication is also considered, as opposed to algorithms like WCM and GLB. The algorithm has a low time complexity of O(E+V(log V+log P)), where E is the number of dependences, V is the number of tasks and P is the number of processors. Experimental results show that LLB outperforms known cluster-mapping algorithms of comparable complexity, improving the schedule lengths up to 42%. Furthermore, compared with LCA, a much higher-complexity algorithm, LLB obtains comparable results for fine-grain graphs and yields improvements up to 16% for coarse-grain graphs.

Journal of Parallel and Distributed Computing | 2008

Parallel and distributed simulation of sediment dynamics in shallow water using particle decomposition approach

W. M. Charles; E. van den Berg; Hai-Xiang Lin; A.W. Heemink; Martin Verlaan

This paper describes the parallel simulation of sediment dynamics in shallow water. By using a Lagrangian model, the problem is transformed to one in which a large number of independent particles must be tracked. This results in a technique that can be parallelised with high efficiency. We have developed a sediment transport model using three different sediment suspension methods. The first method uses a modified mean for the Poisson distribution function to determine the expected number of the suspended particles in each particular grid cell of the domain over all available processors. The second method determines the number of particles to suspend with the aid of the Poisson distribution function only in those grid cells which are assigned to that processor. The third method is based on the technique of using a synchronised pseudo-random-number generator to generate identical numbers of suspended particles in all valid grid cells for each processor. Parallel simulation experiments are performed in order to investigate the efficiency of these three methods. Also the parallel performance of the implementations is analysed. We conclude that the second method is the best method on distributed computing systems (e.g., a Beowulf cluster), whereas the third maintains the best load distribution.

international symposium on distributed computing | 2010

Sparse Matrix-Vector Multiplication Optimizations based on Matrix Bandwidth Reduction using NVIDIA CUDA

Shiming Xu; Hai-Xiang Lin; Wei Xue

Abstract — In this paper we propose the optimization of sparse matrix-vector multiplication (SpMV) with CUDA based on matrix bandwidth/profile reduction techniques. Computational time required to access dense vector is decoupled from SpMV computation. By reducing the matrix profile, the time required to access dense vector is reduced by 17% (for SP) and 24% (for DP). Reduced matrix bandwidth enables column index information compression with shorter formats, resulting in a 17% (for SP) and 10% (for DP) execution time reduction for accessing matrix data under ELLPACK format. The overall speedup for SpMV is 16% and 12.6% for the whole matrix test suite. The optimization proposed in this paper can be combined with other SpMV optimizations such as register blocking. Keywords: SpMV, GP-GPU, NVIDIA CUDA, RCM I. I NTRODUCTION SpMV is an intensively used computational kernel in many scientific applications, such as iterative Krylov subspace solvers [3], preconditioners [4], etc. There has been a large body of works on optimization SpMV on various parallel platforms [12,13,6,5,7]. Due to the low computation count/memory access count ratio, SpMV is mainly memory bandwidth bound. General Purposed computing using Graphics Processing Units (GPGPU [2]) is the technique for utilizing GPU and throughput-oriented architecture for general purposed computation other than graphics applications. NVIDIA CUDA [1] is the first widely used platform for GPGPU. In CUDA, NVIDIA GPUs are abstracted as a platform with massively parallel threads. Threads are organized in a hierarchy of thread blocks and grids. Each thread block is bind to one Stream Processor. Threads are scheduled at the granularity of warps, each with 32 threads. On NVIDIA GT200 series GPUs, 30 Stream Processor are present. Each of them has hardware resources such as Texture Cache (TC), Shared Memory (ShM), etc. TC can be used to cache read-only data and exploits spatial locality. Some recent papers try to optimize SpMV on GPU platforms [6,5,7]. Higher performance than conventional CPU architecture is achieved through bandwidth-aware matrix formats such as ELLPACK [9] and use of TC for dense vector in SpMV. In this paper we propose the use of matrix bandwidth reduction techniques in optimization of SpMV kernel on CUDA platform. We use a decoupled framework to evaluate SpMV by dividing it into 2 parts: matrix-centric memory accesses which are deterministic, and source vector-centric memory accesses which have non-deterministic access patterns. We show that both parts can be enhanced through matrix bandwidth reduction. In total, 16% and 12.6% speedups are achieved for matrix test suite for single precision and double precision, respectively. This paper is organized as follows. In Section II we introduce the SpMV implementation in CUDA and related matrix formats. We cover SpMV optimization based on matrix bandwidth reduction algorithms in Section III. Experiments and analysis are in Section IV. Section V concludes the paper. II. S

parallel computing | 2001

A unifying graph model for designing parallel algorithms for tridiagonal systems

Hai-Xiang Lin

Abstract A framework based on graph theoretic notations is described for the design and analysis of a wide range of parallel tridiagonal matrix algorithms. It comprises of three basic types of graph transformation operations: partition, selection, elimination and update. We use the framework to present a unified description of many known parallel algorithms for the solution of tridiagonal systems. We also discuss the use of this framework to design parallel algorithms.

PLOS ONE | 2017

A Method for Finding Metabolic Pathways Using Atomic Group Tracking

Yiran Huang; Cheng Zhong; Hai-Xiang Lin; Jianyi Wang

A fundamental computational problem in metabolic engineering is to find pathways between compounds. Pathfinding methods using atom tracking have been widely used to find biochemically relevant pathways. However, these methods require the user to define the atoms to be tracked. This may lead to failing to predict the pathways that do not conserve the user-defined atoms. In this work, we propose a pathfinding method called AGPathFinder to find biochemically relevant metabolic pathways between two given compounds. In AGPathFinder, we find alternative pathways by tracking the movement of atomic groups through metabolic networks and use combined information of reaction thermodynamics and compound similarity to guide the search towards more feasible pathways and better performance. The experimental results show that atomic group tracking enables our method to find pathways without the need of defining the atoms to be tracked, avoid hub metabolites, and obtain biochemically meaningful pathways. Our results also demonstrate that atomic group tracking, when incorporated with combined information of reaction thermodynamics and compound similarity, improves the quality of the found pathways. In most cases, the average compound inclusion accuracy and reaction inclusion accuracy for the top resulting pathways of our method are around 0.90 and 0.70, respectively, which are better than those of the existing methods. Additionally, AGPathFinder provides the information of thermodynamic feasibility and compound similarity for the resulting pathways.

Mathematical and Computer Modelling | 2009

Adaptive stochastic numerical scheme in parallel random walk models for transport problems in shallow water

W. M. Charles; E. van den Berg; Hai-Xiang Lin; A.W. Heemink

This paper deals with the simulation of transport of pollutants in shallow water using random walk models and develops several computation techniques to speed up the numerical integration of the stochastic differential equations (SDEs). This is achieved by using both random time stepping and parallel processing. We start by considering a basic stochastic Euler scheme for integration of the diffusion and drift terms of the SDEs, with a strong order 1 in the strong sense. The errors due to this scheme depend on the location of the pollutant; it is dominated by the diffusion term near boundaries, and by the deterministic drift further away from the boundaries. Using a pair of integration schemes, one of strong order 1.5 near the boundary and one of strong order 2.0 elsewhere, we can estimate the error and approximate an optimal step size for a given error tolerance. The resulting algorithm is developed such that it allows for complete flexibility of the step size, while guaranteeing the correct Brownian behaviour. Modelling pollutants by non-interacting particles enables the use of parallel processing in the simulation. We take advantage of this by implementing the algorithm using the MPI library. The inherent asynchronic nature of the particle simulation, in addition to the parallel processing, makes it difficult to get a coherent picture of the results at any given points. However, by inserting internal synchronisation points in the temporal discretisation, the code allows pollution snapshots and particle counts to be made at times specified by the user.

IEEE Transactions on Computers | 1991

An improved vector-reduction method

Henk J. Sips; Hai-Xiang Lin

A pipelined vector-reduction method that is based on L.M. Ni and K. Hwangs (1985) symmetric and asymmetric reduction methods is discussed. It is shown that the proposed method is the fastest among known pipelined vector-reduction methods. >

Journal of Parallel and Distributed Computing | 1990

A new model for on-line arithmetic with an application to the reciprocal calculation

Henk J. Sips; Hai-Xiang Lin

Abstract A new model for on-line arithmetic functions is presented. It differs from the current models that in each iteration step the exact function value is used to generate the on-line digits. From this model the on-line properties of a large class of arithmetic functions can be determined. In general, the on-line properties of an arithmetic function derived from the model cannot be improved by any iterative approximation algorithm for that arithmetic function. The on-line properties of a number of standard arithmetic functions are given. It is shown that the above model of on-line computation can be easily implemented by means of a table look-up system. Furthermore, a table implementation can be used to start the on-line computation of an iterative approximation method. This is shown by an example, the reciprocal calculation, where the combination of a seed table and an adapted Newton-Raphson iteration method leads to a system with a low on-line delay and fast cycle times. The algorithm works for normalized, quasi-normalized, and pseudo-normalized numbers and can therefore be applied to chained on-line computations.

Explore More