Guang R. Gao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guang R. Gao is active.

Explore More

Publication

Featured researches published by Guang R. Gao.

Bioinformatics | 2002

TROLL—Tandem Repeat Occurrence Locator

Adalberto T. Castelo; Wellington Santos Martins; Guang R. Gao

SUMMARY Tandem Repeat Occurrence Locator (TROLL), is a light-weight Simple Sequence Repeat (SSR) finder based on a slight modification of the Aho-Corasick algorithm. It is fast and only requires a standard Personal Computer (PC) to operate. We report running times of 127 s to find all SSRs of length 20 bp or more on the complete Arabdopsis genome--approx. 130 Mbases divided in five chromosomes--using a PC Athlon 650 MHz with 256 MB of RAM. AVAILABILITY TROLL is an open source project and is available at http://finder.sourceforge.net.

Bioinformatics | 2005

An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes

Robel Y. Kahsay; Guang R. Gao; Li Liao

MOTIVATION Knowledge of the transmembrane helical topology can help identify binding sites and infer functions for membrane proteins. However, because membrane proteins are hard to solubilize and purify, only a very small amount of membrane proteins have structure and topology experimentally determined. This has motivated various computational methods for predicting the topology of membrane proteins. RESULTS We present an improved hidden Markov model, TMMOD, for the identification and topology prediction of transmembrane proteins. Our model uses TMHMM as a prototype, but differs from TMHMM by the architecture of the submodels for loops on both sides of the membrane and also by the model training procedure. In cross-validation experiments using a set of 83 transmembrane proteins with known topology, TMMOD outperformed TMHMM and other existing methods, with an accuracy of 89% for both topology and locations. In another experiment using a separate set of 160 transmembrane proteins, TMMOD had 84% for topology and 89% for locations. When utilized for identifying transmembrane proteins from non-transmembrane proteins, particularly signal peptides, TMMOD has consistently fewer false positives than TMHMM does. Application of TMMOD to a collection of complete genomes shows that the number of predicted membrane proteins accounts for approximately 20-30% of all genes in those genomes, and that the topology where both the N- and C-termini are in the cytoplasm is dominant in these organisms except for Caenorhabditis elegans. AVAILABILITY http://liao.cis.udel.edu/website/servers/TMMOD/

international parallel and distributed processing symposium | 2010

Dynamic load balancing on single- and multi-GPU systems

Long Chen; Oreste Villa; Sriram Krishnamoorthy; Guang R. Gao

The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single-and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in current GPU programming APIs, such as NVIDIAs CUDA. We evaluate our approach using both micro-benchmarks and a molecular dynamics application that exhibits significant load imbalance. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.

languages and compilers for parallel computing | 1992

Collective Loop Fusion for Array Contraction

Guang R. Gao; R. Olsen; Vivek Sarkar; Radhika Thekkath

In this paper we propose a loop fusion algorithm specifically designed to increase opportunities for array contraction. Array contraction is an optimization that transforms array variables into scalar variables within a loop nest. In contrast to array elements, scalar variables have better cache behavior and can be allocated to registers. In past work we investigated loop interchange and loop reversal as optimizations that increase opportunities for array contraction [13]. This paper extends this work by including the loop fusion optimization. The fusion method discussed in this paper uses the maxflow-mincut algorithm to do loop clustering. Our collective loop fusion algorithm is efficient, and we demonstrate its usefulness for array contraction with a simple example.

international workshop on high performance reconfigurable computing technology and applications | 2007

Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform

Peiheng Zhang; Guangming Tan; Guang R. Gao

An innovative reconfigurable supercomputing platform -- XD1000 is developed by XtremeData Inc. to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present the implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies -- a key to minimize the overall PE pipeline cycle time; (3) we also put forward a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor speedups 185X and 250X respectively.

IEEE Transactions on Computers | 2000

Location consistency-a new memory model and cache consistency protocol

Guang R. Gao; Vivek Sarkar

Existing memory models and cache consistency protocols assume the memory coherence property which requires that all processors observe the same ordering of write operations to the same location. In this paper, we address the problem of defining a memory model that does not rely on the memory coherence assumption and also the problem of designing a cache consistency protocol based on such a memory model. We define a new memory consistency model, called Location Consistency (LC), in which the state of a memory location is modeled as a partially ordered multiset (pomset) of write and synchronization operations. We prove that LC is strictly weaker than existing memory models, but is still equivalent to stronger models for the common case of parallel programs that have no data races. We also describe a new multiprocessor cache consistency protocol based on the LC memory model. We prove that this LC protocol obeys the LC memory model. The LC protocol does not need to enforce single write ownership of memory blocks. As a result, the LC protocol is simpler and more scalable than existing snooping and directory-based cache consistency protocols.

programming language design and implementation | 1996

Software pipelining showdown: optimal vs. heuristic methods in a production compiler

John C. Ruttenberg; Guang R. Gao; A. Stoutchinin; Woody Lichtenstein

This paper is a scientific comparison of two code generation techniques with identical goals --- generation of the best possible software pipelined code for computers with instruction level parallelism. Both are variants of modulo scheduling, a framework for generation of software pipelines pioneered by Rau and Glaser [RaG181], but are otherwise quite dissimilar.One technique was developed at Silicon Graphics and is used in the MIPSpro compiler. This is the production compiler for SGIs systems which are based on the MIPS R8000 processor [Hsu94]. It is essentially a branch--and--bound enumeration of possible schedules with extensive pruning. This method is heuristic because of the way it prunes and also because of the interaction between register allocation and scheduling.The second technique aims to produce optimal results by formulating the scheduling and register allocation problem as an integrated integer linear programming (ILP1) problem. This idea has received much recent exposure in the literature [AlGoGa95, Feautrier94, GoAlGa94a, GoAlGa94b, Eichenberger95], but to our knowledge all previous implementations have been too preliminary for detailed measurement and evaluation. In particular, we believe this to be the first published measurement of runtime performance for ILP based generation of software pipelines.A particularly valuable result of this study was evaluation of the heuristic pipelining technology in the SGI compiler. One of the motivations behind the McGill research was the hope that optimal software pipelining, while not in itself practical for use in production compilers, would be useful for their evaluation and validation. Our comparison has indeed provided a quantitative validation of the SGI compilers pipeliner, leading us to increased confidence in both techniques.

ACM Transactions on Programming Languages and Systems | 1996

Identifying loops using DJ graphs

Vugranam C. Sreedhar; Guang R. Gao; Yong-Fong Lee

Loop identification is a necessary step in loop transformations for high-performance architectures. One classical technique for detecting loops is Tarjans interval-finding algorithm. The intervals identified by Tarjans method are single-entry, strongly connected subgraphs that closely reflect a programs loop structure. We present a simple algorithm for identifying both reducible and irreducible loops using DJ graphs. Our method is a generalization of Tarjans method, as it identifies nested intervals (or loops) even in the presence of irreducibility.

symposium on principles of programming languages | 1993

A novel framework of register allocation for software pipelining

Qi Ning; Guang R. Gao

Although software pipelining has been proposed as one of the most important loop scheduling methods, simultaneous scheduling and register allocation is less understood and remains an open problem [28]. The objective of this paper is to develop a unified algorithmic framework for concurrent scheduling and register allocation to support time-optimal software pipelining. A key intuition leading to this surprisingly simple formulation and its efficient solution is the association of maximum computation rate of a program graph with its critical cycles due to Reiters pioneering work on Karp-Miller computation graphs [29]. In particular, our method generalizes the work by Callahan, Carr and Kennedy on scalar expansion[6], the work by Lam on modular variable expansion for software pipelined loops [20], and the work by Rau et al. on register allocation for modulo scheduled loops[28].

international symposium on microarchitecture | 1994

Minimizing register requirements under resource-constrained rate-optimal software pipelining

R. Govindarajan; Erik R. Altman; Guang R. Gao

In this paper we address the following software pipelining problem: given a loop and a machine architecture with a fixed number of processor resources (e.g. function units), how can one construct a software-pipelined schedule which runs on the given architecture at the maximum possible iteration rate (a la rate-optimal) while minimizing the number of registers? The main contributions of this paper are: First, we demonstrate that such problem can be described by a simple mathematical formulation with precise optimization objectives under periodic linear scheduling framework. The mathematical formulation provides a clear picture which permits one to visualize the overall solution space (for rate-optimal schedules) under different sets of constraints. Secondly, we show that a precise mathematical formulation and its solution does make a significant performance difference. We evaluated the performance of our method against three other leading contemporary heuristic methods. Experimental results show that the method described in this paper performed significantly better than these methods.

Explore More