Weijia Shang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Weijia Shang is active.

Explore More

Publication

Featured researches published by Weijia Shang.

IEEE Transactions on Computers | 1991

Time optimal linear schedules for algorithms with uniform dependencies

Weijia Shang; José A. B. Fortes

The authors address the problem of identifying optimal linear schedules for uniform dependence algorithms so that their execution time is minimized. Procedures are proposed to solve this problem based on the mathematical solution of a nonlinear optimization problem. The complexity of these procedures is independent of the size of the algorithm. Actually, the complexity is exponential in the dimension of the index set of the algorithm, and for all practical purposes, very small due to the limited dimension of the index set of algorithms of practical interest. A particular class of algorithms for which the proposed solution is greatly simplified is considered, and the corresponding simpler organization procedure is provided. >

IEEE Transactions on Parallel and Distributed Systems | 1992

On time mapping of uniform dependence algorithms into lower dimensional processor arrays

Weijia Shang; José A. B. Fortes

Most existing methods of mapping algorithms into processor arrays are restricted to the case where n-dimensional algorithms, or algorithms with n nested loops, are mapped into (n-1)-dimensional arrays. However, in practice, it is interesting to map n-dimensional algorithms into (k-1)-dimensional arrays where k >

IEEE Transactions on Computers | 1992

Independent partitioning of algorithms with uniform dependencies

Weijia Shang; José A. B. Fortes

Uniform dependence algorithms with arbitrary index sets are considered, and two computationally inexpensive methods to find their independent partitions are proposed. Each method has advantages over the other one for certain kinds of applications, and they both outperform previously proposed approaches in terms of computational complexity and/or optimality. Also, lower and upper bounds are given for the cardinality of maximal independent partitions. In multiple instruction multiple data (MIMD) systems, if different blocks of an independent partition are assigned to different processors, communications between processors will be minimized to zero. This is significant because the communications usually dominate the overhead in MIMD machines. >

IEEE Transactions on Parallel and Distributed Systems | 2002

On time optimal supernode shape

Edin Hodzic; Weijia Shang

With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We identify three parameters of a supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For supernode transformations on algorithms with perfectly nested loops and uniform dependencies, we prove the optimality of a constant linear schedule vector and give a necessary and sufficient condition for optimal relative side lengths. We also prove that the total running time is minimized by a cutting hyperplane direction matrix from a particular subset of all valid directions and we discuss the cases where this subset is unique. The results are derived in continuous space and should be considered approximate. Our model does not include cache effects and assumes an unbounded number of available processors, the communication cost approximated by a constant, uniform dependences, and loop bounds known at compile time. A comprehensive example is discussed with an application of the results to the Jacobi algorithm.

IEEE Transactions on Parallel and Distributed Systems | 1994

On loop transformations for generalized cycle shrinking

Weijia Shang; Matthew T. O'Keefe; José A. B. Fortes

This paper describes several loop transformation techniques for extracting parallelism from nested loop structures. Nested loops can then be scheduled to run in parallel so that execution time is minimized. One technique is called selective cycle shrinking, and the other is called true dependence cycle shrinking. It is shown how selective shrinking is related to linear scheduling of nested loops and how true dependence shrinking is related to conflict-free mappings of higher dimensional algorithms into lower dimensional processor arrays. Methods are proposed in this paper to find the selective and true dependence shrinkings with minimum total execution time by applying the techniques of finding optimal linear schedules and optimal and conflict-free mappings proposed by W. Shang and A.B. Fortes. >

IEEE Transactions on Circuits and Systems for Video Technology | 2010

Context Adaptive Lagrange Multiplier (CALM) for Rate-Distortion Optimal Motion Estimation in Video Coding

Jun Zhang; Xiaoquan Yi; Nam Ling; Weijia Shang

In this paper, we propose an efficient and practical algorithm to dynamically adapt the Lagrange multipliers for each macroblock based on the context of the neighboring or upper layer blocks to improve rate-distortion performance. Our method improves the accuracy for the detection of true motion vectors as well as the most efficient encoding modes for luma, which are used for deriving the motion vectors, and modes for chroma. Simulation results for H.264/advanced video coding video demonstrate that our method reduces bit rate significantly and achieves peak signal-to-noise ratio gain over those of the joint model (JM) software for all sequences tested, with negligible extra computational cost. The improvement is particularly significant for high motion high-resolution videos. This paper describes our work that led to our Joint Video Team adopted contribution (included in software JM 12.0 onward), collectively known as context adaptive Lagrange multiplier (CALM).

field programmable gate arrays | 2002

A faster distributed arithmetic architecture for FPGAs

Radhika S. Grover; Weijia Shang; Qiang Li

Distributed Arithmetic (DA) is an important technique to implement digital signal processing (DSP) functions in FPGAs. However, traditional lookup table (LUT) based DA architectures contain one or more carry propagation chains in the critical path that dictates the fastest time at which an entire design can run. In this paper, we describe a novel technique that can reduce or eliminate the carry-propagate chain from the critical path in LUT based DA architectures on FPGAs. In the proposed scheme, the individual bits of a word do not have to be processed as a unit. Instead, the current iteration can start as soon as the least significant bit (LSB) of the previous iteration is available, without waiting for the entire word from the previous iteration to be fully computed. This technique has great potential in speeding up DSP applications based on DA. Designs are described for serial and parallel DALUT and accumulator structures in which an n-bit carry chain, where n is the word length, is broken into smaller r-bit chains, 1*nnr < n . A cost-performance analysis of the designs is presented. The analysis shows that the designs proposed in this paper have a lower cost-performance ratio (indicating better performance) than traditional DA designs. We also show that the 8-bit (r = 8) designs offer a good compromise between cost and performance. The implementation is on a Xilinx chip XC4028XL-3-BG256 using Xilinx Foundation tools v 3.1i. The results show that the proposed designs can achieve speedup by a factor of at least 1.5 over traditional DA designs in some cases.

international parallel and distributed processing symposium | 1992

On uniformization of affine dependence algorithms

Zhigang Chen; Weijia Shang

The authors consider the problem of transforming irregular data dependence structures of algorithms with nested loops into more regular ones. Algorithms under consideration are n-dimensional algorithms (algorithms with n nested loops) with affine dependences where dependences are linear functions of index variables of the loop. Methods are proposed to transform these algorithms into uniform dependence algorithms where dependences are independent of the index variables (constant). Some parallelism might be lost due to making them uniform. The parallelism preserved by the uniformity is measured by (1) the total execution time by the optimal linear schedule which assigns each computation in the algorithm an execution time according to a linear function of the index of the computation and (2) the size of the cone spanned by the dependence vectors after achieving uniformity. The objective of making the dependence uniform is to maximize parallelism preserved by the uniformity or to minimize the number of dependences after uniformity.<<ETX>>

signal processing systems | 1989

On the optimality of linear schedules

Weijia Shang; José A. B. Fortes

An algorithm can be modeled as a set of indexed computations, and a schedule is a mapping of the algorithm index space into time.Linear schedules are a special class of schedules that are described by a linear mapping and are commonly used in many systolic algorithms.Free schedules cause computations of an algorithm to execute as soon as their operands are available. If one computation uses data generated by another computation, then a data dependence exists between these two computations which can be represented by the difference of their indices (calleddependence vector). Many important algorithms are characterized by the fact that data dependencies areuniform, i.e., the values of the dependence vectors are independent of the indices of computations. There are applications where it is of interest to find an optimal linear schedule with respect to the time of execution ofa specific computation of the given algorithm. This paper addresses the problem of identifying optimal linear schedules for uniform dependence algorithms so that the execution time ofa specific computation of the algorithm is minimized and proposes a procedure to solve this problem based on the mathematical solution of a linear optimization problem. Also, linear schedules are compared with free schedules. The comparison indicates that optimal linear schedules can be as efficient as free schedules, the best schedules possible, and identifies a class of algorithms for which this is always true.

field-programmable custom computing machines | 2010

ShapeUp: A High-Level Design Approach to Simplify Module Interconnection on FPGAs

Christopher E. Neely; Gordon J. Brebner; Weijia Shang

The latest generation of FPGA devices offers huge resource counts that provide the headroom to implement large-scale and complex systems. However, this poses increasing challenges for the designer, not just because of pure size and complexity, but also to harness effectively the flexibility and programmability of the FPGA. A central issue is the need to integrate modules (IP blocks) from diverse sources to promote modular design and reuse. In this paper, we introduce ShapeUp: a high-level approach for designing systems by interconnecting modules, which gives a ‘plug and play’ look and feel to the designer and is supported by tools that carry out implementation and verification functions. The emphasis is on the inter-module connections and abstracting the communication patterns that are typical between modules – for example, the streaming of data that is common in many FPGA based DSP or networking systems, or the reading and writing of data to and from memory modules. The details of wiring and signaling are hidden from view, via metadata associated with individual modules. The ShapeUp tool suite includes an implementation capability that automatically generates wiring between blocks, possibly including additional bridging blocks, and a simulation capability that allows multi-level verification of systems of interconnected modules. The methodology and tools have been validated on Xilinx customer design projects.

Explore More