Is this you? Create Your Porfile

Hong Diep Nguyen

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hong Diep Nguyen is active.

Explore More

Publication

Featured researches published by Hong Diep Nguyen.

ieee international conference on high performance computing data and analytics | 2013

Precimonious: tuning assistant for floating-point precision

Cindy Rubio-González; Cuong Nguyen; Hong Diep Nguyen; James Demmel; William Kahan; Koushik Sen; David H. Bailey; Costin Iancu; David Hough

Given the variety of numerical errors that can occur, floating-point programs are difficult to write, test and debug. One common practice employed by developers without an advanced background in numerical analysis is using the highest available precision. While more robust, this can degrade program performance significantly. In this paper we present Precimonious, a dynamic program analysis tool to assist developers in tuning the precision of floating-point programs. Precimonious performs a search on the types of the floating-point program variables trying to lower their precision subject to accuracy constraints and performance goals. Our tool recommends a type instantiation that uses lower precision while producing an accurate enough answer without causing exceptions. We evaluate Precimonious on several widely used functions from the GNU Scientific Library, two NAS Parallel Benchmarks, and three other numerical programs. For most of the programs analyzed, Precimonious reduces precision, which results in performance improvements as high as 41%.

symposium on computer arithmetic | 2013

Fast Reproducible Floating-Point Summation

James Demmel; Hong Diep Nguyen

Reproducibility, i.e. getting the bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [1]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, make attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rumps algorithm for error-free vector transformation [2], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm trades off efficiency and accuracy: we reproducibly attain reasonably accurate results (with an absolute error bound c · n2 · macheps · max |vi| for a small constant c) with just 2n + O(1) floating-point operations, and quite accurate results (with an absolute error bound c · n3 · macheps2 · max |vi| with 5n + O(1) floating point operations, both with just two reduction operations. Higher accuracies are also possible by increasing the number of error-free transformations. As long as the same rounding mode is used, results computed by the proposed algorithms are reproducible for any run on any platform.

IEEE Transactions on Computers | 2015

Parallel Reproducible Summation

James Demmel; Hong Diep Nguyen

Reproducibility, i.e. getting bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [10]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, makes attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rumps algorithm for error-free vector transformation [7], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm reproducibly computes highly accurate results with an absolute error bound of n · 2-28 macheps maxiIviI at a cost of 7n FLOPs and a small constant amount of extra memory usage. Higher accuracies are also possible by increasing the number of error-free transformations. As long as all operations are performed in to-nearest rounding mode, results computed by the proposed algorithms are reproducible for any run on any platform. In particular, our algorithm requires the minimum number of reductions, i.e. one reduction of an array of six double precision floating point numbers per sum, and hence is well suited for massively parallel environments.

field-programmable logic and applications | 2010

Pipelined FPGA Adders

Florent de Dinechin; Hong Diep Nguyen; Bogdan Pasca

Integer addition is a universal building block, and applications such as quad-precision floating-point or elliptic curve cryptography now demand precisions well beyond 64 bits. This study explores the trade-offs between size, latency and frequency for pipelined large-precision adders on FPGA. It compares three pipelined adder architectures: the classical pipelined ripple-carry adder, a variation that reduces register count, and an FPGA-specific implementation of the carry-select adder capable of providing lower latency additions at a comparable price. For each of these architectures, resource estimation models are defined, and used in an adder generator that selects the best architecture considering the target FPGA, the target operating frequency, and the addition bit width.

symposium on computer arithmetic | 2013

Numerical Reproducibility and Accuracy at ExaScale

James Demmel; Hong Diep Nguyen

Given current hardware trends, ExaScale computing (1018 floating point operations per second) is projected to be available in less than a decade, achieved by using a huge number of processors, of order 109. Given the likely hardware heterogeneity in both platform and network, and the possibility of intermittent failures, dynamic scheduling will be needed to adapt to changing resources and loads. This will make it likely that repeated runs of a program will not execute operations like reductions in exactly the same order. This in turn will make reproducibility, i.e. getting bitwise identical results from run to run, difficult to achieve, because floating point operations like addition are not associative, so computing sums in different orders often leads to different results. Indeed, this is already a challenge on todays platforms.

field-programmable logic and applications | 2011

FPGA-Specific Arithmetic Optimizations of Short-Latency Adders

Hong Diep Nguyen; Bogdan Pasca; Thomas B. Preußer

Integer addition is a pervasive operation in FPGA designs. The need for fast wide adders grows with the demand for large precisions as, for example, required for the implementation of IEEE-754 quadruple precision and elliptic-curve cryptography. The FPGA realization of fast and compact binary adders relies on hardware carry chains. These provide a natural implementation environment for the ripple-carry addition (RCA) scheme. As its latency grows linearly with operand width, wide additions call for acceleration, which is quite reasonably achieved by addition schemes built from parallel RCA blocks. This study presents FPGA-specific arithmetic optimizations for the mapping of carry-select and carry-increment adders targeting the hardware carry chains of modern FPGAs. Different trade-offs between latency and area are explored. The proposed architectures can be successfully used in the context of latency-critical systems or as attractive alternatives to deeply pipelined RCA schemes.

international parallel and distributed processing symposium | 2014

Reconstructing Householder Vectors from Tall-Skinny QR

Grey Ballard; James Demmel; Laura Grigori; Mathias Jacquelin; Hong Diep Nguyen; Edgar Solomonik

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. We also provide algorithmic improvements to the ScaLAPACK and CAQR algorithms.

Journal of Parallel and Distributed Computing | 2015

Reconstructing Householder vectors from Tall-Skinny QR

Grey Ballard; James Demmel; Laura Grigori; Mathias Jacquelin; Nicholas Knight; Hong Diep Nguyen

International Journal of Reliability and Safety | 2009

Extended precision with a rounding mode toward zero environment. Application to the Cell processor

Hong Diep Nguyen; Stef Graillat; Jean Luc Lamotte

In the field of scientific computing, the exactness of the calculation is of prime importance. That leads to efforts made to increase the precision of the floating point algorithms. One of them is to increase the precision of the floating point number to double or quadruple the working precision. The building block of these efforts is the Error-Free Transformations (EFT). In this paper, we develop EFT operations in truncation rounding mode optimised for the Cell processor. They have been implemented and used in double precision library using only single precision numbers. We compare the performance of our library with the native one in double precision on vectors operations. In the best case, the performance of our library is very closed to the standard double precision implementation. The work could be easily extended to obtain quadruple precision.

symposium on computer arithmetic | 2015

Reproducible Tall-Skinny QR

Hong Diep Nguyen; James Demmel

Reproducibility is the ability to obtain bitwise identical results from different runs of the same program on the same input data, regardless of the available computing resources, or how they are scheduled. Recently, techniques have been proposed to attain reproducibility for BLAS operations, all of which rely on reproducibly computing the floating-point sum and dot product. Nonetheless, a reproducible BLAS library does not automatically translate into a reproducible higher-level linear algebra library, especially when communication is optimized. For instance, for the QR factorization, conventional algorithms such as Householder transformation or Gram-Schmidt process can be used to reproducibly factorize a floating-point matrix by fixing the high-level order of computation, for example column-by-column from left to right, and by using reproducible versions of level-1 BLAS operations such as dot product and 2-norm. In a massively parallel environment, those algorithms have high communication cost due to the need for synchronization after each step. The Tall-Skinny QR algorithm obtains much better performance in massively parallel environments by reducing the number of messages by a factor of n to O(log(P)) where P is the processor count, by reducing the number of reduction operations to O(1). Those reduction operations however are highly dependent on the network topology, in particular the number of computing nodes, and therefore are difficult to implement reproducibly and with reasonable performance. In this paper we present a new technique to reproducibly compute a QR factorization for a tall skinny matrix, which is based on the Cholesky QR algorithm to attain reproducibility as well as to improve communication cost, and the iterative refinement technique to guarantee the accuracy of the computed results. Our technique exhibits strong scalability in massively parallel environments, and at the same time can provide results of almost the same accuracy as the conventional Householder QR algorithm unless the matrix is extremely badly conditioned, in which case a warning can be given. Initial experimental results in Matlab show that for not too ill-conditioned matrices whose condition number is smaller than sqrt(1/e) where e is the machine epsilon, our technique runs less than 4 times slower than the built-in Matlab qr() function, and always computes numerically stable results in terms of column-wise relative error.

Explore More