Donald Nguyen
University of Texas at Austin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Donald Nguyen.
symposium on operating systems principles | 2013
Donald Nguyen; Andrew Lenharth; Keshav Pingali
Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently. In this paper, we argue that existing DSLs can be implemented on top of a general-purpose infrastructure that (i) supports very fine-grain tasks, (ii) implements autonomous, speculative execution of these tasks, and (iii) allows application-specific control of task scheduling policies. To support this claim, we describe such an implementation called the Galois system. We demonstrate the capabilities of this infrastructure in three ways. First, we implement more sophisticated algorithms for some of the graph analytics problems tackled by previous DSLs and show that end-to-end performance can be improved by orders of magnitude even on power-law graphs, thanks to the better algorithms facilitated by a more general programming model. Second, we show that, even when an algorithm can be expressed in existing DSLs, the implementation of that algorithm in the more general system can be orders of magnitude faster when the input graphs are road networks and similar graphs with high diameter, thanks to more sophisticated scheduling. Third, we implement the APIs of three existing graph DSLs on top of the common infrastructure in a few hundred lines of code and show that even for power-law graphs, the performance of the resulting implementations often exceeds that of the original DSL systems, thanks to the lightweight infrastructure.
acm sigplan symposium on principles and practice of parallel programming | 2010
Mario Méndez-Lojo; Donald Nguyen; Dimitrios Prountzos; Xin Sui; M. Amber Hassaan; Milind Kulkarni; Martin Burtscher; Keshav Pingali
Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular applications based on the idea of optimistic or speculative execution of programs. However, the overhead of optimistic parallel execution can be substantial. In this paper, we show that many irregular algorithms have structure that can be exploited and present three key optimizations that take advantage of algorithmic structure to reduce speculative overheads. We describe the implementation of these optimizations in the Galois system and present experimental results to demonstrate their benefits. To the best of our knowledge, this is the first system to exploit algorithmic structure to optimize the execution of irregular programs.
ieee international conference on high performance computing data and analytics | 2009
Shih-Wei Liao; Tzu Han Hung; Donald Nguyen; Chin-Yen Chou; Chia Heng Tu; Hucheng Zhou
Performance tuning for data centers is essential and complicated. It is important since a data center comprises thousands of machines and thus a single-digit performance improvement can significantly reduce cost and power consumption. Unfortunately, it is extremely difficult as data centers are dynamic environments where applications are frequently released and servers are continually upgraded. In this paper, we study the effectiveness of different processor prefetch configurations, which can greatly influence the performance of memory system and the overall data center. We observe a wide performance gap when comparing the worst and best configurations, from 1.4% to 75.1%, for 11 important data center applications. We then develop a tuning framework which attempts to predict the optimal configuration based on hardware performance counters. The framework achieves performance within 1% of the best performance of any single configuration for the same set of applications.
programming language design and implementation | 2011
Milind Kulkarni; Donald Nguyen; Dimitrios Prountzos; Xin Sui; Keshav Pingali
Speculative execution is a promising approach for exploiting parallelism in many programs, but it requires efficient schemes for detecting conflicts between concurrently executing threads. Prior work has argued that checking semantic commutativity of method invocations is the right way to detect conflicts for complex data structures such as kd-trees. Several ad hoc ways of checking commutativity have been proposed in the literature, but there is no systematic approach for producing implementations. In this paper, we describe a novel framework for reasoning about commutativity conditions: the commutativity lattice. We show how commutativity specifications from this lattice can be systematically implemented in one of three different schemes: abstract locking, forward gatekeeping and general gatekeeping. We also discuss a disciplined approach to exploiting the lattice to find different implementations that trade off precision in conflict detection for performance. Finally, we show that our novel conflict detection schemes are practical and can deliver speedup on three real-world applications.
architectural support for programming languages and operating systems | 2011
Donald Nguyen; Keshav Pingali
Scheduling is the assignment of tasks or activities to processors for execution, and it is an important concern in parallel programming. Most prior work on scheduling has focused either on static scheduling of applications in which the dependence graph is known at compile-time or on dynamic scheduling of independent loop iterations such as in OpenMP. In irregular algorithms, dependences between activities are complex functions of runtime values so these algorithms are not amenable to compile-time analysis nor do they consist of independent activities. Moreover, the amount of work can vary dramatically with the scheduling policy. To handle these complexities, implementations of irregular algorithms employ carefully handcrafted, algorithm-specific schedulers but these schedulers are themselves parallel programs, complicating the parallel programming problem further. In this paper, we present a flexible and efficient approach for specifying and synthesizing scheduling policies for irregular algorithms. We develop a simple compositional specification language and show how it can concisely encode scheduling policies in the literature. Then, we show how to synthesize efficient parallel schedulers from these specifications. We evaluate our approach for five irregular algorithms on three multicore architectures and show that (1) the performance of some algorithms can improve by orders of magnitude with the right scheduling policy, and (2) for the same policy, the overheads of our synthesized schedulers are comparable to those of fixed-function schedulers.
european conference on parallel processing | 2015
Andrew Lenharth; Donald Nguyen; Keshav Pingali
The need for priority scheduling arises in many algorithms. In these algorithms, there is a dynamic pool of lightweight, unordered tasks, and some execution orders are more efficient than others. Therefore, each task is given an application-specific priority that is a heuristic measure of its importance for early scheduling, and the runtime system schedules these tasks roughly in this order. Concurrent priority queues are not suitable for this purpose. We show that by exploiting the fact that algorithms amenable to priority scheduling are often robust to small deviations from a strict priority order, and by optimizing the scheduler for the cache hierarchy of current multicore and NUMA processors, we can implement concurrent priority schedulers that improve the end-to-end performance of complex irregular benchmarks by orders of magnitude compared to using state-of-the-art concurrent priority queues.
architectural support for programming languages and operating systems | 2014
Donald Nguyen; Andrew Lenharth; Keshav Pingali
Non-determinism in program execution can make program development and debugging difficult. In this paper, we argue that solutions to this problem should be on-demand, portable and parameterless. On-demand means that the programming model should permit the writing of non-deterministic programs since these programs often perform better than deterministic ones for the same problem. Portable means that the program should produce the same answer even if it is run on different machines. Parameterless means that if there are machine-dependent scheduling parameters that must be tuned for good performance, they must not affect the output. Although many solutions for deterministic program execution have been proposed in the literature, they fall short along one or more of these dimensions. To remedy this, we propose a new approach, based on the Galois programming model, in which (i) the programming model permits the writing of non-deterministic programs and (ii) the runtime system executes these programs deterministically if needed. Evaluation of this approach on a collection of benchmarks from the PARSEC, PBBS, and Lonestar suites shows that it delivers deterministic execution with substantially less overhead than other systems in the literature.
languages and compilers for parallel computing | 2010
Xin Sui; Donald Nguyen; Martin Burtscher; Keshav Pingali
Graph partitioning is a common and frequent preprocessing step in many high-performance parallel applications on distributed-and shared-memory architectures. It is used to distribute graphs across memory and to improve spatial locality. There are several parallel implementations of graph partitioning for distributed-memory architectures. In this paper, we present a parallel graph partitioner that implements a variation of the Metis partitioner for shared-memory, multicore architectures. We show that (1) the parallelism in this algorithm is an instance of the general amorphous data-parallelism pattern, and (2) a parallel implementation can be derived systematically from a sequential specification of the algorithm. The resulting program can be executed in parallel using the Galois system for optimistic parallelization. The scalability of this parallel implementation compares favorably with that of a publicly available, hand-parallelized C implementation of the algorithm, ParMetis, but absolute performance is lower because of missing sequential optimizations in our system. On a set of 15 large, publicly available graphs, we achieve an average scalability of 2.98X on 8 cores with our implementation, compared with 1.77X for ParMetis, and we achieve an average speedup of 2.80X over Metis, compared with 3.60X for ParMetis. These results show that our systematic approach for parallelizing irregular algorithms on multicore architectures is promising.
Communications of The ACM | 2016
Andrew Lenharth; Donald Nguyen; Keshav Pingali
Data-centric abstractions and execution strategies are needed to exploit parallelism in large-scale graph analytics.
ieee international conference on high performance computing data and analytics | 2014
Konstantinos I. Karantasis; Andrew Lenharth; Donald Nguyen; María Jesús Garzarán; Keshav Pingali
Many sparse matrix computations can be speeded up if the matrix is first reordered. Reordering was originally developed for direct methods but it has recently become popular for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wave front can improve the locality of reference of sparse matrix-vector multiplication (SpMV), the key kernel in iterative solvers. In this paper, we present the first parallel implementations of two widely used reordering algorithms: Reverse Cut hill-McKee (RCM) and Sloan. On 16 cores of the Stampede supercomputer, our parallel RCM is 5.56 times faster on the average than a state-of-the-art sequential implementation of RCM in the HSL library. Sloan is significantly more constrained than RCM, but our parallel implementation achieves a speedup of 2.88X on the average over sequential HSL-Sloan. Reordering the matrix using our parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations, it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix.