Costin Iancu
Lawrence Berkeley National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Costin Iancu.
international conference on supercomputing | 2003
Wei-Yu Chen; Dan Bonachea; Jason Duell; Parry Husbands; Costin Iancu; Katherine A. Yelick
Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI [7]. In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.
ieee international conference on high performance computing data and analytics | 2013
Cindy Rubio-González; Cuong Nguyen; Hong Diep Nguyen; James Demmel; William Kahan; Koushik Sen; David H. Bailey; Costin Iancu; David Hough
Given the variety of numerical errors that can occur, floating-point programs are difficult to write, test and debug. One common practice employed by developers without an advanced background in numerical analysis is using the highest available precision. While more robust, this can degrade program performance significantly. In this paper we present Precimonious, a dynamic program analysis tool to assist developers in tuning the precision of floating-point programs. Precimonious performs a search on the types of the floating-point program variables trying to lower their precision subject to accuracy constraints and performance goals. Our tool recommends a type instantiation that uses lower precision while producing an accurate enough answer without causing exceptions. We evaluate Precimonious on several widely used functions from the GNU Scientific Library, two NAS Parallel Benchmarks, and three other numerical programs. For most of the programs analyzed, Precimonious reduces precision, which results in performance improvements as high as 41%.
international conference on parallel architectures and compilation techniques | 2005
Wei-Yu Chen; Costin Iancu; Katherine A. Yelick
Global address space languages like UPC exhibit high performance and portability on a broad class of shared and distributed memory parallel architectures. The most scalable applications use bulk memory copies rather than individual reads and writes to the shared space, but finer-grained sharing can be useful for scenarios such as dynamic load balancing, event signaling, and distributed hash tables. In this paper we present three optimization techniques for global address space programs with fine-grained communication: redundancy elimination, use of split-phase communication, and communication coalescing. Parallel UPC programs are analyzed using static single assignment form and a dataflow graph, which are extended to handle the various shared and private pointer types that are available in UPC. The optimizations also take advantage of UPCs relaxed memory consistency model, which reduces the need for cross thread analysis. We demonstrate the effectiveness of the analysis and optimizations using several benchmarks, which were chosen to reflect the kinds of finegrained, communication-intensive phases that exist in some larger applications. The optimizations show speedups of up to 70% on three parallel systems, which represent three different types of cluster network technologies.
international parallel and distributed processing symposium | 2010
Costin Iancu; Steven A. Hofmeyr; Filip Blagojevic; Yili Zheng
Existing multicore systems already provide deep levels of thread parallelism; hybrid programming models and composability of parallel libraries are very active areas of research within the scientific programming community. As more applications and libraries become parallel, scenarios where multiple threads compete for a core are unavoidable. In this paper we evaluate the impact of task oversubscription on the performance of MPI, OpenMP and UPC implementations of the NAS Parallel Benchmarks on UMA and NUMA multi-socket architectures. We evaluate explicit thread affinity management against the default Linux load balancing and discuss sharing and partitioning system management techniques. Our results indicate that oversubscription provides beneficial effects for applications running in competitive environments. Sharing all the available cores between applications provides better throughput than explicit partitioning. Modest levels of oversubscription improve system throughput by 27% and provide better performance isolation of applications from their co-runners: best overall throughput is always observed when applications share cores and each is executed with multiple threads per core. Rather than ¿resource¿ symbiosis, our results indicate that the determining behavioral factor when applications share a system is the granularity of the synchronization operations.
ieee international conference on high performance computing data and analytics | 2011
Khaled Z. Ibrahim; Steven A. Hofmeyr; Costin Iancu; Eric Roman
Live migration is a widely used technique for resource consolidation and fault tolerance. KVM and Xen use iterative pre-copy approaches which work well in practice for commercial applications. In this paper, we study pre-copy live migration of MPI and OpenMP scientific applications running on KVM and present a detailed performance analysis of the migration process. We show that due to a high rate of memory changes, the current KVM rate control and target downtime heuristics do not cope well with HPC applications: statically choosing rate limits and downtimes is infeasible and current mechanisms sometimes provide sub-optimal performance. We present a novel on-line algorithm able to provide minimal downtime and minimal impact on end-to-end application performance. At the core of this algorithm is controlling migration based on the application memory rate of change.
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model | 2010
Filip Blagojevic; Paul Hargrove; Costin Iancu; Katherine A. Yelick
With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.
ieee international conference on high performance computing data and analytics | 2011
Chang-Seo Park; Koushik Sen; Paul Hargrove; Costin Iancu
In this paper we present a precise data race detection technique for distributed memory parallel programs. Our technique, which we call Active Testing, builds on our previous work on race detection for shared memory Java and C programs and it handles programs written using shared memory approaches as well as bulk communication. Active testing works in two phases: in the first phase, it performs an imprecise dynamic analysis of an execution of the program and finds potential data races that could happen if the program is executed with a different thread schedule. In the second phase, active testing re-executes the program by actively controlling the thread schedule so that the data races reported in the first phase can be confirmed. A key highlight of our technique is that it can scalably handle distributed programs with bulk communication and single- and split-phase barriers. Another key feature of our technique is that it is precise — a data race confirmed by active testing is an actual data race present in the program; however, being a testing approach, our technique can miss actual data races. We implement the framework for the UPC programming language and demonstrate scalability up to a thousand cores for programs with both fine-grained and bulk (MPI style) communication. The tool confirms previously known bugs and uncovers several unknown ones. Our extensions capture constructs proposed in several modern programming languages for High Performance Computing, most notably non-blocking barriers and collectives.
international conference on parallel architectures and compilation techniques | 2005
Costin Iancu; Parry Husbands; Paul Hargrove
Hiding communication latency is an important optimization for parallel programs. Programmers or compilers achieve this by using non-blocking communication primitives and overlapping communication with computation or other communication operations. Using non-blocking communication raises two issues: performance and programmability. In terms of performance, optimizers need to find a good communication schedule and are sometimes constrained by lack of full application knowledge. In terms of programmability, efficiently managing non-blocking communication can prove cumbersome for complex applications. In this paper we present the design principles of HUNT, a runtime system designed to search and exploit some of the available overlap present at execution time in UPC programs. Using virtual memory support, our runtime implements demand-driven synchronization for data involved in communication operations. It also employs message decomposition and scheduling heuristics to transparently improve the non-blocking behavior of applications. We provide a user level implementation of HUNT on a variety of modern high performance computing systems. Results indicate that our approach is successful in finding some of the overlap available at execution time. While system and application characteristics influence performance, perhaps the determining factor is the time taken by the CPU to execute a signal handler. Demand driven synchronization at execution time eliminates the need for the explicit management of non-blocking communication. Besides increasing programmer productivity, this feature also simplifies compiler analysis for communication optimizations.
high performance distributed computing | 2016
Nicholas Chaimov; Allen D. Malony; Shane Canon; Costin Iancu; Khaled Z. Ibrahim; Jay Srinivasan
We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.
ieee international conference on high performance computing data and analytics | 2004
Costin Iancu; Parry Husbands; Wei Chen
In this work we investigate how the compiler technique of message strip-mining performs in practice on contemporary high performance networks. Message strip-mining attempts to reduce the overall cost of communication in parallel programs by breaking up large message transfers into smaller ones that can be overlapped with computation. In practice, however, network resource constraints may negate the expected performance gains. By deriving a performance model and synthetic benchmarks we determine how network and application characteristics in.uence the applicability of this optimization. We use these .ndings to determine heuristics to follow when performing this optimization on parallel programs. We propose strip-mining with variable block size as an alternative strategy that performs almost as well as a highly tuned .xed block strategy and has the advantage of being performance portable across systems and application input sets. We evaluate both techniques using synthetic benchmarks and an application from the NAS Parallel Benchmark suite.