Richard Tran Mills | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard Tran Mills is active.

Explore More

Publication

Featured researches published by Richard Tran Mills.

Journal of Physics: Conference Series | 2007

Simulating Subsurface Flow and Transport on Ultrascale Computers using PFLOTRAN

Richard Tran Mills; Chuan Lu; Peter C. Lichtner; Glenn E. Hammond

We describe PFLOTRAN, a recently developed code for modeling multi-phase, multi-component subsurface flow and reactive transport using massively parallel computers. PFLOTRAN is built on top of PETSc, the Portable, Extensible Toolkit for Scientific Computation. Leveraging PETSc has allowed us to develop—with a relatively modest investment in development effort—a code that exhibits excellent performance on the largest-scale supercomputers. Very significant enhancements to the code are planned during our SciDAC-2 project. Here we describe the current state of the code, present an example of its use on Jaguar, the Cray XT3/4 system at Oak Ridge National Laboratory consisting of 11706 dual-core Opteron processor nodes, and briefly outline our future plans for the code.

international conference on computational science | 2005

Vectorized sparse matrix multiply for compressed row storage format

Eduardo F. D'Azevedo; Mark R. Fahey; Richard Tran Mills

The innovation of this work is a simple vectorizable algorithm for performing sparse matrix vector multiply in compressed sparse row (CSR) storage format. Unlike the vectorizable jagged diagonal format (JAD), this algorithm requires no data rearrangement and can be easily adapted to a sophisticated library framework such as PETSc. Numerical experiments on the Cray X1 show an order of magnitude improvement over the non-vectorized algorithm.

ieee international conference on high performance computing data and analytics | 2008

An Evaluation of the Oak Ridge National Laboratory Cray XT3

Sadaf R. Alam; Richard Frederick Barrett; Mark R. Fahey; Jeffery A. Kuehn; O. E. Bronson Messer; Richard Tran Mills; Philip C. Roth; Jeffrey S. Vetter; Patrick H. Worley

In 2005, Oak Ridge National Laboratory (ORNL) received delivery of a 5294 processor Cray XT3. The XT3 is Crays third-generation massively parallel processing system. The ORNL system uses a single-processor node built around the AMD Opteron and uses a custom chip—called SeaStar—for interprocessor communication. The system uses a lightweight operating system called Catamount on its compute nodes. This paper provides a performance evaluation of the Cray XT3, including measurements for micro-benchmark, kernel, and application benchmarks. In particular, we provide performance results for strategic Department of Energy applications areas including climate, biology, astrophysics, combustion, and fusion. Our results, on up to 4096 processors, demonstrate that the Cray XT3 provides competitive processor performance, high interconnect bandwidth, and high parallel efficiency on a diverse application workload, typical in the DOE Office of Science.

parallel computing | 2014

Hierarchical Krylov and nested Krylov methods for extreme-scale computing

Lois Curfman McInnes; Barry F. Smith; Hong Zhang; Richard Tran Mills

The solution of large, sparse linear systems is often a dominant phase of computation for simulations based on partial differential equations, which are ubiquitous in scientific and engineering applications. While preconditioned Krylov methods are widely used and offer many advantages for solving sparse linear systems that do not have highly convergent, geometric multigrid solvers or specialized fast solvers, Krylov methods encounter well-known scaling difficulties for over 10,000 processor cores because each iteration requires at least one vector inner product, which in turn requires a global synchronization that scales poorly because of internode latency. To help overcome these difficulties, we have developed hierarchical Krylov methods and nested Krylov methods in the PETSc library that reduce the number of global inner products required across the entire system (where they are expensive), though freely allow vector inner products across smaller subsets of the entire system (where they are inexpensive) or use inner iterations that do not invoke vector inner products at all. Nested Krylov methods are a generalization of inner-outer iterative methods with two or more layers. Hierarchical Krylov methods are a generalization of block Jacobi and overlapping additive Schwarz methods, where each block itself is solved by Krylov methods on smaller blocks. Conceptually, the hierarchy can continue recursively to an arbitrary number of levels of smaller and smaller blocks. As a specific case, we introduce the hierarchical FGMRES method, or h-FGMRES, and we demonstrate the impact of two-level h-FGMRES with a variable preconditioner on the PFLOTRAN subsurface flow application. We also demonstrate the impact of nested FGMRES, BiCGStab and Chebyshev methods. These hierarchical Krylov methods and nested Krylov methods significantly reduced overall PFLOTRAN simulation time on the Cray XK6 when using 10,000 through 224,000 cores through the combined effects of reduced global synchronization due to fewer global inner products and stronger inner hierarchical or nested preconditioners.

Journal of Physics: Conference Series, 125:Art. No. 012051 | 2008

Toward petascale computing in geosciences: application to the Hanford 300 area

Glenn E. Hammond; Peter C. Lichtner; Richard Tran Mills; Chuan Lu

Modeling uranium transport at the Hanford 300 Area presents new challenges for high performance computing. A field-scale three-dimensional domain with an hourly fluctuating Columbia river stage coupled to flow in highly permeable sediments results in fast groundwater flow rates requiring small time steps. In this work, high-performance computing has been applied to simulate variably saturated groundwater flow and tracer transport at the 300 Area using PFLOTRAN. Simulation results are presented for discretizations up to 10.8 million degrees of freedom, while PFLOTRAN performance was assessed on up to one billion degrees of freedom and 12,000 processor cores on Jaguar, the Cray XT4 supercomputer at ORNL.

Journal of Physics: Conference Series | 2009

Modeling subsurface reactive flows using leadership-class computing

Richard Tran Mills; Glenn E. Hammond; Peter C. Lichtner; Vamsi K Sripathi; G. Mahinthakumar; Barry F. Smith

We describe our experiences running PFLOTRAN–a code for simulation of coupled hydro-thermal-chemical processes in variably saturated, non-isothermal, porous media– on leadership-class supercomputers, including initial experiences running on the petaflop incarnation of Jaguar, the Cray XT5 at the National Center for Computational Sciences at Oak Ridge National Laboratory. PFLOTRAN utilizes fully implicit time-stepping and is built on top of the Portable, Extensible Toolkit for Scientific Computation (PETSc). We discuss some of the hurdles to at scale performance with PFLOTRAN and the progress we have made in overcoming them on leadership-class computer architectures.

international conference on supercomputing | 2001

Algorithmic modifications to the Jacobi-Davidson parallel eigensolver to dynamically balance external CPU and memory load

Richard Tran Mills; Andreas Stathopoulos; Evgenia Smirni

Clusters of workstations (COWs) and SMPs have become popular and cost effective means of solving scientific problems. Because such environments may be heterogenous and/or time shared, dynamic load balancing is central to achieving high performance. Our thesis is that new levels of sophistication are required in parallel algorithm design and in the interaction of the algorithms with the runtime system. To support this thesis, we illustrate a novel approach for application-level balancing of external CPU and memory load on parallel iterative methods that employ some form of local preconditioning on each node. There are two key ideas. First, because all nodes need not perform their portion of the preconditioning phase to the same accuracy, the code can achieve perfect loadbalance, dynamically adapting to external CPU load, if we stop the preconditioning phase on all processors after a fixed amount of time. Second, if the program detects memory thrashing on a node, it recedes its preconditioning phase from that node, hopefully speeding the completion of competing jobs hence the relinquishing of their resources. We have implemented our load balancing approach in a state-of-the-art, coarse grain parallel Jacobi-Davidson eigensolver. Experimental results show that the new method adapts its algorithm based on runtime system information, without compromising the overall convergence behavior. We demonstrate the effectiveness of the new algorithm in a COW environment under (a) variable CPU load and (b) variable memory availability caused by competing applications.

international geoscience and remote sensing symposium | 2010

Geospatiotemporal data mining in an early warning system for forest threats in the United States

Forrest M. Hoffman; Richard Tran Mills; Jitendra Kumar; Srinivasa S. Vulli; William W. Hargrove

We investigate the potential of geospatiotemporal data mining of multi-year land surface phenology data (250 m Normalized Difference Vegetation Index (NDVI) values derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) in this study) for the conterminous United States as part of an early warning system to identify threats to forest ecosystems. Cluster analysis of this massive data set, using high-performance computing, provides a basis for several possible approaches to defining the bounds of “normal” phenological patterns, indicating healthy vegetation in a given geographic location. We demonstrate the applicability of such an approach, using it to identify areas in Colorado, USA, where an ongoing mountain pine beetle outbreak has caused significant tree mortality.

Journal of Grid Computing | 2007

Runtime and Programming Support for Memory Adaptation in Scientific Applications via Local Disk and Remote Memory

Richard Tran Mills; Chuan Yue; Andreas Stathopoulos; Dimitrios S. Nikolopoulos

The ever increasing memory demands of many scientific applications and the complexity of today’s shared computational resources still require the occasional use of virtual memory, network memory, or even out-of-core implementations, with well known drawbacks in performance and usability. In Mills et al. (Adapting to memory pressure from within scientific applications on multiprogrammed COWS. In: International Parallel and Distributed Processing Symposium, IPDPS, Santa Fe, NM, 2004), we introduced a basic framework for a runtime, user-level library, MMlib, in which DRAM is treated as a dynamic size cache for large memory objects residing on local disk. Application developers can specify and access these objects through MMlib, enabling their application to execute optimally under variable memory availability, using as much DRAM as fluctuating memory levels will allow. In this paper, we first extend our earlier MMlib prototype from a proof of concept to a usable, robust, and flexible library. We present a general framework that enables fully customizable memory malleability in a wide variety of scientific applications. We provide several necessary enhancements to the environment sensing capabilities of MMlib, and introduce a remote memory capability, based on MPI communication of cached memory blocks between ‘compute nodes’ and designated memory servers. The increasing speed of interconnection networks makes a remote memory approach attractive, especially at the large granularity present in large scientific applications. We show experimental results from three important scientific applications that require the general MMlib framework. The memory-adaptive versions perform nearly optimally under constant memory pressure and execute harmoniously with other applications competing for memory, without thrashing the memory system. Under constant memory pressure, we observe execution time improvements of factors between three and five over relying solely on the virtual memory system. With remote memory employed, these factors are even larger and significantly better than other, system-level remote memory implementations.

international parallel and distributed processing symposium | 2003

Dynamic load balancing of an iterative eigensolver on networks of heterogeneous clusters

James R. McCombs; Richard Tran Mills; Andreas Stathopoulos

Clusters of homogeneous workstations built around fast networks have become popular means of solving scientific problems, and users often have access to several such clusters. Harnessing the collective power of these clusters to solve a single, challenging problem is desirable, but is often impeded by large inter-cluster network latencies and heterogeneity of different clusters. The complexity of these environments requires commensurate advances in parallel algorithm design. We support this thesis by utilizing two techniques: 1) multigrain, a novel algorithmic technique that induces coarse granularity to parallel iterative methods, providing tolerance for large communication latencies, and 2) an application-level load balancing technique applicable to a specific but important class of iterative methods. We implement both algorithmic techniques on the popular Jacobi-Davidson eigenvalue iterative solver. Our experiments on a cluster environment show that the combination of the two techniques enables effective use of heterogeneous, possibly distributed resources, that cannot be achieved by traditional implementations of the method.

Explore More