Christopher D. Krieger
Colorado State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Christopher D. Krieger.
design, automation, and test in europe | 2017
Jiachen Mao; Xiang Chen; Kent W. Nixon; Christopher D. Krieger; Yiran Chen
Although Deep Neural Networks (DNN) are ubiquitously utilized in many applications, it is generally difficult to deploy DNNs on resource-constrained devices, e.g., mobile platforms. Some existing attempts mainly focus on client-server computing paradigm or DNN model compression, which require either infrastructure supports or special training phases, respectively. In this work, we propose MoDNN — a local distributed mobile computing system for DNN applications. MoDNN can partition already trained DNN models onto several mobile devices to accelerate DNN computations by alleviating device-level computing cost and memory usage. TWo model partition schemes are also designed to minimize non-parallel data delivery time, including both wakeup time and transmission time. Experimental results show that when the number of worker nodes increases from 2 to 4, MoDNN can accelerate the DNN computation by 2.17–4.28 X. Besides the parallel execution, the performance speedup also partially comes from the reduction of the data delivery time, e.g., 30.02% w.r.t. conventional 2D-grids partition.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013
Christopher D. Krieger; Michelle Mills Strout; Catherine Olschanowsky; Andrew Stone; Stephen M. Guzik; Xinfeng Gao; Carlo Bertolli; Paul H. J. Kelly; Gihan R. Mudalige; Brian Van Straalen; Samuel Williams
There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding inter-procedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or run-time system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, OP2, used in full applications with unstructured grids, and a domain-specific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores.
ieee international conference on high performance computing data and analytics | 2012
Christopher D. Krieger; Michelle Mills Strout; Jonathan Roelofs; Amanreet Bajwa
Many sparse or irregular scientific computations are memory bound and benefit from locality improving optimizations such as blocking or tiling. These optimizations result in asynchronous parallelism that can be represented by arbitrary task graphs. Unfortunately, most popular parallel programming models with the exception of Threading Building Blocks (TBB) do not directly execute arbitrary task graphs. In this paper, we compare the programming and execution of arbitrary task graphs qualitatively and quantitatively in TBB, the OpenMP doall model, the OpenMP 3.0 task model, and Cilk Plus. We present performance and scalability results for 8 and 40 core shared memory systems on a sparse matrix iterative solver and a molecular dynamics benchmark.
Proceedings of the 2010 Workshop on Parallel Programming Patterns | 2010
Christopher D. Krieger; Andrew Stone; Michelle Mills Strout
Parallel programming implementation details often obfuscate the original algorithm and make later algorithm maintenance difficult. Although parallel programming patterns help guide the structured development of parallel programs, they do not necessarily avoid the code obfuscation problem. In this paper, we observe how emerging and existing programming models realized as programming languages, preprocessor directives, and/or libraries are able to support the Implementation Strategy Patterns that were proposed as part of the Our Pattern Language. We posit that these emerging programming models are in some cases able to avoid code obfuscation through features that prevent tangling of algorithm and implementation details for parallelization and performance optimizations. We qualitatively evaluate these features in terms of how much they prevent tangling and how well they provide programmer-control over implementation details. We conclude with remarks on potential research directions for studying how to produce efficient and maintainable parallel programs by separating algorithm from implementation.
international parallel and distributed processing symposium | 2014
Michelle Mills Strout; Fabio Luporini; Christopher D. Krieger; Carlo Bertolli; Gheorghe-Teodor Bercea; Catherine Olschanowsky; J. Ramanujam; Paul H. J. Kelly
Many scientific applications are organized in a data parallel way: as sequences of parallel and/or reduction loops. This exposes parallelism well, but does not convert data reuse between loops into data locality. This paper focuses on this issue in parallel loops whose loop-to-loop dependence structure is data-dependent due to indirect references such as A[B[i]]. Such references are a common occurrence in sparse matrix computations, molecular dynamics simulations, and unstructured-mesh computational fluid dynamics (CFD). Previously, sparse tiling approaches were developed for individual benchmarks to group iterations across such loops to improve data locality. These approaches were shown to benefit applications such as moldyn, Gauss-Seidel, and the sparse matrix powers kernel, however the run-time routines for performing sparse tiling were hand coded per application. In this paper, we present a generalized full sparse tiling algorithm that uses the newly developed loop chain abstraction as input, improves inter-loop data locality, and creates a task graph to expose shared-memory parallelism at runtime. We evaluate the overhead and performance impact of the generalized full sparse tiling algorithm on two codes: a sparse Jacobi iterative solver and the Airfoil CFD benchmark.
irregular applications: architectures and algorithms | 2016
Thomas B. Rolinger; Tyler A. Simon; Christopher D. Krieger
Tensor decomposition, the higher-order analogue to singular value decomposition, has emerged as a useful tool for finding relationships in large, sparse, multidimensional data sets. As this technique matures and is applied to increasingly larger data sets, the need for high performance implementations becomes critical. In this work, we perform an objective empirical evaluation of three popular parallel implementations of the Candecomp/Parafac Alternating Least Squares (CP-ALS) tensor decomposition algorithm, namely SPLATT, DFacTo, and ENSIGN. We conduct performance studies across a variety of data sets, comparing the total memory required, the runtime, and the parallel scalability of each implementation. We find that the approach taken by SPLATT results in the fastest runtimes across the data sets, performing 5–22.64 times faster than the other tools. Additionally, SPLATT consumes 1.16–8.62 times less memory than the other tools. When tested on up to 20 cores or nodes, SPLATT using distributed memory parallelism exhibits the best strong scaling.
international conference on parallel processing | 2010
Christopher D. Krieger; Michelle Mills Strout
Irregular scientific applications are difficult to parallelize in an efficient and scalable fashion due to indirect memory references (i.e. A[B[i]]), irregular communication patterns, and load balancing issues. In this paper, we present our experience parallelizing an irregular scientific application written in Java. The application is an N-Body molecular dynamics simulation that is the main component of a Java application called the Molecular Workbench (MW). We parallelized MW to run on multicore hardware using Javas java.util.concurrent library. Speedup was found to vary greatly depending on what type of force computation dominated the simulation. In order to understand the cause of this appreciable difference in scalability, various performance analysis tools were deployed. These tools include Intels VTune, Apples Shark, the Java Application Monitor (JaMON), and Suns VisualVM. Virtual machine instrumentation as well as hardware performance monitors were used. To our knowledge this is the first such performance analysis of an irregular scientific application parallelized using Java threads. In the course of this investigation, a number of challenges were encountered. These difficulties in general stemmed from a mismatch between the nature of our application and either Java itself or the performance tools we used. This paper aims to share our real world experience with Java threading and todays parallel performance tools in an effort to influence future directions for the Java virtual machine, for the Java concurrency library, and for tools for multicore parallel software development.
ieee high performance extreme computing conference | 2017
Thomas B. Rolinger; Tyler A. Simon; Christopher D. Krieger
Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIAs GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intels CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.
languages and compilers for parallel computing | 2012
Christopher D. Krieger; Michelle Mills Strout
Graph partitioners play an important role in many parallel work distribution and locality optimization approaches. Surprisingly, however, to our knowledge there is no freely available parallel graph partitioner designed for execution on a shared memory multicore system. This paper presents a shared memory parallel graph partitioner, ParCubed, for use in the context of sparse tiling run-time data and computation reordering. Sparse tiling is a run-time scheduling technique that schedules groups of iterations across loops together when they access the same data and one or more of the loops contains indirect array accesses. For sparse tiling, which is implemented with an inspector/executor strategy, the inspector needs to find an initial seed partitioning of adequate quality very quickly. We compare our presented hierarchical clustering partitioner, ParCubed, with GPart and METIS in terms of partitioning speed, partitioning quality, and the effect the generated seed partitions have on executor speed. We find that the presented partitioner is 25 to 100 times faster than METIS on a 16 core machine. The total edge cut of the partitioning generated by ParCubed was found not to exceed 1.27x that of the partitioning found by METIS.
international parallel and distributed processing symposium | 2018
Thomas B. Rolinger; Tyler A. Simon; Christopher D. Krieger