Christos Kartsaklis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christos Kartsaklis is active.

Explore More

Publication

Featured researches published by Christos Kartsaklis.

international parallel and distributed processing symposium | 2012

HERCULES: A Pattern Driven Code Transformation System

Christos Kartsaklis; Oscar R. Hernandez; Chung-Hsing Hsu; Thomas Ilsche; Wayne Joubert; Richard L. Graham

New parallel computers are emerging, but developing efficient scientific code for them remains difficult. A scientist must manage not only the science-domain complexity but also the performance-optimization complexity. HERCULES is a code transformation system designed to help the scientist to separate the two concerns, which improves code maintenance, and facilitates performance optimization. The system combines three technologies, code patterns, transformation scripts and compiler plugins, to provide the scientist with an environment to quickly implement code transformations that suit his needs. Unlike existing code optimization tools, HERCULES is unique in its focus on user-level accessibility. In this paper we discuss the design, implementation and an initial evaluation of HERCULES.

Facing the Multicore-Challenge II | 2012

Experiences with high-level programming directives for porting applications to GPUs

Oscar R. Hernandez; Wei Ding; Barbara M. Chapman; Christos Kartsaklis; Ramanan Sankaran; Richard L. Graham

HPC systems now exploit GPUs within their compute nodes to accelerate program performance. As a result, high-end application development has become extremely complex at the node level. In addition to restructuring the node code to exploit the cores and specialized devices, the programmer may need to choose a programming model such as OpenMP or CPU threads in conjunction with an accelerator programming model to share and manage the different node resources. This comes at a time when programmer productivity and the ability to produce portable code has been recognized as a major concern. In order to offset the high development cost of creating CUDA or OpenCL kernels, directives have been proposed for programming accelerator devices, but their implications are not well known. In this paper, we evaluate the state of the art accelerator directives to program several applications kernels, explore transformations to achieve good performance, and examine the expressivity and performance penalty of using high-level directives versus CUDA. We also compare our results to OpenMP implementations to understand the benefits of running the kernels in the accelerator versus CPU cores.

international conference on conceptual structures | 2014

Toward Better Understanding of the Community Land Model within the Earth System Modeling Framework

Dali Wang; Joseph Schuchart; Tomislav Janjusic; Frank Winkler; Yang Xu; Christos Kartsaklis

Abstract One key factor in the improved understanding of earth system science is the development and improvement of high fidelity models. Along with the deeper understanding of biogeophysical and biogeochemical processes, the software complexity of those earth system models becomes a barrier for further rapid model improvements and validation. In this paper, we present our experience on better understanding the Community Land Model (CLM) within an earth system modelling framework. First, we give an overview of the software system of the global offline CLM simulation. Second, we present our approach to better understand the CLM software structure and data structure using advanced software tools. After that, we focus on the practical issues related to CLM computational performance and individual ecosystem function. Since better software engineering practices are much needed for general scientific software systems, we hope those considerations can be beneficial to many other modeling research programs involving multiscale system dynamics.

design automation conference | 2016

A model-driven approach to warp/thread-block level GPU cache bypassing

Hongwen Dai; Chao Li; Huiyang Zhou; Saurabh Gupta; Christos Kartsaklis; Mike Mantor

The high amount of memory requests from massive threads may easily cause cache contention and cache-miss-related resource congestion on GPUs. This paper proposes a simple yet effective performance model to estimate the impact of cache contention and resource congestion as a function of the number of warps/thread blocks (TBs) to bypass the cache. Then we design a hardware-based dynamic warp/thread-block level GPU cache bypassing scheme, which achieves 1.68x speedup on average on a set of memory-intensive benchmarks over the baseline. Compared to prior works, our scheme achieves 21.6% performance improvement over SWL-best [29] and 11.9% over CBWT-best [4] on average.

international conference on parallel processing | 2014

HERCULES: Strong Patterns towards More Intelligent Predictive Modeling

Eunjung Park; Christos Kartsaklis; John Cavazos

Recent work has shown that program analysis techniques to select meaningful code features of programs are important in the task of deciding the best compiler optimizations. Although, there are many successful state-of-the-art program analysis techniques, they often do not provide a simple method to extract the most expressive information about loops, especially when a target program is computationally intensive with complex loops and data dependencies. In this paper, we introduce a static technique to characterize a program using a pattern-driven system named HERCULES. This characterization technique not only helps a user to understand programs by searching pattern-of-interests, but also can be used for a predictive model that effectively selects the proper compiler optimizations. We formulated 35 loop patterns, then evaluated our characterization technique by comparing the predictive models constructed using HERCULES to three other state-of-the-art characterization methods. We show that our models outperform three state-of-the-art program characterization techniques on two multicore systems in selecting the best optimization combination from a given loop transformation space. We achieved up to 67% of the best possible speedup achievable with the optimization search space we evaluated.

Proceedings of the 1st Workshop on Programming Language Evolution | 2014

HERCULES/PL: the pattern language of HERCULES

Christos Kartsaklis; Oscar R. Hernandez

Interrogating the structure of a program for patterns of interest is attractive to the broader spectrum of software engineering. The very approach by which a pattern is constructed remains a concern for the source code mining community. This paper presents a pattern programming model, for the C and Fortran programming languages, using a compiler directives approach. We discuss our specification, called HERCULES/PL, throughout a number of examples and show how different patterns can be constructed, plus some preliminary results.

ieee international conference on high performance computing data and analytics | 2012

Trace Driven Data Structure Transformations

Tomislav Janjusic; Krishna M. Kavi; Christos Kartsaklis

As the complexity of scientific codes and computational hardware increases it is increasingly important to study the effects of data-structure layouts on program memory behavior. Program structure layouts affect the memory performance differently, therefore we need the capability to effectively study such transformations without the need to rewrite application codes. Trace-driven simulations are an effective and convenient mechanism to simulate program behavior at various granularities. During an applications execution, a tool known as a tracer or profiler, collects program flow data and records program instructions. The trace-file consists of tuples that associate each program instruction with program internal variables. In this paper we outline a proof-of-concept mechanism to apply data-structure transformations during trace simulation and observe effects on memory without the need to manually transform an applications code.

international conference on conceptual structures | 2015

Glprof: A Gprof Inspired, Callgraph-oriented Per-object Disseminating Memory Access Multi-cache Profiler

Tomislav Janjusic; Christos Kartsaklis

Application analysis is facilitated through a number of program profiling tools. The tools vary in their complexity, ease of deployment, design, and profiling detail. Specifically, understand- ing, analyzing, and optimizing is of particular importance for scientific applications where minor changes in code paths and data-structure layout can have profound effects. Understanding how intricate data-structures are accessed and how a given memory system responds is a complex task. In this paper we describe a trace profiling tool, Glprof, specifically aimed to lessen the burden of the programmer to pin-point heavily involved data-structures during an applications run-time, and understand data-structure run-time usage. Moreover, we showcase the tools modularity using additional cache simulation components. We elaborate on the tools design, and features. Finally we demonstrate the application of our tool in the context of Spec bench- marks using the Glprof profiler and two concurrently running cache simulators, PPC440 and AMD Interlagos.

ieee international conference on high performance computing data and analytics | 2014

HSLOT: the HERCULES scriptable loop transformations engine

Christos Kartsaklis; Eunjung Park; John Cavazos

HSLOT arms users with a rich set of configurable transformation directives, to be used as-they-are or to be specialized and combined into powerful custom transformations. We offer a plethora of loop transformations, which includes both the classic set (unroll, fuse, fission, tile, and so on) as well as unique ones (specialize, swap nest, split, fork, and so on) that are not found in other state-of-the-art systems. We show how HSLOT enables more transformations such as merging two loops that cannot be fused because of data dependencies and how HSLOT can be used in a simple and systematic fashion to improve memory accesses and expose better parallelism. To use our system, users simply annotate loops with the transformations sequence and compile with our Open64-based HSLOTimplementing Fortran compiler, HSLF90, which produces both object files and optionally source. We describe our experiment results using a set of scientific kernels written in Fortran with HSLOT directives on AMD 32 core system.

Proceedings of the Second Workshop on Optimizing Stencil Computations | 2014

Trace-Driven Memory Access Pattern Recognition in Computational Kernels

Eunjung Park; Christos Kartsaklis; Tomislav Janjusic; John Cavazos

Classifying memory access patterns is paramount to the selection of the right set of optimizations and determination of the parallelization strategy. Static analyses suffer from ambiguities present in source code, which modern compilation techniques, such as profile-guided optimization, alleviate by observing runtime behavior and feeding back into the compilation flow. This paper discusses a dynamic analysis technique for recognizing memory access patterns, with application to the stencils domain, and presents our design and C++ implementation using the memory-tracing tool Gleipnir. Finally, we evaluate and discuss the performance and matching capability of our classifiers in the context of the Polybench scientific benchmark suite, which includes both stencil and matrix computations.

Explore More