Gokcen Kestor
Pacific Northwest National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gokcen Kestor.
ieee international symposium on workload characterization | 2013
Gokcen Kestor; Roberto Gioiosa; Darren J. Kerbyson; Adolfy Hoisie
In the exascale era, the energy cost of moving data across the memory hierarchy is expected to be two orders of magnitude higher than the cost of performing a double-precision floating point operation. Despite its importance, the energy cost of data movement in scientific applications has not be quantitatively evaluated even for current systems.
ieee international conference on high performance computing data and analytics | 2015
Rizwan A. Ashraf; Roberto Gioiosa; Gokcen Kestor; Ronald F. DeMara; Chen-Yong Cher; Pradip Bose
Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.
international conference on performance engineering | 2011
Gokcen Kestor; Vasileios Karakostas; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero
Transactional Memory (TM) has been proposed as an alternative concurrency mechanism for the shared memory parallel programming model. Its main goal is to make parallel programming for Chip Multiprocessors (CMPs) easier than using the traditional lock synchronization constructs, without compromising the performance and the scalability. This topic has received substantial research attention and several TM designs have been proposed using various TM benchmarks. We believe that the evaluation of TM proposals would be more solid if it included realistic applications, that address on-going TM research issues, and that provide the potential for straightforward comparison against locks. In this paper, we introduce RMS-TM, a Transactional Memory benchmark suite composed of seven real-world applications from the Recognition, Mining and Synthesis (RMS) domain. In addition to featuring current TM research issues such as nesting and I/O and system calls inside transactions, the RMS-TM applications also provide a mix of short and long transactions with small/large read and write sets with low/medium/high contention rates. These characteristics, as well as providing lock-based versions of the applications, make RMS-TM a useful TM tool. Current TM benchmarks do not explore all these features. In our evaluation with selected STM and HTM systems, we find that our benchmark suite is also scalable, which is useful for evaluating TM designs on high core counts.
international conference on parallel architectures and compilation techniques | 2011
Gokcen Kestor; Roberto Gioiosa; Tim Harris; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero
Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.
international parallel and distributed processing symposium | 2017
Ivy Bo Peng; Roberto Gioiosa; Gokcen Kestor; Pietro Cicotti; Erwin Laure; Stefano Markidis
Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide ∼4× higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3× performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM.
international symposium on memory management | 2017
Ivy Bo Peng; Roberto Gioiosa; Gokcen Kestor; Pietro Cicotti; Erwin Laure; Stefano Markidis
Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups. We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.
european conference on computer systems | 2014
Gokcen Kestor; Osman S. Unsal; Adrián Cristal; Serdar Tasiran
Transactional memory (TM) has reached a maturity level and programmers have started using this programming model to parallelize their applications. However, although much effort has been put into the development of TM systems, there is still lack of debugging and development tools for TM applications, such as race detection tools. Previous definitions of transactional data race often impose constraints on the TM implementation or the programming language and cannot be widely applied to current STM designs. We propose a new definition of transactional data race that follows the programmers intuition of racy accesses, is independent of thread interleaving, can accommodate popular STM systems, and allows common programming idioms. Based on this definition, we design and implement T-Rex, a precise dynamic race detection tool for C/C++ TM programs. Using T-Rex we discover transactional data races in STAMP applications that, to the best of our knowledge, have not been previously reported. Our experiments also show that T-Rex runtime overhead is comparable to state-of-the-art lock-based race detection tools, despite the extra work required to handle transactional memory semantics.
international parallel and distributed processing symposium | 2017
Gokcen Kestor; Sriram Krishnamoorthy; Wenjing Ma
Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.
international conference on conceptual structures | 2016
Stefano Markidis; Ivy Bo Peng; Roman Iakymchuk; Erwin Laure; Gokcen Kestor; Roberto Gioiosa
Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more ...
Parallel Processing Letters | 2014
Roberto Gioiosa; Gokcen Kestor; Darren J. Kerbyson
To achieve the exaFLOPS performance within a contained power budget, next generation supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of running applications and the health of the system. To achieve minimum runtime overhead while maintaining the desired level of flexibility, we propose a decoupled approach in which accurate monitoring is performed at kernel-level while performance anomaly disambiguation and corrective actions are performed at user-level. We evaluate our monitoring system on a 128-core AMD Interlagos system: First, we show that the runtime overhead of the monitoring system is negligible (0-2%). Then we show how our system can be used to precisely identify performance faults in three different scenarios: OS noise, application co-scheduling and dynamic power capping.