Konstantinos Krommydas
Virginia Tech
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Konstantinos Krommydas.
signal processing systems | 2016
Konstantinos Krommydas; Wu-chun Feng; Christos D. Antonopoulos; Nikolaos Bellas
The proliferation of heterogeneous computing platforms presents the parallel computing community with new challenges. One such challenge entails evaluating the efficacy of such parallel architectures and identifying the architectural innovations that ultimately benefit applications. To address this challenge, we need benchmarks that capture the execution patterns (i.e., dwarfs or motifs) of applications, both present and future, in order to guide future hardware design. Furthermore, we desire a common programming model for the benchmarks that facilitates code portability across a wide variety of different processors (e.g., CPU, APU, GPU, FPGA, DSP) and computing environments (e.g., embedded, mobile, desktop, server). As such, we present the latest release of OpenDwarfs, a benchmark suite that currently realizes the Berkeley dwarfs in OpenCL, a vendor-agnostic and open-standard computing language for parallel computing. Using OpenDwarfs, we characterize a diverse set of modern fixed and reconfigurable parallel platforms: multi-core CPUs, discrete and integrated GPUs, Intel Xeon Phi co-processor, as well as a FPGA. We describe the computation and communication patterns exposed by a representative set of dwarfs, obtain relevant profiling data and execution information, and draw conclusions that highlight the complex interplay between dwarfs’ patterns and the underlying hardware architecture of modern parallel platforms.
application-specific systems, architectures, and processors | 2016
Konstantinos Krommydas; Ruchira Sasanka; Wu-chun Feng
Programming FPGAs has been an arduous task that requires extensive knowledge of hardware design languages (HDLs), such as Verilog or VHDL, and low-level hardware details. With OpenCL support for FPGAs, the design, prototyping and implementation of an FPGA is increasingly moving towards a much higher level of abstraction, when compared to the intrinsically low-level nature of HDLs. On the other hand, in the context of traditional (i.e., CPU) software development, OpenCL is still considered to be low-level and complex because the programmer needs to manually expose parallelism in the code. In this work, we present our approach to enhancing FPGA programmability via GLAF, a visual programming framework, to automatically generate synthesizable OpenCL code with an array of FPGA-specific optimizations. We find that our tool facilitates the development process and produces functionally correct and well-performing code on the FPGA for our molecular modeling, gene sequence search, and filtering algorithms.
international conference on parallel and distributed systems | 2013
Konstantinos Krommydas; Thomas R. W. Scogland; Wu-chun Feng
General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures presents many challenges to the domain scientist, including device selection, programming model, and level of investment in optimization. All of these choices influence the balance between programmability and performance. In this paper, we characterize the performance achievable across a range of optimizations, along with their programmability, for multi- and many-core platforms - specifically, an Intel Sandy Bridge CPU, Intel Xeon Phi co-processor, and NVIDIA Kepler K20 GPU - in the context of an n-body, molecular-modeling application called GEM. Our systematic approach to optimization delivers implementations with speed-ups of 194.98×, 885.18×, and 1020.88× on the CPU, Xeon Phi, and GPU, respectively, over the naive serial version. Beyond the speed-ups, we characterize the incremental optimization of the code from naive serial to fully hand-tuned on each platform through four distinct phases of increasing complexity to expose the strengths and weaknesses of the programming models offered by each platform.This paper focuses on an important research problem of Big Data classification in intrusion detection system. Deep Belief Networks is introduced to the field of intrusion detection, and an intrusion detection model based on Deep Belief Networks is proposed to apply in intrusion recognition domain. The deep hierarchical model is a deep neural network classifier of a combination of multilayer unsupervised learning networks, which is called as Restricted Boltzmann Machine, and a supervised learning network, which is called as Back-propagation network. The experimental results on KDD CUP 1999 dataset demonstrate that the performance of Deep Belief Networks model is better than that of SVM and ANN.
field programmable custom computing machines | 2016
Konstantinos Krommydas; Ahmed E. Helal; Anshuman Verma; Wu-chun Feng
For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA developers use a hardware design language (HDL) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance, that is, there still remains a performance-programmability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we apply and evaluate the effect of various optimization techniques on GEM, an N-body method from the OpenDwarfs benchmark suite.
international conference on parallel processing | 2015
Konstantinos Krommydas; Ruchira Sasanka; Wu-chun Feng
The past decades computing revolution has delivered parallel hardware to the masses. However, the ability to exploit its capabilities and ignite scientific breakthrough at a proportionate level remains a challenge due to the lack of parallel programming expertise. Although different solutions have been proposed to facilitate harvesting the seeds of parallel computing, most target seasoned programmers and ignore the special nature of a target audience like domain experts. This paper addresses the challenge of realizing a programming abstraction and implementing an integrated development framework for this audience. We present GLAF -- a grid-based language and auto-parallelizing, auto-tuning framework. Its key elements are its intuitive visual programming interface, which attempts to render expressing and validating an algorithm easier for domain experts, and its ability to automatically generate efficient serial and parallel Fortran and C code, including potentially beneficial code modifications (e.g., With respect to data layout). We find that the above features assist novice programmers to avoid common programming pitfalls and provide fast implementations.
field-programmable custom computing machines | 2013
Konstantinos Krommydas; Muhsen Owaida; Christos D. Antonopoulos; Nikolaos Bellas; Wu-chun Feng
We present a hardware architecture of a heapsort algorithm, the sorting is employed in a subband coding block of a wavelet-based image coder termed Oktem image coder. Although this coder provides good image quality, the sorting is time consuming, and is application specific, as the sorting is repetitively used for different volume of data in the subband coding, thus a simple hardware implementation with fixed sorting capacity will be difficult to scale during runtime. To tackle this problem, the time/power efficiency and the sorting size flexibility have to be taken in to account. We proposed an improved FPGA heapsort architecture based on Zabolotnys work as an IP accelerator of the image coder. We present a configurable architecture by using adaptive layer enable elements so the sorting capacity could be adjusted during runtime to efficiently sort different amount of data. With the adaptive memory shutdown, our improved architecture provides up to 20.9% power reduction on the memories compared to the baseline implementation. Moreover, our architecture provides 13x speedup compared to ARM CortexA 9.
international conference on multimedia and expo | 2010
Konstantinos Krommydas; George Tsoublekas; Christos D. Antonopoulos; Nikolaos Bellas
Modern multimedia workloads provide increased levels of quality and compression efficiency at the expense of substantially increased computational complexity. It is important to leverage the off-the-shelf emerging multi-core processor architectures and exploit all levels of parallelism of such workloads in order to achieve real time functionality at a reasonable cost. This paper presents the implementation, optimization and characterization of the AVS video decoder on Intel Core i7, a quad-core, hyper-threaded, chip multiprocessor (CMP). AVS (Audio Video Standard), a new compression standard from China, is competing with H.264 to potentially replace MPEG-2, mainly in the Chinese market. We show that it is necessary to perform a series of software optimizations and exploit parallelism at different levels in order to achieve FullHD real time functionality. The input dependent variability of execution time per work chunk is addressed using dynamic scheduling to allocate work to each thread. Moreover, we evaluate the interaction of the application with the i7 CMP architecture using both high-and low-level performance metrics. Finally, we evaluate a new feature of Intels i7 micro-architecture called Turbo Boost, which dynamically varies the frequencies of non-idling cores to optimize performance.
international conference on multimedia and expo | 2011
Konstantinos Krommydas; Christos D. Antonopoulos; Nikolaos Bellas; Wu-chun Feng
Newer video compression standards provide high video quality and greater compression efficiency, compared to their predecessors. Their increased complexity can be outbalanced by leveraging all the levels of available parallelism, task- and data-level, using available off-the-shelf hardware, such as current generations chip multiprocessors. As we move to more cores though, scalability issues arise and need to be tackled in order to take advantage of the abundant computational power. In this paper we evaluate a previously implemented parallel version of the AVS video decoder on the experimental 32-core Intel Manycore Testing Lab. We examine this previous versions performance bottlenecks and scalability issues and introduce a distributed queue implementation as the proposed solution. Finally, we provide insight on separate optimizations regarding inter macroblocks and investigate performance variations and tradeoffs, when combined with a distributed queue scheme.
international conference on parallel processing | 2018
Konstantinos Krommydas; Paul Sathre; Ruchira Sasanka; Wu-chun Feng
GLAF, short for Grid-based Language and Auto-parallelization Framework, is a programming framework that seeks to democratize parallel programming by facilitating better productivity in parallel computing via an intuitive graphical programming interface (GPI) that automatically parallelizes and generates code in many languages. Originally, GLAF addressed program development from scratch via the GPI; but this unduly restricted GLAFs utility to creating new codes only. Thus, this paper extends GLAF by enabling program development from pre-existing kernels of interest, which can then be easily and transparently integrated into existing legacy codes. Specifically, we address the theoretical and practical limitations of integration and interoperability of auto-generated parallel code within existing FORTRAN codes; enhance GLAF to overcome these limitations; and present an integrative case study and evaluation of the enhanced GLAF via the implementation of important kernels in two NASA codes: (1) the Synoptic Surface & Atmospheric Radiation Budget (SARB), part of the Clouds and the Earths Radiant Energy System (CERES), and (2) the Fully Unstructured Navier-Stokes (FUN3D) suite for computational fluid dynamics.
international parallel and distributed processing symposium | 2015
Rubasri Kalidas; Mayank Daga; Konstantinos Krommydas; Wu-chun Feng
Graphics processing units (GPUs) have delivered promising speedups in data-parallel applications. A discrete GPU resides on the PCIe interface and has traditionally required data to be moved from the host memory to the GPU memory via PCIe. In certain applications, the overhead of these data transfers between memory spaces can nullify any performance gains achieved from faster computation on the GPU. Recent advances allow GPUs to directly access data from the host memory across the PCIe bus, thereby alleviating the data-transfer bottlenecks. Another class of accelerators called accelerated processing units (APUs) mitigate data-transfer overhead by placing CPU and GPU cores on the same physical die. However, APUs in the current form provide several different data paths between the CPU and GPU, all of which can differently affect application performance. In this paper, we explore the effects of different available data paths on both GPUs and APUs in the context of a broader set of computation and communication patterns commonly referred to as dwarfs.