Sam Skalicky
Rochester Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sam Skalicky.
reconfigurable computing and fpgas | 2013
Sam Skalicky; Christopher A. Wood; Marcin Lukowiak; Matthew Ryan
One of the pitfalls of FPGA design is the relatively long implementation time when compared to alternative architectures, such as CPU, GPU or DSP. This time can be greatly reduced however by using tools that can generate hardware systems in the form of a hardware description language (HDL) from high-level languages such as C, C++, or Python. Such implementations can be optimized by applying special directives that focus the high-level synthesis (HLS) effort on particular objectives, such as performance, area, throughput, or power consumption. In this paper we examine the benefits of this approach by comparing the performance and design times of HLS generated systems versus custom systems for matrix multiplication. We investigate matrix multiplication using a standard algorithm, Strassen algorithm, and a sparse algorithm to provide a comprehensive analysis of the capabilities and usability of the Xilinx Vivado HLS tool. In our experience, a hardware-oriented electrical engineering student can achieve up to 61% of the performance of custom designs with 1/3 the effort, thus enabling faster hardware acceleration of many compute-bound algorithms.
applied reconfigurable computing | 2013
Sam Skalicky; Sonia Martín López; Marcin Łukowiak; James Letendre; Matthew Ryan
The potential design space of FPGA accelerators is very large. The factors that define the performance of a particular implementation include the architecture design, number of pipelines, and memory bandwidth. In this paper we present a mathematical model that, based on these factors, predicts the computation time of pipelined FPGA accelerators. This model can be used to quickly explore the design space without any implementation or simulation. We evaluate the model, its usefulness, and ability to identify the bottlenecks and improve performance. Being the core of many compute-intensive applications, linear algebra computations are the main contributors to their total execution time. Hence, five relevant linear algebra computations are selected, analyzed, and the accuracy of the model is validated against implemented designs.
application specific systems architectures and processors | 2013
Sam Skalicky; Sonia Martín López; Marcin Łukowiak; James Letendre; David Gasser
One of the main challenges of using heterogeneous systems results from the need to find the computation-to-hardware assignments that maximize the overall application performance. The important computational factors that must be taken into account include algorithmic complexity, exploitable parallelism, memory bandwidth requirements, and data size. To achieve high performance, a hardware platform is chosen to satisfy the needs of a computation with corresponding architectural features such as clock speed, number of parallel computational units, and memory bandwidth. In this paper five linear algebra computations that are commonly found in compute-intensive applications are selected and evaluated in terms of performance on CPU, GPU, and FPGA platforms across a wide range of data sizes. The results are used to provide guidelines to help select the best performing hardware platform based on the computational factors. Using a cutting edge signal processing application as a case study, we demonstrate the importance of making computation assignments for improved performance. Our experimental results show that a properly implemented heterogeneous system achieves a speedup of up to 39x and 3.8x compared to CPU-only and GPU-only systems respectively.
reconfigurable computing and fpgas | 2013
Sam Skalicky; Sonia Martín López; Marcin Lukowiak
One of the main challenges of using cutting edge medical imaging applications in the clinical setting is the large amount of data processing required. Many of these applications are based on linear algebra computations operating on large data sizes and their execution may require days in a standard CPU. Distributed heterogeneous systems are capable of improving the performance of applications by using the right computation-to-hardware mapping. To achieve high performance, hardware platforms are chosen to satisfy the needs of each computation with corresponding architectural features such as clock speed, number of parallel computational units, and memory bandwidth. In this paper we evaluate the performance benefits of using different hardware platforms to accelerate the execution of a transmural electro physiological imaging algorithm, targeting a standard CPU with GPU and FPGA accelerators. Using this cutting edge medical imaging application as a case study, we demonstrate the importance of making intelligent computation assignments for improved performance. We show that, depending on the size of the data structures the application works with, the usage of an FPGA to run certain computations can make a big difference: a heterogeneous system with all three hardware platforms (CPU+GPU+FPGA) can cut the execution time by half, compared to the best result using one single accelerator (CPU+GPU). In addition, our experimental results show that combining CPU, GPU, and FPGA platforms in a single system achieves a speedup of up to 62×, 2×, and 1605× compared to systems with a single CPU, GPU, or FPGA platform respectively.
reconfigurable computing and fpgas | 2015
Sam Skalicky; Tejaswini Ananthanarayana; Sonia Martín López; Marcin Lukowiak
In this paper we propose a new degree of flexibility for soft processor design in which only the instructions relevant to the task at hand are implemented as a subset of the Instruction Set Architecture (ISA). These customized processors execute software kernels in the usual way, yet can be implemented with a fraction of the hardware resources used by other full- ISA soft processor cores. We present a design methodology for such customized ISA processors where the functional description of the processor is written in a high level language and the hardware implementation is obtained using high level synthesis (HLS) tools. We investigate the potential and limitations of the HLS tools to take processor simulator-like implementations in C/C++ and produce functional hardware processor architectures. We apply this methodology to two relevant applications -a basic linear algebra application and a cryptographic application- due to their distinct features. Our results demonstrate that these customized processors utilize significantly less hardware resources than those required by Xilinx MicroBlaze with reasonable performance degradation using HLS generated hardware exclusively.
high performance computing and communications | 2015
Sam Skalicky; Sonia Martín López; Marcin Lukowiak; Andrew G. Schmidt
Compute-intensive applications incorporate ever increasing data processing requirements on hardware systems. Many of these applications have only recently become feasible thanks to the increasing computing power of modern processors. The Matlab language is uniquely situated to support the description of these compute-intensive scientific applications, and consequently has been continuously improved to provide increasing computational support in the form of multithreading for CPUs and utilizing accelerators such as GPUs and FPGAs. Moreover, to take advantage of the computational support in these heterogeneous systems from the problem domain to the computer architecture necessitates a wide breadth of knowledge and understanding. In this work, we present a framework for the development of compute-intensive scientific applications in Matlab using heterogeneous processor systems. We investigate systems containing CPUs, GPUs, and FPGAs. We leverage the capabilities of Matlab and supplement them by automating the mapping, scheduling, and parallel code generation. Our experimental results on a set of benchmarks achieved from 20x to 60x speedups compared to the standard Matlab CPU environment with minimal effort required on the part of the user.
reconfigurable computing and fpgas | 2014
Sam Skalicky; Sonia Martín López; Marcin Lukowiak; Christopher A. Wood
The performance of a pipelined architecture is often limited by incorrectly designed or poorly implemented control logic. Once a design is implemented and meets timing constraints, the mission is to evaluate if it is achieving optimum performance. At this stage, the number of pipelines and functional units are fixed and the amount of resources and memory bandwidth are finalized. If a design is performing suboptimally the only recourse is to improve the control logic. In this paper we present a metric to quantify the achievable performance of a design and use it to analyze performance degradation due to control logic. We analyze the control logic of existing architectures and present improvements that achieve speedups of up to 10.7×.
reconfigurable computing and fpgas | 2014
Sam Skalicky; Tyler Kwolek; Sonia Martín López; Marcin Lukowiak
FPGAs have been shown to provide orders of magnitude improvement over CPUs and GPUs in terms of absolute performance and energy efficiency for various kernels such as Cholesky decomposition, matrix inversion, and FFT among others. Despite this, the overall performance of many applications suffer when implemented entirely in FPGAs. Combining FPGAs with CPUs and GPUs provides the range of capabilities needed to support diverse computational requirements of applications. Integrating FPGAs into these systems challenges application developers with constructing hardware kernel implementations and interfacing from the low level hardware logic in the FPGA to the high speed networks that connect processors in the system. In this work we extend the compute capabilities of Matlab by incorporating support for FPGAs and automating the parallel code generation. We characterize the system and evaluate the performance gains that can be achieved by adding the FPGA for two compute intensive applications. We present performance results for medical imaging and fluid dynamics applications implemented in a CPU+GPU+FPGA system and achieved up to 40× improvement compared to the standard Matlab CPU+GPU environment.
design, automation, and test in europe | 2015
Sam Skalicky; Andrew G. Schmidt; Sonia Martín López; Matthew French
With the continual enhancement of heterogeneous resources in FPGA devices, utilizing these resources becomes a challenging burden for developers. Especially with the inclusion of sophisticated multiple processor system-on-chips, the necessary skill set to effectively leverage these resources spans both hardware and software expertise. The maturation of high level synthesis tools and programming languages aim to alleviate these complexities, yet there still exist systematic gaps that must be bridged to provide a more cohesive hardware/software development environment. High level MPSoC design initiatives such as Redsharc have reduced the costs of entry, simplifying application implementation. We propose a unified hardware/software framework for system construction, leveraging Redsharcs APIs, efficient on-chip interconnects, and run-time controllers. We present system level abstractions that enable compilation and implementation tools for hardware and software to be merged into a single configurable system development environment. Finally, we demonstrate our proposed framework with Redsharc, using AES encryption/decryption spanning software implementations on ARM and MicroBlaze processors and hardware kernels.
arXiv: Software Engineering | 2014
Sam Skalicky; Andrew G. Schmidt; Matthew French