Oren Segal
University of Massachusetts Lowell
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Oren Segal.
field programmable logic and applications | 2014
Oren Segal; Martin Margala; Sai Rahul Chalamalasetti; Mitch Wright
Heterogeneous computing offers a promising solution for energy efficient computing in the data center. FPGA based heterogeneous computing is an especially promising direction since it allows for the creation of custom hardware solutions for data centric parallel applications. One of the main issues delaying wide spread adoption of FPGAs as main stream high performance computing devices is the difficulty in programming them. OpenCL was meant to address the difficulties and the non-uniformity related to programming heterogeneous devices, unfortunately because of its complexity it sets the bar high for many software programmers, preventing them from directly benefiting from the computing power and energy efficiency that OpenCL and heterogeneous computing have to offer. This work presents an effort to bridge the gap by extending an existing Java programming framework (APARAPI), based on OpenCL, so that it can be used to program FPGAs at a high level of abstraction and increased ease of programmability. We run several real world algorithms to assess the performance of the APARAPI framework on both a low end and a high end system. On the low end and high and systems respectively we find up to 78-80 percent power reduction and 4.8X-5.3X speed increase running NBody simulation, as well as up to 65-80 percent power reduction and 6.2X-7X speed increase for a K-Means MapReduce algorithm running on top of the Hadoop framework and APARAPI.
field programmable logic and applications | 2014
Zhuo Qian; Nasibeh Nasiri; Oren Segal; Martin Margala
Fast Fourier Transform (FFT) is one of the fundamental operations in digital signal processing area. Split-radix Fast Fourier Transform (SRFFT) approximates the minimum number of multiplications by theory among all the FFT algorithms, therefore SRFFT is a good candidate for the implementation of a low power FFT processor. In this PhD work, we aim to implement a novel low power Split-Radix FFT processor using shared-memory architecture and extend this work to a parallel structure based on FPGA. We started by designing a new radix-2 butterfly unit using clock gating approach to block unnecessary switching activity in the multiplier. Compared to existing SRFFT processors which are based on the “L” shaped butterfly, our implementation simplifies the address generation process for FFT data. Furthermore, because the number of multiplications required by SRFFT algorithm significantly decreases as the FFT size increases, it is reasonable to assume the proposed architecture will save more power when it comes to larger points of FFT.
ieee high performance extreme computing conference | 2014
Oren Segal; Nasibeh Nasiri; Martin Margala; Wim Vanderbauwhede
Heterogeneous computing offers a promising solution for high performance and energy efficient computing. Until recently the high performance heterogeneous computing arena was dominated by discrete GPUs but in recent years, new solutions based on devices such as APUs and FPGAs have emerged. These new solutions show promise for further improvements in energy efficiency. FPGA based heterogeneous computing is an especially promising direction since it allows for the creation of custom hardware solutions for data centric parallel applications. One of the main issues delaying wide spread adoption of FPGAs as main stream high performance computing devices is the difficulty in programming them. Alteras OpenCL implementation for FPGAs provides a high level of abstraction and increased ease of programmability of FPGAs. Two high performance computing applications (Lava Molecular Dynamics and Nearest-Neighbours) and a data centric application (Document Classification) were compiled using Alteras OpenCL compiler and programmed on a Nallatech FPGA board. Hardware utilization, kernel execution time and total execution time are reported. Up to 5.3x, 4.3x and 1.3x speed up over the Dual Xeon processor implementations was achieved respectively for LavaMD, Nearest-Neighbours and Document Classification.
ieee high performance extreme computing conference | 2015
Oren Segal; Philip Colangelo; Nasibeh Nasiri; Zhuo Qian; Martin Margala
Combining several types of devices and architectures is at the heart of heterogeneous computings power efficiency advantage, but the strength of heterogeneous systems is also their Achilles heel, i.e. the diversity of the devices and ecosystems needed to maintain them present major technological challenges. Some of the biggest challenges are in the realm of system programing. We believe that for heterogeneous systems computing to become a mainstream system design choice, high level and standard system design flows need to be adopted in order to achieve transparency when dealing with diverse devices and architectures. In this paper we present an open source high level framework and design flow that allows working with any type of device that supports OpenCL. In addition we test our design flow and framework on an N-body simulation across multiple device types and show how such high level framework and heterogeneous system design can deliver a more power efficient solution when compared to a single general purpose device and dual CPU+GPU device type approach.
field programmable gate arrays | 2015
Nasibeh Nasiri; Oren Segal; Martin Margala; Wim Vanderbauwhede; Sai Rahul Chalamalasetti
Document classification is at the heart of several of the applications that have been driving the proliferation of the internet in our daily lives. The ever growing amounts of data and the need for higher throughput, more energy efficient document classification solutions motivated us to investigate alternatives to the traditional homogenous CPU based implementations. We investigate a heterogeneous system where CPUs are combined with FPGAs as system accelerators. Incorporating FPGAs as accelerators in a heterogeneous computing environment allows for the creation of flexible custom hardware solutions that can potentially offer increased power efficiency and performance gains. One of the main issues delaying wide spread adoption of FPGAs as standard heterogeneous system accelerators is the difficulty in programming them. The OpenCL standard offers a unified C programming model for any device that adheres to its standards. An Altera OpenCL FPGA based implementation of a document classification system is investigated in which a stream of HTML documents is scored according to a profile on a document-by-document basis. The results show that the throughput of the document classification application with and without Bloom Filters is 312MB/s and 343MB/s respectively, when running on CPU, and 354MB/s and 452MB/s respectively, when running on an FPGA. Our results also show up to 32% power efficiency improvement for the FPGA implementation over the CPU implementation. We would like to thank Davor Capalija from Altera for his invaluable advice during our work on the FPGA version of the algorithm.
midwest symposium on circuits and systems | 2014
Nasibeh Nasiri; Oren Segal; Martin Margala
Fused multiply-add (FMA) units can reduce latency and increase energy efficiency in arithmetic operations. A modified architecture of a multiply-accumulation chained unit (MFMA) is described in this paper. The add/sub pipelined datapath of a traditional fused multiply-add unit is modified to save hardware resources, conserve energy and reduce latency in DSP applications. The proposed datapath for add/sub is flexible, generic and can be used in any IEEE-754 compatible floating point architecture as a replacement for the traditional multiply-accumulation chained unit. FMA and MFMA are both implemented in a nine-stage pipelined design. The clock limiting stage for both architectures is the normalization stage which remains unchanged in the proposed architecture. FPGA implementation for the proposed three-input add/sub and ASIC implementation for the MFMA is performed. In the FPGA implementation of the proposed add/sub datapath the area reduction is 19.56% and power reduction is 20.67% and the latency is halved compared to two cascaded two-input add/sub datapaths. In ASIC implementations of the classic FMA and MFMA the overall area reduction is 7.16% and power saving is 5.69%.
power and timing modeling optimization and simulation | 2016
Nasibeh Nasiri; Philip Colangelo; Oren Segal; Martin Margala; Wim Vanderbauwhede
Datacenter workloads demand high throughput, low cost and power efficient solutions. In most data centers the operating costs dominates the infrastructure cost. The ever growing amounts of data and the critical need for higher throughput, more energy efficient document classification solutions motivated us to investigate alternatives to the traditional homogeneous CPU based implementations of document classification systems. Several heterogeneous systems were investigated in the past where CPUs were combined with GPUs and FPGAs as system accelerators. The increasing complexity of FPGAs made them an interesting device in the heterogeneous computing environments and on the other hand difficult to program using Hardware Description languages. We explore the trade-offs when using high level synthesis and low level synthesis when programming FPGAs. Using low level synthesis results in less hardware resource usage on FPGAs and also offers the higher throughput compared to using HLS tool. While using HLS tool different heterogeneous computing devices such as multicore CPU and GPU targeted. Through our implementation experience and empirical results for data centric applications, we conclude that we can achieve power efficient results for these set of applications by either using low level synthesis or high level synthesis for programming FPGAs.
international conference on high performance computing and simulation | 2016
Oren Segal; Martin Margala
In this paper we evaluate the potential of running a compute-intensive simulation on a heterogeneous cluster built from CPU, GPU and FPGA devices. We do so by augmenting a commercially available cluster of CPUs and GPUs with an FPGA device and running a distributed n-body simulation on top of Spark for unconventional cores (SparkCL) on the three different types of computing architectures. We show that given an algorithm with a sufficiently high compute intensity, such as pairwise additive n-body, we can significantly increase performance and performance per watt in comparison to running the same algorithm on a homogeneous CPU based cluster. In addition, we show the potential of using FPGAs in future commodity heterogeneous clusters alongside CPUs and GPUs.
field programmable gate arrays | 2014
Yosi Ben Asher; Jacob Gendel; Gadi Haber; Oren Segal; Yousef Shajrawi
Manycore shared memory architectures hold a significant premise to speed up and simplify SOCs. Using many homogeneous small-cores will allow replacing the hardware accelerators of SOCs by parallel algorithms communicating through shared memory. Currently shared memory is realized by maintaining cache-consistency across the cores, caching all the connected cores to one main memory module. This approach, though used today, is not likely to be scalable enough to support the high number of cores needed for highly parallel SOCs. Therefore we consider a theoretical scheme for shared memory wherein: the shared address space is divided between a set of memory modules; and a communication network allows each core to access every such module in parallel. Load-balancing between the memory modules is obtained by rehashing the memory address-space. We have designed a simple generic shared memory architecture, synthesized it to 2,4,8,,..1024-cores for FPGA virtex-7 and evaluated it on several parallel programs. The synthesis results and the execution measurements show that, for the FPGA, all problematic aspects of this construction can be resolved. For example, unlike ASICs, the growing complexity of the communication network is absorbed by the FPGAs routing grid and by its routing mechanism. This makes this type of architectures particularly suitable for FPGAs. We used 32-bits modified PACOBLAZE cores and tested different parameters of this architecture verifying its ability to achieve high speedups. The results suggest that re-hashing is not essential and one hash-function suffice (compared to the family of universal hash functions that is needed by the theoretical construction).
arXiv: Distributed, Parallel, and Cluster Computing | 2015
Oren Segal; Philip Colangelo; Nasibeh Nasiri; Zhuo Qian; Martin Margala