Farnoud Farahmand | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Farnoud Farahmand is active.

Explore More

Publication

Featured researches published by Farnoud Farahmand.

international symposium on circuits and systems | 2016

Big biomedical image processing hardware acceleration: A case study for K-means and image filtering

Katayoun Neshatpour; Arezou Koohi; Farnoud Farahmand; Rajiv V. Joshi; Setareh Rafatirad; Avesta Sasan; Houman Homayoun

Most hospitals today are dealing with the big data problem, as they generate and store petabytes of patient records most of which in form of medical imaging, such as pathological images, CT scans and X-rays in their datacenters. Analyzing such large amounts of biomedical imaging data to enable discovery and guide physicians in personalized care is becoming an important focus of data mining and machine learning algorithms developed for biomedical Informatics (BMI). Algorithms that are developed for BMI heavily rely on complex and computationally intensive machine learning and data mining methods to learn from large data. The high processing demand of big biomedical imaging data has given rise to their implementation in high-end server platforms running software ecosystems that are optimized for dealing with large amount of data including Apache Hadoop and Apache Spark. However, efficient processing of such large amount of imaging data running computational intensive learning methods is becoming a challenging problem using state-of-the-art high performance computing server architectures. To address this challenge, in this paper, we introduce a scalable and efficient hardware acceleration method using low cost commodity FPGAs that is interfaced with a server architecture through a high speed interface. In this work we present a full end-to-end implementation of big data image processing and machine learning applications in a heterogeneous CPU+FPGA architecture. We develop the MapReduce implementation of K-means and Laplacian Filtering in Hadoop Streaming environment that allows developing mapper functions in non-Java based languages suited for interfacing with FPGA-based hardware accelerating environment. We accelerate the mapper functions through hardware+software (HW+SW) co-design. We do a full implementation of the HW+SW mappers on the Zynq FPGA platform. The results show promising kernel speedup of up to 27× for large image data sets. This translate to 7.8× and 1.8× speedup in an end-to-end Hadoop MapReduce implementation of K-mean s and Laplacian Filtering algorithm, respectively.

reconfigurable computing and fpgas | 2015

A universal hardware API for authenticated ciphers

Ekawat Homsirikamol; William Diehl; Ahmed Ferozpuri; Farnoud Farahmand; Malik Umar Sharif; Kris Gaj

In this paper, we propose a universal hardware Application Programming Interface (API) for authenticated ciphers. In particular, our API is intended to meet the requirements of all algorithms submitted to the CAESAR competition. Two major parts of the API, the interface and the communication protocol, were developed with the goal of reducing any potential biases in benchmarking of authenticated ciphers in hardware. Our high-speed implementation of the proposed hardware API includes universal, open-source pre-processing and post-processing units, common for all CAESAR candidates and the current standards, such as AES-GCM and AES-CCM. Apart from the full documentation, examples, and the source code of the pre-processing and post-processing units, we have made available in public domain a) a universal testbench to verify the functionality of any CAESAR candidate implemented using our hardware API, b) a Python script used to automatically generate test vectors for this testbench, c) VHDL wrappers used to determine the maximum clock frequency and the resource utilization of all implementations, and d) RTL VHDL source codes of highspeed implementations of AES and the Keccak Permutation F, which may be used as building blocks in implementations of related ciphers. We hope that the existence of these resources will substantially reduce the time necessary to develop hardware implementations of all CAESAR candidates for the purpose of evaluation, comparison, and future deployment in real products.

ieee computer society annual symposium on vlsi | 2016

Architecture Exploration for Energy-Efficient Embedded Vision Applications: From General Purpose Processor to Domain Specific Accelerator

Maria Malik; Farnoud Farahmand; Paul Otto; Nima Akhlaghi; Tinoosh Mohsenin; Siddhartha Sikdar; Houman Homayoun

OpenCV applications are computationally intensive tasks among computer vision algorithms. The demand for low power yet high performance real-time processing of OpenCV embedded vision applications have led to developing their customized implementations on state-of-the-art embedded processing platforms. Given the industry move to heterogeneous platforms which integrates single core or multicore CPU with on-chip FPGA accelerators and GPU accelerators, the question of what platform and what implementation, whether hardware or software, is best suited for energy-efficient processing of this class of applications is becoming important. In this paper, we seek to answer this question through a detailed hardware and software implementation of OpenCV applications and methodically measurement and comprehensive analysis of their power and performance on state-of-the-art heterogeneous embedded processing platforms. The results show that in addition to application behavior, the size of image is an important factor in deciding the efficient platform in terms of highest energy-efficiency (EDP) among hardware accelerators on FPGA and software accelerators on GPU and multicore CPUs. While hardware implementation on ZYNQ shown to be the most performance and energy-efficient for image size of 500x500 or less, software GPU implementation found to be the most efficient and achieves highest speedup for larger image sizes. In addition, while for compute intensive vision applications the gap between FPGA, CPU and GPU reduces as the size of image increases, for non-intensive applications, a large performance and EDP gap is observed between the studied platforms, as the size of the image increases.

hardware oriented security and trust | 2018

Comparison of cost of protection against differential power analysis of selected authenticated ciphers

William Diehl; Abubakr Abdulgadir; Farnoud Farahmand; Jens-Peter Kaps; Kris Gaj

Authenticated ciphers are vulnerable to side-channel attacks, including differential power analysis (DPA). Test Vector Leakage Assessment (TVLA) using Welchs t-test has been used to verify improved resistance of block ciphers to DPA after application of countermeasures. However, extension of this methodology to authenticated ciphers is non-trivial, since this requires additional input and output conditions, complex interfaces, and long test vectors interlaced with protocol necessary to describe authenticated cipher operations. In this research we augment an existing side-channel analysis architecture (FOBOS) with TVLA for authenticated ciphers. We use this capability to show that implementations in the Spartan-6 FPGA of the CAESAR Round 3 candidates ACORN, ASCON, CLOC (AES and TWINE), SILC (AES, PRESENT, and LED), JAMBU (AES and SIMON), and Ketje Jr., as well as AES-GCM, are potentially vulnerable to 1st order DPA. We then implement versions of the above ciphers, protected against 1st order DPA, using threshold implementations. TVLA is used to verify improved resistance to 1st order DPA of the protected cipher implementations. Finally, we benchmark unprotected and protected cipher implementations in the Spartan-6 FPGA, and compare the costs of 1st order DPA protection in terms of area, frequency, throughput, throughput-to-area (TP/A) ratio, power, and energy per bit. Our results show that ACORN is the most energy efficient, has the lowest area (in LUTs), and has the highest TP/A ratio of DPA-resistant implementations. However, Ketje Jr. has the highest throughput.

reconfigurable computing and fpgas | 2016

A Zynq-based testbed for the experimental benchmarking of algorithms competing in cryptographic contests

Farnoud Farahmand; Ekawat Homsirikamol; Kris Gaj

Hardware performance evaluation of candidates competing in cryptographic contests, such as SHA-3 and CAE-SAR, is very important for ranking their suitability for standardization. One of the most essential performance metrics is the throughput, which highly depends on the algorithm, hardware implementation architecture, coding style, and options of tools. The maximum throughput is calculated based on the maximum clock frequency supported by each algorithm. A common way of determining the maximum clock frequency is static timing analysis provided by the CAD toolsets such as Xilinx ISE, Xilinx Vivado, and Altera Quartus Prime. In this project, we have developed a universal testbed, which is capable of measuring the maximum clock frequency experimentally, using a prototyping board. We are targeting cryptographic hardware cores, such as implementations of SHA-3 candidates. Our testbed is designed using a Zynq platform and takes advantage of software/hardware co-design. It supports two separate clock domains, one for a hardware module under test, and the other for the communication between an ARM core and hardware accelerator. We measured the maximum clock frequency and the execution time of 12 Round 2 SHA-3 candidates experimentally on ZedBoard and compared the results with the frequencies reported by Xilinx Vivado. Our results indicate that depending on the characteristics of each algorithm, we may achieve either much higher or the same experimental frequency than the results reported by the tools using static timing analysis. This behavior is then further analyzed, and the relevant conclusions drawn.

Proceedings of the 2018 Workshop on Attacks and Solutions in Hardware Security - ASHES '18 | 2018

Fixing the CLOC with Fine-grain Leakage Analysis

William Diehl; Farnoud Farahmand; Abubakr Abdulgadir; Jens-Peter Kaps; Kris Gaj

Authenticated ciphers offer the promise of improved security for resource-constrained devices. Recent cryptographic contests and standardization efforts are evaluating authenticated ciphers for performance and security, including resistance to Differential Power Analysis (DPA). In this research, we study the CLOC-AES authenticated cipher in terms of vulnerability to DPA and cost of implementation of countermeasures against DPA. Using the FOBOS test architecture, we first show that an FPGA implementation of CLOC is vulnerable to DPA through Test Vector Leakage Assessment methodology (i.e., t-tests). After applying DPA countermeasures, we show that protected CLOC implementations pass t-tests, except for discrete leakage corresponding to a data-dependent branch condition in the CLOC specification. Using an enhanced tool called FOBOS Profiler, we analyze the source of t-test failure down to the exact clock cycle and device state, to confirm the source of leakage. We introduce a new protected non-linear transformation into the datapath, remove all data-dependent decision criteria from the device controller, and verify that the updated protected implementations pass t-tests. We show that the cost of including the protected non-linear transformation leads to 3.8 factor growth in area, 48 percent reduction in throughput, and 86 percent reduction in throughput-to-area ratio, compared to the unprotected implementation. Our analysis shows the high cost of DPA-protected non-linear transformations in authenticated ciphers above the cryptographic primitive layer.

field programmable logic and applications | 2017

Comparison of hardware and software implementations of selected lightweight block ciphers

William Diehl; Farnoud Farahmand; Panasayya Yalla; Jens-Peter Kaps; Kris Gaj

Lightweight block ciphers are an important topic of research in the context of the Internet of Things (IoT). Current cryptographic contests and standardization efforts seek to benchmark lightweight ciphers in both hardware and software. Although there have been several benchmarking studies of both hardware and software implementations of lightweight ciphers, direct comparison of hardware and software implementations is difficult due to differences in metrics, measures of effectiveness, and implementation platforms. In this research, we facilitate this comparison by use of a custom lightweight reconfigurable processor. We implement six ciphers, AES, SIMON, SPECK, PRESENT, LED and TWINE, in hardware using register transfer level (RTL) design, and in software using the custom reconfigurable processor. Both hardware and software implementations are instantiated in identical Xilinx Kintex-7 FPGAs, which enables direct comparison of throughput, area, throughput-to-area (TP/A) ratio, power, and energy. Results show that TWINE and AES have the highest TP/A ratios for hardware and software implementations, respectively, assuming an area target of 300–450 LUTs. In terms of direct comparison, software implementations on tailored reconfigurable processers generally use less power — especially where reconfigurable instruction set extensions are permitted. However, custom hardware implementations have higher throughput and energy-efficiency than software implementations on the same platform.

IACR Cryptology ePrint Archive | 2015