Katayoun Neshatpour
George Mason University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Katayoun Neshatpour.
international conference on big data | 2015
Katayoun Neshatpour; Maria Malik; Mohammad Ali Ghodrat; Avesta Sasan; Houman Homayoun
A recent trend for big data analytics is to provide heterogeneous architectures to allow support for hardware specialization. Considering the time dedicated to create such hardware implementations, an analysis that estimates how much benefit we gain in terms of speed and energy efficiency, through offloading various functions to hardware would be necessary. This work analyzes data mining and machine learning algorithms, which are utilized extensively in big data applications in a heterogeneous CPU+FPGA platform. We select and offload the computational intensive kernels to the hardware accelerator to achieve the highest speed-up and best energy-efficiency. We use the latest Xilinx Zynq boards for implementation and result analysis. We also perform a first order comprehensive analysis of communication and computation overheads to understand how the speedup of each application contributes to its overall execution in an end-to-end Hadoop MapReduce environment. Moreover, we study how other system parameters such as the choice of CPU (big vs little) and the number of mapper slots affect the performance and power-efficiency benefits of hardware acceleration. The results show that a kernel speedup of upto χ 321.5 with hardware+software co-design can be achieved. This results in χ2.72 speedup, 2.13χ power reduction, and 15.21χ energy efficiency improvement (EDP) in an end-to-end Hadoop MapReduce environment.
ieee/acm international symposium cluster, cloud and grid computing | 2015
Katayoun Neshatpour; Maria Malik; Houman Homayoun
Big data applications share inherent characteristics that are fundamentally different from traditional desktop CPU, parallel and web service applications. They rely on deep machine learning and data mining applications. A recent trend for big data analytics is to provide heterogeneous architectures to allow support for hardware specialization to construct the right processing engine for analytics applications. However, these specialized heterogeneous architectures require extensive exploration of design aspects to find the optimal architecture in terms of performance and cost. % Considering the time dedicated to create such specialized architectures, a model that estimates the potential speedup achievable through offloading various parts of the algorithm to specialized hardware would be necessary. This paper analyzes how offloading computational intensive kernels of machine learning algorithms to a heterogeneous CPU+FPGA platform enhances the performance. We use the latest Xilinx Signboards for implementation and result analysis. Furthermore, we perform a comprehensive analysis of communication and computation overheads such as data I/O movements, and calling several standard libraries that can not be offloaded to the accelerator to understand how the speedup of each application will contribute to its overall execution in an end-to-end Hadoop MapReduce environment.
field-programmable custom computing machines | 2015
Katayoun Neshatpour; Maria Malik; Mohammad Ali Ghodrat; Houman Homayoun
Emerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware acceleration through specialization has received renewed interest in recent years, mainly due to the dark silicon challenge. To address the computing requirements of big data, and based on the benchmarking and characterization results, we envision a data-driven heterogeneous architecture for next generation big data server platforms that leverage the power of field-programmable gate array (FPGA) to build custom accelerators in a Hadoop MapReduce framework. Unlike a full and dedicated implementation of Hadoop MapReduce algorithm on FPGA, we propose the hardware/software (HW/SW) co-design of the algorithm, which trades some speedup at a benefit of less hardware. Considering communication overhead with FPGA and other overheads involved in Hadoop MapReduce environment such as compression and decompression, shuffling and sorting, our experimental results show significant potential for accelerating Hadoop MapReduce machine learning kernels using HW/SW co-design methodology.
design, automation, and test in europe | 2017
Maria Malik; Katayoun Neshatpour; Tinoosh Mohsenin; Avesta Sasan; Houman Homayoun
The rapid growth in the data yields challenges to process data efficiently using current high-performance server architectures such as big Xeon cores. Furthermore, physical design constraints, such as power and density, have become the dominant limiting factor for scaling out servers. Heterogeneous architectures that combine big Xeon cores with little Atom cores have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on an architecture that matches resource needs more closely than a one-size-fits-all architecture. Therefore, the question of whether to map the application to big Xeon or little Atom in heterogeneous server architecture becomes important. In this paper, we characterize Hadoop-based applications and their corresponding MapReduce tasks on big Xeon and little Atom-based server architectures to understand how the choice of big vs little cores is affected by various parameters at application, system and architecture levels and the interplay among these parameters. Furthermore, we have evaluated the operational and the capital cost to understand how performance, power and area constraints for big data analytics affects the choice of big vs little core server as a more cost and energy efficient architecture.
international symposium on circuits and systems | 2016
Katayoun Neshatpour; Arezou Koohi; Farnoud Farahmand; Rajiv V. Joshi; Setareh Rafatirad; Avesta Sasan; Houman Homayoun
Most hospitals today are dealing with the big data problem, as they generate and store petabytes of patient records most of which in form of medical imaging, such as pathological images, CT scans and X-rays in their datacenters. Analyzing such large amounts of biomedical imaging data to enable discovery and guide physicians in personalized care is becoming an important focus of data mining and machine learning algorithms developed for biomedical Informatics (BMI). Algorithms that are developed for BMI heavily rely on complex and computationally intensive machine learning and data mining methods to learn from large data. The high processing demand of big biomedical imaging data has given rise to their implementation in high-end server platforms running software ecosystems that are optimized for dealing with large amount of data including Apache Hadoop and Apache Spark. However, efficient processing of such large amount of imaging data running computational intensive learning methods is becoming a challenging problem using state-of-the-art high performance computing server architectures. To address this challenge, in this paper, we introduce a scalable and efficient hardware acceleration method using low cost commodity FPGAs that is interfaced with a server architecture through a high speed interface. In this work we present a full end-to-end implementation of big data image processing and machine learning applications in a heterogeneous CPU+FPGA architecture. We develop the MapReduce implementation of K-means and Laplacian Filtering in Hadoop Streaming environment that allows developing mapper functions in non-Java based languages suited for interfacing with FPGA-based hardware accelerating environment. We accelerate the mapper functions through hardware+software (HW+SW) co-design. We do a full implementation of the HW+SW mappers on the Zynq FPGA platform. The results show promising kernel speedup of up to 27× for large image data sets. This translate to 7.8× and 1.8× speedup in an end-to-end Hadoop MapReduce implementation of K-mean s and Laplacian Filtering algorithm, respectively.
international conference on hardware/software codesign and system synthesis | 2016
Katayoun Neshatpour; Avesta Sasan; Houman Homayoun
In this paper, we present the implementation of big data analytics applications in a heterogeneous CPU+FPGA accelerator architecture. We develop the MapReduce implementation of K-means, K nearest neighbor, support vector machine and Naive Bayes in a Hadoop Streaming environment that allows developing mapper/reducer functions in a non-Java based language suited for interfacing with FPGA-based hardware accelerating environment. We present a full implementation of the HW+SW mappers on the Zynq FPGA platform. A promising speedup as well as energy-efficiency gains of upto 4.5X and 22X is achieved, respectively, in an end-to-end Hadoop implementation.
IEEE Transactions on Circuits and Systems | 2015
Katayoun Neshatpour; Mahdi Shabany; P. Glenn Gulak
This paper introduces a novel low-complexity multiple-input multiple-output (MIMO) detector tailored for single-carrier frequency division-multiple access (SC-FDMA) systems, suitable for efficient hardware implementations. The proposed detector starts with an initial estimate of the transmitted signal based on a minimum mean square error (MMSE) detector. Subsequently, it recognizes less reliable symbols for which more candidates in the constellation are browsed to improve the initial estimate. An efficient high-throughput VLSI architecture is also introduced achieving a superior performance compared to the conventional MMSE detectors with less than 28% added complexity. The performance of the proposed design is close to the existing maximum likelihood post-detection processing (ML-PDP) scheme, while resulting in a significantly lower complexity, i.e., 4.5×102 and 7×104 times fewer Euclidean distance (ED) calculations in the 16-QAM and 64-QAM schemes, respectively. The proposed design for the 16-QAM scheme is fabricated in a 0.13 μm CMOS technology and fully tested, achieving a 1.332 Gbps throughput, reporting the first fabricated design for SC-FDMA MIMO detectors to-date. A soft version of the proposed architecture is also introduced, which is customized for coded systems.
Journal of Parallel and Distributed Computing | 2018
Katayoun Neshatpour; Maria Malik; Avesta Sasan; Setareh Rafatirad; Tinoush Mohsenin; Hassan Ghasemzadeh; Houman Homayoun
Abstract In this paper, we present a full end-to-end implementation of big data analytics applications in a heterogeneous CPU+FPGA architecture. Selecting the optimal architecture that results in the highest acceleration for big data applications requires an in-depth of each application. Thus, we develop the MapReduce implementation of K -means, K nearest neighbor, support vector machine and naive Bayes in a Hadoop Streaming environment that allows developing mapper functions in a non-Java based language suited for interfacing with FPGA-based hardware accelerating environment. We further profile various components of Hadoop MapReduce to identify candidates for hardware acceleration. We accelerate the mapper functions through hardware+software (HW+SW) co-design. Moreover, we study how various parameters at the application (size of input data), system (number of mappers running simultaneously per node and data split size), and architecture (choice of CPU core such as big vs little, e.g., Xeon vs Atom) levels affect the performance and power-efficiency benefits of Hadoop streaming hardware acceleration and the overall performance and energy-efficiency of the system. A promising speedup as well as energy-efficiency gains of up to 8.3 × and 15 × is achieved, respectively, in an end-to-end Hadoop implementation. Our results show that HW+SW acceleration yields significantly higher speedup on Atom server, reducing the performance gap between little and big cores after the acceleration. On the other hand, HW+SW acceleration reduces the power consumption of Xeon server more significantly, reducing the power gap between little and big cores. Our cost Analysis shows that the FPGA-accelerated Atom server yields execution times that are close to or even lower than stand-alone Xeon server for the studied applications, while reducing the server cost by more than 3 × . We confirm the scalability of FPGA acceleration of MapReduce by increasing the data size on 12-node Xeon cluster and show that FPGA acceleration maintains its benefit for larger data sizes on a cluster.
great lakes symposium on vlsi | 2015
Katayoun Neshatpour; Houman Homayoun; Amin Khajeh; Wayne Burleson
As CMOS technology scales down towards nanometer regime and the supply voltage approaches the threshold voltage, increase in operating temperature results in increased circuit current, which in turn reduces circuit propagation delay. This paper exploits this new phenomenon, known as inverse thermal dependence (ITD) for power, performance, and temperature optimization in processor architecture. ITD changes the maximum achievable operating frequency of the processor at high temperatures. Dynamic thermal management techniques such as activity migration, dynamic voltage frequency scaling, and throttling are revisited in this paper, with a focus on the effect of ITD. Results are obtained using the predictive technology models of 7nm, 10nm 14nm and 20nm technology nodes and with extensive architectural and circuit simulations. The results show that based on the design goals, various design corners should be re-investigated for power, performance and energy-efficiency optimization. Architectural simulations for a multi-core processor and across standard benchmarks show that utilizing ITD-aware schemes for thermal management improves the performance of the processor in terms of speed and energy-delay-product by 8.55% and 4.4%, respectively.
IEEE Transactions on Multi-Scale Computing Systems | 2018
Maria Malik; Katayoun Neshatpour; Setareh Rafatirad; Houman Homayoun