Maria Malik
George Mason University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Maria Malik.
international conference on big data | 2015
Maria Malik; Setareh Rafatirah; Avesta Sasan; Houman Homayoun
Emerging Big Data applications require a significant amount of server computational power. Big data analytics applications rely heavily on specific deep machine learning and data mining algorithms, and exhibit high computational intensity, memory intensity, I/O intensity and control intensity. Big data applications require computing resources that can efficiently scale to manage massive amounts of diverse data. However, the rapid growth in the data yields challenges to process data efficiently using current server architectures such as big Xeon cores. Furthermore, physical design constraints, such as power and density, have become the dominant limiting factor for scaling out servers. Therefore recent work advocates the use of low-power embedded cores in servers such as little Atom to address these challenges. In this work, through methodical investigation of power and performance measurements, and comprehensive system level and micro-architectural analysis, we characterize emerging big data applications on big Xeon and little Atom-based server architecture. The characterization results across a wide range of real-world big data applications and various software stacks demonstrate how the choice of big vs little core-based server for energy-efficiency is significantly influenced by the size of data, performance constraints, and presence of accelerator. Furthermore, the microarchitecture-level analysis highlights where improvement is needed in big and little cores microarchitecture.
international conference on big data | 2015
Katayoun Neshatpour; Maria Malik; Mohammad Ali Ghodrat; Avesta Sasan; Houman Homayoun
A recent trend for big data analytics is to provide heterogeneous architectures to allow support for hardware specialization. Considering the time dedicated to create such hardware implementations, an analysis that estimates how much benefit we gain in terms of speed and energy efficiency, through offloading various functions to hardware would be necessary. This work analyzes data mining and machine learning algorithms, which are utilized extensively in big data applications in a heterogeneous CPU+FPGA platform. We select and offload the computational intensive kernels to the hardware accelerator to achieve the highest speed-up and best energy-efficiency. We use the latest Xilinx Zynq boards for implementation and result analysis. We also perform a first order comprehensive analysis of communication and computation overheads to understand how the speedup of each application contributes to its overall execution in an end-to-end Hadoop MapReduce environment. Moreover, we study how other system parameters such as the choice of CPU (big vs little) and the number of mapper slots affect the performance and power-efficiency benefits of hardware acceleration. The results show that a kernel speedup of upto χ 321.5 with hardware+software co-design can be achieved. This results in χ2.72 speedup, 2.13χ power reduction, and 15.21χ energy efficiency improvement (EDP) in an end-to-end Hadoop MapReduce environment.
international conference on computer design | 2015
Maria Malik; Houman Homayoun
The traditional low-power embedded processors such as ARM and Atom are entering the high-performance server market. At the same time, big data analytics are emerging and dramatically changing the landscape of data center workloads. Thus, the question of whether low-power embedded architectures are suited to process big data applications efficiently, is becoming important. In this work, through methodical investigation of power, performance measurements and comprehensive system level analysis, we demonstrate that low power embedded architectures can provide significant energy-efficiency for processing big data analytics applications.
ieee/acm international symposium cluster, cloud and grid computing | 2015
Katayoun Neshatpour; Maria Malik; Houman Homayoun
Big data applications share inherent characteristics that are fundamentally different from traditional desktop CPU, parallel and web service applications. They rely on deep machine learning and data mining applications. A recent trend for big data analytics is to provide heterogeneous architectures to allow support for hardware specialization to construct the right processing engine for analytics applications. However, these specialized heterogeneous architectures require extensive exploration of design aspects to find the optimal architecture in terms of performance and cost. % Considering the time dedicated to create such specialized architectures, a model that estimates the potential speedup achievable through offloading various parts of the algorithm to specialized hardware would be necessary. This paper analyzes how offloading computational intensive kernels of machine learning algorithms to a heterogeneous CPU+FPGA platform enhances the performance. We use the latest Xilinx Signboards for implementation and result analysis. Furthermore, we perform a comprehensive analysis of communication and computation overheads such as data I/O movements, and calling several standard libraries that can not be offloaded to the accelerator to understand how the speedup of each application will contribute to its overall execution in an end-to-end Hadoop MapReduce environment.
field-programmable custom computing machines | 2015
Katayoun Neshatpour; Maria Malik; Mohammad Ali Ghodrat; Houman Homayoun
Emerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware acceleration through specialization has received renewed interest in recent years, mainly due to the dark silicon challenge. To address the computing requirements of big data, and based on the benchmarking and characterization results, we envision a data-driven heterogeneous architecture for next generation big data server platforms that leverage the power of field-programmable gate array (FPGA) to build custom accelerators in a Hadoop MapReduce framework. Unlike a full and dedicated implementation of Hadoop MapReduce algorithm on FPGA, we propose the hardware/software (HW/SW) co-design of the algorithm, which trades some speedup at a benefit of less hardware. Considering communication overhead with FPGA and other overheads involved in Hadoop MapReduce environment such as compression and decompression, shuffling and sorting, our experimental results show significant potential for accelerating Hadoop MapReduce machine learning kernels using HW/SW co-design methodology.
design, automation, and test in europe | 2017
Maria Malik; Katayoun Neshatpour; Tinoosh Mohsenin; Avesta Sasan; Houman Homayoun
The rapid growth in the data yields challenges to process data efficiently using current high-performance server architectures such as big Xeon cores. Furthermore, physical design constraints, such as power and density, have become the dominant limiting factor for scaling out servers. Heterogeneous architectures that combine big Xeon cores with little Atom cores have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on an architecture that matches resource needs more closely than a one-size-fits-all architecture. Therefore, the question of whether to map the application to big Xeon or little Atom in heterogeneous server architecture becomes important. In this paper, we characterize Hadoop-based applications and their corresponding MapReduce tasks on big Xeon and little Atom-based server architectures to understand how the choice of big vs little cores is affected by various parameters at application, system and architecture levels and the interplay among these parameters. Furthermore, we have evaluated the operational and the capital cost to understand how performance, power and area constraints for big data analytics affects the choice of big vs little core server as a more cost and energy efficient architecture.
international symposium on performance analysis of systems and software | 2016
Maria Malik; Avesta Sasan; Rajiv V. Joshi; Setareh Rafatirah; Houman Homayoun
The traditional low-power embedded processors such as Atom and ARM are entering the high-performance server market. At the same time, as the size of data grows, emerging Big Data applications require more and more server computational power that yields challenges to process data energy-efficiently using current high performance server architectures. Furthermore, physical design constraints, such as power and density have become the dominant limiting factor for scaling out servers. Numerous big data applications rely on using the Hadoop MapReduce framework to perform their analysis on large-scale datasets. Since Hadoop configuration parameters as well as architecture parameters directly affect the MapReduce job performance and energy-efficiency, system and architecture level parameters tuning is vital to maximize the energy efficiency. In this work, through methodical investigation of performance and power measurements, we demonstrate how the interplay among various Hadoop configurations and system and architecture level parameters affect the performance and energy-efficiency across various Hadoop applications.
international conference on computer design | 2015
Mohammad Khavari Tavana; Divya Pathak; Mohammad Hossein Hajkazemi; Maria Malik; Ioannis Savidis; Houman Homayoun
In the recent years, many-core platforms have emerged to boost performance while meeting tight power constraints. Per-core Dynamic Voltage and Frequency Scaling (DVFS) maximizes energy savings and meets the performance requirements of a given workload. Given a limited number of I/O pins and the need for finer control of voltage and frequency settings per core, there is a substantial cost in using off-chip voltage regulators. Consequently, there has been increased attention on the use of on-chip voltage regulators (OCVR) in many-core systems. However, integrating OCVRs comes at a cost of reduced power conversion efficiency (PCE) and increased complexity in the power delivery network and management of the OCVRs. In this paper, the effect of PCE on the thread-to-core mapping algorithm is investigated and the importance of the PCE-aware mapping scheme to optimize energy-efficiency is highlighted. Based on the results, up to 38% more energy savings is achieved as compared to PCE-agnostic algorithms. Moreover, the impact of core clustering granularity and process variation on the total efficiency of the system is explored. When relaxing the energy constraints by just 10%, an effective mapping reduces the complexity of the power delivery system by allowing the use of a significantly smaller number of voltage regulators, as compared to per-core OCVR. The results provided in the paper indicate an important opportunity for system and circuit co-design to implement energy-efficient and complexity-effective platforms for a target workload.
international conference on computer design | 2015
Paul Otto; Maria Malik; Nima Akhlaghi; Rebel Sequeira; Houman Homayoun; Siddhartha Sikdar
The de facto standard for embedded platforms with medium to low computing demands are ARM with Thumb ISA and Intel Atom with the X86 ISA with multiple cores. Operating these architectures in the milliwatts range while running realtime computer vision corner detection algorithms is a challenging problem. We present the analysis of power, performance and energy-efficiency measurements of Harris corner detection across a wide range of voltage and frequency settings, multicore/multithreading strategies, and compiler and application optimization parameters to find how the interplay of these parameters affect the power, performance and energy-efficiency. Our measurement of results on state-of-the-art embedded platforms demonstrate that a systematic cross-layer optimization at the application level (Sobel filter type, aperture size, number of image tiles), compiler level (branch prediction, function inlining) and system level (voltage and frequency setting, single core vs multicore implementation) significantly improves the energy-efficiency of corner detection, while meeting its real-time performance constraints. This cross-layer optimization improves the energy-efficiency of Harris corner on Atom and ARM by 89.5% and 87.2%, respectively.
ieee computer society annual symposium on vlsi | 2016
Maria Malik; Farnoud Farahmand; Paul Otto; Nima Akhlaghi; Tinoosh Mohsenin; Siddhartha Sikdar; Houman Homayoun
OpenCV applications are computationally intensive tasks among computer vision algorithms. The demand for low power yet high performance real-time processing of OpenCV embedded vision applications have led to developing their customized implementations on state-of-the-art embedded processing platforms. Given the industry move to heterogeneous platforms which integrates single core or multicore CPU with on-chip FPGA accelerators and GPU accelerators, the question of what platform and what implementation, whether hardware or software, is best suited for energy-efficient processing of this class of applications is becoming important. In this paper, we seek to answer this question through a detailed hardware and software implementation of OpenCV applications and methodically measurement and comprehensive analysis of their power and performance on state-of-the-art heterogeneous embedded processing platforms. The results show that in addition to application behavior, the size of image is an important factor in deciding the efficient platform in terms of highest energy-efficiency (EDP) among hardware accelerators on FPGA and software accelerators on GPU and multicore CPUs. While hardware implementation on ZYNQ shown to be the most performance and energy-efficient for image size of 500x500 or less, software GPU implementation found to be the most efficient and achieves highest speedup for larger image sizes. In addition, while for compute intensive vision applications the gap between FPGA, CPU and GPU reduces as the size of image increases, for non-intensive applications, a large performance and EDP gap is observed between the studied platforms, as the size of the image increases.