Is this you? Create Your Porfile

Neena Imam

Oak Ridge National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Neena Imam is active.

Explore More

Publication

Featured researches published by Neena Imam.

high performance computing and communications | 2016

Experimental Analysis of File Transfer Rates over Wide-Area Dedicated Connections

Nageswara S. V. Rao; Qiang Liu; Satyabrata Sen; Greg Hinkel; Neena Imam; Ian T. Foster; Rajkumar Kettimuthu; Bradley W. Settlemyer; Chase Q. Wu; Daqing Yun

File transfers over dedicated connections, supported by large parallel file systems, have become increasingly important in high-performance computing and big data workflows. It remains a challenge to achieve peak rates for such transfers due to the complexities of file I/O, host, and network transport subsystems, and equally importantly, their interactions. We present extensive measurements of disk-to-disk file transfers using Lustre and XFS file systems mounted on multi-core servers over a suite of 10 Gbps emulated connections with 0–366 ms round trip times. Our results indicate that large buffer sizes and many parallel flows do not always guarantee high transfer rates. Furthermore, large variations in the measured rates necessitate repeated measurements to ensure confidence in inferences based on them. We propose a new method to efficiently identify the optimal joint file I/O and network transport parameters using a small number of measurements. We show that for XFS and Lustre with direct I/O, this method identifies configurations achieving 97% of the peak transfer rate while probing only 12% of the parameter space.

ACM Computing Surveys | 2016

Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods

Robert A. Bridges; Neena Imam; Tiffany M. Mintz

Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high-throughput applications. Although GPUs consume large amounts of power, their use for high-throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. This work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. As direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to power use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling are discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Last, possible directions for future research are discussed.

international parallel and distributed processing symposium | 2016

Dynamic Resource Management for Parallel Tasks in an Oversubscribed Energy-Constrained Heterogeneous Environment

Dylan Machovec; Bhavesh Khemka; Sudeep Pasricha; Anthony A. Maciejewski; Howard Jay Siegel; Gregory A. Koenig; Michael Wright; Marcia Hilton; Rejendra Rambharos; Neena Imam

The worth of completing parallel tasks is modeled using utility functions, which monotonically-decrease with time and represent the importance and urgency of a task. These functions define the utility earned by a task at the time of its completion. The performance of such a system is measured as the total utility earned by all completed tasks over some interval of time (e.g., 24 hours). To maximize system performance when scheduling dynamically arriving parallel tasks onto a high performance computing (HPC) system that is oversubscribed and energy-constrained, we have designed, analyzed, and compared different heuristic techniques. Four utility-aware heuristics (i.e., Max Utility, Max Utility-per-Time, Max Utility-per-Resource, and Max Utility-per-Energy), three FCFS-based heuristics (Conservative Backfilling, EASY Backfilling, and FCFS with Multiple Queues), and a Random heuristic were examined in this study. A technique that is often used with the FCFS-based heuristics is the concept of a permanent reservation. We compare the performance of permanent reservations with temporary place-holders to demonstrate the advantages that place-holders can provide. We also present a novel energy filtering technique that constrains the maximum energy-per-resource used by each task. We conducted a simulation study to evaluate the performance of these heuristics and techniques in an energy-constrained oversubscribed HPC environment. With place-holders, energy filtering, and dropping tasks with low potential utility, our utility-aware heuristics are able to significantly outperform the existing FCFS-based techniques.

Journal of Sensors | 2009

Acoustic Source Localization via Time Difference of Arrival Estimation for Distributed Sensor Networks Using Tera-Scale Optical Core Devices

Neena Imam; Jacob Barhen

For real-time acoustic source localization applications, one of the primary challenges is the considerable growth in computational complexity associated with the emergence of ever larger, active or passive, distributed sensor networks. These sensors rely heavily on battery-operated system components to achieve highly functional automation in signal and information processing. In order to keep communication requirements minimal, it is desirable to perform as much processing on the receiver platforms as possible. However, the complexity of the calculations needed to achieve accurate source localization increases dramatically with the size of sensor arrays, resulting in substantial growth of computational requirements that cannot be readily met with standard hardware. One option to meet this challenge builds upon the emergence of digital optical-core devices. The objective of this work was to explore the implementation of key building block algorithms used in underwater source localization on the optical-core digital processing platform recently introduced by Lenslet Inc. This demonstration of considerably faster signal processing capability should be of substantial significance to the design and innovation of future generations of distributed sensor networks.

2017 Annual IEEE International Systems Conference (SysCon) | 2017

On defense strategies for system of systems using aggregated correlations

Nageswara S. V. Rao; Neena Imam; Chris Y. T. Ma; Kjell Hausken; Fei He; Jun Zhuang

We consider a System of Systems (SoS) wherein each system Si, i = 1, 2,…, N, is composed of discrete cyber and physical components which can be attacked and reinforced. We characterize the disruptions using aggregate failure correlation functions given by the conditional failure probability of SoS given the failure of an individual system. We formulate the problem of ensuring the survival of SoS as a game between an attacker and a provider, each with a utility function composed of a survival probability term and a cost term, both expressed in terms of the number of components attacked and reinforced. The survival probabilities of systems satisfy simple product-form, first-order differential conditions, which simplify the Nash Equilibrium (NE) conditions. We derive the sensitivity functions that highlight the dependence of SoS survival probability at NE on cost terms, correlation functions, and individual system survival probabilities. We apply these results to a simplified model of distributed cloud computing infrastructure.

Archive | 2009

SENSOR DATA PROCESSING FOR TRACKING UNDERWATER THREATS USING TERASCALE OPTICAL CORE DEVICES

Jacob Barhen; Neena Imam

A critical aspect of littoral surveillance (including port protection) involves the localization and tracking of underwater threats such as manned or unmanned autonomous underwater vehicles. In this article, we present a methodology for locating underwater threat sources from uncertain sensor network data, and illustrate the threat tracking aspects using active sonars in a matched filter framework. The novelty of the latter paradigm lies in its implementation on a tera-scale optical core processor, EnLight™, recently introduced by Lenslet Laboratories. This processor is optimized for array operations, which it performs in a fixed point arithmetic architecture at tera-scale throughput. Using the EnLight™ 64α prototype processor, our results (i) illustrate the ability to reach a robust tracking accuracy, and (ii) demonstrate that a considerable speed-up (a factor of over 13,000) can be achieved when compared to an Intel Xeon™ processor in the computation of sets of 80K-sample complex Fourier transforms that are associated with our matched filter techniques.

ieee acm international symposium cluster cloud and grid computing | 2017

High-Performance Key-Value Store On OpenSHMEM

Huansong Fu; Manjunath Gorentla Venkata; Ahana Roy Choudhury; Neena Imam; Weikuan Yu

Recently, there has been a growing interest in enabling fast data analytics by leveraging system capabilities from large-scale high-performance computing (HPC) systems. OpenSHMEM is a popular run-time system on HPC systems that has been used for large-scale compute-intensive scientific applications. In this paper, we propose to leverage OpenSHMEM to design a distributed in-memory key-value store for fast data analytics. Accordingly, we have developed SHMEMCache on top of OpenSHMEM to leverage its symmetric global memory, efficient one-sided communication operations and general portability. We have also evaluated SHMEMCache through extensive experimental studies. Our results show that SHMEMCache has accomplished significant performance improvements over hte original Memcached in terms of latency and throughput. Our evaluation on the Titan supercomputer has also demonstrated that SHMEMCache can scale to 1024 nodes.

OpenSHMEM 2015 Revised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 9397 | 2015

Graph 500 in OpenSHMEM

Eduardo F. D'Azevedo; Neena Imam

This document describes the effort to implement the Graph 500 benchmark using OpenSHMEM based on the MPI-2 one-side version. The Graph 500 benchmark performs a breadth-first search in parallel on a large randomly generated undirected graph and can be implemented using basic MPI-1 and MPI-2 one-sided communication. Graph 500 requires atomic bit-wise operations on unsigned long integers but neither atomic bit-wise operations nor OpenSHMEM for unsigned long are available in OpenSHEM. Such needed bit-wise atomic operations and support for unsigned long are implemented using atomic condition swap CSWAP on signed long integers. Preliminary results on comparing the OpenSHMEM and MPI-2 one-sided implementations on a Silicon Graphics Incorporated SGI cluster and the Cray XK7 are presented.

international parallel and distributed processing symposium | 2016

Performance Models for Split-Execution Computing Systems

Travis S. Humble; Alexander J. McCaskey; Jonathan Schrock; Hadayat Seddiqi; Keith A. Britt; Neena Imam

Split-execution computing leverages the capabilities of multiple computational models to solve problems, but splitting program execution across different computational models incurs costs associated with the translation between domains. We analyze the performance of a split-execution computing system developed from conventional and quantum processing units (QPUs) by using behavioral models that track resource usage. We focus on asymmetric processing models built using conventional CPUs and a family of special-purpose QPUs that employ quantum computing principles. Our performance models account for the translation of a classical optimization problem into the physical representation required by the quantum processor while also accounting for hardware limitations and conventional processor speed and memory. We conclude that the bottleneck in this split-execution computing system lies at the quantum-classical interface and that the primary time cost is independent of quantum processor behavior.

Superconductor Science and Technology | 2016

Memory cell operation based on small Josephson junctions arrays

Yehuda Braiman; N. Nair; J Rezac; Neena Imam

In this paper we analyze a cryogenic memory cell circuit based on a small coupled array of Josephson junctions. All the basic memory operations (e.g., write, read, and reset) are implemented on the same circuit and different junctions in the array can in principle be utilized for these operations. The presented memory operation paradigm is fundamentally different from conventional single quantum flux operation logics (SFQ). As an example, we demonstrate memory operation driven by a SFQ pulse employing an inductively coupled array of three Josephson junctions. We have chosen realistic Josephson junction parameters based on state-of-the-art fabrication capabilities and have calculated access times and access energies for basic memory cell operations. We also implemented an optimization procedure based on the simulated annealing algorithm to calculate the optimized and typical values of access times and access energies.

Explore More