Gregory B. Newby | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gregory B. Newby is active.

Explore More

Publication

Featured researches published by Gregory B. Newby.

southern conference programmable logic | 2007

Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology and Empirical Study

Esam El-Araby; Mohamed Taher; Mohamed Abouellail; Tarek A. El-Ghazawi; Gregory B. Newby

Most application developers are willing to give up some performance and chip utilization in exchange of productivity. High-level tools for developing reconfigurable computing applications trade performance with ease-of-use. However, it is hard to know in a general sense how much performance and utilization one is giving up and how much ease-of-use he/she is gaining. More importantly, given the lack of standards and the uncertainty generated by sales literature, it is very hard to know the real differences that exist among different high-level programming paradigms. In order to do so, one needs a formal methodology and/or a framework that uses a common set of metrics and common experiments over a number of representative tools. In this work, we consider three representative high-level tools, Impulse-C, Mitrion-C, and DSPLogic in the Cray XD1 environment. These tools were selected to represent imperative programming, functional programming and graphical programming, and thereby demonstrate the applicability of our methodology. It will be shown that in spite of the disparity in concepts behind those tools, our methodology will be able to formally uncover the basic differences among them and analytically assess their comparative performance, utilization, and ease-of-use.

international parallel and distributed processing symposium | 2007

Experimental Evaluation of Emerging Multi-core Architectures

Abdullah Kayi; Yiyi Yao; Tarek A. El-Ghazawi; Gregory B. Newby

The trend of increasing speed and complexity in the single-core processor as stated in the Moores law is facing practical challenges. As a result, the multi-core processor architecture has emerged as the dominant architecture for both desktop and high-performance systems. Multi-core systems introduce many challenges that need to be addressed to achieve the best performance. Therefore, a new set of benchmarking techniques to study the impacts of the multi-core technologies is necessary. In this paper, multi-core specific performance metrics for cache coherency and memory bandwidth/latency/contention are investigated. This study also proposes a new benchmarking suite which includes cases extended from the high performance computing challenge (HPCC) benchmark suite. Performance results are measured on a Sun Fire T1000 server with six cores and an AMD Opteron dual core system. Experimental analysis and observations in this paper provide for a better understanding of the emerging multi-core architectures.

Simulation Modelling Practice and Theory | 2009

Performance issues in emerging homogeneous multi-core architectures

Abdullah Kayi; Tarek A. El-Ghazawi; Gregory B. Newby

Abstract Multi-core architectures have emerged as the dominant architecture for both desktop and high-performance systems. Multi-core systems introduce many challenges that need to be addressed to achieve the best performance. Therefore, benchmarking of these processors is necessary to identify the possible performance issues. In this paper, broad range of homogeneous multi-core architectures are investigated in terms of essential performance metrics. To measure performance, we used micro-benchmarks from High-Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), LMbench, and an FFT benchmark. Performance analysis is conducted on multi-core systems from UltraSPARC and x86 architectures; including systems based on Conroe, Kentsfield, Clovertown, Santa Rosa, Barcelona, Niagara, and Victoria Falls processors. Also, the effect of multi-core architectures in cluster performance is examined using a Clovertown based cluster. Finally, cache coherence overhead is analyzed using a full-system simulator. Experimental analysis and observations in this study provide for a better understanding of the emerging homogeneous multi-core systems.

computational science and engineering | 2008

Application Performance Tuning for Clusters with ccNUMA Nodes

Abdullah Kayi; Edward Kornkven; Tarek A. El-Ghazawi; Gregory B. Newby

With the increasing trend of putting more cores inside a single chip, more clusters adapt multicore multiprocessor nodes for high-performance computing (HPC). Cache coherent non-uniform memory access architectures (ccNUMA) are becoming an increasingly popular choice for such systems. In this paper, application performance analysis is provided using a 2312 Opteron cores system based on Sun Fire servers. Performance bottlenecks are identified and some potential solutions are proposed. With the proposed performance tunings, up to 30% application performance improvement was observed. In addition, provided experimental analysis can be utilized by HPC application developers in order to better understand clusters with ccNUMA nodes and also as a guideline for the usage of such architectures for scientific computing.

computing frontiers | 2010

Efficient cache design for solid-state drives

Miaoqing Huang; Olivier Serres; Vikram K. Narayana; Tarek A. El-Ghazawi; Gregory B. Newby

Solid-State Drives (SSDs) are data storage devices that use solid-state memory to store persistent data. Flash memory is the de facto nonvolatile technology used in most SSDs. It is well known that the writing performance of flash-based SSDs is much lower than the reading performance due to the fact that a flash page can be written only after it is erased. In this work, we present an SSD cache architecture designed to provide a balanced read/write performance for flash memory. An efficient automatic updating technique is proposed to provide a more responsive SSD architecture by writing back stable but dirty flash pages according to a predetermined set of policies during the SSD device idle time. Those automatic updating policies are also tested and compared. Simulation results demonstrate that both reading and writing performance are improved significantly by incorporating the proposed cache with automatic updating feature into SSDs.

high performance computing and communications | 2008

Performance Evaluation of Clusters with ccNUMA Nodes - A Case Study

Abdullah Kayi; Edward Kornkven; Tarek A. El-Ghazawi; Samy Al-Bahra; Gregory B. Newby

In the quest for higher performance and with the increasing availability of multicore chips, many systems are currently packing more processors per node. Adopting a ccNUMA node architecture in these cases has the promise of achieving a balance between cost and performance. In this paper, a 2312 Opteron cores system based on Sun Fire servers is considered as a case study to examine the performance issues associated with such architectures. In this work, we characterize the performance behavior of the system with focus on the node level using different configurations. It will be shown that the benefits from larger nodes can be severely limited due to many reasons. These reasons were isolated and the associated performance losses were assessed. The results revealed that such problems were mainly caused by topological imbalances, limitations of the used cache coherency protocol, operating system services distribution, and the lack of intelligent management of memory affinity.

intelligence and security informatics | 2003

Secure information sharing and information retrieval infrastructure with GridIR

Gregory B. Newby; Kevin Gamiel

This poster describes the emerging standard for information retrieval on computational grids, GridIR. GridIR is based on the work of the Global Grid Forum (GGF). GridIR implements a multi-tiered security model at the collection, query and datum level. Unlike large monolithic search engines and customized small-scale systems, GridIR provides a standard method for federating data sets with multiple data types. Three main components make up GridIR. The components can exist on any computer on the computational grid with sufficient computational resources, software and access permissions to join a Virtual Organization (VO):

International Journal of Reconfigurable Computing | 2010

Parameterized hardware design on reconfigurable computers: an image processing case study

Miaoqing Huang; Olivier Serres; Tarek A. El-Ghazawi; Gregory B. Newby

Reconfigurable Computers (RCs) with hardware (FPGA) co-processors can achieve significant performance improvement compared with traditional microprocessor (µP)-based computers for many scientific applications. The potential amount of speedup depends on the intrinsic parallelism of the target application as well as the characteristics of the target platform. In this work, we use image processing applications as a case study to demonstrate how hardware designs are parameterized by the co-processor architecture, particularly the data I/O, i.e., the local memory of the FPGA device and the interconnect between the FPGA and the µP. The local memory has to be used by applications that access data randomly. A typical case belonging to this category is image registration. On the other hand, an application such as edge detection can directly read data through the interconnect in a sequential fashion. Two different algorithms of image registration, the exhaustive search algorithm and the Discrete Wavelet Transform (DWT)-based search algorithm, are implemented on hardware, i.e., Xilinx Vertex-IIPro 50 on the Cray XD1 reconfigurable computer. The performance improvements of hardware implementations are 10× and 2×, respectively. Regarding the category of applications that directly access the interconnect, the hardware implementation of Canny edge detection can achieve 544× speedup.

acs/ieee international conference on computer systems and applications | 2009

Hardware acceleration prospects and challenges for high performance computing

Gregory B. Newby

High performance computing (HPC) has often benefited from special-purpose hardware. This paper examines the potential roles for several different approaches to hardware acceleration that are currently being deployed in HPC systems. Because each technology has different performance characteristics, as well as practical considerations (such as electrical consumption, physical interface, and cost), a match of these characteristics to the desired HPC workload is desirable. Technologies discussed include multicore processors, chip multithreading, graphics processing units, field-programmable gate arrays, Cell processors, and vector processors.

hawaii international conference on system sciences | 2010

Instability of Relevance-Ranked Results Using Latent Semantic Indexing for Web Search

Houssain Kettani; Gregory B. Newby

The latent semantic indexing (LSI) methodology for information retrieval applies the singular value decomposition to identify an eigensystem for a large matrix, in which cells represent the occurrence of terms (words) within documents. This methodology is used to rank text documents, such as Web pages or abstracts, based on their relevance to a topic. The LSI was introduced to address the issues of synonymy (different words with the same meaning) and polysemy (the same words with multiple meanings), thus addressing the ambiguity in human language by utilizing the statistical context of words. Rather than keeping all k possible eigenvectors and eigenvalues from the singular value decomposition which approximates the original term by document matrix, a smaller number is used - essentially allowing a fuzzy match of a topic to the original term by document matrix. In this paper, we show that the choice k impacts the resultant ranking and there is no value of k that results in stability of ranked results for similarity of the topic to documents. This is a surprising result, because prior literature indicates that eigensystems based on successively large values of k should approximate the complete (max k) eigensystems. The finding that document-query similarity rankings with larger values of k do not, in fact, maintain consistency, makes it difficult to assert that any particular value of k is optimal. This in turn renders LSI potentially untrustworthy for use in ranking text documents, even for values that differ by only 10% of the max k.

Explore More