Matthew Poremba | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthew Poremba is active.

Explore More

Publication

Featured researches published by Matthew Poremba.

ieee computer society annual symposium on vlsi | 2012

NVMain: An Architectural-Level Main Memory Simulator for Emerging Non-volatile Memories

Matthew Poremba; Yuan Xie

Emerging non-volatile memory (NVM) technologies, such as PCRAM and STT-RAM, have demonstrated great potentials to be the candidates as replacement for DRAM-based main memory design for computer systems. It is important for computer architects to model such emerging memory technologies at the architecture level, to understand the benefits and limitations for better utilizing them to improve the performance/energy/reliability of future computing systems. In this paper, we introduce an architectural-level simulator called NV Main, which can model main memory design with both DRAM and emerging non-volatile memory technologies, and can facilitate designers to perform design space explorations utilizing these emerging memory technologies. We discuss design points of the simulator and provide validation of the model, along with case studies on using the tool for design space explorations.

IEEE Computer Architecture Letters | 2015

NVMain 2.0: A User-Friendly Memory Simulator to Model (Non-)Volatile Memory Systems

Matthew Poremba; Tao Zhang; Yuan Xie

In this letter, a flexible memory simulator - NVMain 2.0, is introduced to help the community for modeling not only commodity DRAMs but also emerging memory technologies, such as die-stacked DRAM caches, non-volatile memories (e.g., STT-RAM, PCRAM, and ReRAM) including multi-level cells (MLC), and hybrid non-volatile plus DRAM memory systems. Compared to existing memory simulators, NVMain 2.0 features a flexible user interface with compelling simulation speed and the capability of providing sub-array-level parallelism, fine-grained refresh, MLC and data encoder modeling, and distributed energy profiling.

international conference on computer aided design | 2010

Evaluation of using inductive/capacitive-coupling vertical interconnects in 3D network-on-chip

Jin Ouyang; Jing Xie; Matthew Poremba; Yuan Xie

In recent 3DIC studies, through silicon vias (TSV) are usually employed as the vertical interconnects in the 3D stack. Despite its benefit of short latency and low power, forming TSVs adds additional complexities to the fabrication process. Recently, inductive/capactive-coupling links are proposed to replace TSVs in 3D stacking because the fabrication complexities of them are lower. Although state-of-the-art inductive/capacitive-coupling links show comparable bandwidth and power as TSV, the relatively large footprints of those links compromise their area efficiencies. In this work, we study the design of 3D network-on-chip (NoC) using inductive/capacitive-coupling links. We propose three techniques to mitigate the area overhead introduced by using these links: (a) serialization, (b) in-transceiver data compression, and (c) high-speed asynchronous transmission. With the combination of these three techniques, evaluation results show that the overheads of all aspects caused by using inductive/capacitive-coupling vertical links can be bounded under 10%.

high-performance computer architecture | 2014

CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture

Tao Zhang; Matthew Poremba; Cong Xu; Guangyu Sun; Yuan Xie

As DRAM density keeps increasing, more rows need to be protected in a single refresh with the constant refresh number. Since no memory access is allowed during a refresh, the refresh penalty is no longer trivial and can result in significant performance degradation. To mitigate the refresh penalty, a Concurrent-REfresh-Aware Memory system (CREAM) is proposed in this work so that memory access and refresh can be served in parallel. The proposed CREAM architecture distinguishes itself with the following key contributions: (1) Under a given DRAM power budget, sub-rank-level refresh (SRLR) is developed to reduce refresh power and the saved power is used to enable concurrent memory access; (2) sub-array-level refresh (SALR) is also devised to effectively lower the probability of the conflict between memory access and refresh; (3) In addition, novel sub-array level refresh scheduling schemes, such as sub-array round-robin and dynamic scheduling, are designed to further improve the performance. A quasi-ROR interface protocol is proposed so that CREAM is fully compatible with JEDEC-DDR standard with negligible hardware overhead and no extra pin-out. The experimental results show that CREAM can improve the performance by 12.9% and 7.1% over the conventional DRAM and the Elastic-Refresh DRAM memory, respectively.

high-performance computer architecture | 2013

TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network

Yuan-Ying Chang; Yoshi Shih-Chieh Huang; Matthew Poremba; Vijaykrishnan Narayanan; Yuan Xie; Chung-Ta King

Switch allocation is a critical pipeline stage in the router of an Network-on-Chip (NoC), in which flits in the input ports of the router are assigned to the output ports for forwarding. This allocation is in essence a matching between the input requests and output port resources. Efficient router designs strive to maximize the matching. Previous research considers the allocation decision at each cycle either independently or depending on prior allocations. In this paper, we demonstrate that the matching decisions made in a router along time actually form a time series, and the Quality-of-Allocation (QoA) can be maximized if the matching decision is made across the time series, from past history to future requests. Based on this observation, a novel router design, TS-Router, is proposed. TS-Router predicts future requests to arrive at a router and tries to maximize the matching across cycles. It can be extended easily from most state-of-the-art routers in a lightweight fashion. Our evaluation of TS-Router uses synthetic traffic as well as real benchmark programs in full-system simulator. The results show that TS-Router can have higher number of matchings and lower latency. In addition, a prototype of TS-Router is implemented in Verilog, so that power consumption and area overhead are also evaluated.

signal processing systems | 2010

Accelerating adaptive background subtraction with GPU and CBEA architecture

Matthew Poremba; Yuan Xie; Marilyn Wolf

Background subtraction is an important problem in computer vision and is a fundamental task for many applications. In the past, background subtraction has been limited by the amount of computing power available. The task was performed on small frames and, in the case of adaptive algorithms, with relatively small models to achieve real-time performance. With the introduction of multi- and many-core chip-multiprocessors (CMP), more computing resources are available to handle this important task. The advent of specialized CMP, such as NVIDIAs Compute Unified Device Architecture (CUDA) and IBMs Cell Broadband Engine Architecture (CBEA), provides new opportunities to accelerate real-time video applications. In this paper, we evaluate the acceleration of background subtraction with these two different chip-multiprocessor (CMP) architectures (CUDA and CBEA), such that larger image frames can be processed with more models while still achieving real-time performance. Our analysis results show impressive performance improvement over a baseline implementation that uses a multi-threaded dual-core CPU. Specifically, the CUDA implementation and CBEA implementation can achieve up to 17.82X and 2.77X improvement, respectively.

high-performance computer architecture | 2017

Design and Analysis of an APU for Exascale Computing

Thiruvengadam Vijayaraghavany; Yasuko Eckert; Gabriel H. Loh; Michael J. Schulte; Mike Ignatowski; Bradford M. Beckmann; William C. Brantley; Joseph L. Greathouse; Wei Huang; Arun Karunanithi; Onur Kayiran; Mitesh R. Meswani; Indrani Paul; Matthew Poremba; Steven E. Raasch; Steven K. Reinhardt; Greg Sadowski; Vilas Sridharan

The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper presents a vision for an architecture that can be used to construct exascale systems. We describe a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer. The ENA consists of an Exascale Heterogeneous Processor (EHP) coupled with an advanced memory system. The EHP provides a high-performance accelerated processing unit (CPU+GPU), in-package high-bandwidth 3D memory, and aggressive use of die-stacking and chiplet technologies to meet the requirements for exascale computing in a balanced manner. We present initial experimental analysis to demonstrate the promise of our approach, and we discuss remaining open research challenges for the community.

international symposium on low power electronics and design | 2014

EECache: exploiting design choices in energy-efficient last-level caches for chip multiprocessors

Hsiang-Yun Cheng; Matthew Poremba; Narges Shahidi; Ivan Stalev; Mary Jane Irwin; Mahmut T. Kandemir; Jack Sampson; Yuan Xie

Power management for large last-level caches (LLCs) is important in chip-multiprocessors (CMPs), as the leakage power of LLCs accounts for a significant fraction of the limited on-chip power budget. Since not all workloads need the entire cache, portions of a shared LLC can be disabled to save energy. In this paper, we explore different design choices, from circuit-level cache organization to micro-architectural management policies, to propose a low-overhead run-time mechanism for energy reduction in the shared LLC. Results show that our design (EECache) provides 14.1% energy saving at only 1.2% performance degradation on average, with negligible hardware overhead.

asia and south pacific design automation conference | 2014

NoΔ: Leveraging delta compression for end-to-end memory access in NoC based multicores

Jia Zhan; Matthew Poremba; Yi Xu; Yuan Xie

As the number of on-chip processing elements increases, the interconnection backbone bears bursty traffic from memory and cache accesses. In this paper, we propose a compression technique called NoΔ, which leverages delta compression to compress network traffic. Specifically, it conducts data encoding prior to packet injection and decoding before ejection in the network interface. The key idea of NoΔ is to store a data packet in the Network-on-Chip as a common base value plus an array of relative differences (Δ). It can improve the overall network performance and achieve energy savings because of the decreased network load. Moreover, this scheme does not require modifications of the cache storage design and can be seamlessly integrated with any optimization techniques for the on-chip interconnect. Our experiments reveal that the proposed NoΔ incurs negligible hardware overhead and outperforms state-of-the-art zero-content compression and frequent-value compression.

asia and south pacific design automation conference | 2015

Heterogeneous architecture design with emerging 3D and non-volatile memory technologies

Qiaosha Zou; Matthew Poremba; Rui He; Wei Yang; Junfeng Zhao; Yuan Xie

Energy becomes the primary concern in nowadays multi-core architecture designs. Moores law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

Explore More