Krishna T. Malladi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Krishna T. Malladi is active.

Explore More

Publication

Featured researches published by Krishna T. Malladi.

IEEE Computer Architecture Letters | 2017

LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory

Amirali Boroumand; Saugata Ghose; Minesh Patel; Hasan Hassan; Brandon Lucia; Kevin Hsieh; Krishna T. Malladi; Hongzhong Zheng; Onur Mutlu

Processing-in-memory (PIM) architectures cannot use traditional approaches to cache coherence due to the high off-chip traffic consumed by coherence messages. We propose LazyPIM, a new hardware cache coherence mechanism designed specifically for PIM. LazyPIM uses a combination of speculative cache coherence and compressed coherence signatures to greatly reduce the overhead of keeping PIM coherent with the processor. We find that LazyPIM improves average performance across a range of PIM applications by 49.1 percent over the best prior approach, coming within 5.5 percent of an ideal PIM mechanism.

international symposium on microarchitecture | 2017

DRISA: a DRAM-based Reconfigurable In-Situ Accelerator

Shuangchen Li; Dimin Niu; Krishna T. Malladi; Hongzhong Zheng; Bob Brennan; Yuan Xie

Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

international symposium on computer architecture | 2016

DRAF: a low-power DRAM-based reconfigurable acceleration fabric

Mingyu Gao; Christina Delimitrou; Dimin Niu; Krishna T. Malladi; Hongzhong Zheng; Bob Brennan; Christos Kozyrakis

FPGAs are a popular target for application-specific accelerators because they lead to a good balance between flexibility and energy efficiency. However, FPGA lookup tables introduce significant area and power overheads, making it difficult to use FPGA devices in environments with tight cost and power constraints. This is the case for datacenter servers, where a modestly-sized FPGA cannot accommodate the large number of diverse accelerators that datacenter applications need. This paper introduces DRAF, an architecture for bit-level reconfigurable logic that uses DRAM subarrays to implement dense lookup tables. DRAF overlaps DRAM operations like bitline precharge and charge restoration with routing within the reconfigurable routing fabric to minimize the impact of DRAM latency. It also supports multiple configuration contexts that can be used to quickly switch between different accelerators with minimal latency. Overall, DRAF trades off some of the performance of FPGAs for significant gains in area and power. DRAF improves area density by 10x over FPGAs and power consumption by more than 3x, enabling DRAF to satisfy demanding applications within strict power and cost constraints. While accelerators mapped to DRAF are 2-3x slower than those in FPGAs, they still deliver a 13x speedup and an 11x reduction in power consumption over a Xeon core for a wide range of datacenter tasks, including analytics and interactive services like speech recognition.

international performance computing and communications conference | 2016

KOVA : A tool for kernel visualization and analysis

Manu Awasthi; Krishna T. Malladi

The time spent by an application can broadly be classified into two main categories — user mode and kernel mode. In order to optimize applications from a performance perspective, it is critical to know the code regions where they spend the bulk of their time. With datacenter applications becoming more I/O intensive and storage devices attaining higher performance with each generation, the contribution of the Linux kernel stack to overall performance is increasing to an all time high. These trends make it imperative to observe and visualize kernel behavior and performance in order to effectively optimize it for specific use cases. To that end, in this paper, we present KOVA: Kernel Overhead Visualization and Analysis framework that builds on top of existing kernel tracers to provide comprehensive insights into Linux kernel behavior.

acm international conference on systems and storage | 2016

Software-Defined Emulation Infrastructure for High Speed Storage

Krishna T. Malladi; Manu Awasthi; Hongzhong Zheng

NVMe, being a new I/O communication protocol, suffers from a lack of tools to evaluate storage solutions built on the standard. In this paper, we provide the design and analysis of a comprehensive, fully customizable emulation infrastructure that builds on the NVMe protocol. It provides a number of knobs that allow system architects to quickly evaluate performance implications of a wide variety of storage solutions while natively executing workloads.

Proceedings of the Second International Symposium on Memory Systems | 2016

DRAMScale: Mechanisms to Increase DRAM Capacity

Krishna T. Malladi; Uk-Song Kang; Manu Awasthi; Hongzhong Zheng

New resistive memory technologies promise scalability and non-volatility but suffer from longer, asymmetric read-write latencies and lower endurance, placing the burden of system design on architects. In order to avoid such pitfalls and still provision for exascale data requirements using a much faster DRAM technology, we introduce DRAMScale. It features three novel mechanisms to increase DRAM density while complementing technology scaling and creating a new capacity-optimized DRAM system. Such optimizations enable us to build a two-tier memory system that meets memory latency and capacity requirements.

Proceedings of the Second International Symposium on Memory Systems | 2016

DRAMPersist: Making DRAM Systems Persistent

Krishna T. Malladi; Manu Awasthi; Hongzhong Zheng

Modern applications exercise main memory systems in different ways. A lot of scale-out, in-memory applications exploit a number of desirable properties provided by DRAM such as high capacity, low latency and high bandwidth. Although DRAM technology continues to scale aggressively, new resistive memory technologies are on the horizon, promising scalability, density and non-volatility. However, they still suffer from longer, asymmetric read-write latencies and have lower endurance as compared to DRAM. Considering these factors, scale-out, distributed applications will benefit greatly from main memory architectures that provide the non-volatility of new memory technologies, but still have DRAM-like latencies. To that end, we introduce DRAMPersist -- a novel mechanism to make main memory persistent and complement existing high speed storage, specifically geared for scale-out systems.

networking architecture and storages | 2017

FlashStorageSim: Performance Modeling for SSD Architectures

Krishna T. Malladi; Mu-Tien Chang; Dimin Niu; Hongzhong Zheng

We present FlashStorageSim, an SSD architecture performance model for data center servers, validated with an enterprise SSD. In addition to the SSD controller, SSD organization, and flash devices, FlashStorageSim models the host interface (e.g., SATA, PCIe, DDR). This allows users to explore non-traditional SSD use cases. We also implement mechanisms to improve simulation speed, which is shown to reduce simulation time by more than 7X. We show how FlashStorageSim can help researchers understand SSD design decisions.

networking architecture and storages | 2017

Rack Level Scheduling for Containerized Workloads

Qiumin Xu; Krishna T. Malladi; Manu Awasthi

High performance SSDs have become ubiquitous in warehouse scale computing. Increased adoptions can be attributed to their high bandwidth, low latency and excellent random I/O performance. Owing to this high performance, multiple I/O intensive services can now be co-located on the same server. SSDs also introduce periodic latency spikes due to garbage collection. This, combined with multi-tenancy increases latency unpredictability since co-located applications now compete for CPU, memory, and disk bandwidth. The combination of these latency spikes and unpredictability lead to long tail latencies that can significantly decrease the system performance at scale. In this paper, we present a rack-level scheduling algorithm, which dynamically detects and shifts workloads with long tail latencies within servers in the same rack. Different from the global resource management methods, rack-level scheduling utilizes lightweight containers to minimize data movement and message passing overheads, leading to a much more efficient solution to reduce tail latency.With the algorithms implemented in the storage driver of the containerization infrastructure, it becomes viable to deploy and migrate applications in existing server racks without extensive modifications to storage, OS and other subsystems.

international symposium on performance analysis of systems and software | 2017

Docker characterization on high performance SSDs

Qiumin Xu; Manu Awasthi; Krishna T. Malladi; Janki Bhimani; Jingpei Yang; Murali Annavaram

Docker containers [2] are becoming the mainstay for deploying applications in cloud platforms, having many desirable features like ease of deployment, developer friendliness and lightweight virtualization. Meanwhile, solid state disks (SSDs) have witnessed tremendous performance boost through recent innovations in industry such as Non-Volatile Memory Express (NVMe) standards [3], [4]. However, the performance of containerized applications on these high speed contemporary SSDs has not yet been investigated. In this paper, we present a characterization of the performance impact among a wide variety of the available storage options for deploying Docker containers and provide the configuration options to best utilize the high performance SSDs.

Explore More