Ram Huggahalli
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ram Huggahalli.
international symposium on computer architecture | 2005
Ram Huggahalli; Ravi R. Iyer; Scott Tetrick
Recent I/O technologies such as PCI-Express and 10 Gb Ethernet enable unprecedented levels of I/O bandwidths in mainstream platforms. However, in traditional architectures, memory latency alone can limit processors from matching 10 Gb inbound network I/O traffic. We propose a platform-wide method called direct cache access (DCA) to deliver inbound I/O data directly into processor caches. We demonstrate that DCA provides a significant reduction in memory latency and memory bandwidth for receive intensive network I/O applications. Analysis of benchmarks such as SPECWeb9, TPC-W and TPC-C shows that overall benefit depends on the relative volume of I/O to memory traffic as well as the spatial and temporal relationship between processor and I/O memory accesses. A system level perspective for the efficient implementation of DCA is presented.
high performance interconnects | 2015
Mark S. Birrittella; Mark Debbage; Ram Huggahalli; James A. Kunz; Tom Lovett; Todd M. Rimmer; Keith D. Underwood; Robert C. Zak
The Intel® Omni-Path Architecture (Intel® OPA) is designed to enable a broad class of computations requiring scalable, tightly coupled CPU, memory, and storage resources. Integration between devices in the Intel® OPA family and Intel® CPUs enable improvements in system level packaging and network efficiency. When coupled with the new user-focused open standard APIs developed by the OpenFabrics Alliance (OFA) Open Fabrics Initiative (OFI), host fabric interfaces (HFIs) and switches in the Intel® OPA family are optimized to provide low latency, high bandwidth, and high message rate. Intel® OPA provides important innovations to enable a multi-generation, scalable fabric, including: link layer reliability, extended fabric addressing, and optimizations for high core count CPUs. Datacenter needs are also a core focus for Intel® OPA, which includes: link level traffic flow optimization to minimize datacenter jitter for high priority packets, robust partitioning support, quality of service support, and a centralized fabric management system. Basic performance metrics from first generation HFI and switch implementations demonstrate the potential of the new fabric architecture.
high-performance computer architecture | 2009
Amit Kumar; Ram Huggahalli; Srihari Makineni
10GbE connectivity is expected to be a standard feature of server platforms in the near future. Among the numerous methods and features proposed to improve network performance of such platforms is Direct Cache Access (DCA) to route incoming I/O to CPU caches directly. While this feature has been shown to be promising, there can be significant challenges when dealing with high rates of traffic in a multiprocessor and multi-core environment. In this paper, we focus on two practical considerations with DCA. In the first case, we show that the performance benefit from DCA can be limited when network traffic processing rate cannot match the I/O rate. In the second case, we show that affinitizing both stack and application contexts to cores that share a cache is critical. With proper distribution and affinity, we show that a standard Linux network stack runs 32% faster for 2KB to 64KB I/O sizes.
modeling, analysis, and simulation on computer and telecommunication systems | 2004
Jaideep Moses; Ramesh Illikkal; Ravi R. Iyer; Ram Huggahalli; Donald Newell
As platforms evolve from employing single-threaded, single-core CPUs to multi-threaded, multi-core CPUs and embedded hardware-assist engines, the simulation infrastructure required for performance analysis of these platforms becomes extremely complex. While investigating hardware/software solutions for server network acceleration (SNA), we encountered limitations of existing simulators for some of these solutions. For example, light weight threading and asynchronous memory copy solutions for SNA could not be modeled accurately and efficiently and hence we developed a flexible trace-driven simulation framework called ASPEN (architectural simulator for parallel engines). ASPEN is based on the use of rich workload traces (RWT), which capture the major events of interest during the execution of a workload on a single-threaded CPU and platform and replaying it a multi-threaded architecture with hardware-assist engines. We introduce the overall ASPEN framework and describe its usage in the context of SNA. We believe that ASPEN is a useful performance tool for future platform architects and performance analysts.
high performance interconnects | 2010
Guangdeng Liao; Xia Zhu; Steen Larsen; Laxmi N. Bhuyan; Ram Huggahalli
With the rapid evolution of network speed from 1Gbps to 10Gbps, a wide spectrum of research has been done on TCP/IP to improve its processing efficiency on general purpose processors. However, most of them did studies only from the performance perspective and ignored its power efficiency. As power has become a major concern in data centers, where servers are often interconnected with 10GbE, it becomes critical to understand power efficiency of TCP/IP packet processing over 10GbE. In this paper, we extensively examine power consumption of TCP/IP packet processing over 10GbE on Intel Nehalem platforms across a range of I/O sizes by using a power analyzer. In order to understand the power consumption, we use an external Data Acquisition System (DAQ) to obtain a breakdown of power consumption for individual hardware components such as CPU, memory and NIC etc. In addition, as integrated NIC architectures are gaining more attention in high-end servers, we also study power consumption of TCP/IP packet processing on an integrated NIC by using a Sun Niagara 2 processor with two integrated 10GbE NICs. We carefully compare the power efficiency of using an integrated NIC with using a PCI-E based discrete NIC. We make many new observations as follows: 1) Unlike 1GbE NICs, 10GbE NICs have high idle power dissipation, and TCP/IP packet processing over 10GbE consumes significant dynamic power. 2) Our power breakdown reveals that CPU is the major source of the dynamic power consumption, followed by memory. As the I/O size increases, the CPU power consumption reduces but the memory power consumption grows. Compared to CPU and memory, NIC has low dynamic power consumption. 3) Large I/O sizes are much more power efficient than small I/O sizes. 4) While integrating a 10GbE NIC slightly increases CPU power consumption, it not only reduces system idle power dissipation due to elimination of PCI-E interface in NICs, but also achieves dynamic power savings due to better processing efficiency. Our studies motivate us to design a more power efficient server architecture, which can be used in the next generation data centers.
ieee international conference on high performance computing, data, and analytics | 2008
Priya Govindarajan; Srihari Makineni; Donald Newell; Ravi R. Iyer; Ram Huggahalli; Amit Kumar
Scaling TCP/IP receive side processing to 10Gbps speeds on commercialserver platforms has been a major challenge. This led to the development oftwo key techniques: Large Receive Offload (LRO) and Direct Cache Access(DCA). Only recently, systems supporting these two techniques have becomeavailable. So, we want to evaluate these two techniques using 10Gigabit NICs tofind out if we can finally get 10Gbps rates. We evaluate these two techniques indetail to understand performance benefit offered by these two techniques and theremaining major overheads. Our measurements showed that LRO and DCA togetherimprove TCP/IP receive performance by more than 50% over the base case(no LRO and DCA). These two techniques combined with the improvements inthe CPU architecture and the rest of the platform over the last 3-4 years have morethan doubled the TCP/IP receive processing throughput to 7Gbps. Our detailedarchitectural characterization of TCP/IP processing, with these two features enabled,has revealed that buffer management and copy operations still take up significantamount of processing time. We also analyze the scaling behavior ofTCP/IP to figure out how multi-core architectures improve network processing.This part of our analysis has highlighted some limiting factors that need to be addressedto achieve scaling beyond 10Gbps.
IEEE Micro | 2016
Mark S. Birrittella; Mark Debbage; Ram Huggahalli; James A. Kunz; Tom Lovett; Todd M. Rimmer; Keith D. Underwood; Robert C. Zak
The Intel Omni-Path Architecture (Intel OPA) is designed to enable a broad class of computations requiring scalable, tightly coupled CPU, memory, and storage resources. Integration between the Intel OPA family and Intel CPUs enable improvements in system-level packaging and network efficiency. When coupled with the new open standard APIs developed by the OpenFabrics Alliance (OFA) Open Fabrics Initiative (OFI), the Intel OPA family is optimized to provide low latency, high bandwidth, and a high message rate. Intel OPA enables a multigeneration, scalable fabric through innovations including link layer reliability, extended fabric addressing, and optimizations for high-core-count CPUs. Intel OPA also provides optimizations to address datacenter needs, including link-level traffic flow optimization, to minimize jitter for high-priority packets, partitioning support, quality-of-service support, and a centralized fabric management system. Basic performance metrics from first-generation host fabric interface and switch implementations demonstrate the new fabric architectures potential.
international symposium on microarchitecture | 2007
Amit Kumar; Ram Huggahalli
symposium on computer architecture and high performance computing | 2007
Steen Larsen; Parthasarathy Sarangam; Ram Huggahalli
Archive | 2007
Ram Huggahalli; Raymond S. Tetrick