Is this you? Create Your Porfile

Suvinay Subramanian

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Suvinay Subramanian is active.

Explore More

Publication

Featured researches published by Suvinay Subramanian.

international symposium on computer architecture | 2014

SCORPIO: a 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering

Bhavya K. Daya; Chia-Hsin Chen; Suvinay Subramanian; Woo Cheol Kwon; Sunghyun Park; Tushar Krishna; Jim Holt; Anantha P. Chandrakasan; Li-Shiuan Peh

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs. We present SCORPIO, an ordered mesh Network-on-Chip (NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupledfrom the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application runtime reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively. The SCORPIO architecture is incorporated in an 11 mm-by13mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power ArchitectureTMcores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833 MHz post-layout) with an estimated power of 28.8 W (768 mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.

design, automation, and test in europe | 2013

SMART: a single-cycle reconfigurable NoC for SoC applications

Chia-Hsin Owen Chen; Sunghyun Park; Tushar Krishna; Suvinay Subramanian; Anantha P. Chandrakasan; Li-Shiuan Peh

As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs to interconnect the multiple cores on the chip. Given aggressive SoC design targets, NoCs have to deliver low latency, high bandwidth, at low power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures and tailors a generic mesh topology for SoC applications at runtime. The heart of our SMART NoC is a novel low-swing clockless repeated link circuit embedded within the router crossbars, that allows packets to potentially bypass all the way from source to destination core within a single clock cycle, without being latched at any intermediate router. Our clockless repeater link has been proven in silicon in 45nm SOI. Results show that at 2GHz, we can traverse 8mm within a single cycle, i.e. 8 hops with 1mm cores. We implement the SMART NoC to layout and show that SMART NoC gives 60% latency savings, and 2.2X power savings compared to a baseline mesh NoC.

acm special interest group on data communication | 2016

Programmable Packet Scheduling at Line Rate

Anirudh Sivaraman; Suvinay Subramanian; Mohammad Alizadeh; Sharad Chole; Shang Tse Chuang; Anurag Agrawal; Hari Balakrishnan; Tom Edsall; Sachin Katti; Nick McKeown

Switches today provide a small menu of scheduling algorithms. While we can tweak scheduling parameters, we cannot modify algorithmic logic, or add a completely new algorithm, after the switch has been designed. This paper presents a design for a {\em programmable} packet scheduler, which allows scheduling algorithms---potentially algorithms that are unknown today---to be programmed into a switch without requiring hardware redesign. Our design uses the property that scheduling algorithms make two decisions: in what order to schedule packets and when to schedule them. Further, we observe that in many scheduling algorithms, definitive decisions on these two questions can be made when packets are enqueued. We use these observations to build a programmable scheduler using a single abstraction: the push-in first-out queue (PIFO), a priority queue that maintains the scheduling order or time. We show that a PIFO-based scheduler lets us program a wide variety of scheduling algorithms. We present a hardware design for this scheduler for a 64-port 10 Gbit/s shared-memory (output-queued) switch. Our design costs an additional 4% in chip area. In return, it lets us program many sophisticated algorithms, such as a 5-level hierarchical scheduler with programmable decisions at each level.

IEEE Computer | 2013

Single-Cycle Multihop Asynchronous Repeated Traversal: A SMART Future for Reconfigurable On-Chip Networks

Tushar Krishna; Chia-Hsin Owen Chen; Sunghyun Park; Woo-Cheol Kwon; Suvinay Subramanian; Anantha P. Chandrakasan; Li-Shiuan Peh

Future scalability for kilo-core architectures requires solutions beyond the capabilities of protocol and software design. Single-cycle multihop asynchronous repeated traversal (SMART) creates virtual single-cycle paths across the shared network between cores, potentially offering significant reductions in runtime latency and energy expenditure.

hot topics in networks | 2015

Towards Programmable Packet Scheduling

Anirudh Sivaraman; Suvinay Subramanian; Anurag Agrawal; Sharad Chole; Shang Tse Chuang; Tom Edsall; Mohammad Alizadeh; Sachin Katti; Nick McKeown; Hari Balakrishnan

Packet scheduling in switches is not programmable; operators only choose among a handful of scheduling algorithms implemented by the manufacturer. In contrast, other switch functions such as packet parsing and header processing are becoming programmable [10, 3, 6]. This paper presents a programmable packet scheduler that allows operators to program a variety of scheduling algorithms. Our design exploits the insight that any scheduling algorithm can be deconstructed into two decisions: in what order packets depart and when they depart. The algorithms only differ in how the order and departure times are computed. We show how these decisions map to two well-understood abstractions: priority and calendar queues [11]. Priority and calendar queues can then be composed together to realize a broad range of sophisticated scheduling algorithms. Further, both abstractions can be realized using the same mechanism: a programmable push-in first-out queue (PIFO) that allows a packet to push itself into an arbitrary location in a queue by programming a packet field. A PIFO is feasible in hardware. Preliminary synthesis indicates that an unoptimized hardware design meets timing at 1 GHz on a 16 nm technology node and occupies only 5% additional die area relative to existing merchant-silicon switching chips.

international symposium on microarchitecture | 2015

A scalable architecture for ordered parallelism

Mark C. Jeffrey; Suvinay Subramanian; Cong Yan; Joel S. Emer; Daniel Sanchez

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits. We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51–122 × speedups over a single-core system, and outperforms software-only parallel algorithms by 3–18 ×.

IEEE Micro | 2016

Unlocking Ordered Parallelism with the Swarm Architecture

Mark C. Jeffrey; Suvinay Subramanian; Cong Yan; Joel S. Emer; Daniel Sanchez

The authors present Swarm, a parallel architecture that exploits ordered parallelism, which is abundant but hard to mine with current software and hardware techniques. Swarm programs consist of short tasks, as small as tens of instructions each, with programmer-specified order constraints. Swarm executes tasks speculatively and out of order and efficiently speculates thousands of tasks ahead of the earliest active task to uncover enough parallelism. Several techniques allow Swarm to scale to large core counts and speculation windows. The authors evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm outperforms sequential implementations of these algorithms by 43 to 117 times and state-of-the-art software-only parallel algorithms by 3 to 18 times. Besides achieving near-linear scalability, Swarm programs are almost as simple as their sequential counterparts, because they do not use explicit synchronization.

international symposium on microarchitecture | 2016

Data-centric execution of speculative parallel programs

Mark C. Jeffrey; Suvinay Subramanian; Maleen Abeydeera; Joel S. Emer; Daniel Sanchez

Multicore systems must exploit locality to scale, scheduling tasks to minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention in speculative systems (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge to reveal and exploit locality in speculative parallel programs. A hint is an abstract integer, given when a speculative task is created, that denotes the data that the task is likely to access. We show it is easy to modify programs to convey locality through hints. We design simple hardware techniques that allow a state-of-the-art, tiled speculative architecture to exploit hints by: (i) running tasks likely to access the same data on the same tile, (ii) serializing tasks likely to conflict, and (iii) balancing tasks across tiles in a locality-aware fashion. We also show that programs can often be restructured to make hints more effective. Together, these techniques make speculative parallelism practical on large-scale systems: at 256 cores, hints achieve near-linear scalability on nine challenging applications, improving performance over hint-oblivious scheduling by 3.3× gmean and by up to 16×. Hints also make speculation far more efficient, reducing wasted work by 6.4× and traffic by 3.5× on average.

international symposium on computer architecture | 2017

Fractal: An Execution Model for Fine-Grain Nested Speculative Parallelism

Suvinay Subramanian; Mark C. Jeffrey; Maleen Abeydeera; Hyun Ryong Lee; Victor A. Ying; Joel S. Emer; Daniel Sanchez

Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not support nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs that do support nested parallelism focus on parallelizing at the coarsest (shallowest) levels, incurring large overheads that squander most of their potential. We present FRACTAL, a new execution model that supports unordered and timestamp-ordered nested parallelism. FRACTAL lets programmers seamlessly compose speculative parallel algorithms, and lets the architecture exploit parallelism at all levels. FRACTAL can parallelize a broader range of applications than prior speculative execution models. We design a FRACTAL implementation that extends the Swarm architecture and focuses on parallelizing at the finest (deepest) levels. Our approach sidesteps the issues of nested parallel HTMs and uncovers abundant fine-grain parallelism. As a result, FRACTAL outperforms prior speculative architectures by up to 88× at 256 cores.

international conference on parallel architectures and compilation techniques | 2017

SAM: Optimizing Multithreaded Cores for Speculative Parallelism

Maleen Abeydeera; Suvinay Subramanian; Mark C. Jeffrey; Joel S. Emer; Daniel Sanchez

This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading.We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently.We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33x on average, while a speculation-oblivious policy yields a 1.85x speedup. SAM also reduces wasted work by 52%.

Explore More