John Demme | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John Demme is active.

Explore More

Publication

Featured researches published by John Demme.

international symposium on computer architecture | 2014

A reconfigurable fabric for accelerating large-scale datacenter services

Andrew Putnam; Adrian M. Caulfield; Eric S. Chung; Derek Chiou; Kypros Constantinides; John Demme; Hadi Esmaeilzadeh; Jeremy Fowers; Gopi Prashanth Gopal; Jan Gray; Michael Haselman; Scott Hauck; Stephen Heil; Amir Hormati; Joo-Young Kim; Sitaram Lanka; James R. Larus; Eric C. Peterson; Simon Pope; Aaron Smith; Jason Thong; Phillip Yi Xiao; Doug Burger

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6×8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables. In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution-or, while maintaining equivalent throughput, reduces the tail latency by 29%.

international symposium on computer architecture | 2013

On the feasibility of online malware detection with performance counters

John Demme; Matthew Maycock; Jared Schmitz; Adrian Tang; Adam Waksman; Simha Sethumadhavan; Salvatore J. Stolfo

The proliferation of computers in any domain is followed by the proliferation of malware in that domain. Systems, including the latest mobile platforms, are laden with viruses, rootkits, spyware, adware and other classes of malware. Despite the existence of anti-virus software, malware threats persist and are growing as there exist a myriad of ways to subvert anti-virus (AV) software. In fact, attackers today exploit bugs in the AV software to break into systems. In this paper, we examine the feasibility of building a malware detector in hardware using existing performance counters. We find that data from performance counters can be used to identify malware and that our detection techniques are robust to minor variations in malware programs. As a result, after examining a small set of variations within a family of malware on Android ARM and Intel Linux platforms, we can detect many variations within that family. Further, our proposed hardware modifications allow the malware detector to run securely beneath the system software, thus setting the stage for AV implementations that are simpler and less buggy than software AV. Combined, the robustness and security of hardware AV techniques have the potential to advance state-of-the-art online malware detection.

international symposium on computer architecture | 2012

TimeWarp: rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks

Robert Martin; John Demme; Simha Sethumadhavan

Over the past two decades, several microarchitectural side channels have been exploited to create sophisticated security attacks. Solutions to this problem have mainly focused on fixing the source of leaks either by limiting the flow of information through the side channel by modifying hardware, or by refactoring vulnerable software to protect sensitive data from leaking. These solutions are reactive and not preventative: while the modifications may protect against a single attack, they do nothing to prevent future side channel attacks that exploit other microarchitectural side channels or exploit the same side channel in a novel way. In this paper we present a general mitigation strategy that focuses on the infrastructure used to measure side channel leaks rather than the source of leaks, and thus applies to all known and unknown microarchitectural side channel leaks. Our approach is to limit the fidelity of fine grain timekeeping and performance counters, making it difficult for an attacker to distinguish between different microarchitectural events, thus thwarting attacks. We demonstrate the strength of our proposed security modifications, and validate that our changes do not break existing software. Our proposed changes require minor - or in some cases, no - hardware modifications and do not result in any substantial performance degradation, yet offer the most comprehensive protection against microarchitectural side channels to date.

international symposium on computer architecture | 2011

Rapid identification of architectural bottlenecks via precise event counting

John Demme; Simha Sethumadhavan

On-chip performance counters play a vital role in computer architecture research due to their ability to quickly provide insights into application behaviors that are time consuming to characterize with traditional methods. The usefulness of modern performance counters, however, is limited by inefficient techniques used today to access them. Current access techniques rely on imprecise sampling or heavyweight kernel interaction forcing users to choose between precision or speed and thus restricting the use of performance counter hardware. In this paper, we describe new methods that enable precise, lightweight interfacing to on-chip performance counters. These low-overhead techniques allow precise reading of virtualized counters in low tens of nanoseconds, which is one to two orders of magnitude faster than current access techniques. Further, these tools provide several fresh insights on the behavior of modern parallel programs such as MySQL and Firefox, which were previously obscured (or impossible to obtain) by existing methods for characterization. Based on case studies with our new access methods, we discuss seven implications for computer architects in the cloud era and three methods for enhancing hardware counters further. Taken together, these observations have the potential to open up new avenues for architecture research.

high performance embedded architectures and compilers | 2012

Approximate graph clustering for program characterization

John Demme; Simha Sethumadhavan

An important aspect of system optimization research is the discovery of program traits or behaviors. In this paper, we present an automated method of program characterization which is able to examine and cluster program graphs, i.e., dynamic data graphs or control flow graphs. Our novel approximate graph clustering technology allows users to find groups of program fragments which contain similar code idioms or patterns in data reuse, control flow, and context. Patterns of this nature have several potential applications including development of new static or dynamic optimizations to be implemented in software or in hardware. For the SPEC CPU 2006 suite of benchmarks, our results show that approximate graph clustering is effective at grouping behaviorally similar functions. Graph based clustering also produces clusters that are more homogeneous than previously proposed non-graph based clustering methods. Further qualitative analysis of the clustered functions shows that our approach is also able to identify some frequent unexploited program behaviors. These results suggest that our approximate graph clustering methods could be very useful for program characterization.

IEEE Micro | 2015

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

To advance datacenter capabilities beyond what commodity server designs can provide, the authors designed and built a composable, reconfigurable fabric to accelerate large-scale software services. Each instantiation of the fabric consists of a 6 x 8 2D torus of high-end field-programmable gate arrays (FPGAs) embedded into a half-rack of 48 servers. The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

Grid and Cloud Database Management | 2011

Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase

Haimonti Dutta; Alex Kamil; Manoj Pooleery; Simha Sethumadhavan; John Demme

Huge volumes of data are being accumulated from a variety of sources in engineering and scientific disciplines; this has been referred to as the “Data Avalanche”. Cloud computing infrastructures (such as Amazon Elastic Compute Cloud (EC2)) are specifically designed to combine high compute performance with high performance network capability to meet the needs of data-intensive science. Reliable, scalable, and distributed computing is used extensively on the cloud. Apache Hadoop is one such open-source project that provides a distributed file system to create multiple replicas of data blocks and distribute them on compute nodes throughout a cluster to enable reliable and rapid computations. Column-oriented databases built on Hadoop (such as HBase) along with MapReduce programming paradigm allows development of large-scale distributed computing applications with ease. In this chapter, benchmarking results on a small in-house Hadoop cluster composed of 29 nodes each with 8-core processors is presented along with a case-study on distributed storage of electroencephalogram (EEG) data. Our results indicate that the Hadoop / HBase projects are still in their nascent stages but provide promising performance characteristics with regard to latency and throughput. In future work, we will explore the development of novel machine learning algorithms on this infrastructure.

Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering | 2009

COMPASS: A Community-driven Parallelization Advisor for Sequential Software

Simha Sethumadhavan; Nipun Arora; Ravindra Babu Ganapathi; John Demme; Gail E. Kaiser

The widespread adoption of multicores has renewed the emphasis on the use of parallelism to improve performance. The present and growing diversity in hardware architectures and software environments, however, continues to pose difficulties in the effective use of parallelism thus delaying a quick and smooth transition to the concurrency era. In this paper, we describe the research being conducted at Columbia University on a system called COMPASS that aims to simplify this transition by providing advice to programmers while they reengineer their code for parallelism. The advice proffered to the programmer is based on the wisdom collected from programmers who have already parallelized some similar code. The utility of COMPASS rests, not only on its ability to collect the wisdom unintrusively but also on its ability to automatically seek, find and synthesize this wisdom into advice that is tailored to the task at hand, i.e., the code the user is considering parallelizing and the environment in which the optimized program is planned to execute. COMPASS provides a platform and an extensible framework for sharing human expertise about code parallelization - widely, and on diverse hardware and software. By leveraging the “wisdom of crowds” model [30], which has been conjectured to scale exponentially and which has successfully worked for wikis, COMPASS aims to enable rapid propagation of knowledge about code parallelization in the context of the actual parallelization reengineering, and thus continue to extend the benefits of Moores law scaling to science and society.

international conference on computer design | 2015

Increasing reconfigurability with memristive interconnects

John Demme; Bipin Rajendran; Steven M. Nowick; Simha Sethumadhavan

The design of on-chip interconnects is largely governed by the size and power of the devices being connected. While large components like memory controllers, video decode accelerators, and cores can afford the overhead of a large packet switching NoC router, smaller components like adders or other ALUs cannot. Instead, they are typically connected via simple wires, limiting their runtime reconfigurability. The notable exception - FPGAs - use an interconnect which allows extreme reconfigurability, but the FPGA pays for it in area, power, and latency costs. Less costly reconfigurable interconnects, therefore, could allow hardware designers to expose more reconfigurability while limiting area and power costs. This paper presents the design of a high-radix circuit switching crossbar design using memristors. This design utilizes Phase Change Memory (PCM), overcoming some of its limitations such as leakage power and low voltage operation. The very small size of memristors shrinks the area, power, and latency of crossbars by up to 16x, 4.4x, and 2.4x, respectively, leaving little interconnect overhead but wiring overhead. As a tool for designers, memristive interconnects offer significant potential to increase runtime design flexibility.

Archive | 2015

Anti-Virus in Silicon

Beng Chiew Tang; John Demme; Simha Sethumadhavan; Salvatore J. Stolfo

Anti-virus (AV) software is fundamentally broken. AV systems today rely on correct functioning of not only the AV software but also the underlying OS and VMM. Thus proper functioning of software AV requires millions of lines of complex code – which houses thousands of bugs – to work correctly. Needless to say, and as evidenced in numerous software AV attacks, effective software AV systems have been difficult to build. At the same time, malware incidents are increasing and there is strong demand for good anti-virus solutions; the software anti-virus market is estimated at close to 8B dollars annually. In this work we present a new class of robust AV systems called Silicon anti-virus systems. Unlike software AV systems, these systems are lean and mostly implemented in hardware to avoid reliance on complex software, but, like software AV systems, are updatable in the field when new malware is encountered. We describe the first generation of silicon AV that uses simple machine learning techniques with existing performance counter infrastructure. Our published and unpublished work shows that common malware such as viruses and adware, and even zero day exploits can be detected accurately [1, 2]. These systems form a very effective first-line, energyefficient defense against malware.

Explore More