Is this you? Create Your Porfile

Ilia A. Lebedev

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ilia A. Lebedev is active.

Explore More

Publication

Featured researches published by Ilia A. Lebedev.

reconfigurable computing and fpgas | 2010

MARC: A Many-Core Approach to Reconfigurable Computing

Ilia A. Lebedev; Shaoyi Cheng; Austin Doupnik; James B. Martin; Christopher W. Fletcher; Daniel Burke; Mingjie Lin; John Wawrzynek

We present a Many-core Approach to Reconfigurable Computing (MARC), enabling efficient high-performance computing for applications expressed using parallel programming models such as OpenCL. The MARC system exploits abundant special FPGA resources such as distributed block memories and DSP blocks to implement complete single-chip high efficiency many-core micro architectures. The key benefits of MARC are that it (i) allows programmers to easily express parallelism through the API defined in a high-level programming language, (ii) supports coarse-grain multithreading and dataflow-style fine-grain threading while permitting bit-level resource control, and (iii) greatly reduces the effort required to re-purpose the hardware system for different algorithms or different applications. A MARC prototype machine with 48 processing nodes was implemented using a Virtex-5 (XCV5LX155T-2) FPGA for a well known Bayesian network inference problem. We compare the runtime of the MARC machine against a manually optimized implementation. With fully synthesized application-specific processing cores, our MARC machine comes within a factor of 3 of the performance of a fully optimized FPGA solution but with a considerable reduction in development effort and a significant increase in retarget ability.

field programmable gate arrays | 2010

High-throughput bayesian computing machine with reconfigurable hardware

Mingjie Lin; Ilia A. Lebedev; John Wawrzynek

We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evalu- ating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCM achieves high throughput by exploiting the FPGAs distributed memories and abundant hardware structures (such as long carry-chains and registers), which enables us to 1) develop an innovative memory allocation scheme based on a maximal matching algorithm that completely avoids memory stalls, 2) optimize and deeply pipeline the logic design of each processing node, and 3) optimally schedule them. The BCM architecture we present not only can be applied to many important algorithms in artificial intelligence, signal processing, and digital communications, but also has high reusability, i.e., a new application needs not change a BCMs hardware design, only new task graph processing and code compilation are necessary. Moreover, the throughput of a BCM scales almost linearly with the size of the FPGA on which it is implemented. A prototype of a Bayesian computing machine with 16 processing nodes was implemented with a Virtex-5 FPGA (XCV5LX155T-2) on a BEE3 (Berkeley Emulation Engine) platform. For a wide variety of sample Bayesian problems, comparing running the same network evaluation algorithm on a 2.4 GHz Core 2 Duo Intel processor and a GeForce 9400m using the CUDA software package, the BCM demonstrates 80x and 15x speedups respectively, with a peak throughput of 20.4 GFLOPS (Giga Floating-Point Operations per Second).

field-programmable logic and applications | 2010

OpenRCL: Low-Power High-Performance Computing with Reconfigurable Devices

Mingjie Lin; Ilia A. Lebedev; John Wawrzynek

This work presents the Open Reconfigurable Computing Language (OpenRCL) system designed to enable low-power high-performance reconfigurable computing with imperative programming language such as C/C++. The key idea is to expose the FPGA platform as a compiler target for applications expressed in the OpenCL paradigm. To this end, we present a combination of low-level virtual machine instruction set, execution model, many-core architecture, and associated compiler to achieve high performance and power efficiency by exploiting the FPGA’s distributed memories and abundant hardware structures (such as DSP blocks, long carry-chains, and registers). Our resulting OpenRCL system not only allows programmers to easily express parallelism through the API defined in the OpenCL standard but also supports coarse-grain multithreading and dataflow-style fine-grain threading while permitting bit-level resource control. An OpenRCL prototype machine with 30 processing nodes was implemented using a Virtex-5 (XCV5LX155T-2) FPGA. For the well-known Parallel Prefix Sum (Scan) problem, comparing the runtime of the same problem on a GeForce 9400m using the OpenCL SDK from Apple Inc., the OpenRCL machine demonstrates comparable performance with a 5x reduction in core power consumption.

field programmable gate arrays | 2011

Bridging the GPGPU-FPGA efficiency gap

Christopher W. Fletcher; Ilia A. Lebedev; Narges Bani Asadi; Daniel Burke; John Wawrzynek

This paper compares an implementation of a Bayesian inference algorithm across several FPGAs and GPGPUs, while embracing both the execution model and high-level architecture of a GPGPU. Our study is motivated by recent work in template-based programming and architectural models for FPGA computing. The comparison we present is meant to demonstrate the FPGAs potential, while constraining the design to follow the microarchitectural template of more programmable devices such as GPGPUs. The FPGA implementation proves capable of matching the performance of a high-end Nvidia Fermi-based GPU - the most advanced GPGPU available to us at the time of this study. Further investigation shows that each FPGA core outperforms workstation GPGPU cores by a factor of ~ 3.14x, and mobile GPGPU cores by ~ 4.25x despite a ~ 4x reduction in core clock frequency. Using these observations, we discuss the efficiency gap between these two platforms, and the challenges associated with template-based programming models.

acm symposium on parallel algorithms and architectures | 2011

Brief announcement: distributed shared memory based on computation migration

Mieszko Lis; Keun Sup Shim; Myong Hyon Cho; Christopher W. Fletcher; Michel A. Kinsy; Ilia A. Lebedev; Omer Khan; Srinivas Devadas

Driven by increasingly unbalanced technology scaling and power dissipation limits, microprocessor designers have resorted to increasing the number of cores on a single chip, and pundits expect 1000-core designs to materialize in the next few years [1]. But how will memory architectures scale and how will these next-generation multicores be programmed? One barrier to scaling current memory architectures is the offchip memory bandwidth wall [1,2]: off-chip bandwidth grows with package pin density, which scales much more slowly than on-die transistor density [3]. To reduce reliance on external memories and keep data on-chip, today’s multicores integrate very large shared last-level caches on chip [4]; interconnects used with such shared caches, however, do not scale beyond relatively few cores, and the power requirements and access latencies of large caches exclude their use in chips on a 1000-core scale. For massive-scale multicores, then, we are left with relatively small per-core caches. Per-core caches on a 1000-core scale, in turn, raise the question of memory coherence. On the one hand, a shared memory abstraction is a practical necessity for general-purpose programming, and most programmers prefer a shared memory model [5]. On the other hand, ensuring coherence among private caches is an expensive proposition: bus-based and snoopy protocols don’t scale beyond relatively few cores, and directory sizes needed in cache-coherence protocols must equal a significant portion of the combined size of the per-core caches as otherwise directory evictions will limit performance [6]. Moreover, directory-based coherence protocols are notoriously difficult to implement and verify [7].

ieee hot chips symposium | 2013

Hardware-level thread migration in a 110-core shared-memory multiprocessor

Mieszko Lis; Keun Sup Shim; Brandon Cho; Ilia A. Lebedev; Srinivas Devadas

Advantages - significantly reduces traffic on high-locality workloads up to 14x reduction in traffic in some benchmarks - simple to implement and verify (indep. of core count, no transient states) - decentralized & trivially scalable (only # core ID bits, addr ↔ core mapping) Challenges - workloads should be optimized with memory model in mind (like allocating data on cache line boundaries but more coarse-grained) - automatically mapping allocation over cores not a trivial problem Opportunities - fine-grained migration is an enabling technology - since its cheap and responsive, can be used for almost anything - e.g., if only some cores have FPUs, migrate to access FPU.

reconfigurable computing and fpgas | 2012

Exploring many-core design templates for FPGAs and ASICs

Ilia A. Lebedev; Christopher W. Fletcher; Shaoyi Cheng; James B. Martin; Austin Doupnik; Daniel Burke; Mingjie Lin; John Wawrzynek

We present a highly productive approach to hardware design based on a many-coremicroarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the systemfor different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms.

international conference on computer design | 2013

Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine

Keun Sup Shim; Mieszko Lis; Myong Hyon Cho; Ilia A. Lebedev; Srinivas Devadas

As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moores law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy. Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM2), establishes a new design point. On the one hand, EM2 supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remote-access-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM2 chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM2 is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 man-months.

Foundations and Trends in Electronic Design Automation | 2017

Secure Processors Part II: Intel SGX Security Analysis and MIT Sanctum Architecture

Victor Costan; Ilia A. Lebedev; Srinivas Devadas

This manuscript is the second in a two part survey and analysis of the state of the art in secure processor systems, with a specific focus on remote software attestation and software isolation. The first part established the taxonomy and prerequisite concepts relevant to an examination of the state of the art in trusted remote computation: attested software isolation containers (enclaves). This second part extends Part I’s description of Intel’s Software Guard Extensions (SGX), an available and documented enclave-capable system, with a rigorous security analysis of SGX as a system for trusted remote computation. This part documents the authors’ concerns over the shortcomings of SGX as a secure system and introduces the MIT Sanctum processor developed by the authors: a system designed to offer stronger security guarantees, lend itself better to analysis and formal verification, and offer a more straightforward and complete threat model than the Intel system, all with an equivalent programming model. This two part work advocates a principled, transparent, and wellscrutinized approach to system design, and argues that practical guarantees of privacy and integrity for remote computation are achievable at a reasonable design cost and performance overhead. V. Costan, I. Lebedev and S. Devadas. Secure Processors Part II: Intel SGX Security Analysis and MIT Sanctum Architecture. Foundations and Trends

Computers & Security | 2017

PriviPK: Certificate-less and secure email communication

Mashael AlSabah; Alin Tomescu; Ilia A. Lebedev; Dimitrios N. Serpanos; Srinivas Devadas

Abstract We introduce PriviPK, an infrastructure that is based on a novel combination of certificateless (CL) cryptography and key transparency techniques to enable e2e email encryption. Our design avoids (1) key escrow and deployment problems of previous IBC systems, (2) certificate management, as in S/MIME, or participation in complicated Web of Trust, as in PGP, and (3) impersonation attacks because it relies on key transparency approaches where end users verify their identity and key bindings. PriviPK uses a new CL key agreement protocol that has the unique property that it allows users to update their public keys without the need to contact a third party (such as a CA) for the recertification process, which allows for cheap forward secrecy and key revocation operations. Furthermore, PriviPK uniquely combines important privacy properties such as forward secrecy, deniability (or non-deniability if desired), and user transparency while avoiding the administrative overhead of certificates for asynchronous communication. PriviPK enables quick bootstrapping of shared keys among participating users, allowing them to encrypt and authenticate each other transparently. We describe an implementation of PriviPK and provide performance measurements that show its minimal computational overhead. We also describe our PriviPK-enabled e2e secure email client, a modification of The Nylas Mail, 2015 email client.

Explore More