Is this you? Create Your Porfile

Xiangyao Yu

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiangyao Yu is active.

Explore More

Publication

Featured researches published by Xiangyao Yu.

computer and communications security | 2013

Path ORAM: an extremely simple oblivious RAM protocol

Emil Stefanov; Marten van Dijk; Elaine Shi; Christopher W. Fletcher; Ling Ren; Xiangyao Yu; Srinivas Devadas

We present Path ORAM, an extremely simple Oblivious RAM protocol with a small amount of client storage. Partly due to its simplicity, Path ORAM is the most practical ORAM scheme for small client storage known to date. We formally prove that Path ORAM requires log^2 N / log X bandwidth overhead for block size B = X log N. For block sizes bigger than Omega(log^2 N), Path ORAM is asymptotically better than the best known ORAM scheme with small client storage. Due to its practicality, Path ORAM has been adopted in the design of secure processors since its proposal.

very large data bases | 2014

Staring into the abyss: an evaluation of concurrency control with one thousand cores

Xiangyao Yu; George Bezerra; Andrew Pavlo; Srinivas Devadas; Michael Stonebraker

Computer architectures are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. This unprecedented level of on-chip parallelism introduces a new dimension to scalability that current database management systems (DBMSs) were not designed for. In particular, as the number of cores increases, the problem of concurrency control becomes extremely challenging. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts. To better understand just how unprepared current DBMSs are for future CPU architectures, we performed an evaluation of concurrency control for on-line transaction processing (OLTP) workloads on many-core chips. We implemented seven concurrency control algorithms on a main-memory DBMS and using computer simulations scaled our system to 1024 cores. Our analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art DBMSs suffer from these limitations. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.

symposium on computer architecture and high performance computing | 2014

Design and Evaluation of Hierarchical Rings with Deflection Routing

Rachata Ausavarungnirun; Chris Fallin; Xiangyao Yu; Kevin Kai-Wei Chang; Greg Nazario; Reetuparna Das; Gabriel H. Loh; Onur Mutlu

Hierarchical ring networks, which hierarchically connect multiple levels of rings, have been proposed in the past to improve the scalability of ring interconnects, but past hierarchical ring designs sacrifice some of the key benefits of rings by reintroducing more complex in-ring buffering and buffered flow control. Our goal in this paper is to design a new hierarchical ring interconnect that can maintain most of the simplicity of traditional ring designs (i.e., no in-ring buffering or buffered flow control) while achieving high scalability as more complex buffered hierarchical ring designs. To this end, we revisit the concept of a hierarchical-ring networkon-chip. Our design, called HiRD (Hierarchical Rings with Deflection), includes critical features that enable us to mostly maintain the simplicity of traditional simple ring topologies while providing higher energy efficiency and scalability. First, HiRD does not have any buffering or buffered flow control within individual rings, and requires only a small amount of buffering between the ring hierarchy levels. When inter-ring buffers are full, our design simply deflects flits so that they circle the ring and try again, which eliminates the need for in-ring buffering. Second, we introduce two simple mechanisms that together provide an end-to-end delivery guarantee within the entire network (despite any deflections that occur) without impacting the critical path or latency of the vast majority of network traffic. Our experimental evaluations on a wide variety of multiprogrammed and multithreaded workloads and synthetic traffic patterns show that HiRD attains equal or better performance at better energy efficiency than multiple versions of both a previous hierarchical ring design and a traditional single ring design. We also extensively analyze our designs characteristics and injection and delivery guarantees. We conclude that HiRD can be a compelling design point that allows higher energy efficiency and scalability while retaining the simplicity and appeal of conventional ring-based designs.

international conference on management of data | 2016

TicToc: Time Traveling Optimistic Concurrency Control

Xiangyao Yu; Andrew Pavlo; Daniel Sanchez; Srinivas Devadas

Concurrency control for on-line transaction processing (OLTP) database management systems (DBMSs) is a nasty game. Achieving higher performance on emerging many-core systems is difficult. Previous research has shown that timestamp management is the key scalability bottleneck in concurrency control algorithms. This prevents the system from scaling to large numbers of cores. In this paper we present TicToc, a new optimistic concurrency control algorithm that avoids the scalability and concurrency bottlenecks of prior T/O schemes. TicToc relies on a novel and provably correct data-driven timestamp management protocol. Instead of assigning timestamps to transactions, this protocol assigns read and write timestamps to data items and uses them to lazily compute a valid commit timestamp for each transaction. TicToc removes the need for centralized timestamp allocation, and commits transactions that would be aborted by conventional T/O schemes. We implemented TicToc along with four other concurrency control algorithms in an in-memory, shared-everything OLTP DBMS and compared their performance on different workloads. Our results show that TicToc achieves up to 92% better throughput while reducing the abort rate by 3.3x over these previous algorithms.

ieee high performance extreme computing conference | 2013

Integrity verification for path Oblivious-RAM

Ling Ren; Christopher W. Fletcher; Xiangyao Yu; Marten van Dijk; Srinivas Devadas

Oblivious-RAMs (ORAM) are used to hide memory access patterns. Path ORAM has gained popularity due to its efficiency and simplicity. In this paper, we propose an efficient integrity verification layer for Path ORAM, which only imposes 17% latency overhead. We also show that integrity verification is vital to maintaining privacy for recursive Path ORAMs under active adversaries.

cloud computing security workshop | 2013

Generalized external interaction with tamper-resistant hardware with bounded information leakage

Xiangyao Yu; Christopher W. Fletcher; Ling Ren; Marten van Dijk; Srinivas Devadas

This paper investigates secure ways to interact with tamper-resistant hardware leaking a strictly bounded amount of information. Architectural support for the interaction mechanisms is studied and performance implications are evaluated. The interaction mechanisms are built on top of a recently-proposed secure processor Ascend[ascend-stc12]. Ascend is chosen because unlike other tamper-resistant hardware systems, Ascend completely obfuscates pin traffic through the use of Oblivious RAM (ORAM) and periodic ORAM accesses. However, the original Ascend proposal, with the exception of main memory, can only communicate with the outside world at the beginning or end of program execution; no intermediate information transfer is allowed. Our system, Stream-Ascend, is an extension of Ascend that enables intermediate interaction with the outside world. Stream-Ascend significantly improves the generality and efficiency of Ascend in supporting many applications that fit into a streaming model, while maintaining the same security level.Simulation results show that with smart scheduling algorithms, the performance overhead of Stream-Ascend relative to an insecure and idealized baseline processor is only 24.5%, 0.7%, and 3.9% for a set of streaming benchmarks in a large dataset processing application. Stream-Ascend is able to achieve a very high security level with small overheads for a large class of applications.

international symposium on microarchitecture | 2015

IMP: indirect memory prefetcher

Xiangyao Yu; Christopher J. Hughes; Nadathur Satish; Srinivas Devadas

Machine learning, graph analytics and sparse linear algebra-based applications are dominated by irregular memory accesses resulting from following edges in a graph or non-zero elements in a sparse matrix. These accesses have little temporal or spatial locality, and thus incur long memory stalls and large bandwidth requirements. A traditional streaming or striding prefetcher cannot capture these irregular access patterns. A majority of these irregular accesses come from indirect patterns of the form A[B[j]]. We propose an efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and hide latency. We also propose a partial cacheline accessing mechanism for these prefetches to reduce the network and DRAM bandwidth pressure from the lack of spatial locality. Evaluated on 7 applications, IMP shows 56% speedup on average (up to 2.3×) compared to a baseline 64 core system with streaming prefetchers. This is within 23% of an idealized system. With partial cacheline accessing, we see another 9.4% speedup on average (up to 46.6%).

international symposium on computer architecture | 2015

PrORAM: dynamic prefetcher for oblivious RAM

Xiangyao Yu; Syed Kamran Haider; Ling Ren; Christopher W. Fletcher; Albert Kwon; Marten van Dijk; Srinivas Devadas

Oblivious RAM (ORAM) is an established technique to hide the access pattern to an untrusted storage system. With ORAM, a curious adversary cannot tell what address the user is accessing when observing the bits moving between the user and the storage system. All existing ORAM schemes achieve obliviousness by adding redundancy to the storage system, i.e., each access is turned into multiple random accesses. Such redundancy incurs a large performance overhead.Although traditional data prefetching techniques successfully hide memory latency in DRAM based systems, it turns out that they do not work well for ORAM because ORAM does not have enough memory bandwidth available for issuing prefetch requests. In this paper, we exploit ORAM locality by taking advantage of the ORAM internal structures. While it might seem apparent that obliviousness and locality are two contradictory concepts, we challenge this intuition by exploiting data locality in ORAM without sacrificing security. In particular, we propose a dynamic ORAM prefetching technique called PrORAM (Dynamic Prefetcher for ORAM) and comprehensively explore its design space. PrORAM detects data locality in programs at runtime, and exploits the locality without leaking any information on the access pattern. Our simulation results show that with PrORAM, the performance of ORA M can be significantly improved. PrORAM achieves an average performance gain of 20% over the baseline ORA M for memory intensive benchmarks among Splash2 and 5.5% for SP EC06 workloads. The peiformance gain for YCSB and TPCC in DBMS benchmarks is 23.6% and 5% respectively. On average, PrORAM offers twice the performance gain than that offered by a static super block scheme.

international conference on parallel architectures and compilation techniques | 2015

Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory

Xiangyao Yu; Srinivas Devadas

A new memory coherence protocol, Tardis, is proposed. Tardis uses timestamp counters representing logical time as well as physical time to order memory operations and enforce sequential consistency in any type of shared memory system. Tardis is unique in that as compared to the widely-adopted directory coherence protocol, and its variants, it completely avoids multicasting and only requires O(log N) storage per cache block for an N-core system rather than O(N) sharer information. Tardis is simpler and easier to reason about, yet achieves similar performance to directory protocols on a wide range of benchmarks run on 16, 64 and 256 cores.

international conference on parallel architectures and compilation techniques | 2016

Tardis 2.0: Optimized Time Traveling Coherence for Relaxed Consistency Models

Xiangyao Yu; Hongzhe Liu; Ethan Zou; Srinivas Devadas

Cache coherence scalability is a big challenge in shared memory systems. Traditional protocols do not scale due to the storage and traffic overhead of cache invalidation. Tardis, a recently proposed coherence protocol, removes cache invalidation using logical timestamps and achieves excellent scalability. The original Tardis protocol, however, only supports the Sequential Consistency (SC) memory model, limiting its applicability. Tardis also incurs extra network traffic on some benchmarks due to renew messages, and has suboptimal performance when the program uses spinning to communicate between threads. In this paper, we address these downsides of Tardis protocol and make it significantly more practical. Specifically, we discuss the architectural, memory system and protocol changes required in order to implement the TSO consistency model on Tardis, and prove that the modified protocol satisfies TSO. We also describe modifications for Partial Store Order (PSO) and Release Consistency (RC). Finally, we propose optimizations for better leasing policies and to handle program spinning. On a set of benchmarks, optimized Tardis improves on a full-map directory protocol in the metrics of performance, storage and network traffic, while being simpler to implement.

Explore More