Yufei Ren | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yufei Ren is active.

Explore More

Publication

Featured researches published by Yufei Ren.

european conference on computer systems | 2016

zExpander: a key-value cache with both high performance and fewer misses

Xingbo Wu; Li Zhang; Yandong Wang; Yufei Ren; Michel H. T. Hack; Song Jiang

While key-value (KV) cache, such as memcached, dedicates a large volume of expensive memory to holding performance-critical data, it is important to improve memory efficiency, or to reduce cache miss ratio without adding more memory. As we find that optimizing replacement algorithms is of limited effect for this purpose, a promising approach is to use a compact data organization and data compression to increase effective cache size. However, this approach has the risk of degrading the caches performance due to additional computation cost. A common perception is that a high-performance KV cache is not compatible with use of data compacting techniques. In this paper, we show that, by leveraging highly skewed data access pattern common in real-world KV cache workloads, we can both reduce miss ratio through improved memory efficiency and maintain high performance for a KV cache. Specifically, we design and implement a KV cache system, named zExpander, which dynamically partitions the cache into two sub-caches. One serves frequently accessed data for high performance, and the other compacts data and metadata for high memory efficiency to reduce misses. Experiments show that zExpander can increase memcacheds effective cache size by up to 2x and reduce miss ratio by up to 46%. When integrated with a cache of a higher performance, its advantages remain. For example, with 24 threads on a YCSB workload zExpander can achieve throughput of 32 million RPS with 36% of its cache misses removed.

asia pacific workshop on systems | 2016

NVMcached: An NVM-based Key-Value Cache

Xingbo Wu; Fan Ni; Li Zhang; Yandong Wang; Yufei Ren; Michel H. T. Hack; Zili Shao; Song Jiang

As byte-addressable, high-density, and non-volatile memory (NVM) is around the corner to be equipped alongside the DRAM memory, issues on enabling the important key-value cache services, such as memcached, on the new storage medium must be addressed. While NVM allows data in a KV cache to survive power outage and system crash, in practice their integrity and accessibility depend on data consistency enforced during writes to NVM. Though techniques for enforcing the consistency, such as journaling, COW, or checkpointing, are available, they are often too expensive by frequently using CPU cache flushes to ensure crash consistency, leading to (much) reduced performance and excessively compromised NVMs lifetime. In this paper we design and evaluate NVMcached, a KV cache for non-volatile byte-addressable memory that can significantly reduce use of flushes and minimize data loss by leveraging consistency-friendly data structures and batched space allocation and reclamation. Experiments show that NVMcached can improve its system throughput by up to 2.8x for write-intensive real-world workloads, compared to a non-volatile memcached.

modeling analysis and simulation on computer and telecommunication systems | 2017

Nexus: Bringing Efficient and Scalable Training to Deep Learning Frameworks

Yandong Wang; Li Zhang; Yufei Ren; Wei Zhang

Demand is mounting in the industry for scalable GPU-based deep learning systems. Unfortunately, existing training applications built atop popular deep learning frameworks, including Caffe, Theano, and Torch, etc, are incapable of conducting distributed GPU training over large-scale clusters.To remedy such a situation, this paper presents Nexus, a platform that allows existing deep learning frameworks to easily scale out to multiple machines without sacrificing model accuracy. Nexus leverages recently proposed distributed parameter management architecture to orchestrate distributed training by a large number of learners spread across the cluster. Through characterizing the run-time behavior of existing single-node based applications, Nexus is equipped with a suite of optimization schemes, including hierarchical and hybrid parameter aggregation, enhanced network and computation layer, and quality-guided communication adjustment, etc, to strengthen the communication channels and resource utilization. Empirical evaluations with a diverse set of deep learning applications demonstrate that Nexus is easy to integrate and can deliver efficient distributed training services to major deep learning frameworks. In addition, Nexuss optimization schemes are highly effective to shorten the training time with targeted accuracy bounds.

modeling analysis and simulation on computer and telecommunication systems | 2017

Lightweight Replication Through Remote Backup Memory Sharing for In-memory Key-Value Stores

Yandong Wang; Li Zhang; Michel H. T. Hack; Yufei Ren; Min Li

Memory price will continue dropping in the next few years according to Gartner. Such trend renders it affordable for in-memory key-value stores (IMKVs) to maintain redundant memory-resident copies of each key-value pair to provision enhanced reliability and high availability services. Though contemporary IMKVs have reached unprecedented performance, delivering single-digit microsecond-scale latency with up to tens of millions queries per second throughput, existing replication protocols are unable to keep pace with such an advancement of IMKVs, either incurring unbearable latency overhead or demanding intensive resource usage. Consequently, the adoption of those replication techniques always results in substantial performance degradation.In this paper, we propose MacR, a RDMA-based high-performance and lightweight replication protocol for IMKVs. The design of MacR centers around sharing the remote backup memory to enable RDMA-based replication protocol, and synthesizes a collection of optimizations, including memory allocator cooperative replication and adaptive bulk data synchronization to control the number of network operations and to enhance the recovery performance. Performance evaluations with a variety of YCSB workloads demonstrate that MacR can efficiently outperform alternative replication methods in terms of the throughput while preserving sufficiently low latency overhead. It can also efficiently speed up the recovery process.

international conference on data mining | 2017

GaDei: On Scale-Up Training as a Service for Deep Learning

Wei Zhang; Minwei Feng; Yunhui Zheng; Yufei Ren; Yandong Wang; Ji Liu; Peng Liu; Bing Xiang; Li Zhang; Bowen Zhou; Fei Wang

high performance computing and communications | 2017

iRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems

Yufei Ren; Xingbo Wu; Li Zhang; Yandong Wang; Wei Zhang; Zijun Wang; Michel H. T. Hack; Song Jiang

Archive | 2017

CACHE MANAGEMENT IN RDMA DISTRIBUTED KEY/VALUE STORES BASED ON ATOMIC OPERATIONS

Michel H. T. Hack; Yufei Ren; Yandong Wang; Li Zhang

Archive | 2017

COORDINATED VERSION CONTROL SYSTEM, METHOD, AND RECORDING MEDIUM FOR PARAMETER SENSITIVE APPLICATIONS

Michel H. T. Hack; Yufei Ren; Yandong Wang; Li Zhang

Ibm Journal of Research and Development | 2017

IBM Deep Learning Service

Bishwaranjan Bhattacharjee; Scott Boag; Chandani Doshi; Parijat Dube; Ben Herta; Vatche Ishakian; K. R. Jayaram; Rania Khalaf; Avesh Krishna; Yu Bo Li; Vinod Muthusamy; Ruchir Puri; Yufei Ren; Florian Rosenberg; Seetharami R. Seelam; Yandong Wang; Jian Ming Zhang; Li Zhang

Archive | 2015