Is this you? Create Your Porfile

Songping Yu

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Songping Yu is active.

Explore More

Publication

Featured researches published by Songping Yu.

international performance computing and communications conference | 2015

WAlloc: An efficient wear-aware allocator for non-volatile main memory

Songping Yu; Nong Xiao; Mingzhu Deng; Yuxuan Xing; Fang Liu; Zhiping Cai; Wei Chen

The non-volatile memory (NVM) has the illustrious merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. However, traditional memory allocators designed with in-place data writes are not appropriate for non-volatile main memory (NVRAM) due to the limited endurance. For instance, the number of write operations is merely 108 times per PCM cell. In this paper, we quantitatively analyze the wear-oblivious of DRAM-oriented designed allocator-glibc malloc and the inefficiency of wear-conscious allocator-NVMalloc. For example, the average imbalance factor (the maximum/the average) of memory allocation is about 7.5 and 3, respectively. Based on our observations, we propose WAlloc, an efficient wear-aware manual memory allocator designed for NVRAM, decouples metadata and data, uses Less Allocated First Out allocation policy and redirects the data writes. Experimental results show that the wear-leveling of WAlloc outperforms that of NVMalloc about 30% and 60% under random workloads and well-distributed workloads, respectively. In addition, considering the trade-off between space and wear-leveling, WAlloc reduces average data memory writes in 64 bytes block by average 1.5X comparing with malloc with almost 8% extra space overhead.

ACM Journal on Emerging Technologies in Computing Systems | 2017

Redesign the Memory Allocator for Non-Volatile Main Memory

Songping Yu; Nong Xiao; Mingzhu Deng; Fang Liu; Wei Chen

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. However, traditional memory allocators designed with in-place data writes are not appropriate for the non-volatile main memory (NVRAM) due to the limited endurance. In this article, first, we quantitatively analyze the wear-oblivious of DRAM-oriented designed allocator—glibc malloc and the inefficiency of wear-conscious allocator NVMalloc. Then, we propose WAlloc, an efficient wear-aware manual memory allocator designed for NVRAM: (1) decouples metadata and data management; (2) distinguishes metadata with volatility; (3) redirects the data writes around to achieve wear-leveling; (4) redesigns an efficient and effective NVM copy mechanism, bypassing the CPU cache partially and prefetching data explicitly. Finally, experimental results show that the wear-leveling of WAlloc outperforms that of NVMalloc about 30% and 60% under random workloads and well-distributed workloads, respectively. Besides, WAlloc reduces the average data memory writes in 64 bytes block by 1.5 times comparing with glibc malloc. With the fulfillment of data persistency, cache bypassing NVM copy is better than cache line flushing NVM copy with performance improvement circa 14%.

networking architecture and storages | 2017

Megalloc: Fast Distributed Memory Allocator for NVM-Based Cluster

Songping Yu; Nong Xiao; Mingzhu Deng; Yuxuan Xing; Fang Liu; Wei Chen

As the expected emerging Non-Volatile Memory (NVM) technologies, such as 3DXPoint, are in production, there has been a recent push in the big data processing community from storage-centric towards memory-centric. Generally, in large-scale systems, distributed memory management through traditional network with TCP/IP protocol exposes performance bottleneck. Briefly, CPU- centric network involves context switching, memory copy etc. Remote Direct Memory Access (RDMA) technology reveals the tremendous performance advantage over than TCP/IP: Allowing access to remote memory directly bypassing OS kernel. In this paper, we propose Megalloc, a distributed NVM allocator exposes NVMs as a shared address space of a cluster of machines based-on RDMA. Firstly, it makes memory allocation metadata accessed directly by each machine, allocating NVM in coarse-grained way; secondly, adopting fine-grained memory chunk for applications to read or store data; finally, it guarantees high distributed memory allocation performance.

Mobile Information Systems | 2017

RAID-6Plus: A Comprised Methodology for Extending RAID-6 Codes

Mingzhu Deng; Nong Xiao; Songping Yu; Fang Liu; Lingyu Zhu; Zhiguang Chen

Existing RAID-6 code extensions assume that failures are independent and instantaneous, overlooking the underlying mechanism of multifailure occurrences. Also, the effect of reconstruction window is ignored. Additionally, these coding extensions have not been adapted to occurrence patterns of failure in real-world applications. As a result, the third parity drive is set to handle the triple-failure scenario; however, the lower level failure situations have been left unattended. Therefore, a new methodology of extending RAID-6 codes named RAID-6Plus with better compromise has been studied in this paper. RAID-6Plus (Deng et al., 2015) employs short combinations which can greatly reuse overlapped elements during reconstruction to remake the third parity drive. A sample extension code called RDP

International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage | 2017

Fast Truss Decomposition in Memory

Yuxuan Xing; Nong Xiao; Yutong Lu; Ronghua Li; Songping Yu; Siqi Gao

The k-truss is a type of cohesive subgraphs proposed for the analysis of massive network. Existing in-memory algorithms for computing k-truss are inefficient for searching and parallel. We propose a novel traversal algorithm for truss decomposition: it effectively reduces computation complexity, we fully exploit the parallelism thanks to the optimization, and overlap IO and computation for a better performance. Our experiments on real datasets verify that it is 2x–5x faster than the exiting fastest in-memory algorithm.

International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage | 2017

Pyramid: Revisiting Memory Extension with Remote Accessible Non-Volatile Main Memory

Songping Yu; Mingzhu Deng; Yuxuan Xing; Nong Xiao; Fang Liu; Wei Chen

Remote Direct Memory Access (RDMA) provides the ability to direct access remote user space memory without remote CPU’s involvement, shortening the network latency tremendously; in addition, a new generation of fast Non-Volatile Memory (NVM) technologies, such as 3D XPoint, is in production, and its property has the promise to access-speed like memory and durability-like storage. So, Remote access Non-Volatile Main Memory is reasonable. Traditional local memory extension is bounded by slow storage media (HDD/SSD). In this paper, first, we revisit local memory extension and propose a new memory extension model, Pyramid, extending memory with remote NVM; then, discussing the mechanism of remote data consistency, which can be delivered with RDMA operation of write-with-immediate in Pyramid; besides, we evaluate the performance of random access to remote NVM and manifest the performance opportunity brought by remote accessible NVM through comparing it with new technologies of storage-NVMe-SSD and PCM-based SSD. Finally, we argue that Pyramid promises memory scalability with good performance guarantee.

international conference on web services | 2016

InnerCache: A Tactful Cache Mechanism for RDMA-Based Key-Value Store

Min Yang; Songping Yu; Rujie Yu; Nong Xiao; Fang Liu; Wei Chen

High-Performance network technology, Remote Direct Memory Access (RDMA), has revealed its tremendous advantage over traditional TCP/IP. With its ultra-low latency and high bandwidth, RDMA has been extensively adopted in distributed environment, especially for in-memory key-value stores. However, although RDMA does provide the ability to interact with remote user space memory directly, memory copy still exists between data memory area and communication memory area with two-sided communication semantics in inmemory key-value store. In addition, using high performance one-sided communication semantics will expose memory totally, hence an inadvertent corrupt data operation could crash system. In this paper, we propose a tactful cache mechanism for RDMA-based in-memory key-value store - InnerCache. Our design concerns two dimensions with respect to improve the system performance with two-sided communication semantics and make system less vulnerable. It merges one-sided and two-sided communication model through making communication memory cacheable. Experimental results show that InnerCache can efficiently improve the performance of RDMA-based in-memory key-value store.

high performance computing and communications | 2016

An Experimental Study of Redundant Array of Independent SSDs and Filesystems

Yuxuan Xing; Ya Feng; Songping Yu; Zhengguo Chen; Fang Liu; Nong Xiao

Solid state disks (SSDs) become more and more popular in personal devices and data centers. Flash chips can be packaged in Hard disk drive (HDD) form factors and provide the same interface as HDDs. This character makes SSDs easily replace HDDs in existing storage systems. PCIe-based SSD can provide a higher I/O performance, but it is still a little expensive. This paper studies the feasibility of Redundant Arrays of Independent SSDs (RAIS) with different filesystems. We comprehensively analyze the performance of RAIS constructed by SATA SSD and PCIe SSD individually. We investigate different RAIS configurations (RAIS0, 5, 6) and filesystems under various I/O access patterns. Finally, we illustrate our serval key findings and recommendations for building RAIS.

high performance computing and communications | 2016

HPGraph: A High Parallel Graph Processing System Based on Flash Array

Yuxuan Xing; Ya Feng; Songping Yu; Zhengguo Chen; Fang Liu; Nong Xiao

Large graphs analytics has been an important aspect of many big data applications, such as web search, social networks and recommendation systems. Many research focuses on processing large scale graphs using distributed system over past few years. And numbers of studies turn to construct graph processing system on a single server-class machine in consideration of cost, usability and maintainability. HPGraph is a high parallel graph processing system which adopts the edge-centric model, our contributions are as follows: (1) designing an efficient data allocation and access strategy for NUMA machine, and providing tasks scheduling to keep load balance, (2) raising a fine-grained edge-block filtering mechanism to avoid accessing unnecessary edge data, (3) constructing a high-speed flash array as the second storage. We made a detailed evaluation on a 16-core machine using asset of popular real word and synthetic data sets, and the results show that HPGraph always outperforms the state-of-the-art single machine graph processing systems-GridGraph. And HPGraph can achieve 1.27X faster than GridGraph for specific application. Our source code is available at https://github.com/xinghuan1990/HPGraph.

asia pacific services computing conference | 2015

RAID-6Plus: A Fast and Reliable Coding Scheme Aided by Multi-failure Degradation

Mingzhu Deng; Yang Ou; Nong Xiao; Songping Yu; Wei Chen; Zhiguang Chen; Fang Liu

Existing triple-failure-tolerant codes assume that failures are independent and instantaneous. Such assumptions overlook the underlying mechanism of multi-failure occurrences and ignored the effect of reconstruction window. These codes are not adapted to the occurrence pattern of failure in real-world applications. As a result, the third parity drive is almost idle as it set to handle the triple-failure scenario only with lower-level failure situations unattended. Furthermore, the problem of single failure rebuild deteriorates with the increasing disk capacity, and the systems reliability will decrease with user experience impaired. Aiming at these problems, a fast reconstructable coding scheme extended from RAID-6 has been developed in this study. RAID-6Plus maintains a smaller reconstruction window by recoding the third parity drive. Existing codes provide absolute reliability for triple failures via full combinations. As a contrast, RAID-6Plus employs short combinations which are able to greatly reuse overlapped elements during reconstruction to remake the third parity drive. The short combinations shorten the reconstruction window of single failure, which avoids multi-failure overlapping in the reconstruction window. The capability of multi-failure degradation provides RAID-6Plus with 1 a better system performance comparing to RTP and STAR and 2 an enhanced reliability comparing to RAID-6.

Explore More