Zhaoguo Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhaoguo Wang is active.

Explore More

Publication

Featured researches published by Zhaoguo Wang.

acm sigplan symposium on principles and practice of parallel programming | 2011

COREMU: a scalable and portable parallel full-system emulator

Zhaoguo Wang; Ran Liu; Yufei Chen; Xi Wu; Haibo Chen; Weihua Zhang; Binyu Zang

This paper presents the open-source COREMU, a scalable and portable parallel emulation framework that decouples the complexity of parallelizing full-system emulators from building a mature sequential one. The key observation is that CPU cores and devices in current (and likely future) multiprocessors are loosely-coupled and communicate through well-defined interfaces. Based on this observation, COREMU emulates multiple cores by creating multiple instances of existing sequential emulators, and uses a thin library layer to handle the inter-core and device communication and synchronization, to maintain a consistent view of system resources. COREMU also incorporates lightweight memory transactions, feedback-directed scheduling, lazy code invalidation and adaptive signal control to provide scalable performance. To make COREMU useful in practice, we also provide some preliminary tools and APIs that can help programmers to diagnose performance problems and (concurrency) bugs. A working prototype, which reuses the widely-used QEMU as the sequential emulator, is with only 2500 lines of code (LOCs) changes to QEMU. It currently supports x64 and ARM platforms, and can emulates up to 255 cores running commodity OSes with practical performance, while QEMU cannot scale above 32 cores. A set of performance evaluation against QEMU indicates that, COREMU has negligible uniprocessor emulation overhead, performs and scales significantly better than QEMU. We also show how COREMU could be used to diagnose performance problems and concurrency bugs of both OS kernel and parallel applications.

european conference on computer systems | 2014

Using restricted transactional memory to build a scalable in-memory database

Zhaoguo Wang; Hao Qian; Jinyang Li; Haibo Chen

The recent availability of Intel Haswell processors marks the transition of hardware transactional memory from research toys to mainstream reality. DBX is an in-memory database that uses Intels restricted transactional memory (RTM) to achieve high performance and good scalability across multi-core machines. The main limitation (and also key to practicality) of RTM is its constrained working set size: an RTM region that reads or writes too much data will always be aborted. The design of DBX addresses this challenge in several ways. First, DBX builds a database transaction layer on top of an underlying shared-memory store. The two layers use separate RTM regions to synchronize shared memory access. Second, DBX uses optimistic concurrency control to separate transaction execution from its commit. Only the commit stage uses RTM for synchronization. As a result, the working set of the RTMs used scales with the meta-data of reads and writes in a database transaction as opposed to the amount of data read/written. Our evaluation using TPC-C workload mix shows that DBX achieves 506,817 transactions per second on a 4-core machine.

ACM Transactions on Storage | 2017

Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication

Haibo Chen; Heng Zhang; Mingkai Dong; Zhaoguo Wang; Yubin Xia; Haibing Guan; Binyu Zang

In-memory key/value store (KV-store) is a key building block for many systems like databases and large websites. Two key requirements for such systems are efficiency and availability, which demand a KV-store to continuously handle millions of requests per second. A common approach to availability is using replication, such as primary-backup (PBR), which, however, requires M+1 times memory to tolerate M failures. This renders scarce memory unable to handle useful user jobs. This article makes the first case of building highly available in-memory KV-store by integrating erasure coding to achieve memory efficiency, while not notably degrading performance. A main challenge is that an in-memory KV-store has much scattered metadata. A single KV put may cause excessive coding operations and parity updates due to excessive small updates to metadata. Our approach, namely Cocytus, addresses this challenge by using a hybrid scheme that leverages PBR for small-sized and scattered data (e.g., metadata and key), while only applying erasure coding to relatively large data (e.g., value). To mitigate well-known issues like lengthy recovery of erasure coding, Cocytus uses an online recovery scheme by leveraging the replicated metadata information to continuously serve KV requests. To further demonstrate the usefulness of Cocytus, we have built a transaction layer by using Cocytus as a fast and reliable storage layer to store database records and transaction logs. We have integrated the design of Cocytus to Memcached and extend it to support in-memory transactions. Evaluation using YCSB with different KV configurations shows that Cocytus incurs low overhead for latency and throughput, can tolerate node failures with fast online recovery, while saving 33% to 46% memory compared to PBR when tolerating two failures. A further evaluation using the SmallBank OLTP benchmark shows that in-memory transactions can run atop Cocytus with high throughput, low latency, and low abort rate and recover fast from consecutive failures.

IEEE Computer Architecture Letters | 2015

Persistent Transactional Memory

Zhaoguo Wang; Han Yi; Ran Liu; Mingkai Dong; Haibo Chen

This paper proposes persistent transactional memory (PTM), a new design that adds durability to transactional memory (TM) by incorporating with the emerging non-volatile memory (NVM). PTM dynamically tracks transactional updates to cache lines to ensure the ACI (atomicity, consistency and isolation) properties during cache flushes and leverages an undo log in NVM to ensure PTM can always consistently recover transactional data structures from a machine crash. This paper describes the PTM design based on Intels restricted transactional memory. A preliminary evaluation using a concurrent key/value store and a database with a cache-based simulator shows that the additional cache line flushes are small.

asia pacific workshop on systems | 2013

Opportunities and pitfalls of multi-core scaling using hardware transaction memory

Zhaoguo Wang; Hao Qian; Haibo Chen; Jinyang Li

Hardware transactional memory, which holds the promise to simplify and scale up multicore synchronization, has recently become available in main stream processors in the form of Intels restricted transactional memory (RTM). Will RTM be a panacea for multi-core scaling? This paper tries to shed some light on this question by studying the performance scalability of a concurrent skip list using competing synchronization techniques, including fine-grained locking, lock-free and RTM (using both Intels RTM emulator and a real RTM machine). Our experience suggests that RTM indeed simplifies the implementation, however, a lot of care must be taken to get good performance. Specifically, to avoid excessive aborts due to RTM capacity miss or conflicts, programmers should move memory allocation/deallocation out of RTM region, tuning fallback functions, and use compiler optimization.

international conference on management of data | 2016

Scaling Multicore Databases via Constrained Parallel Execution

Zhaoguo Wang; Shuai Mu; Yang Cui; Han Yi; Haibo Chen; Jinyang Li

Multicore in-memory databases often rely on traditional con- currency control schemes such as two-phase-locking (2PL) or optimistic concurrency control (OCC). Unfortunately, when the workload exhibits a non-trivial amount of contention, both 2PL and OCC sacrifice much parallel execution op- portunity. In this paper, we describe a new concurrency control scheme, interleaving constrained concurrency con- trol (IC3), which provides serializability while allowing for parallel execution of certain conflicting transactions. IC3 combines the static analysis of the transaction workload with runtime techniques that track and enforce dependencies among concurrent transactions. The use of static analysis simplifies IC3s runtime design, allowing it to scale to many cores. Evaluations on a 64-core machine using the TPC- C benchmark show that IC3 outperforms traditional con- currency control schemes under contention. It achieves the throughput of 434K transactions/sec on the TPC-C bench- mark configured with only one warehouse. It also scales better than several recent concurrent control schemes that also target contended workloads.

acm sigplan symposium on principles and practice of parallel programming | 2017

Eunomia: Scaling Concurrent Search Trees under Contention Using HTM

Xin Wang; Weihua Zhang; Zhaoguo Wang; Ziyun Wei; Haibo Chen; Wenyun Zhao

While hardware transactional memory (HTM) has recently been adopted to construct efficient concurrent search tree structures, such designs fail to deliver scalable performance under contention. In this paper, we first conduct a detailed analysis on an HTM-based concurrent B+Tree, which uncovers several reasons for excessive HTM aborts induced by both false and true conflicts under contention. Based on the analysis, we advocate Eunomia, a design pattern for search trees which contains several principles to reduce HTM aborts, including splitting HTM regions with version-based concurrency control to reduce HTM working sets, partitioned data layout to reduce false conflicts, proactively detecting and avoiding true conflicts, and adaptive concurrency control. To validate their effectiveness, we apply such designs to construct a scalable concurrent B+Tree using HTM. Evaluation using key-value store benchmarks on a 20-core HTM-capable multi-core machine shows that Eunomia leads to 5X-11X speedup under high contention, while incurring small overhead under low contention.

IEEE Transactions on Parallel and Distributed Systems | 2017

Scaling Concurrent Index Structures under Contention Using HTM

Weihua Zhang; Xin Wang; Shiyu Ji; Ziyun Wei; Zhaoguo Wang; Haibo Chen

Hardware transactional memory (HTM) is an emerging hardware feature. HTM simplifies the programming model of concurrent programs while preserving high and scalable performance. With the commercial availability of HTM-capable processors, HTM has recently been adopted to construct efficient concurrent index structures. However, with the expansion of data volume and user amount, data management systems have to process workloads exhibiting high contention; meanwhile, according to our experiments, the conventional HTM-base concurrent index structures fail to provide scalable performance under highly-contented workloads. Such performance pathology strictly constrains the usage of HTM on data management systems. In this paper, we first conduct a thorough analysis on HTM-based concurrent index structures, and uncover several reasons for excessive HTM aborts incurred by both false and true conflicts under contention. Based on the analysis, we advocate Eunomia, a design pattern for HTM-based concurrent index structure which contains several principles to improve HTM performance, including splitting HTM regions with version-based concurrency control to reduce HTM working sets, partitioned data layout to reduce false conflicts, proactively detecting and avoiding conflicting requests, and adaptive concurrency control strategy. To validate their effectiveness, we apply such design principles to construct a scalable concurrent B+Tree and a skip list using HTM. Evaluation using key-value store and database benchmarks on a 20-core HTM-capable multi-core machine shows that Eunomia leads to substantial speedup under high contention, while incurring small overhead under low contention.

asia pacific workshop on systems | 2017

Extracting More Intra-transaction Parallelism with Work Stealing for OLTP Workloads

Xiaozhou Zhou; Zhaoguo Wang; Rong Chen; Haibo Chen; Jinyang Li

Online transaction processing systems use two-phase locking (2PL) to guarantee serializability. However, traditional 2PL does not perform well under high contention, because a transaction will be blocked when it fails to acquire lock. This paper proposes a scalable work stealing algorithm for 2PL to leverage intra-transaction parallelism. The key idea is to parallelize the lock holders work among lock waiters. Compared to traditional 2PL, our approach can achieve up to 2.8X throughput improvement for TPC-C new-order transactions under high contention.

usenix annual technical conference | 2015