Jianchen Shan
New Jersey Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jianchen Shan.
ieee international conference on cloud computing technology and science | 2010
Jinshi Zhu; Yongmei Lei; Jianchen Shan
In this paper, a parallel computational model and algorithm based on space decomposition is constructed and implemented, which supports the dynamically resource allocation under cluster environment. The major aim is to explore the new space decomposition scheme that can solve computation intensive problem. The fast multipole method (FMM) is an algorithm for rapid evaluation of the potential and force fields in the system involving large numbers of particles. Based on the serial FMM algorithm, a parallel implementation entitled SDPFMM is presented in the paper. The proposed algorithm is characterized by scalability and flexibility. We carried out the experiment on the high-performance computer ZQ3000 with Intel Trace Analyzer and Collector integrated into SDPFMM, and analyzed the experimental data and MPI performance of SDPFMM. The results demonstrate that the proposed algorithm is satisfying in both efficiency and solution quality.
ieee international conference on cloud computing technology and science | 2015
Jianchen Shan; Xiaoning Ding; Narain Gehani
On virtualized platforms, Lock Holder Preemption (LHP) is known as a serious problem, which makes virtual CPUs (VCPUs) spin excessively while waiting for locks and seriously degrades performance. To address this problem, hardware facilities, such as Intel PLE and AMD PF, are provided on processors to preempt spinning VCPUs. Though these facilities have been predominantly used on mainstreamvirtualization systems, using them in a manner that achieves the highest performance is still a challenging issue. The core issue in dealing with the LHP problem is to determine the best time to preempt spinning VCPUs (i.e., spinning thresholds). Due to the semantic gap between different software layers, the virtual machine monitor (VMM) does not have the information about whether a VCPU is spinning normally (i.e., waiting for a lock to be released quickly) or is spinning excessively (i.e., waiting for a lock which is currently held by a preempted VCPU and cannot be released quickly). Thus, it cannot determine adequate thresholds for preempting spinning VCPUs to achieve high performance. Preempting spinning VCPUs late wastes system resources. Preempting them prematurely incurs costly context switches between VCPUs and delays lock acquisition. The paper addresses the issue of preempting spinning VCPUs with an end-to-end approach named Adaptive PLE (APLE). APLE monitors the execution efficiency of each VM by collecting the overhead incurred by wasteful spinning and wasteful VCPU switches. Then, it periodically adjusts the spinning threshold to reduce the overhead and increase the execution efficiency of the VM. The implementation of APLE incurs only minimal changes to existing systems (about 80 lines of code in KVM). The experiments with multicore workloads show that APLE can improve throughput by up to 68%.
IEEE Transactions on Parallel and Distributed Systems | 2017
Jianchen Shan; Xiaoning Ding; Narain Gehani
Spin-locks are widely used in software for efficient synchronization. However, they cause serious performance degradation on virtualized platforms, such as the Lock Holder Preemption (LHP) problem and the Lock Waiter Preemption (LWP) problem, due to excessive spinning by virtual CPUs (VCPUs). The excessive spinning occurs when a VCPU waits to acquire a spin-lock. To address the performance degradation, hardware facilities, such as Intel PLE and AMD PF, are provided on processors to preempt VCPUs when they spin excessively. Although these facilities have been predominantly used on mainstream virtualization systems, using them in a manner that achieves the highest performance is still a challenging issue. There are two core problems in using these hardware facilities to reduce excessive spinning. One is to determine the best time to preempt a spinning VCPU (i.e., the selection of spinning thresholds). The other is which VCPU should be scheduled to run after the spinning VCPU is descheduled. Due to the semantic gap between different software layers, the virtual machine monitor (VMM) does not have information about the computation characteristics on VCPUs, which is needed to address the above problems. This makes the problems inherently challenging. We propose a framework named AdPtive Pause-Loop Exiting and Scheduling (APPLES) to address these problems. APPLES monitors the overhead caused by excessive spinning and preempting spinning VCPUs, and periodically adjusts spinning thresholds to reduce the overhead. APPLES also evaluates and schedules “ready” VCPUs in a VM by their potential to reduce the spinning incurred by the spin-lock synchronization. The evaluation is based on the causality and the time of VCPU preemptions. The implementation of APPLES incurs only minimal changes to existing systems (about 100 lines of code in KVM). Experiments show that APPLES can improve performance by 3
IEEE Transactions on Cloud Computing | 2017
Nafize R. Paiker; Jianchen Shan; Cristian Borcea; Narain Gehani; Reza Curtmola; Xiaoning Ding
\sim
IEEE Transactions on Parallel and Distributed Systems | 2016
Xiaoning Ding; Jianchen Shan; Song Jiang
49 percent (14 percent on average) for the workloads with frequent spin-lock operations.
ieee international conference on cloud computing technology and science | 2015
Xiaoning Ding; Jianchen Shan
With cloud assistance, mobile apps can offload their resource-demanding computation tasks to the cloud. This leads to a scenario where computation tasks in the same program run concurrently on both the mobile device and the cloud. An important challenge is to ensure that the tasks are able to access and share the files on both the mobile and the cloud in a manner that is efficient, consistent, and transparent to locations. Existing distributed file systems and network file systems do not satisfy these requirements. Current systems for offloading tasks either do not support file access for offloaded tasks or do not offload tasks with file access. The paper addresses this issue by designing and implementing an application-level file system called Overlay File System (OFS). To improve efficiency, OFS maintains and buffers local copies of data sets on both the cloud and the mobile device. OFS ensures consistency and guarantees that all the reads get the latest data. It combines write-invalidate and write-update policies to effectively reduce the network traffic incurred by invalidating/updating stale data copies and to reduce the execution delay when the latest data cannot be accessed locally. To guarantee location transparency, OFS creates a unified view of the data that is location independent and is accessible as local storage. We overcome the challenges caused by the special features of mobile systems on an application-level file system, like the lack of root privilege and state loss when application is killed due to the shortage of resource and implement an easy to deploy prototype of OFS. The paper tests the OFS prototype on Android OS with a real mobile app and real mobile user traces. Extensive experiments show that OFS can effectively support consistent file accesses from computation tasks, no matter whether they are on a mobile device or offloaded to the cloud. In addition, OFS reduce both file access latency and network traffic incurred by file accesses.
usenix annual technical conference | 2014
Xiaoning Ding; Phillip B. Gibbons; Michael Kozuch; Jianchen Shan
In high-end data processing systems, such as databases, the execution concurrency level rises continuously since the introduction of multicore processors. This happens both on premises and in the cloud. For these systems, a buffer pool management of high scalability plays an important role on overall system performance. The scalability of buffer pool management is largely determined by its data replacement algorithm, which is a major component in the buffer pool management. It can seriously degrade the scalability if not designed and implemented properly. The root cause is its use of lock-protected data structures that incurs high contention with concurrent accesses. A common practice is to modify the replacement algorithm to reduce the contention on the lock(s), such as approximating the LRU replacement with the CLOCK algorithm or partitioning the data structures and using distributed locks. Unfortunately, the modification usually compromises the algorithms hit ratio, a major performance goal. It may also involve significant effort on overhauling the original algorithm design and implementation. This paper provides a general solution to improve the scalability of a buffer pool management using any replacement algorithms for the data processing systems on physical on-premises machines and virtual machines in the cloud. Instead of making a difficult trade-off between the high hit ratio of a replacement algorithm and the low lock contention of its approximation, we design a system framework, called BP-Wrapper, that eliminates almost all lock contention without requiring any changes to an existing algorithm. In BP-Wrapper, we use a dynamic batching technique and a prefetching technique to reduce lock contention and to retain high hit ratio. The implementation of BP-Wrapper in PostgreSQL adds only about 300 lines of C code. It can increase the throughput by up to two folds compared with the replacement algorithms with lock contention when running TPC-C-like and TPC-W-like workloads.
international conference on body area networks | 2016
Pradyumna Neog; Hillol Debnath; Jianchen Shan; Nafize R. Paiker; Narain Gehani; Reza Curtmola; Xiaoning Ding; Cristian Borcea
Hardware-assisted virtualization, as an effective approach to low virtualization overhead, has been dominantly used. However, existing hardware assistance mainly focuses on single-thread performance. Much less attention has been paid to facilitate the efficient interaction between threads, which is critical to the execution of multi-threaded computation on virtualized multicore platforms. This paper aims to answer two questions: 1) what is the performance impact of virtualization on multi-threaded computation, and 2) what are the factors impeding multi-threaded computation from gaining full speed on virtualized platforms. Targeting the first question, the paper measures the virtualization overhead for computation-intensive applications that are designed for multicore processors. We show that some multicore applications still suffer significant performance losses in virtual machines. Even with hardware assistance for reducing virtualization overhead fully enabled, the execution time may be increased by more than 150% when the system is not over-committed, and the system throughput can be reduced by 6x when the system is over-committed. To answer the second question, with experiments, the paper diagnoses the main causes for the performance losses. Focusing the interaction between threads and between VCPUs, the paper identifies and examines a few performance factors, including the intervention of the virtual machine monitor (VMM) to schedule/switch virtual CPUs (VCPUs) and to handle interrupts required by inter-core communication, excessive spinning in user space, and cache-unaware data sharing.
usenix annual technical conference | 2018
Weiwei Jia; Cheng Wang; Xusheng Chen; Jianchen Shan; Xiaowei Shang; Heming Cui; Xiaoning Ding; Luwei Cheng; Francis C. M. Lau; Yuexuan Wang; Yuangang Wang
international conference on parallel and distributed systems | 2017
Jianchen Shan; Weiwei Jia; Xiaoning Ding