Ye Cai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ye Cai is active.

Explore More

Publication

Featured researches published by Ye Cai.

international symposium on parallel architectures algorithms and programming | 2014

Characteristic Analysis of Operating Systems for Large Scale Hierarchical NUMA System

Qiuming Luo; Yuanyuan Zhou; Mei Wang; Ye Cai

LH-NUMAs are essentially clusters of NUMA nodes with a globally shared memory abstraction, either in hardware or in software. This Abstract architecture just like expensive mainframe, while keep the price low as commercial clusters of servers. It is a prospective optional major architecture for cloud computing in todays era of big data. This article studies the special needs for OS requested by LH-NUMA. And we push the OS design principles of Hive, Fos, Multikernel, etc., a little further by embracing the characteristics of LH-NUMA. The contribution includes: 1) Analyzing the architectural of LH-NUMA, a distinctive architecture located between cluster and mainframe. 2) Analyzing the challenges to the OS for LH-NUMA according to the characteristics of LH-NUMA. 3) Analyzing the inefficiency of running current OSes for many-core system on LH-NUMA. 4) Try to apply the Hive, Fos, Multikernel and other design principles of many-core OS to LH-NUMA and analyzing the advantages and insufficiency for LH-NUMA. 5) Listing the requirements and picturing the framework principle for OS on LH-NUMA.

parallel and distributed computing: applications and technologies | 2012

Quantitatively Measuring the Memory Locality Leakage on NUMA Systems Based on Instruction-Based-Sampling

Qiuming Luo; Chengjian Liu; Chang Kong; Ye Cai

Sustaining the memory locality is critical for obtaining high performance in NUMA system. But how to identify a locality leakage problem and how to measure the leakage is still open issue. This paper provides an algorithm to quantitatively measure the locality leakage based on the memory trace produced by IBS (Instruction-Based-Sampling). A “perfect matrix” PM is generated from virtual memory address trace, which represents the highest locality pattern. A “communication matrix” CM is obtained from physical memory address trace to describe the actual memory access pattern. The penalty factors are calculated from PM or CM with considering of the hardware NUMA factor. The leakage is measured by the difference between the penalty factors of PM and the penalty factors of CM, which can be used to estimate the performance decrease and guide the optimization. Some experiment results are show to testify the effectiveness and accuracy of our quantitative measurement.

parallel and distributed computing: applications and technologies | 2011

Performance Evaluation of OpenMP Constructs and Kernel Benchmarks on a Loongson-3A Quad-Core SMP System

Qiuming Luo; Chang Kong; Ye Cai; Gang Liu

As a competitor and alternative to mainstream general-purpose CPU (Intel/AMD/etc.), Loongson is a family of general-purpose MIPS-compatible CPUs developed at the ICT of CAS in China. The quad-core Loongson 3A is evaluated in this paper. The performance of the basic OpenMP constructs on Loongson-3A quad-core SMP is obtained by applying the EPCC Micro benchmarks. And then the performance of NAS kernel codes is obtained by applying NAS Parallel Benchmarks (NPB). These benchmarking are carried out for three different OpenMP compilers (and the runtime system), which includes GCC, OMPipth (OMPi with pthread library) and OMPi-psth (OMPi with psthread library). The results show that OMPI-pths performance is the best and OMPi-psths performance is the worst. Those test results might help to program the OpenMP codes as well as to select the appropriate compiler and its runtime system. And an Intel core i5 quad-core platform is used for comparison purpose, by running NPB, which implies that Loongson 3As performance is nearly one tenth of i5s. The NPB results can help to defining a Loongson systems scale when replacing an Intel i5 system for a given problem size.

network and parallel computing | 2014

Optimization of Uncore Data Flow on NUMA Platform

Qiuming Luo; Yuanyuan Zhou; Chang Kong; Mei Wang; Ye Cai

Uncore part of the processor has a profound effect, especially in NUMA systems, since it is used to connect cores, last level caches (LLC), on-chip multiple memory controllers (MCs) and high-speed interconnections. In our previous study, we investigated several benchmarks’ data flow in Uncore of Intel Westmere microarchitecture and found that the data flow of Global Queue (GQ) and QuickPath Home Logical (QHL) has serious imbalance and congestion problem. This paper, we aims at the problem of entries’ low efficiency in GQ and QHL we set up an M/M/3 Queue Model for GQ and QHL’s three trackers’ data flow, and then design a Dynamic Entries Management (DEM) mechanism which could improve entries’ efficiency dramatically. The model is implemented in Matlab to simulate two different data flow pattern. Experiment results shows that DEM mechanism reduces stall cycles of trackers significantly: DEM reduces almost 60% stall cycles under smooth request sequences; DEM mechanism reduces almost 20~30% stall cycles under burst request sequences.

international symposium on parallel architectures, algorithms and programming | 2011

The Design and Implementation of OMPit: An OpenMP Compiler Characterized by Logs for Parallel and Work-Sharing

Qiuming Luo; Ye Cai; Chengjian Liu; Chang Kong

There are many tools for OpenMP benchmarking which measure the various aspects of the performance, such as the overheads of OpenMP directives and the characteristics of the whole system. But we lack some tools to show us the worksharing details when the OpenMP program finished running. The OMPit (OMPi for tutoring) is designed to provide the worksharing information during the running, which can be used for tutoring and might help debugging or tuning. The work-sharing logging includes the work assignment and the timestamps for three different work-sharing behaviors. The logging information can be output as a text files or visualized figures. The designing of OMPit is provided and the details of how to inserting the logging code into the OMPi compiler is discussed too.

network and parallel computing | 2013