Jingling Xue | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jingling Xue is active.

Explore More

Publication

Featured researches published by Jingling Xue.

measurement and modeling of computer systems | 2003

Data cache locking for higher program predictability

Xavier Vera; Björn Lisper; Jingling Xue

Caches have become increasingly important with the widening gap between main memory and processor speeds. However, they are a source of unpredictability due to their characteristics, resulting in programs behaving in a different way than expected.Cache locking mechanisms adapt caches to the needs of real-time systems. Locking the cache is a solution that trades performance for predictability: at a cost of generally lower performance, the time of accessing the memory becomes predictable.This paper combines compile-time cache analysis with data cache locking to estimate the worst-case memory performance (WCMP) in a safe, tight and fast way. In order to get predictable cache behavior, we first lock the cache for those parts of the code where the static analysis fails. To minimize the performance degradation, our method loads the cache, if necessary, with data likely to be accessed.Experimental results show that this scheme is fully predictable, without compromising the performance of the transformed program. When compared to an algorithm that assumes compulsory misses when the state of the cache is unknown, our approach eliminates all overestimation for the set of benchmarks, giving an exact WCMP of the transformed program without any significant decrease in performance.

Parallel Processing Letters | 1997

On Tiling as a Loop Transformation

Jingling Xue

This paper is a follow-up Irigoin and Triolets earlier work and our recent work on tiling. In this paper, tiling is discussed in terms of its effects on the dependences between tiles, the dependences within a tile and the required dependence test for legality. A necessary and sufficient condition is given for enforcing the data dependences of the program, while Irigion and Triolets atomic tile constraint is only sufficient. A condition is identified under which both Irigoin and Triolets and our constraints are equivalent. The results of this paper are discussed in terms of their impact on dependence abstractions suitable for legality test and on tiling to optimise a certain given goal.

Journal of Parallel and Distributed Computing | 1997

Communication-Minimal Tiling of Uniform Dependence Loops

Jingling Xue

Tiling is a loop transformation that a compiler uses to create automatically blocked algorithms in order to improve the benefits of the memory hierarchy and reduce the communication overhead between processors. Motivated by existing results, this paper presents a conceptually simple approach to finding tilings with a minimal amount of communication between tiles. The development of almost all results is based primarily on the inequality of arithmetic and geometric means and the concept of extremal rays from convex cones. The key insight is that a tiling that is communication-minimal must induce the same amount of communication through all faces of a tile, which restricts the search space for optimal tilings to those tiling matrices whose rows are all extremal rays in a cone. For nested loops with several special forms of dependences, closed-form optimal tilings are derived. In the general case, a procedure is given that always returns optimal tilings. An efficient implementation of the procedure, along with experimental results, is presented. A detailed comparison of this work with some existing results is provided.

real-time systems symposium | 2003

Data caches in multitasking hard real-time systems

Xavier Vera; Björn Lisper; Jingling Xue

Data caches are essential in modern processors, bridging the widening gap between main memory and processor speeds. However, they yield very complex performance models, which make it hard to bound execution times tightly. This paper contributes a new technique to obtain predictability in preemptive multitasking systems in the presence of data caches. We explore the use of cache partitioning, dynamic cache locking, and static cache analysis to provide worst-case performance estimates in a safe and tight way. Cache partitioning divides the cache among tasks to eliminate inter-task cache interferences. We combine static cache analysis and cache locking mechanisms to ensure that all intra-task conflicts, and consequently, memory access times, are exactly predictable. To minimize the performance degradation due to cache partitioning and locking, two strategies are employed. First, the cache is loaded with data likely to be accessed so that their cache utilization is maximized. Second, compiler optimizations such as tiling and padding are applied in order to reduce cache replacement misses. Experimental results show that this scheme is fully predictable, without compromising the performance of the transformed programs. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, with a CPU utilization reduction ranging between 3.8 and 20.0 times for a high performance system.

symposium on code generation and optimization | 2010

Level by level: making flow- and context-sensitive pointer analysis scalable for millions of lines of code

Hongtao Yu; Jingling Xue; Wei Huo; Xiaobing Feng; Zhaoqing Zhang

We present a practical and scalable method for flow- and context-sensitive (FSCS) pointer analysis for C programs. Our method analyzes the pointers in a program level by level in terms of their points-to levels, allowing the points-to relations of the pointers at a particular level to be discovered based on the points-to relations of the pointers at this level and higher levels. This level-by-level strategy can enhance the scalability of the FSCS pointer analysis in two fundamental ways, by enabling (1) fast and accurate flow-sensitive analysis on full sparse SSA form using a flow-insensitive algorithm and (2) fast and accurate context-sensitive analysis using a full transfer function and a meet function for each procedure. Our level-by-level algorithm, LevPA, gives rises to (1) a precise and compact SSA representation for subsequent program analysis and optimization tasks and (2) a flow- and context-sensitive MAY/MUST mod (modification) set and read set for each procedure. Our preliminary results show that LevPA can analyze some programs with over a million lines of C code in minutes, faster than the state-of-the-art FSCS methods.

high-performance computer architecture | 2002

Let's study whole-program cache behaviour analytically

Xavier Vera; Jingling Xue

Based on a new characterisation of data reuse across multiple loop nests, we preset a method, a prototyping implementation and some experimental results for analysing the cache behaviour of whole programs with regular computations. Validation against cache simulation using real codes shows the efficiency and accuracy of our method. The largest program, we have analysed, Applu from SPECfP95, has 3868 lines, 16 subroutines and 2565 references. In the case of a 32KB cache with a 32B line size, our method obtains the miss ratio with an absolute error of about 0.80% in about 128 seconds while the simulator used runs for nearly 5 hours on a 933MHz Pentium. III PC. Our method can be used to guide compiler locality optimisations and improve cache simulation performance.

ACM Transactions in Embedded Computing Systems | 2007

Data cache locking for tight timing calculations

Xavier Vera; Björn Lisper; Jingling Xue

Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpredictability, resulting in programs sometimes behaving in a different way than expected. Detailed information about the number of cache misses and their causes allows us to predict cache behavior and to detect bottlenecks. Small modifications in the source code may change memory patterns, thereby altering the cache behavior. Code transformations, which take the cache behavior into account, might result in a high cache performance improvement. However, cache memory behavior is very hard to predict, thus making the task of optimizing and timing cache behavior very difficult. This article proposes and evaluates a new compiler framework that times cache behavior for multitasking systems. Our method explores the use of cache partitioning and dynamic cache locking to provide worst-case performance estimates in a safe and tight way for multitasking systems. We use cache partitioning, which divides the cache among tasks to eliminate intertask cache interferences. We combine static cache analysis and cache-locking mechanisms to ensure that all intratask conflicts, and consequently, memory access times, are exactly predictable. The results of our experiments demonstrate the capability of our framework to describe cache behavior at compile time. We compare our timing approach with a system equipped with a nonpartitioned, but statically, locked data cache. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, demonstrating that our fully predictable scheme does not compromise the performance of the transformed programs.

international symposium on software testing and analysis | 2012

Static memory leak detection using full-sparse value-flow analysis

Yulei Sui; Ding Ye; Jingling Xue

We introduce a static detector, Saber, for detecting memory leaks in C programs. Leveraging recent advances on sparse pointer analysis, Saber is the first to use a full-sparse value-flow analysis for leak detection. Saber tracks the flow of values from allocation to free sites using a sparse value-flow graph (SVFG) that captures def-use chains and value flows via assignments for all memory locations represented by both top-level and address-taken pointers. By exploiting field-, flow- and context-sensitivity during different phases of the analysis, Saber detects leaks in a program by solving a graph reachability problem on its SVFG. Saber, which is fully implemented in Open64, is effective at detecting 211 leaks in the 15 SPEC2000 C programs and five applications, while keeping the false positive rate at 18.5%. We have also compared Saber with Fastcheck (which analyzes allocated objects flowing only into top-level pointers) and Sparrow (which handles all allocated objects using abstract interpretation) using the 15 SPEC2000 C programs. Saber is as accurate as Sparrow but is 14.2X faster and reports 40.7% more bugs than Fastcheck at a slightly higher false positive rate but is only 3.7X slower.

symposium on code generation and optimization | 2011

Acculock: Accurate and efficient detection of data races

Xinwei Xie; Jingling Xue

Happens-before detectors are precise but can be too conservative to detect certain data races in repeated test runs as they are sensitive to thread interleaving. By making the opposite tradeoffs, lockset detectors can detect more races but are not precise (by reporting false positives). For both types of detectors, happens-before detectors run more slowly as they use expensive vector clocks. Existing hybrid race detectors (combining lockset and happens-before) alleviate some of the limitations in both analysis techniques at the cost of additional analysis overhead. Recently, due to FastTrack, epoch-based happens-before and lockset detectors now exhibit comparable performance. It is the time to rethink how to design a hybrid race detector to balance precision and coverage, by leveraging the lightweightness of epoch clocks. Acculock is the first such a solution. Acculock analyzes a program by reasoning about the subset of the happens-before relation observed with lock acquires and releases excluded, thereby reducing its sensitivity to thread interleaving. When such a weaker happens-before relation is violated, Acculock applies a new efficient lockset algorithm to enforce a lock-based synchronization discipline by distinguishing the locks protecting reads and writes. The key motivation behind is to ensure that Acculock can improve happens-before detectors by discovering also data races in alternate thread interleavings when analyzing one program execution while limiting false warnings thus incurred in a controlled manner. In addition, Acculock achieves these objectives by maintaining comparable performance as FastTrack, the fastest happens-before detector. All these properties of Acculock are validated and confirmed by comparing it against six other detectors, all implemented in Jikes RVM using 11 benchmark programs.

symposium on code generation and optimization | 2012

On-demand dynamic summary-based points-to analysis

Lei Shang; Xinwei Xie; Jingling Xue

Static analyses can be typically accelerated by reducing redundancies. Modern demand-driven points-to or alias analysis techniques rest on the foundation of Context-Free Language (CFL) reachability. These techniques achieve high precision efficiently for a small number of queries raised in small programs but may still be too slow in answering many queries for large programs in a context-sensitive manner. We present an approach, called DynSum, to perform context-sensitive demand-driven points-to analysis fully on-demand by means of computing CFL-reachability summaries without any precision loss. The novelty lies in initially performing a Partial Points-To Analysis (PPTA) within a method, which is field-sensitive but context-independent, to summarize its local points-to relations encountered during a query and reusing this information later in the same or different calling contexts. We have compared DynSum with RefinePTS, a refinement-based analysis, using three clients (safe casting, null dereferencing and factory methods) for a suite of nine Java programs. DynSums average speedups are 1.95x, 2.28x and 1.37x, respectively. We have also compared DynSum with a static approach, which is referred to StaSum here, to show its improved scalability for the same three clients.

Explore More