Ke Bai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ke Bai is active.

Explore More

Publication

Featured researches published by Ke Bai.

international conference on hardware/software codesign and system synthesis | 2010

Heap data management for limited local memory (LLM) multi-core processors

Ke Bai; Aviral Shrivastava

This paper presents a scheme to manage heap data in the local memory present in each core of a limited local memory (LLM) multi-core processor. While it is possible to manage heap data semi-automatically using software cache, managing heap data of a core through software cache may require changing the code of the other threads. Cross thread modifications are difficult to code and debug, and only become more difficult as we scale the number of cores. We propose a semi-automatic, and scalable scheme for heap data management that hides this complexity in a library with a much natural programming interface. Furthermore, for embedded applications, where the maximum heap size can be known at compile time, we propose optimizations on the heap management to significantly improve the application performance. Experiments on several benchmarks of MiBench executing on the Sony Playstation 3 show that our scheme is easier to use, and if we know the maximum size of heap data, then our optimizations can improve application performance by an average of 14%.

application specific systems architectures and processors | 2010

Dynamic code mapping for limited local memory systems

Seung Chul Jung; Aviral Shrivastava; Ke Bai

This paper presents heuristics for dynamic management of application code on limited local memories present in high-performance multi-core processors. Previous techniques formulate the problem using call graphs, which do not capture the temporal ordering of functions. In addition, they only use a conservative estimate of the interference cost between functions to obtain a mapping. As a result previous techniques are unable to achieve efficient code mapping. Techniques proposed in this paper overcome both these limitations and achieve superior code mapping. Experimental results from executing benchmarks from MiBench onto the Cell processor in the Sony Playstation 3 demonstrate up to 29% and average 12% performance improvement, at tolerable compile-time overhead.

design automation conference | 2013

SSDM: smart stack data management for software managed multicores (SMMs)

Jing Lu; Ke Bai; Aviral Shrivastava

Software Managed Multicore (SMM) architectures have been proposed as a solution for scaling the memory architecture. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory. If all the code and data of the task to be executed on an SMM core cannot fit on the local memory, then data must be managed explicitly in the program through DMA instructions. While all code and data need to be managed, an efficient technique to manage stack data is of utmost importance since an average of 64% of all accesses may be to stack variables [16]. In this paper, we formulate the problem of stack data management optimization on an SMM core. We then develop both an ILP and a heuristic - SSDM (Smart Stack Data Management) to find out where to insert stack data management calls in the program. Experimental results demonstrate SSDM can reduce the overhead by 13X over the state-of-the-art stack data management technique [10].

design, automation, and test in europe | 2013

Automatic and efficient heap data management for limited local memory multicore architectures

Ke Bai; Aviral Shrivastava

Limited Local Memory (LLM) multi-core architectures substitute cache with scratch pad memories (SPM), and therefore have much lower power consumption. As they lack of automatic memory management, programming on such architectures becomes challenging, in the sense that it requires the programmer/compiler to efficiently manage the limited local memory. Managing heap data of the tasks executing in the cores of an LLM multi-core is an important problem. This paper presents a fully automated and efficient scheme for heap data management. Specifically, we propose i) code transformation for automation of heap management, with seamless support for multi-level pointers, and ii) improved data structures to more efficiently manage unlimited heap data. Experimental results on several benchmarks from MiBench demonstrate an average 43% performance improvement over previous approach [1].

international conference on hardware/software codesign and system synthesis | 2013

CMSM: an efficient and effective code management for software managed multicores

Ke Bai; Jing Lu; Aviral Shrivastava; Bryce Holton

As we scale the number of cores in a multicore processor, scaling the memory hierarchy is a major challenge. Software Managed Multicore (SMM) architectures are one of the promising solutions. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory. If all the code and data of the task mapped to a core do not fit on its local scratchpad memory, then explicit code and data management is required. In this paper, we solve the problem of efficiently managing code on an SMM architecture. We extend the state of the art by: i) correctly calculating the code management overhead, ii) even in the presence of branches in the task, and iii) developing a heuristic CMSM (Code Mapping for Software Managed multicores) that results in efficient code management execution on the local scratchpad memory. Our experimental results collected after executing applications from MiBench suite [1] on the Cell SPEs (Cell is an SMM architecture) [2], demonstrate that correct management cost calculation and branch consideration can improve performance by 12%. Our heuristic CMSM can reduce runtime in more than 80% of the cases, and by up to 20% on our set of benchmarks.

compilers, architecture, and synthesis for embedded systems | 2011

Vector class on limited local memory (LLM) multi-core processors

Ke Bai; Di Lu; Aviral Shrivastava

Limited Local Memory (LLM) multi-core architecture is a promising solution for scalable memory hierarchy. LLM architecture, e.g., IBM Cell/B.E. is a purely distributed memory architecture in which each core can directly access only its small local memory, and that is why it is extremely power-efficient. Vector is a popular container class in the C++ Standard Template Library (STL), which provides the functionality similar to a dynamic array. Due to the small non-virtualized memory in the LLM architecture, vector library implementation cannot be used as it is. In this paper, we propose and implement a scheme to manage vector class in the local memory present in each core of LLM multi-core architecture. Our scalable solution can transparently maintain vector data between the shared global memory and the local memories. In addition, different data transfer granularities are provided by our vector class to achieve better performance. We also propose a mechanism to ensure the validity of pointers-to-elements when the vector elements are moved into the global memory. Experimental result shows that our vector class can improve the programmability of vector class significantly while the overhead can be contained within 7%.

application specific systems architectures and processors | 2011

Stack data management for Limited Local Memory (LLM) multi-core processors

Ke Bai; Aviral Shrivastava; Saleel Kudchadker

Limited Local Memory (LLM) architectures are power-efficient, scalable memory multi-core architectures, in which cores have a scratch-pad like local memory that is software controlled. Any data transfers between the main memory and the local memory must be explicitly present as Direct Memory Access (DMA) commands in the application. Stack data management of the cores is an important problem in LLM architecture, and our previous work outlined a promising scheme for that [1]. In this paper, we improve the previous approach, and now can i) manage limitless stack data, ii) increase the applicability of stack management, and iii) perform stack management with smaller footprint on the local memory. We demonstrate these by executing benchmarks from the MiBench suite on the IBM Cell processor.

ACM Transactions in Embedded Computing Systems | 2013

A software-only scheme for managing heap data on limited local memory(LLM) multicore processors

Ke Bai; Aviral Shrivastava

This article presents a scheme for managing heap data in the local memory present in each core of a limited local memory (LLM) multicore architecture. Although managing heap data semi-automatically with software cache is feasible, it may require modifications of other thread codes. Crossthread modifications are very difficult to code and debug, and will become more complex and challenging as we increase the number of cores. In this article, we propose an intuitive programming interface, which is an automatic and scalable scheme for heap data management. Besides, for embedded applications, where the maximum heap size can be profiled, we propose several optimizations on our heap management to significantly decrease the library overheads. Our experiments on several benchmarks from MiBench executing on the Sony Playstation 3 show that our scheme is natural to use, and if we know the maximum size of heap data, our optimizations can improve application performance by an average of 14%.

ACM Transactions in Embedded Computing Systems | 2015

Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores

Jing Lu; Ke Bai; Aviral Shrivastava

Scaling the memory hierarchy is a major challenge when we scale the number of cores in a multicore processor. Software Managed Multicore (SMM) architectures come up as one of the promising solutions. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory [Banakar et al. 2002]. As the local memory usually is small, large applications cannot be directly executed on it. Code and data of the task mapped to each core need to be managed between global memory and local memory. This article solves the problem of efficiently managing code on an SMM architecture. The primary requirement of generating efficient code assignments is a correct management cost model. In this article, we address this problem by proposing a cost calculation graph. In addition, we develop two heuristics CMSM (Code Mapping for Software Managed multicores) and CMSM_advanced that result in efficient code management execution on the local scratchpad memory. Experimental results collected after executing applications from the MiBench suite [Guthaus et al. 2001] demonstrate that merely by adopting the correct management cost calculation, even using previous code assignment schemes, we can improve performance by an average of 12%. Combining the correct management cost model and a more optimized code mapping algorithm together, our heuristics can reduce runtime in more than 80% of the cases, and by up to 20% on our set of benchmarks, compared to the state-of-the-art code assignment approach [Jung et al. 2010]. When compared with Instruction-level Parallelism (ILP) results, CMSM_advanced performs an average of 5% worse. We also simulate the benchmarks on a cache-based system, and find that the code management overhead on SMM core with our code management is much less than memory latency of a cache-based system.

compilers, architecture, and synthesis for embedded systems | 2014

Construction of GCCFG for inter-procedural optimizations in software managed manycore (SMM) architectures

Bryce Holton; Ke Bai; Aviral Shrivastava; Harini Ramaprasad

Software Managed Manycore (SMM) architectures - in which each core has only a scratch pad memory (instead of caches), - are a promising solution for scaling memory hierarchy to hundreds of cores. However, in these architectures, the code and data of the tasks mapped to the cores must be explicitly managed in the software by the compiler. State-of-the-art compiler techniques for SMM architectures require inter-procedural information and analysis. A call graph of the program does not have enough information, and Global CFG, i.e., combining all the control flow graphs of the program has too much information, and becomes too big. As a result, most new techniques have informally defined and used GCCFG (Global Call Control Flow Graph) - a whole program representation which captures the control-flow as well as function call information in a succinct way - to perform inter-procedural analysis. However, how to construct it has not been shown yet. We find that for several simple call and control flow graphs, constructing GCCFG is relatively straightforward, but there are several cases in common applications where unique graph transformation is needed in order to formally and correctly construct the GCCFG. This paper fills this gap, and develops graph transformations to allow the construction of GCCFG in (almost) all cases. Our experiments show that by using succinct representation (GCCFG) rather than elaborate representation (GlobaICFG) the compilation time of state-of-the-art code management technique [4] can be improved by an average of 5X, and that of stack management [20] can be improved by an average of 4X.

Explore More