Maolin Guan
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Maolin Guan.
international symposium on microarchitecture | 2008
Mei Wen; Nan Wu; Chunyuan Zhang; Qianming Yang; Jun Ren; Yi He; Wei Wu; Jun Chai; Maolin Guan; Changqing Xun
In this paper shows the extension of application domains, hardware-managed memory structures such as caches are drawing attention for dealing with irregular stream applications. However, since a real application usually has both regular and irregular stream characteristics, conventional stream register files, caches, or combinations thereof have shortcomings. This article focuses on combining software- and hardware-managed memory structures and presents a new syncretic memory system based on the ft64 stream accelerator.
asia and south pacific design automation conference | 2008
Mei Wen; Nan Wu; Maolin Guan; Chunyuan Zhang
In this paper we describe load scheduling, a novel method that balances load among register files by residual resources. Load scheduling can reduce register pressure for clustered VLIW processors with distributed register files while not increasing VLIW scheduling length. We have implemented load scheduling in compiler for Imagine and FT64 stream processors. The result shows that the proposed technique effectively reduces the number of variables spilled to memory, and can even eliminate it. The algorithm presented in this paper is extremely efficient in embedded processor with limited register resource because it can improve registers utilization instead of increasing the requirement for the number of registers.
high performance embedded architectures and compilers | 2011
Nan Wu; Qianming Yang; Mei Wen; Yi He; Ju Ren; Maolin Guan; Chunyuan Zhang
Conventional stream architectures focus on exploiting ILP and DLP in the applications, although stream model also exposes abundant TLP at kernel granularity. On the other side, with the development of model VLSI technology, increasing application demands and scalability challenges conventional stream architectures. In this paper, we present a novel Tiled Multi-Core Stream Architecture called TiSA. TiSA introduces the tile that consists of multiple stream cores as a new category of architectural resources, and designed an on-chip network to support stream transfer among tiles. In TiSA, multiple levels parallelisms are exploited on different granularity of processing elements. Besides hardware modules, this paper also discusses some other key issues of TiSA architecture, including programming model, various execution patterns and resource allocations. We then evaluate the hardware scalability of TiSA by scaling to 10s~1000s ALUs and estimating its area and delay cost. We also evaluate the software scalability of TiSA by simulating 6 stream applications and comparing sustained performance with other stream processors and general purpose processors, and different configuration of TiSA. A 256-ALU TiSA with 4 tile and 4 stream cores per tile is shown to be feasible with 45 nanometer technology, sustaining 100~350 GFLOP/s on most stream benchmarks and providing ~10x of speedup over a 16-ALU TiSA with a 5% degradation in area per ALU. The result shows that TiSA is a VLSI- and performance-efficient architecture for the billions-transistors era.
international conference on computer science and service system | 2012
Qianming Yang; Nan Wu; Maolin Guan; Chunyuan Zhang; Jun Cai
With the evolution of more sophisticated communication standards and algorithms, embedded applications exhibit demanding performance and efficiency requirements. Massively Parallel Computing based on many simple cores and few powerful cores is becoming the mainstream method of building high performance and low power processor. While aimed at the design of the simple core, this paper proposes an energy-efficient processor architecture named Smart Core. Following the idea of explicitly parallel and accurate computing, Smart Core uses exposed and non-deep pipeline to eliminate the pipeline registers and to reduce the cost of executing instructions. Multi-level data memory organization, consisted of streaming memory, multi-mode register file and fully distributed tiny operand register file, captures various data reuse and locality to reduce the cost of delivering data. To reduce the cost of delivering instructions, an asymmetric and fully distributed instruction register file is used to capture locality and reuse of instructions in a loop. Preliminary results show that Smart Core achieves an energy efficiency that is 25x greater than the traditional embedded RISC processor. When scaled to a 40nm CMOS technology, single chip multi-processor, consisted of many cores like Smart Core, is capable of providing more than 1TOPS performance while achieving efficiency of 100GOPS/W or more.
ieee international conference on high performance computing data and analytics | 2012
Yi He; Maolin Guan; Chunyuan Zhang; Tian Tian; Qianming Yang
Huge code size and poor code density have always been a serious problem in VLIW processor. In order to deal with the problem and its influence on the instruction memory in stream architecture, this paper proposes a novel method called field-divided VLIW compression through analyzing the code characteristics of stream program across a wide range of typical stream application domains and dividing the instruction code unrelated to each other into different subfields. Based on the field-divided VLIW compression, this paper designs a fully distributed on-chip instruction memory (FDIM) for stream architecture. The experiment on MASA stream processor demonstrates that the field-divided VLIW compression can reduce about 38% of off-chip instruction code and about 66% of on-chip instruction memory space demand in the case of having little influence on the program performance; FDIM reduces the area of on-chip instruction memory by about 37%, thus reduces the area of the MASA stream processor by about 8.92%. Besides, the energy consumption of instruction memory is decreased by about 61%.
international conference on computer sciences and convergence information technology | 2010
Maolin Guan; Nan Wu; Mei Wen; Chunyuan Zhang
Based on the characteristics of data level parallelism (DLP) multi-threading programs appearing in the practical application, this paper proposes a new method that implements software integration of identical DLP threads via compilation for VLIW processors. This method translates DLP into ILP by merging the operations in corresponding basic blocks divided from different threads into one basic block to extend the instruction window that the compiler can schedule, and optimizes the control flow of the program after thread integration to ensure the correctness of the program. The experimental results show that this technique can accelerate the program execution very well without exerting more burdens on the programmer, while the hardware overhead can be ignored. Generally speaking, integration of 2∼4 threads can get a speedup of 1.34∼2.07.
Archive | 2011
Yi He; Chunyuan Zhang; Mei Wen; Nan Wu; Qianming Yang; Ju Ren; Maolin Guan; Changqing Xun; Wei Wu; Chai Jun; Jingxu Li
Archive | 2008
Qianming Yang; Nan Wu; Mei Wen; Changqing Xun; Ju Ren; Yi He; Wei Wu; Chai Jun; Maolin Guan; Chunyuan Zhang; Jingxu Li
Archive | 2008
Maolin Guan; Mei Wen; Nan Wu; Chunyuan Zhang; Ju Ren; Yi He; Changqing Xun; Qianming Yang; Wei Wu
International Journal of Advancements in Computing Technology | 2013
Maolin Guan; Chunyuan Zhang; Nan Wu