Junlin Lu
Peking University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Junlin Lu.
international conference on parallel architectures and compilation techniques | 2012
Lingda Li; Dong Tong; Zichao Xie; Junlin Lu; Xu Cheng
In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity. Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer. The bypass technique is an effective and attractive solution that prevents the insertion of harmful blocks.
field programmable gate arrays | 2010
Kan Huang; Junlin Lu; Jiufeng Pang; Hao Li; Dong Tong; Xu Cheng
For the increasing market of smart phones, mobile internet devices, and ultra-mobile PCs, mainstream vendors propose two approaches: one is based on ARM SoC, and the other is based on power-efficient x86 processor. However, either approach has its own limitation. The ARM-based approach lacks application software while the x86-based approach does not support flexible SoC extension. To overcome the limitations, we propose the PKUnity86 SoC architecture, which is based on AMBA bus architecture to support fast IP integration. Furthermore, it contains a reduced AMD Geode GX2 processor and several specific designs to support Microsoft Windows and exploit the massive PC software resources. This paper presents two FPGA prototypes of PKUnity86: P86-Core and P86-Min. For P86-Core, which is to verify the core of PKUnity86, we change the RTL code of the reduced Geode GX2 to make it FPGA-synthesizable and implement it on a Xilinx Virtex-4 LX200 FPGA device. We connect the FPGA board to a Geode SP4GX22 motherboard so that we can do full-system emulation. For P86-Min, which is to verify the minimum set of PKUnity86, we implement the RTL code on two Xilinx Virtex-4 LX200 FPGA devices and emulate the full system on a single FPGA board. In addition, we adopt a hardware-software codevelopment methodology and employ various debug tools to facilitate building P86-Min. Both prototypes reach its own compatibility goal: P86-Core supports Windows XP and previous versions and P86-Min supports Windows 98 and previous versions. The evaluation results show that PKUnity86 achieves Windows compatibility with small hardware overheads and no performance loss.
Journal of Computer Science and Technology | 2010
Xu Cheng; Xiaoyin Wang; Junlin Lu; Jiangfang Yi; Dong Tong; Xuetao Guan; Feng Liu; Xian-Hua Liu; Chun Yang; Yi Feng
CPU and System-on-Chip (SoC) are two key technologies of IT industry. During the course of ten years of research, we have defined the UniCore instruction set architecture, and designed the UniCore CPU and the PKUnity SoC family. This cross-disciplinary practice has also fostered many innovations in microprocessor architecture, optimizing compilers, low power design, functional verification, physical design, and so on. In the mean time, we have put technology transfer on the list of our top priorities. This effort has led to several marketable products, such as ultra mobile personal computers, secure micro-workstations and 3C-converged consumer electronics. The development of the next generation products, the 64-bit multi-core CPU and SoC, is also underway. They will find their applications in secure and adaptable computers for mobile and desktop, as well as personal digital multimedia devices. Being consistent with the philosophy and the long-term plan, and by leveraging the cutting-edge process technology, we will continue to make more innovations in CPUs and SoCs, and strengthen our commitment to technology transfer.
international conference on supercomputing | 2014
Lingda Li; Junlin Lu; Xu Cheng
Last-level cache performance has been proved to be crucial to the system performance. Essentially, any cache management policy improves performance by retaining blocks that it believes to have higher values preferentially. Most cache management policies use the access time or reuse distance of a block as its value to minimize total miss count. However, cache miss penalty is variable in modern systems due to i) variable memory access latency and ii) the disparity in latency toleration ability across different misses. Some recently proposed policies thus take into account the miss penalty as the block value. However, only considering miss penalty is not enough. In fact, the value of a block includes not only the penalty on its misses, but also the reduction of processor stall cycles on its hits, i.e., hit benefit. Therefore, we propose a method to compute both miss penalty and hit benefit. Then, the value of a block is calculated by accumulating all the miss penalty and hit benefits of its requests. Using our notion of block value, we propose Value based Insertion Policy (VIP) which aims to reserve more blocks with higher values in the cache. VIP keeps track of a small number of incoming and victim block pairs to learn the relationship between the value of the incoming block and that of the victim. On a miss, if the value of the incoming block is learned to be lower than that of the victim block in the past, VIP will predict that the incoming block is valueless and insert it with a high eviction priority. The evaluation shows that VIP can improve cache performance significantly in both single-core and multi-core environment while requiring a low storage overhead.
international conference on computer design | 2012
Lingda Li; Dong Tong; Zichao Xie; Junlin Lu; Xu Cheng
Inclusive cache hierarchies are widely adopted in modern processors, since they can simplify the implementation of cache coherence. However, it sacrifices some performance to guarantee inclusion. Many recent intelligent management policies are proposed to improve the last-level cache (LLC) performance by evicting blocks with poor locality earlier. Unfortunately, they are inapplicable in inclusive LLCs. In this paper, we propose Two-level Eviction Priority (TEP) policy. Besides the eviction priority provided by the baseline replacement policy, TEP appends an additional high level of eviction priority to LLC blocks, which is decided at the insertion time and cannot be changed during their lifetime in the LLC. When blocks with high eviction priority are not in inner caches anymore, they get evicted from the LLC preferentially. Thus, the LLC can retain more useful blocks to improve performance. TEP can cooperate well with various baseline replacement policies. Our evaluation shows that TEP with NRU can improve the performance of inclusive LLCs significantly while requiring negligible extra storage. It also outperforms other recent proposals including QBS, DIP, and DRRIP.
design, automation, and test in europe | 2012
Xianglei Dang; Xiaoyin Wang; Dong Tong; Junlin Lu; Jiangfang Yi; Keyi Wang
Energy efficiency is becoming a major constraint in processor designs. Every component of the processor should be reconsidered to reduce wasted energy and area. Prefetching is an important technique for tolerating memory latency. Prefetcher designs have important impact on the energy efficiency of the memory hierarchy. Stride prefetchers require little storage, but cannot handle irregular access patterns. Delta correlation (DC) prefetchers can handle complicated access patterns, but waste storage because of storing multiple miss addresses for a stride pattern. Moreover, DC prefetchers waste the bandwidth and energy of the memory hierarchy because they cannot identify whether an address has been prefetched and generate a large number of redundant prefetches. In this paper, we propose a storage and energy efficient data prefetcher called stride/DC (S/DC) to combine the advantages of stride and DC prefetchers. S/DC uses a pattern prediction table (PPT) which stores two recent miss addresses in each entry to capture stride patterns. PPT avoids recording multiple miss addresses for a stride pattern, and thus improves the storage efficiency. When handling stride patterns, each PPT entry maintains a counter for obtaining the last prefetched address to avoid generating redundant prefetches. When handling other patterns, S/DC compares the new predicted address with earlier generated addresses in the prefetch queue and filters the redundant ones. In addition, to expand the filtering scope, S/DC uses a prefetch filter to store addresses evicted from the prefetch queue. In this way, S/DC reduces the bandwidth requirements and energy consumption of prefetching. Experimental results demonstrate that S/DC achieves comparable performance with only 24% of the storage and reduces 11.46% of the L2 cache energy, as compared to the CZone/DC prefetcher.
application-specific systems, architectures, and processors | 2017
Yangguo Liu; Junlin Lu; Dong Tong; Xu Cheng
Memory interference is a critical impediment to system performance in CMP systems. To address this problem, we first propose a Dynamically Proportional Bandwidth Throttling policy (DPBT), which dynamically throttles back memory-intensive applications based on their memory access behavior. DPBT achieves a more balance memory bandwidth partitioning. Moreover, we improve the previous memory channel partitioning scheme by integrating it with a bank partitioning. We further integrate DPBT with the improved memory channel partitioning scheme and a memory scheduling policy to leverage the architecture advantages, and present a Stage Memory Resource Management Method (SRM). Experimental results show that DPBT improves system throughput/fairness by 13.5%/31.1%. SRM provides 27.1% better system throughput and 34.8% better system fairness.
Journal of Computer Science and Technology | 2014
Lingda Li; Junlin Lu; Xu Cheng
The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cores varies across different misses. Compared with parallel misses and store misses, isolated fetch and load misses are more costly. The variation of cache miss penalty suggests that the cache replacement policy should take it into account. To that end, first, we propose the notion of retention benefit. Retention benefits can evaluate not only the increment of processor stall cycles on cache misses, but also the reduction of processor stall cycles due to cache hits. Then, we propose Retention Benefit Based Replacement (RBR) which aims to maximize the aggregate retention benefits of blocks reserved in the cache. RBR keeps track of the total retention benefit for each block in the cache, and it preferentially evicts the block with the minimum total retention benefit on replacement. The evaluation shows that RBR can improve cache performance significantly in both single-core and multi-core environment while requiring a low storage overhead. It also outperforms other state-of-the-art techniques.
Journal of Computer Science and Technology | 2012
Zhenhao Zhang; Xiaoyin Wang; Dong Tong; Jiangfang Yi; Junlin Lu; Keyi Wang
Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation. But this in-flight forwarding only occupies about 15% of all store-load communications, which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further. This paper presents a new micro-architecture named ASW (short for active store window). It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ. This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache, which is referred to as far forwarding in this paper. At the back-end of the pipeline, it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding. The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures. Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSQ design on most benchmarks by 10.22% and 8.71% respectively.
Archive | 2012
Xu Cheng; Zhenxue Zhang; Xiaoyin Wang; Dong Tong; Jiangfang Yi; Junlin Lu; Keyi Wang