Dong Tong
Peking University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dong Tong.
international conference on parallel architectures and compilation techniques | 2012
Lingda Li; Dong Tong; Zichao Xie; Junlin Lu; Xu Cheng
In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity. Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer. The bypass technique is an effective and attractive solution that prevents the insertion of harmful blocks.
design, automation, and test in europe | 2007
Yi Feng; Zheng Zhou; Dong Tong; Xu Cheng
Multiple asynchronous clock domains have been increasingly employed in System-on-Chip (SoC) designs for different I/O interfaces. Functional validation is one of the most expensive tasks in the SoC design process. Simulation on register transfer level (RTL) is still the most widely used method. It is important to quantitatively measure the validation confidence and progress for clock domain crossing (CDC) designs. In this paper, we propose an efficient method for definition of CDC coverage, which can be used in RTL simulation for a multi-clock domain SoC design. First, we develop a CDC fault model to present the actual effect of metastability. Second, we use a temporal data flow graph (TDFG) to propagate the CDC faults to observable variables. Finally, CDC coverage is defined based on the CDC faults and their observability. Our experiments on a commercial IP demonstrate that this method is useful to find CDC errors early in the design cycles.
international conference on computer design | 2011
Zichao Xie; Dong Tong; Mingkai Huang; Xiaoyin Wang; Qinqing Shi; Xu Cheng
Indirect-branch prediction is becoming more important for modern processors as more programs are written in object-oriented languages. Previous hardware-based indirect-branch predictors generally require significant hardware storage or use aggressive algorithms which make the processor front-end more complex. In this paper, we propose a fast and cost-efficient indirect-branch prediction strategy, called Target Address Pointer (TAP) Prediction. TAP Prediction reuses the history-based branch direction predictor to detect occurrences of indirect branches, and then stores indirect-branch targets in the Branch Target Buffer (BTB). The key idea of TAP Prediction is to predict the Target Address Pointers, which generate virtual addresses to index the targets stored in the BTB, rather than to predict the indirect-branch targets directly. TAP Prediction also reuses the branch direction predictor to construct several small predictors. When fetching an indirect branch, these small predictors work in parallel to generate the target address pointer. Then TAP prediction accesses the BTB to fetch the predicted indirect-branch target using the generated virtual address. This mechanism could achieve time cost comparable to that of dedicated-storage-predictors, without requiring additional large amounts of storage. Our evaluation shows that for three representative direction predictors-Hybrid, Perceptrons, and O-GEHL-TAP schemes improve performance by 18.19%, 21.52%, and 20.59%, respectively, over the baseline processor with the most commonly-used BTB prediction. Compared with previous hardware-based indirect-branch predictors, the TAP-Perceptrons scheme achieves performance improvement equivalent to that provided by a 48KB TTC predictor, and it also outperforms the VPC predictor by 14.02%.
international conference on hardware/software codesign and system synthesis | 2010
Hao Li; Dong Tong; Kan Huang; Xu Cheng
Full-system emulation on FPGA(Field-Programmable Gate Array) with real-world workloads can enhance the confidence of SoC(System-on-Chip) design. However, since FPGA emulation requires complete implementation of key modules and provides weak visibility, it is time-consuming. This paper proposes FEMU, a hybrid firmware/hardware emulation framework for SoC verification. The core of FEMU is implemented by transplanting QEMU, a full-system emulator, from OS level to BIOS level, so we can directly emulate devices upon hardware. Moreover, FEMU provides programming interfaces to simplify device modeling in firmware. Based on an auxiliary set of hardware modules, FEMU allows hybrid full-system emulation with the combination of real hardware and emulated firmware model. Therefore, FEMU can facilitate full-system emulation in three aspects. First, FEMU enables full-system emulation with the minimum hardware implementation, so the DUT (Design Under Test) module can be verified under target application as early as possible. Second, by comparing the execution traces generated using real hardware and emulated firmware model, respectively, FEMU helps locate and fix bugs occurred in the full-system emulation. Third, by replacing un-verified hardware modules with emulated firmware models, FEMU helps isolating design issues in multiple modules. In a practical SoC project, FEMU helped us identify several design issues in full-system emulation. In addition, the evaluation results show that the emulation speed of FEMU is comparable with QEMU after transplantation.
international symposium on low power electronics and design | 2013
Zichao Xie; Dong Tong; Xu Cheng
Accurate branch prediction can improve processor performance, while reducing energy waste. Though some existing branch predictors have been proved effective, they usually require large amount of storage or complicate the processor front-end. This paper proposes a novel branch prediction technique called History Artificially Selected (HAS) prediction. It is a hardware technique that bases on the existing branch predictors to detect history noises and avoid noise interferences when predicting branches. It separates the original branch predictor into sub-predictors, each of which performs differently in branch history updating. With the help of some history stacks, one sub-predictor saves and restores the branch history at the entrance and the exit of loops and program subroutines where history noise usually exists. Through using a tournament mechanism, HAS prediction selectively uses the modified branch history to eliminate the history noise interferences and retain those useful history correlations at the same time. Our experimental results show that for three representative branch predictors, gshare, perceptron, and TAGE, it reduces the MPKI by 1.49, 2.85, and 1.10 respectively, resulting in 4.55%, 10.16%, and 4.45% performance improvement. It also reduces energy consumption by 4.02%, 7.78%, and 3.91%, respectively.
design, automation, and test in europe | 2012
Mingxing Tan; Xianhua Liu; Zichao Xie; Dong Tong; Xu Cheng
Branch prediction is critical in exploring instruction level parallelism for modern processors. Previous aggressive branch predictors generally require significant amount of hardware storage and complexity to pursue high prediction accuracy. This paper proposes the Compiler-guided History Stack (CHS), an energy-efficient compiler-microarchitecture cooperative technique for branch prediction. The key idea is to track very-long-distance branch correlation using a low-cost compiler-guided history stack. It relies on the compiler to identify branch correlation based on two program substructures: loop and procedure, and feed the information to the predictor by inserting guiding instructions. At runtime, the processor dynamically saves and restores the global history using a low-cost history stack structure according to the compiler-guided information. The modification on the global history enables the predictor to track very-long-distance branch correlation and thus improves the prediction accuracy. We show that CHS can be combined with most of existing branch predictors and it is especially effective with small and simple predictors. Our evaluations show that the CHS technique can reduce the average branch mispredictions by 28.7% over gshare predictor, resulting in average performance improvement of 10.4%. Furthermore, it can also improve those aggressive perceptron, OGEHL and TAGE predictors.
field programmable gate arrays | 2010
Kan Huang; Junlin Lu; Jiufeng Pang; Hao Li; Dong Tong; Xu Cheng
For the increasing market of smart phones, mobile internet devices, and ultra-mobile PCs, mainstream vendors propose two approaches: one is based on ARM SoC, and the other is based on power-efficient x86 processor. However, either approach has its own limitation. The ARM-based approach lacks application software while the x86-based approach does not support flexible SoC extension. To overcome the limitations, we propose the PKUnity86 SoC architecture, which is based on AMBA bus architecture to support fast IP integration. Furthermore, it contains a reduced AMD Geode GX2 processor and several specific designs to support Microsoft Windows and exploit the massive PC software resources. This paper presents two FPGA prototypes of PKUnity86: P86-Core and P86-Min. For P86-Core, which is to verify the core of PKUnity86, we change the RTL code of the reduced Geode GX2 to make it FPGA-synthesizable and implement it on a Xilinx Virtex-4 LX200 FPGA device. We connect the FPGA board to a Geode SP4GX22 motherboard so that we can do full-system emulation. For P86-Min, which is to verify the minimum set of PKUnity86, we implement the RTL code on two Xilinx Virtex-4 LX200 FPGA devices and emulate the full system on a single FPGA board. In addition, we adopt a hardware-software codevelopment methodology and employ various debug tools to facilitate building P86-Min. Both prototypes reach its own compatibility goal: P86-Core supports Windows XP and previous versions and P86-Min supports Windows 98 and previous versions. The evaluation results show that PKUnity86 achieves Windows compatibility with small hardware overheads and no performance loss.
acm symposium on applied computing | 2010
Shu Liu; Xu Cheng; Xuetao Guan; Dong Tong
Flash memory is widely used because of its shock-resistance and power-efficient features. However, it cannot replace hard disks as secondary storage devices due to their greater cost per unit storage and low capability. In this paper, we propose an energy efficient heterogeneous secondary storage system management scheme for mobile systems. We employ flash memory device as a file cache of hard disk and extend existing data cache management algorithms to distribute files between two devices with consideration of file level cache restrictions. As a result, most file accesses are conducted in flash memory device and disk is spun-down to save energy. We develop a trace-driven simulator to evaluate our scheme in comparison with other alternatives. Results demonstrate that with the help of our scheme, energy consumption of secondary storage system can be saved by up to 90% and I/O access time is improved. Furthermore, the file cache management algorithms can result in high hit ratios.
Journal of Computer Science and Technology | 2010
Xu Cheng; Xiaoyin Wang; Junlin Lu; Jiangfang Yi; Dong Tong; Xuetao Guan; Feng Liu; Xian-Hua Liu; Chun Yang; Yi Feng
CPU and System-on-Chip (SoC) are two key technologies of IT industry. During the course of ten years of research, we have defined the UniCore instruction set architecture, and designed the UniCore CPU and the PKUnity SoC family. This cross-disciplinary practice has also fostered many innovations in microprocessor architecture, optimizing compilers, low power design, functional verification, physical design, and so on. In the mean time, we have put technology transfer on the list of our top priorities. This effort has led to several marketable products, such as ultra mobile personal computers, secure micro-workstations and 3C-converged consumer electronics. The development of the next generation products, the 64-bit multi-core CPU and SoC, is also underway. They will find their applications in secure and adaptable computers for mobile and desktop, as well as personal digital multimedia devices. Being consistent with the philosophy and the long-term plan, and by leveraging the cutting-edge process technology, we will continue to make more innovations in CPUs and SoCs, and strengthen our commitment to technology transfer.
Journal of Computer Science and Technology | 2007
Yu-Lai Zhao; Xianfeng Li; Dong Tong; Xu Cheng
Mainstream processors implement the instruction scheduler using a monolithic CAM-based issue queue (IQ), which consumes increasingly high energy as its size scales. In particular, its instruction wakeup logic accounts for a major portion of the consumed energy. Our study shows that instructions with 2 non-ready operands (called 2OP instructions) are in small percentage, but tend to spend long latencies in the IQ. They can be effectively shelved in a small RAM-based waiting instruction buffer (WIB) and steered into the IQ at appropriate time. With this two-level shelving ability, half of the CAM tag comparators are eliminated in the IQ, which significantly reduces the energy of wakeup operation. In addition, we propose an adaptive banking scheme to downsize the IQ and reduce the bit-width of tag comparators. Experiments indicate that for an 8-wide issue superscalar or SMT processor, the energy consumption of the instruction scheduler can be reduced by 67%. Furthermore, the new design has potentially faster scheduler clock speed while maintaining close IPC to the monolithic scheduler design. Compared with the previous work on eliminating tags through prediction, our design is superior in terms of both energy reduction and SMT support.