Is this you? Create Your Porfile

Rafael Kioji Vivas Maeda

Hong Kong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rafael Kioji Vivas Maeda is active.

Explore More

Publication

Featured researches published by Rafael Kioji Vivas Maeda.

high performance embedded architectures and compilers | 2016

JADE: a Heterogeneous Multiprocessor System Simulation Platform Using Recorded and Statistical Application Models

Rafael Kioji Vivas Maeda; Peng Yang; Xiaowen Wu; Zhe Wang; Jiang Xu; Zhehui Wang; Haoran Li; Luan H. K. Duong; Zhifei Wang

Recent advances in the computing industry towards multiprocessor technologies shifted the dominant method of performance increase from frequency scaling to parallelism. Due to its huge design space, evaluating candidate multicore architectures in early design stages, when the number of variables is at its maximum, is challenging. Simulation plays an important role in estimating architecture performance, and evaluating how the system would perform on average, as well as boundary cases, would require many iterations to cover various cases in the application input domain. Since simulation of heterogeneous systems with enough details are naturally slow, exhaustively evaluating the system for all possible inputs require tremendous amount of time and resources. While there exist quite a few multiprocessor simulators available, they often rely on individual input specification, demanding extensive input enumeration and simulation runs, diminishing their effectiveness for complex systems evaluation. Aiming to fulfill this gap, we publicly release a heterogeneous multiprocessor system simulation platform called JADE, targeting fast initial architecture explorations. Opposing to most simulators, JADE uses statistical models that follow distributions extracted from internal structures of the application, providing a more convenient and systematic exploration approach to evaluate systems performance. JADE simulation features include detailed electrical and optical interconnections, detailed memory hierarchy infrastructure, and built-in energy analysis allowing studies of a broad spectrum of systems.

IEEE Transactions on Very Large Scale Integration Systems | 2016

Coherent and Incoherent Crosstalk Noise Analyses in Interchip/Intrachip Optical Interconnection Networks

Luan H. K. Duong; Zhehui Wang; Mahdi Nikdast; Jiang Xu; Peng Yang; Zhifei Wang; Zhe Wang; Rafael Kioji Vivas Maeda; Haoran Li; Xuan Wang; Sébastien Le Beux; Yvain Thonnart

Recently, interchip/intrachip optical interconnection networks have been proposed for ultrahigh-bandwidth and low-latency communications. These networks employ the microresonators (MRs) to modulate, direct, or detect the optical signal. However, utilized MRs suffer from intrinsic crosstalk noise and signal power loss, degrading the network efficiency via the signal-to-noise ratio (SNR). The amount of crosstalk noise and signal power loss may differ from network to network. Hence, there exists a need to systematically analyze the effect of the crosstalk noise and the power loss issues. In this paper, we have developed the analytical models considering both coherent and incoherent crosstalk for both the interchip and intrachip optical networks. The interchip/intrachip optical interconnection networks-the I2CON-are analyzed as a case study. The quantitative results on the individual networks have demonstrated that the architectural design determines the impact of crosstalk on the SNR. We have also demonstrated that the optical interconnection networks with interchip/intrachip interconnects result in better bit error rate (BER) compared with that of only intrachip interconnect. Our analyses of the worst case can be utilized as a platform to compare the realistic performance among different optical interconnection networks via the degradation of SNR/BER and data bandwidth.

IEEE Transactions on Very Large Scale Integration Systems | 2016

A Holistic Modeling and Analysis of Optical–Electrical Interfaces for Inter/Intra-chip Interconnects

Zhehui Wang; Jiang Xu; Peng Yang; Luan H. K. Duong; Zhifei Wang; Xuan Wang; Zhe Wang; Haoran Li; Rafael Kioji Vivas Maeda

With the fast development of inter/intra-chip optical interconnects, the gap between the data rates of electrical interconnects and optical interconnects is continuously increasing. Electrical-optical (E-O) interfaces and optical-electrical (O-E) interfaces are a pair of components that convert data between parallel electrical interconnects and serial optical interconnects. This paper holistically models and analyzes E-O and O-E interfaces in terms of energy consumption, area, and latency. Traditional interfaces, where data are converted between parallel and serial ports by serializers and deserializers (SerDes), are studied. A new type of E-O and O-E interface, which serializes and deserializes data by optical weaving technologies, are proposed alongside. Traditional interfaces will become a bottleneck for the further development of optical interconnects in the near future because of the high energy consumption and large area of SerDes necessitating new technologies. Our analysis shows that optical weaving interfaces have a better overall performance than traditional interfaces. For example, if there are 64 parallel electrical interconnects and four optical wavelengths, optical weaving interfaces can achieve a 81.6% improvement in energy consumption and a 40.8% improvement in area, compared with traditional interfaces.

high-performance computer architecture | 2017

Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance

Rafael Kioji Vivas Maeda; Qiong Cai; Jiang Xu; Zhe Wang; Zhongyuan Tian

Exploring the design space of the memory hierarchy requires the use of effective methodologies, tools, and models to evaluate different parameter values. Reuse distance is of one of the locality models used in the design exploration and permits analytical cache miss estimation, program characterization, and synthetic trace generation. Unfortunately, the reuse distance is limited to a single locality granularity. Hence, it is not a suitable model for caches with hybrid line sizes, such as sectored caches, an increasingly popular choice forlarge caches. In this work, we introduce a generalization to the reuse distance, which is able to capture locality seen at multiple granularities. We refer to it as Hierarchical Reuse Distance (HRD). The proposed model has same profiling and synthesis complexity as the traditional reuse distance, and our results show that HRD reduces the average miss rate error on sectored caches by more than three times. In addition, it has superior characteristics in exploring multi-level caches with conventional single line size. For instance, our method increases the accuracy on L2 and L3 by a factor of 4 and converges three orders of magnitude faster.

asia and south pacific design automation conference | 2017

Modular reinforcement learning for self-adaptive energy efficiency optimization in multicore system

Zhe Wang; Zhongyuan Tian; Jiang Xu; Rafael Kioji Vivas Maeda; Haoran Li; Peng Yang; Zhehui Wang; Luan H. K. Duong; Zhifei Wang; Xuanqi Chen

Energy-efficiency is becoming increasingly important to modern computing systems with multi-/many-core architectures. Dynamic Voltage and Frequency Scaling (DVFS), as an effective low-power technique, has been widely applied to improve energy-efficiency in commercial multi-core systems. However, due to the large number of cores and growing complexity of emerging applications, it is difficult to efficiently find a globally optimized voltage/frequency assignment at runtime. In order to improve the energy-efficiency for the overall multicore system, we propose an online DVFS control strategy based on core-level Modular Reinforcement Learning (MRL) to adaptively select appropriate operating frequencies for each individual core. Instead of focusing solely on the local core conditions, MRL is able to make comprehensive decisions by considering the running-states of multiple cores without incurring exponential memory cost which is necessary in traditional Monolithic Reinforcement Learning (RL). Experimental results on various realistic applications and different system scales show that the proposed approach improves up to 28% energy-efficiency compared to the recent individual-RL approach.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017

Energy-Efficient Power Delivery System Paradigms for Many-Core Processors

Haoran Li; Xuan Wang; Jiang Xu; Zhe Wang; Rafael Kioji Vivas Maeda; Zhehui Wang; Peng Yang; Luan H. K. Duong; Zhifei Wang

The design of power delivery system plays a crucial role in guaranteeing the proper functionality of many-core processor systems. The power loss suffered on power delivery has become a salient part of total power consumption, and the energy efficiency of a highly dynamic system has been significantly challenged. Being able to achieve a fast response time and multiple voltage domain control, on-chip voltage regulators (VRs) have become popular choices to enable fine-grain power management, which also enlarge the design space of power delivery systems. This paper analytically studies different power delivery system paradigms and power management schemes in terms of energy efficiency, area overhead, and power pin occupation. The analysis shows that compared to the conventional paradigm with off-chip VRs, hybrid paradigms with both on-chip and off-chip VRs are able to maintain high efficiency in a larger range of workloads, though they suffer from low efficiency at light workload. Employed with the quantized power management scheme, the hybrid paradigm can improve the system energy efficiency at light workload by a maximum of 136% compared to the traditional load balanced scheme. Besides this, the in-package (iP) hybrid paradigm further shows its advantage in reducing the physical overheads. The results reveal that at 120 W workload, it occupies only a 10.94% total footprint area or 39.07% power pins of that of the off-chip paradigm. We conclude that the iP hybrid paradigm achieves the best tradeoffs between efficiency, physical overhead, and realization of fine-grain power management.

Journal of Lightwave Technology | 2016

Low-Loss High-Radix Integrated Optical Switch Networks for Software-Defined Servers

Zhifei Wang; Zhehui Wang; Jiang Xu; Peng Yang; Luan H. K. Duong; Zhe Wang; Haoran Li; Rafael Kioji Vivas Maeda

Software-defined servers provide high flexibility and customizablility with low power consumption. To satisfy the ultrahigh bandwidth requirement of the interconnection of these servers, integrated optical switch networks, based on the recent development of silicon photonics, are promising candidates. In this study, we present a family of floorplan optimized delta optical networks (FODONs) with the proposed stage switches. Both the analytical approximation and the loss model based on the exhaustive search approach are developed to evaluate the loss parameters in the networks. The optimization of the stage switch radix is conducted as well. Results show that when 32 WDM channels are employed, the worst-case loss of the 1024 × 1024 FODON with 4 × 4 stage switches is only 26 dB, which is 95, 63, 37 dB less than Benes, Fat-tree, and Baseline networks of the same size, respectively. Furthermore, the average loss and the cost of hardware resources of FODONs are much lower than other networks.

IEEE Transactions on Very Large Scale Integration Systems | 2016

An Adaptive Process-Variation-Aware Technique for Power-Gating-Induced Power/Ground Noise Mitigation in MPSoC

Zhe Wang; Xuan Wang; Jiang Xu; Haoran Li; Rafael Kioji Vivas Maeda; Zhehui Wang; Peng Yang; Luan H. K. Duong; Zhifei Wang

Power gating (PG) is one of the most effective techniques to reduce the leakage power in multiprocessor system-on-chips (MPSoCs). However, the power-mode transition during the PG period of an individual processing unit (PU) will introduce serious power/ground (P/G) noise to the neighboring PUs. As technology scales, the P/G noise problem becomes a severe reliability threat to MPSoCs. At the same time, the increasing manufacturing process variations (PVs) also bring uncertainties to the P/G noise problem and make it difficult to predict and mitigate. To tackle this problem, in this paper, we analyze the PG-induced P/G noise in the presence of PVs and propose a hardware-software collaborated runtime technique to adaptively protect PUs from P/G noise. Sensor network-on-chip is used to gather noise information and coordinate different system components. An online PV-aware algorithm is developed to effectively decide the noise impact range and arrange protections for affected PUs based on the collected noise information. We evaluate the proposed technique through cycle-level Monte Carlo simulations of NoC-based MPSoCs in different scales. The experimental results on various realistic applications show that our technique could achieve comparable reliability to the most reliable static technique while improve on average 3.78%-29.5% the system energy efficiency and reduce 15.7%-70.4% the performance penalty on different MPSoC scales.

design, automation, and test in europe | 2015

Adaptively tolerate power-gating-induced power/ground noise under process variations

Zhe Wang; Xuan Wang; Jiang Xu; Xiaowen Wu; Zhehui Wang; Peng Yang; Luan H. K. Duong; Haoran Li; Rafael Kioji Vivas Maeda; Zhifei Wang

Power gating is one of the most effective techniques to reduce the leakage power in multiprocessor system-on-chips (MPSoCs). However, the power-mode transition during the power gating period of an individual processing unit will introduce serious power/ground (P/G) noise to the neighboring processing units. As technology scales, the P/G noise problem becomes a severe reliability threat to MPSoCs. At the same time, the increasing manufacturing process variations also bring uncertainties to the P/G noise problem and make it difficult to predict and deal with. In order to address this problem, for the first time, this paper analyzes the power-gating-induced P/G noise in the presence of process variations, and proposes a hardware-software collaborated online method to adaptively protect processing units from P/G noise. Sensor network-on-chip (SENoC) is used to gather noise information and coordinate different system components. Meanwhile an online software-based algorithm is developed to effectively decide the noise impact range and arrange protections for affected processing units based on the collected information. We evaluate the proposed method through Monte Carlo simulations on a NoC-based MPSoC platform. The experimental results show that for a set of real applications, our method achieves on average 13.2% overall performance improvement and 13.3% system energy reduction compared with the traditional stop-go method.

asia and south pacific design automation conference | 2015

Alleviate chip I/O pin constraints for multicore processors through optical interconnects

Zhehui Wang; Jiang Xu; Peng Yang; Xuan Wang; Zhe Wang; Luan H. K. Duong; Zhifei Wang; Haoran Li; Rafael Kioji Vivas Maeda; Xiaowen Wu; Yaoyao Ye; Qinfen Hao

Chip I/O pins are an increasingly limited resource and significantly affect the performance, power and cost of multicore processors. Optical interconnects promise low power and high bandwidth, and are potential alternatives to electrical interconnects. This work systematically developed a set of analytical models for electrical and optical interconnects to study their structures, receiver sensitivities, crosstalk noises, and attenuations. We verified the models by published implementation results. The analytical models quantitatively identified the advantages of optical interconnects in terms of bandwidth, energy consumption, and transmission distance. We showed that optical interconnects can significantly reduce chip pin counts. For example, compared to electrical interconnects, optical interconnects can save at least 92% signal pins when connecting chips more than 25 cm (10 inches) apart.

Explore More