Is this you? Create Your Porfile

Victor Goulart

Egypt-Japan University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Victor Goulart is active.

Explore More

Publication

Featured researches published by Victor Goulart.

Computers & Mathematics With Applications | 2012

Flexible router architecture for network-on-chip

Mostafa S. Sayed; Ahmed Shalaby; Mohamed El-Sayed; Victor Goulart

The growing complexity of systems-on-chip (SoCs) pushes researchers to propose replacing the bus architecture by Networks-on-Chip (NoCs). The key advantages of NoCs are efficient exploitation of performance and scalability. Nowadays NoCs are a well established research topic and several implementations have been proposed. Some techniques are proposed to improve NoC performance in terms of latency and throughput while others are proposed to improve area utilization and power consumption. An important research in NoC design is the tradeoff between area/power and performance. In order to improve performance some techniques tend to increase the number of buffers. However this method increases area and power consumption. This paper introduces new router architecture, called the Flexible Router, which improves the performance of the overall network using the same amount of available buffers but in more efficient way. Therefore there is no need to increase the size of buffers or to use extra virtual channels (VCs) which cause high power consumption, area overheads, and complex logic. The Flexible Router provides a way to handle the requests to a busy buffer by other buffers in the router. It is observed that the Flexible router can achieve better performance in terms of increasing the saturation rate for Hotspot, Uniform, and Nearest-Neighbor traffic patterns, especially Hotspot with an 11.4% increase. Discussion about area overhead compared to the Base router and analysis of arriving out of order packets (side-effect) are also provided.

digital systems design | 2011

Faster Processor Allocation Algorithms for Mesh-Connected CMPs

Luka B. Daoud; M. El Sayed Ragab; Victor Goulart

Designing efficient processor allocation algorithms is one of the major issues to build large high performance Chip Multiprocessors (CMPs). The task of the processor allocator (PA) is to assign one or more processors for an incoming job. In this paper, we propose two new contiguous processor allocation algorithms, Better First Fit (BFF) and Improved Better First Fit (IBFF), for CMPs based on 2-D mesh networks. Our proposed algorithms outperform other existing allocation strategies based on busy array approach such as First Fit (FF) and Improved First Fit (IFF) using a faster scanning of the bit-map matrix. Evaluation of BFF and IBFF were done with different sets of jobs over different network sizes and job dispatching approaches. According to the job size, BFF is faster than IFF by up to 63.1% or 18.5% in average for all job sizes evaluated. IBFF, when set with best parameters, is 54.8% faster than IFF and 4% faster than BFF for a random mix of job sizes.

Iet Image Processing | 2017

Diagonal-based fast intra-mode decision algorithm for HEVC

Maher Abdelrasoul; Mohammed S. Sayed; Victor Goulart

Intra-mode decision plays an important role in the new high-efficiency video coding (HEVC) video compression standard. The higher number of intra-modes in HEVC standard increased the computational complexity and encoding time significantly. To reduce the encoding time with as low effect as possible on the coding quality, a new algorithm based on the texture of the block diagonals is proposed. The proposed algorithm is used to reduce the number of calculations observed in the original standard. The results show that the proposed algorithm can show variety of solutions with wide range of encoding time reduction and BD rate increment. The solution with the highest time saving within an acceptable BD rate increment shows 40.85% time saving and 1.75% BD rate increment. With different parameters settings, BD rate increments with as low as 0.78% can be achieved while having 33.38% time saving.

latin american symposium on circuits and systems | 2012

Processor allocation algorithm based on Frame Combing with Memorization for 2D mesh CMPs

Luka B. Daoud; M. El Sayed Ragab; Victor Goulart

Processor Allocator (PA) is one of the main components to achieve high performance Chip Multiprocessors (CMPs). The task of the PA is to assign a set of processors to execute an incoming job scheduled by the Operating System (OS). An efficient PA is one that allocates an incoming job, if a suitable free submesh exists, with minimum overhead. In this paper, we propose a new contiguous processor allocation algorithm, Frame Combing with Memorization (FCM) for 2D mesh CMPs, which is fast, has complete submesh recognition, and assigns a set of processors without creating coverage areas for the incoming job. Our proposed algorithm outperforms other existing allocation strategies based on busy array approach such as Improved First Fit (IFF), and Better First Fit (BFF), or even Right of Busy Submeshes (RBS), a fast busy list based PA algorithm. Performance evaluation has been done with different job sets over different network sizes and at different network occupations. Our proposed PA is in average 3 up to 5 times faster than IFF, BFF and RBS for allocating small job (size 4×2) in a network size 10×10. For big network sizes (30×30), it is up to 60, 79, and 48 times faster than IFF, BFF, and RBS, respectively.

2012 Japan-Egypt Conference on Electronics, Communications and Computers | 2012

Congestion mitigation using flexible router architecture for Network-on-Chip

Mostafa S. Sayed; Ahmed Shalaby; M. El-Sayed Ragab; Victor Goulart

An important topic in Network-on-Chip (NoC) design is the tradeoff between area and performance. Some techniques tend to increase the number of buffers to improve performance. However this method increases the chip area and so does the power consumption. In this paper we introduce a new flexible router architecture that can improve the performance of the overall network using the same amount of buffering available but in an efficient way. Therefore there is no need to increase the size of buffers or to use extra virtual channels (VCs) which have high power and area overheads or complex logic. If there is a request to a busy buffer the router will store the incoming packet in any other suitable free buffer in the router. The Flexible router shows an increase in performance in terms of increasing the saturation rate for Hotspot, Uniform, and Nearest-Neighbor traffics, especially Hotspot with 11.4% increase. Discussion about area overhead over a standard Base router and the analysis of arriving unordered packets (side-effect) are also presented.

international conference on electronics, circuits, and systems | 2013

Hardware implementation and evaluation of the Flexible router architecture for NoCs

Hossam El-Sayed; Mohamed El-Sayed Ragab; Mohammed S. Sayed; Victor Goulart

The core of the Network on Chips is the router; therefore it is needed to design routers that meet the requirements of performance, area and power. In order to improve its performance, some techniques tend to increase the number of buffers, but it is responsible for a large portion of the router area and power. In the Flexible router architecture, the performance of the router increased in terms of increasing the saturation rate while having the same number of buffers as the Base router. Moreover, it was found that at high injection rates, the Flexible router outperforms the base router by near 14.3% in throughput, 27.6% of Latency and 15% increase in saturation point for uniform random traffic. This Paper focuses on hardware implementation and evaluation of the Flexible Router to verify its functionality and evaluate its performance compared to the base router. We use a cycle-accurate NoC simulation system implemented in Verilog HDL under uniform, neighbor, and hotspot traffic patterns.

ieee computer society annual symposium on vlsi | 2016

Scalable Integer DCT Architecture for HEVC Encoder

Maher Abdelrasoul; Mohammed S. Sayed; Victor Goulart

HEVC (H.265) standard was proposed as a means to increase the compression rate with no loss in video quality. Large integer DCT, with sizes 16x16 and 32x32, is one of the key new features of the H.265 standard. In this paper, we propose a new scalable architecture for integer DCT in HEVC encoder. The proposed architecture is a fully pipelined architecture with optimized adders bit-widths. It was prototyped on TSMC 65 nm CMOS technology. The prototyping results show the high performance of theproposed architecture. Its gate count is 130K and it can achieve throughput of 9.26 Gsps. The proposed architecture can encode 8K @ 120 fps video sequence with working frequency of 373.25 MHz in real time.

parallel, distributed and network-based processing | 2014

Hierarchical Network Coding for Collective Communication on HPC Interconnects

Ahmed Shalaby; Mohamed El-Sayed Ragab; Victor Goulart; Ikki Fujiwara; Michihiro Koibuchi

Network bandwidth is a performance concern especially for collective communication because the bisection bandwidth of recent supercomputers is far less than their full bisection bandwidth. In this context we propose to exploit the use of a network coding technique to reduce the number of unicasts and the size of transferred data generated by latency-sensitive collective communication in supercomputers. Our proposed network coding scheme has a hierarchical multicasting structure with intra-group and inter-group unicasts. Quantitative analysis show that the aggregate path hop counts by our hierarchical network coding decrease as much as 94% when compared to conventional unicast-based multicasts. We validate these results by cycle-accurate network simulations. In 1,024-switch networks, the network reduces the execution time of collective communication as much as 64%. We also show that our hierarchical network coding is beneficial for any packet size.

digital systems design | 2013

High Performance Bitwise OR Based Submesh Allocation for 2D Mesh-Connected CMPs

Luka B. Daoud; Victor Goulart

Chip Multiprocessors (CMPs) are widely used across many application domains. The processor allocator (PA) assigns one or a set of processors to execute an applications job. In order to be efficient, the allocation of jobs to processors should be fast, with low overhead, reduce fragmentation or be able to increase the number of allocated jobs. In this paper, we propose a new contiguous processor allocation algorithm based on bit wise OR operation for 2D mesh CMPs, which assigns a set of processors without creating coverage areas for the incoming job. Our PA outperforms other state-of-the-art existing PAs based on busy array or busy list approaches. The hardware implementation of the algorithm compared to other PAs not only showed less area consumption but also higher working frequencies.

norchip | 2012

Intermediate nodes selection schemes for Network Coding in Network-on-Chips

Ahmed Shalaby; M. El Sayed Ragab; Victor Goulart

Network Coding (NC) is a novel technique that maximizes information flow inside networks by combining packets to be sent in common communication links concurrently. This paper discusses the feasibility of network coding for Network-on-Chip (NoC) and addresses the selection of intermediate nodes (where packets are merged or forked) and its impact on the NoC performance. A number of intermediate node selection algorithms are proposed and evaluated over different NoC sizes.

Explore More