Binzhang Fu
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Binzhang Fu.
IEEE Transactions on Very Large Scale Integration Systems | 2015
Ying Wang; Yinhe Han; Lei Zhang; Binzhang Fu; Cheng Liu; Huawei Li; Xiaowei Li
The confluence of 3-D integration and network-on-chip (NoC) provides an effective solution to the scalability problem of on-chip interconnects. In 3-D integration, through-silicon via (TSV) is considered to be the most promising bonding technology. However, TSVs are also precious link resources because they consume significant chip area and possibly lead to routing congestion in the physical design stage. In addition, TSVs suffer from serious yield losses that shrink the effective TSV density. Thus, it is necessary to implement a TSV-economical 3-D NoC architecture in cost-effective design. For symmetric 3-D mesh NoCs, we observe that the TSVs bandwidth utilization is low and they rarely become the contention spots in networks as planar links. Based on this observation, we propose the TSV sharing (TS) scheme to save TSVs in 3-D NoC by enabling neighboring routers to share the vertical channels in a time division multiplexing way. We also investigate different TS implementation alternatives and show how TS improves TSV-effectiveness (TE) in multicore processors through a design space exploration. In experiments, we comprehensively evaluate TSs influence on all layers of system. It is shown that the proposed method significantly promotes TE with negligible performance overhead.
computing frontiers | 2015
Long Li; Nongda Hu; Ke Liu; Binzhang Fu; Mingyu Chen; Lixin Zhang
Enabling multiple paths in datacenter networks is a common practice to improve the performance and robustness. Multi-path TCP (MPTCP) explores multiple paths by splitting a single flow into multiple subflows. The number of the subflows in MPTCP is determined before a connection is established, and it usually remains unchanged during the lifetime of that connection. While MPTCP improves both bandwidth efficiency and network reliability, more subflows incur additional overhead, especially for small (so-called mice) subflows. Additionally, it is difficult to choose the appropriate number of the subflows for each TCP connection to achieve good performance without incurring significant overhead. To address this problem, we propose an adaptive multi-path transmission control protocol, namely the AMTCP, which dynamically adjusts the number of the subflows according to application workloads. Specifically, AMTCP divides the time into small intervals and measures the throughput of each subflow over the latest interval, then adjusts the number of the subflows dynamically with the goal of reducing resource and scheduling overheads for mice flows and achieving a higher throughput for elephant flows. Our evaluations show that AMTCP increases the throughput by over 30% compared to conventional TCP. Meanwhile, AMTCP decreases the average number of the subflows by more than 37.5% while achieving a similar throughput compared to MPTCP.
high performance computing and communications | 2013
Wentao Bao; Binzhang Fu; Mingyu Chen; Lixin Zhang
The high-density server is featured as low power, low volume, and high computational density. With the rising use of high-density servers in data-intensive and large-scale web applications, it requires a high-performance and cost-efficient intra-server interconnection network. Most of state-of-the-art high-density servers adopt the fully-connected intra-server network to attain high network performance. Unfortunately, this solution costs too much due to the high degree of nodes. In this paper, we exploit the theoretically optimized Moore graph to interconnect the chips within a server. Accounting for the suitable size of applications, a 50-size Moore graph, called Hoffman-Singleton graph, is adopted. In practice, multiple chips should be integrated onto one processor board, which means that the original graph should be partitioned into homogeneous connected subgraphs. However, the existing partition scheme does not consider above problem and thus generates heterogeneous subgraphs. To address this problem, we propose two equivalent-partition schemes for the Hoffman-Singleton graph. In addition, a logic-based and minimal routing mechanism, which is both time and area efficient, is proposed. Finally, we compare the proposed network architecture with its counterparts, namely the fully-connected, Kautz and Torus networks. The results show that our proposed network can achieve competitive performance as fully-connected network and cost close to Torus.
design, automation, and test in europe | 2010
Binzhang Fu; Yinhe Han; Huawei Li; Xiaowei Li
In this paper, we propose a binary-tree waveguide connected Optical-Network-on-Chip (ONoC) to accelerate the establishment of the lightpath. By broadcasting the control data in the proposed power-efficient binary-tree waveguide, the maximal hops for establishing lightpath is reduced to two. With extensive simulations and analysis, we demonstrate that the proposed ONoC significantly reduces the setup time, and then the packet latency.
design automation conference | 2013
Hang Lu; Guihai Yan; Yinhe Han; Binzhang Fu; Xiaowei Li
Cloud service providers use workload consolidation technique in many-core cloud processors to optimize system utilization and augment performance for ever extending scale-out workloads. Performance isolation usually has to be enforced for the consolidated workloads sharing the same many-core resources. Networks-on-chip (NoC) serves as a major shared resource, also needs to be isolated to avoid violating performance isolation. Prior work uses strict network isolation to fulfill performance isolation. However, strict network isolation either results in low consolidation density, or complex routing mechanisms which indicates prohibitive high hardware cost and large latency. In view of this limitation, we propose a novel NoC isolation strategy for many-core cloud processors, called relaxed isolation (RISO). It permits underutilized links to be shared by multiple applications, at the same time keeps the aggregated traffic in check to enforce performance isolation. The experimental results show that the consolidation density is improved more than 12% in comparison with previous strict isolation scheme, meanwhile reducing network latency by 38.4% on average.
pacific rim international symposium on dependable computing | 2009
Binzhang Fu; Yinhe Han; Huawei Li; Xiaowei Li
The Network-on-Chip (NoC) meshes are limited by the reliability constraint, which impels us to exploit the fault tolerant routing. Particularly, one of the main design issues is minimizing the loss of non-faulty routers at the presence of faults. To address that problem, we propose a new fault tolerant routing, which has the following two distinct advantages: First, it keeps a network deadlock-free by utilizing restricted intermediate nodes rather than adding virtual channels (VC). This characteristic leads to an area-efficient router. Second, in the proposed routing algorithm, the rounds of DOR are not limited by the number of VC’s anymore. As a consequence, the number of sacrificed non-faulty routers is significantly reduced. We demonstrate above advantages through extensive simulations. The experimental results show that under the limitation of VC’s, the proposed routing algorithm always sacrifices the minimal number of non-faulty routers compared to previous solutions.
computing frontiers | 2016
Long Li; Ke Liu; Binzhang Fu; Mingyu Chen; Lixin Zhang
A guarantee-aware cost effective virtual machine placement algorithm for the cloud is proposed in this paper. The algorithm is first formulated as a nonlinear programming problem of which the objective is to minimize the number of physical machines used. Specifically, apart from constraints for computing resources, we add an additional one for each network component to ensure the sum of offered guarantees for each link is not greater than the link capacity. We then devise a heuristic algorithm for the nonlinear programming problem. Results show that our approach can reduce the number of physical machines used by 32.5% compared to the most recent one.
international conference on asic | 2009
Binzhang Fu; Yinhe Han; Huawei Li; Xiaowei Li
Reusing Network-on-Chip (NoC) as Test-Access-Mechanism (TAM) has been adopted to transfer test data to embedded cores. However, an observation shows that compared to NoC-reuse TAM, some bus-based TAM are able to achieve better results in test time due to its fine-grained scheduling unit. This paper proposed a new TAM named Test Tree(T2). T2TAM could be built by reusing the hardware resources of routers instead of reusing the packet-based NoC. Though implementing DFT design on routers, the T2TAM can achieve wire utilization and adopts fine-grained basic scheduling. Besides, to address the problem of testing large number of homogeneous cores, T2-TAM is proposed to facilitate multicasting stimuli to homogeneous cores to save test time. Experimental results show that the test cycles could be reduced up to 38% in comparison with the work reusing NoC as TAM with only 0.3% DFT overhead.
high performance computing and communications | 2016
Sheng Xu; Binzhang Fu; Mingyu Chen; Lixin Zhang
Congestion-aware adaptive routing can effectively improve the performance of Networks-on-Chip (NoC) due to its ability to accurately predict network congestion and make optimal routing decisions. Based on the fact that transporting quantitative congestion information is cost-prohibitive through current Congestion Propagation Networks (CPN), state-of-the-art adaptive routing algorithms tend to exploit qualitative congestion information. Unfortunately, qualitative congestion information can not provide a precise view of the network congestion level and hence mispredict network congestion in some cases, which easily leads to suboptimal routing decisions. To address this problem, this paper proposes the Quantitative Congestion Awareness (QCA) technique, which collects non-local quantitative congestion information by transferring the difference instead of the absolute value of the desired congestion metrics, such as the number of free virtual channels. With QCA technique, the cost of CPN is minimized and fixed since only one wire per destination is required regardless of the size of network and number of virtual channels per physical channel. A novel adaptive routing algorithm combining both congestion avoidance scheme and comprehensive evaluation scheme is proposed to fully exploit the properties of quantitative congestion information and make optimal routing decisions. With extensive simulations, the results show that the throughput could be improved up to 17.13% compared with state-of-the-art routing algorithms.
international conference on computer communications and networks | 2015
Sheng Xu; Binzhang Fu; Mingyu Chen; Lixin Zhang
Torus, which is simple and incrementally expandable, could perfectly fit the scale-out model of current large scale computing systems, such as data centers. However, on the downside, torus suffers from its long network diameter. One way to address this problem is using random shortcuts. However, this approach does not consider the variety of data center traffic, and leads to severe non-uniform network performance. To address this problem, we propose the Flyover, which exploits the flexibility of optical circuit switching to add on-demand shortcuts, as a cost-efficient and scale-out network architecture for DCNs. The following three features guarantee Flyover a good performance. First, a new defined serpent flow instead of the elephant flow is prioritized. Unlike the elephant flow, which is big in size, the serpent flow is big in both size and distance. Through this, the electrical torus network could be maximally relieved and the overall network performance is optimized. Second, Flyover generates region-to-region instead of the point-to-point shortcuts. Therefore, the valuable optical shortcuts could be fully utilized. Third, a semi-random heuristic algorithm is proposed to achieve the advantages of both reducing computation time and improving network performance. Furthermore, several ways to expand Flyover are discussed and evaluated to make sure that Flyover is highly scalable. Finally, Flyover is extensively analyzed and compared with its counterparts using both simulations and prototypes. The results show that Flyover could maximally improve the network throughput by 135% and latency by 277%.