Zvonko G. Vranesic
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zvonko G. Vranesic.
international symposium on microarchitecture | 1997
Keith I. Farkas; Paul Chow; Norman P. Jouppi; Zvonko G. Vranesic
The multicluster architecture that we introduce offers a decentralized, dynamically scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural registers. The motivation for the multicluster architecture is to reduce the clock cycle time, relative to a single-cluster architecture with the same number of hardware resources, by reducing the size and complexity of components on critical timing paths. Resource partitioning, however, introduces instruction-execution overhead and may reduce the number of concurrently executing instructions. To counter these two negative by-products of partitioning, we developed a static instruction scheduling algorithm. We describe this algorithm, and using trace-driven simulations of SPEC92 benchmarks, evaluate its effectiveness. This evaluation indicates that for the configurations considered the multicluster architecture may have significant performance advantages at feature sizes below 0.35 /spl mu/m, and warrants further investigation.
design automation conference | 1991
Robert J. Francis; Jonathan Rose; Zvonko G. Vranesic
A new technology mapping algorithm for lookup tablebased Field Programmable Gate Arrays (FPGA) is presented. The major innovation is a method for choosing gate-level decompositions based on bin packing. This approach is up to 28 times faster than a previous exhaustive approach. The algorithm also exploits reconvergent paths and replication of logic at fanout nodes to reduce the number of lookup tables in the circuit. The new algorithm is implemented in the Chortle-crf program. In an experimental comparison Chortle-crf requires 14 YO fewer lookup tables than Chortle [Fran90] and 10 ~o fewer lookup tables than mis-pga [Murg90a] to implement a set of benchmark networks. Chortle-crf can also implement a network as a circuit of Xilinx 3000 series Configurable Logic Blocks (CLBS). To implement the benchmark networks as circuits of CLBS Chortle-crf requires 12 70 fewer CLBS than mis-pga and 22 % fewer CLBS than XNFOPT [Xili89]. In these experiments Chortle-crf waa an average of 68 times faster than mis-pga and 30 times faster than XNFOPT. 1
international conference on computer aided design | 1990
Stephen Dean Brown; Jonathan Rose; Zvonko G. Vranesic
A detailed routing algorithm, called the coarse graph expander (CGE), that has been designed specifically for field-programmable gate arrays (FPGAs) is described. The algorithm approaches this problem in a general way, allowing it to be used over a wide range of different FPGA routing architectures. It addresses the issue of scarce routing resources by considering the side effects that the routing of one connection has on another, and also has the ability to optimize the routing delays of time-critical connections. CGE has been used to obtain excellent routing results for several industrial circuits implemented in FPGAs with various routing architectures. The results show that CGE can route relatively large FPGAs in very close to the minimum number of tracks as determined by global routing, and it can successfully optimize the routing delays of time-critical connections. CGE has a linear run time over circuit size. >
international symposium on computer architecture | 1997
Keith I. Farkas; Paul Chow; Norman P. Jouppi; Zvonko G. Vranesic
In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.
IEEE Computer | 1991
Zvonko G. Vranesic; Michael Stumm; David Lewis; Ron White
The architecture of the Hector multiprocessor, which exploits current microprocessor technology to produce a machine with a good cost/performance tradeoff, is described. A key design feature of Hector is its interconnection backplane, which can accommodate future technology because it uses simple hardware with short critical paths in logic circuits and short lines in the interconnection network. The system is reliable and flexible and can be realized at a relatively low cost. The hierarchical structure results in a fast backplane and a bandwidth that increases linearly with the number of processors. Hector scales efficiently to larger sizes and faster processors.<<ETX>>
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2006
Valavan Manohararajah; Stephen Dean Brown; Zvonko G. Vranesic
In this paper, an iterative technology-mapping tool called IMap is presented. It supports depth-oriented (area is a secondary objective), area-oriented (depth is a secondary objective), and duplication-free mapping modes. The edge-delay model (as opposed to the more commonly used unit-delay model) is used throughout. Two new heuristics are used to obtain area reductions over previously published methods. The first heuristic predicts the effects of various mapping decisions on the area of the final solution, and the second heuristic bounds the depth of the mapping solution at each node. In depth-oriented mode, when targeting five lookup tables (LUTs), IMap obtains depth optimal solutions that are 44.4%, 19.4%, and 5% smaller than those produced by FlowMap, CutMap, and DAOMap, respectively. Targeting the same LUT size in area-oriented mode, IMap obtains solutions that are 17.5% and 9.4% smaller than those produced by duplication-free mapping and ZMap, respectively. IMap is also shown to be highly efficient. Runtime improvements of between 2.3times and 82times are obtained over existing algorithms when targeting five LUTs. Area and runtime results comparing IMap to the other mappers when targeting four and six LUTs are also presented
international conference on computer aided design | 1991
Robert J. Francis; Jonathan Rose; Zvonko G. Vranesic
A novel technology mapping algorithm that reduces the delay of combinational circuits implemented with lookup-table-based field-programmable gate arrays (FPGAs) is presented. The algorithm reduces the contribution of logic block delays to the critical path delay by reducing the number of lookup tables on the critical path. The key feature of the algorithm is the use of bin packing to determine the gate-level decomposition of every node in the network. In addition, reconvergent paths and the replication of logic at fanout nodes are exploited to further reduce the depth of the lookup table circuit. For fanout-free trees the algorithm will construct the optimal depth K-input table circuit when K is less than or equal to 6.<<ETX>>
design automation conference | 1983
Kuang-Wei Chiang; Zvonko G. Vranesic
This paper considers the problem of detecting faults in CMOS combinational networks. Effects of open and short faults in CMOS networks are analyzed. It is shown that the test sequence must be properly organized if the effects of all open faults are to be observable at the network output terminal. A simple and efficient heuristic method for organizing the test sequence to detect all single faults in a CMOS network is suggested.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 1988
Jonathan Rose; W.M. Snelgrove; Zvonko G. Vranesic
An algorithm called heuristic spanning creates parallelism by simultaneously investigating different areas of the plausible combinatorial search space. It is used to replace the high-temperature portion of simulated annealing. The low-temperature portion of simulated annealing is sped up by a technique called section annealing, in which placement is geographically divided and the pieces are assigned to separate processors. Each processor generates simulated-annealing-style moves for the cells in its area and communicates the moves to other processors as necessary. Heuristic spanning and section annealing are shown experimentally to converge to the same final cost function as regular simulated annealing. These approaches achieve significant speedup over uniprocessor simulated annealing, giving high-quality VLSI placement of standard cells in a short period of time. >
IEEE Transactions on Computers | 1970
Zvonko G. Vranesic; E.S. Lee; Kenneth C. Smith
Many-valued switching systems have been of considerable academic interest, despite the apparent inability to use them in practical applications. The main drawbacks, as pointed out by a number of authors, are the difficulties associated with the implementation of the functional basic set and the lack of adequate simplification techniques.