Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hans Mulder is active.

Publication


Featured researches published by Hans Mulder.


international symposium on microarchitecture | 2000

Introducing the IA-64 architecture

Jerome C. Huck; Dale C. Morris; Jonathan K. Ross; Allan Knies; Hans Mulder; Rumi Zahir

Microprocessors continue on the relentless path to provide more performance. Every new innovation in computing-distributed computing on the Internet, data mining, Java programming, and multimedia data streams-requires more cycles and computing power. Even traditional applications such as databases and numerically intensive codes present increasing problem sizes that drive demand for higher performance. Design innovations, compiler technology, manufacturing process improvements, and integrated circuit advances have been driving exponential performance increases in microprocessors. To continue this growth in the future, Hewlett Packard and Intel architects examined barriers in contemporary designs and found that instruction-level parallelism (ILP) can be exploited for further performance increases. This article examines the motivation, operation, and benefits of the major features of IA-64. Intels IA-64 manual provides a complete specification of the IA-64 architecture.


conference on high performance computing (supercomputing) | 1991

MOVE: a framework for high-performance processor design

Henk Corporaal; Hans Mulder

No abstract available


international symposium on microarchitecture | 1991

Software pipelining for transport-triggered architectures

Jan Hoogerbrugge; Henk Corporaal; Hans Mulder

This paper discusses software pipelining for a new class of architectures that we call transport-triggered. These architectures reduce the interconnection requirements between function units. They also exhibit code scheduling possibilities which are not available in traditional operation-triggered architectures. In addition the scheduling freedom is extended by the use of so-called hybridpipelined function utits. In order to exploit this tleedom, existing scheduling techniques need to be extended. We present a software pipelirtirtg technique, based on Lam’s algorithm, which exploits the potential of !mnsport-triggered architectures. Performance results are presented for several benchmak loops. Depending on the available transport capacity, MFLOP rates may increase significantly as compared to scheduling without the ex~a degrees of freedom. As stated in [5] transport-triggered MOVE architectures have extra irtstxuction scheduling degrees of tkeedom. This paper investigates if and how those extra degrees influence the software pipelining iteration initiation interval. It therefore adapts the existing algorithms for software pipelining as developed by Lam [2]. It is shown that transport-triggering may lead to a significant reduction of the iteration initiation interval and therefore to an increase of the MIPS and/or MFLOPS rate. The remainder of this paper starts with an introduction of the MOVE class of architectures; it clari6es the idea of transporttriggered architectures. Section 3 formulates the software pipelining problem and its algorithmic solution for trrmsport-triggered architectures. Section 4 describes the architecture characteristics and benchmarks used for the measurements. In order to research the influence of the extra scheduling freedom, the algorithm has been applied to the benchmarks under dfierent scheduling disciplines. The next section (5) compares and analysis the measurements. Finally section 6 gives severaf conclusions and indicates further research to be done.


international symposium on microarchitecture | 1989

Cost-effective design of application specific VLIW processors using the SCARCE framework

Hans Mulder; R. J. Portier

Increasing the performance of application-specific processors by exploiting application-resident parallelism is often prohibited by costs; especially in the case of low-volume productions. The flexibility of horizontal-microcoded machines allows these costs to be reduced, but the flexibility often reduces efficiency. VLIW is a new and promising concept for the design of low-cost, high-performance parallel computer systems. We suggest that the VLIW concept can also be used as a basis for cost-effective design of application-specific processors which must exploit application-resident parallelism. The SCARCE (SCalable ARChitecture Experiment) framework, an approach for cost-effective design of application-specific processors, provides features which allow the design of retargetable VLIW architectures. However, a retargetable VLIW architecture is only effective if there is a retargetable VLIW compiler. Since a VLIW compiler is an essential part of the VLIW architecture, tradeoffs must be made between the variety of VLIW architectures and the compiler complexity. We suggest that limiting the flexibility of the retargetable VLIW architecture does not necessary reduce the application space. This paper discusses the issues related to the design of a retargetable VLIW processor architecture and compiler within the SCARCE framework.


architectural support for programming languages and operating systems | 1989

Data buffering: run-time versus compile-time support

Hans Mulder

Data-dependency, branch, and memory-access penalties are main constraints on the performance of high-speed microprocessors. The memory-access penalties concern both penalties imposed by external memory (e.g. cache) or by under utilization of the local processor memory (e.g. registers). This paper focuses solely on methods of increasing the utilization of data memory, local to the processor (registers or register-oriented buffers). A utilization increase of local processor memory is possible by means of compile-time software, run-time hardware, or a combination of both. This paper looks at data buffers which perform solely because of the compile-time software (single register sets); those which operate mainly through hardware but with possible software assistance (multiple register sets); and those intended to operate transparently with main memory implying no software assistance whatsoever (stack buffers). This paper shows that hardware buffering schemes cannot replace compile-time effort, but at most can reduce the complexity of this effort. It shows the utility increase of applying register allocation for multiple register sets. The paper also shows a potential utility decrease inherent to stack buffers. The observation that a single register set, allocated by means of interprocedural allocation, performs competitively with both multiple register set and stack buffer emphasizes the significance of the conclusion


IEEE Transactions on Computers | 1992

Processor architecture and data buffering

Hans Mulder; Michael J. Flynn

The tradeoff between visualizing or hiding the highest levels of the memory hierarchy, which impacts both performance and scalability, is examined by comparing a set of architectures from three major architecture families: stack, register, and memory-to-memory. The stack architecture is used as reference. It is shown that scalable architectures require at least 32 words of local memory and therefore are not applicable for low-density technologies. It is also shown that software support can bridge the performance gap between scalable and nonscalable architectures. A register architecture with 32 words of local storage allocated interprocedurally outperforms scalable architectures with equal sized local memories and even some with larger sized local memories. When a small cache is added to an unscalable architecture, their performance advantage becomes significant. >


international symposium on microarchitecture | 1989

A flexible VLSI core for an adaptable architecture

Hans Mulder; P. Stravers

Two major limitations concerning the design of cost-effective application-specific architectures are the recurrent costs of system-software development and hardware implementation, in particular VLSI implementation, for each architecture. The SCalable ARChitecture Experiment (SCARCE) aims to provide a framework for application-specific processor design. The framework allows scaling of functionality, implementation complexity, and performance. The SCARCE framework consists and will consist of: an architecture framework defining the constraints for the design of application-specific architectures; tools for synthesizing architectures from application or application-area; VLSI cell libraries and tools for quick generation of application-specific processors; a system-software platform which can be retargeted quickly to fit the application-specific architecture; This paper concentrates on the micro-architecture framework of SCARCE and outlines the process of generating VLSI processors.


international conference on lightning protection | 1990

Sequential architecture models for Prolog: a performance comparison

Mark Korsloot; Hans Mulder

In this paper we investigate the relation between architectural support for Prolog and performance. We will show that partial support for tags does perform as well as full support, but it only reduces the execution time by approximately 10%. With respect to special addressing modes, auto address modification (post/pre increment, decrement on loads and stores) only yields a cycle reduction of approximately 6% and the introduction of a single shadow register set yields around 8%. Combining these optimizations, a performance gain of 20 to 25% can be achieved, depending on the memory system.Usingvliw techniques, which exploit instruction-level parallelism, the performance can be doubled, using three processing elements. Two processing elements already provide a significant speedup, but the use of four processing elements is not justified if we compare the gain in performance with the cost of the extra hardware.In general we observe only a small performance improvement (around 20%) when moving fromrisc to special-purposerisc architectures, an improvement which can also be achieved by applying advanced compiler technology, such as compiler optimization, optimizations forwam, and optimal scheduling techniques forvliw architectures. Unfortunately these hardware and software effects do not add up, as a better compiler reduces the effect of hardware support.Finally, the cycle time is essential for comparing the performance of different (micro)-architectures, but it is not always clear what the effects of the different tradeoffs are on the maximum achievable cycle time.


Archive | 1996

Instruction prefetch mechanism utilizing a branch predict instruction

Tse-Yu Yeh; Mircea Poplingher; Kent Fielden; Hans Mulder; Rajiv Gupta; Dale C. Morris; Michael S. Schlansker


Archive | 1997

Instruction template for efficient processing clustered branch instructions

Harshvardhan Sharangpani; Michael Paul Corwin; Dale C. Morris; Kent Fielden; Tse-Yu Yeh; Hans Mulder; James M. Hull

Collaboration


Dive into the Hans Mulder's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Henk Corporaal

Eindhoven University of Technology

View shared research outputs
Top Co-Authors

Avatar

Jan Hoogerbrugge

Delft University of Technology

View shared research outputs
Top Co-Authors

Avatar

Mark Korsloot

Delft University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge