Wolfgang K. Giloi
Technical University of Berlin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wolfgang K. Giloi.
parallel computing | 1988
Wolfgang K. Giloi
Abstract The designer of a numerical supercomputer is confronted with fundamental design decisions stemming from some basic dichotomies in supercomputer technology and architecture. On the side of the hardware technology there exists the dichotomy between the use of very high-speed circuitry or very large-scale integrated circuitry. On the side of the architecture there exists the dichotomy between the SIMD vector machine and the MIMD multiprocessor architecture. In the latter case, the ‘nodes’ of the system may communicate through shared memory, or each node has only private memory, and communication takes place through the exchange of messages. All these design decisions have implications with respect to performance, cost-effectiveness, software complexity, and fault-tolerance. In the paper the various dichotomies are discussed and a rationale is provided for the decision to realize the SUPRENUM supercomputer, a large ‘number cruncher’ with 5 Gflops peak performance, in the form of a massively parallel MIMD/SIMD multicomputer architecture. In its present incorporation, SUPRENUM is configurable to up to 256 nodes, where each node is a pipeline vector machine with 20 Mflops peak performance, IEEE double precision. The crucial issues of such an architecture, which we consider the trendsetter for future numerical supercomputer architecture in general, are on the hardware side the need for a bottleneck-free interconnection structure as well as the highest possible node performance obtained with the highest possible packaging density, in order to accommodate a node on a single circuit board. On the side of the system software the design goal is to obtain an adequately high degree of operational safety and data security with minimum software overhead. On the side of the user an appropriate program development environment must be provided. Last but not least, the system must exhibit a high degree of fault tolerance, if for nothing else but for the sake of obtaining a sufficiently high MTBF. In the paper a detailed discussion of the hardware and software architecture of the SUPRENUM supercomputer, whose design is based upon the considerations discussed, is presented. A largely bottleneck-free interconnection structure is accomplished in a hierarchical manner: the machine consists of up to 16 ‘clusters’, and each cluster consists of 16 working ‘nodes’ plus some organisational nodes. The node is accommodated on a single circuit board; its architecture is based on the principle of data structure architecture explained in the paper. SUPRENUM is strictly a message-based system; consequently, the local node operating system has been designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead. SUPRENUM is organized as a distributed system—a prerequisite for the high degree of fault tolerance required; therefore, there exists no centralized global operating system. The paper concludes with an outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type.
distributed memory computing conference | 1991
Wolfgang K. Giloi; C. Hastedt; Friedrich Schön; Wolfgang Schröder-Preikschat
A virtual shared memory architecture (VSMA) is a distributed memory architecture that looks to the application software as if it were a shared memory system. The major problem with such a system is to maintain the coherence of the distributed data entities. Shared virtual memory means that the shared data entities are pages of local virtual memories with demand paging. Memory coherence may be strong or weak. Strong coherence is a scheme where all the shared data entities look from the outside as if they were stored in one coherent memory. This simplifies programming of a distributed memory system at the cost of a high message traffic in the system, needed to maintain the strong coherence. The efficiency of the system can be increased by adding a weak coherence scheme that allows for multiple writes by different threads of control into the same page. The price of the weak coherence scheme is the need for explicit program synchronizations, needed to reestablish at the end the strong coherence of the result. For the computer architect, the challenging question is how to implement a VSMA most efficiently and, specifically, by what architectural means to support the implementation. In the paper a new solution to this question is presented based upon an innovative distributed memory architecture in which communication is conducted by a dedicated communication processor in each node rather than by the node CPU. This will make the exchange of short, fixed-size messages, e.g., invalidation notices, very efficient. Therefore, it becomes more appropriate to minimize the overall administrative overhead, even at the cost of more message traffic. On that rationale, a novel, capability-based mechanism for both strong and weak coherence of shared virtual memory is presented. The weak coherence scheme is built on top of the strong coherence, utilizing its mechanisms. The proposed implementation is totally distributed and based on a strict need to know philosophy. Consequently, the elaborate pointer lists and their handling at runtime typical for other solutions is not needed.
international symposium on computer architecture | 1978
Wolfgang K. Giloi; Helmut K. Berg
Computer architectures may be characterized by their operational principle and their physical structure. The paper defines these two characteristics for the novel concept of data structure architectures (DSAs). The representation and processing of arbitrary data structures in such a DSA is demonstrated by examples. It is shown how the functional requirements of a DSA can be satisfied by the specific information structure and the physical structure of the STARLET architecture introduced in preceding publications.
international conference on supercomputing | 1989
Wolfgang K. Giloi; W. Schroeder-Preikschat
The next generation of supercomputers will be largely parallel MIMD architectures, ranging in peak performance from 10 to 100 GFLOPS in the mid nineties to 1000 GFLOPS in the late nineties. Largely parallel means that such a system will consist of hundreds or thousands of processing nodes (PN), and each PN will have a peak performance of several hundred MFLOPS. Obtaining such an extremely high performance is not only an issue of appropriate node architecture but requires also a very high bandwidth interconnection network and an extremely fast implementation of the inter process communication (IPC) protocol. The paper deals with an IPC protocol implementation that reduces the communication startup time to approximately 20 microseconds, by combining highly efficient software solutions, given in the form of lightweight processes, with dedicated hardware, given in the form of a specific communication processor in each PN, to perform the rendezvous required between sender and receiver processes.
national computer conference | 1981
Wolfgang K. Giloi; Reinhold Gueth; Bruce D. Shriver
Microprogramming has become the means of implementing the machine language instructions of a conventional computer. In the future, the vertical migration of functions from the software levels of a system to the microprogramming level may become equally important. The vertical migration of functions of a computer is undertaken to realize architectures having improved performance, functionality, reliability, or data security. The increased volume of microcode brought about by vertical migration tends to increase the complexity of the firmware development process and calls for a firmware engineering discipline that provides tools for the design and specification, implementation, validation, and maintenance of firmware. We present a rationale for the specification and procedural design of firmware based on the use of an appropriately defined specification language. The features of such a language and the supporting software system are outlined and demonstrated by the example of an existing APL-based firmware development system.
parallel computing | 1994
Wolfgang K. Giloi
Abstract The paper presents a taxonomy of the existing forms of parallel computer architectures, based on the characteristics of the hardware architecture and the abstract machine layered upon it. The abstract machine reflects the programming models provided. The main classes of hardware architectures are: physically shared memory systems and distributed memory systems. Distributed memory systems may be remote memory access architectures or message passing architectures. The major forms of abstract machine architecture are: message passing systems and logically shared memory architectures. Three solutions for logically shared memory architectures are known (1) distributed shared memory architectures, (2) multi-threaded architectures, and (3) virtual shared memory architectures, All three types are discussed in detail under the aspects of performance, programmability, and scalability, and their corresponding programming paradigms are characterized. The implications of the three concepts on node architecture and the requirements of latency minimization or latency hiding are discussed and illustrated by examples taken from pioneering realizations of the three kinds of architecture such as DASH, ∗T, and MANNA.
Future Generation Computer Systems | 1985
Wolfgang K. Giloi
Abstract The principle of having a processor manipulate the states of single memory words in a word-at-a-time fashion causes an ‘intellectual bottleneck’ for the programmer as well as a physical, performance limiting bottleneck for the executing machine. Functional programming mitigates the intellectual but not the physical bottleneck. The intellectual bottleneck problem can also be resolved within the framework of procedural programming by introducing into an appropriate programming language objects of data structure types as the entities to be manipulated. In order to also resolve the physical bottleneck problem, appropriate data structure objects should exist already at the hardware level as objects of machine data structure types that are recognized and manipulated by the machine. This approach allows the computer to deal efficiently with the parallelism inherent in data structure objects. In the paper, appropriate, generic machine data structure types are introduced, to provide the basis for the efficient representation and processing of arbitrary, application-oriented data types. The hardware representation of the machine data structure type is based upon the use of descriptor information. The resulting computer architecture is free of the “von Neumann bottleneck”, conceptually as well as physically.
international symposium on computer architecture | 1983
Wolfgang K. Giloi; Peter M. Behr
An abstract view of a computer system is provided by a hierarchy of functions, ranging from the high-level operating system functions down to the primitive functions of the hardware. Vertical migration of high-level functions into the microcode of a CPU or horizontal migration of hardware functions out of the CPU into dedicated processors alone is not an adequate realization method for innovative computer architectures with complex functionality. In the paper, a new design principle called hierarchical function distribution is introduced to cope with the task of designing innovative multicomputer systems with complex functionality. The design rules of hierarchical function distribution are presented, and the advantages of the approach are discussed and illustrated by examples.
parallel computing | 1994
Wolfgang K. Giloi
Abstract SUPRENUM is a highly parallel supercomputer for numerical applications. The 5-GFLOPS peak performance of the 256-node system made it the most powerful MIMD architecture of the ‘first generation.’ Each node is a complete, single-board vector machine with 20 Mflops peak performance (IEEE double precision). SUPRENUM is a distributed memory architecture, resulting in a highly scalable system that can be made fault-tolerant. Message passing is accelerated by dedicated communication hardware in each node. Array access is performed by an ‘intelligent’ DMA address generator. The SUPRENUM architecture was the first to be based on two-level interconnection structure, consisting of a number of clusters with each cluster consisting of a number of nodes. At the cluster level the nodes are interconnected by two very fast parallel buses. At the system level the clusters are interconnected by a torus structure consisting of serial ring buses. The nodes run under the proprietary, distributed PEACE operating system. Significant efforts were undertaken to make the system programmable, by providing a host of software tools, libraries, and application software packages. The paper discusses the rationale for the SUPRENUM architecture, the goals achieved, and the lessons learned.
Future Generation Computer Systems | 1987
J. Beer; Wolfgang K. Giloi
Abstract An architecture is presented for the parallel execution of sequential Prolog. The architecture is based on a pipeline of unification processors and designed to work as a co-processor to a more conventional, UNIX based workstation. The unification processors execute highly optimized compiled Prolog code; however, the basic concept of the architecture could also increase the performance of interpreter based systems. It will be shown that even programs that do not exhibit any of the ‘classical’ forms of parallelism (i.e. AND-, OR-parallelism, etc.) can be effectively mapped onto the proposed architecture. The presented architecture should also prove very effective as a multi-user Prolog machine executing several independent Prolog programs in parallel. In contrast to other attempts to execute sequential Prolog in parallel we do not restrict the use of any of the standard Prolog language features such as dynamic assert/retract, CUT, etc. Simulation results show that peak execution rates of over 1000 KLIPS can be obtained.