Kemal Ebcioglu
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kemal Ebcioglu.
conference on object-oriented programming systems, languages, and applications | 2005
Philippe Charles; Christian Grothoff; Vijay A. Saraswat; Christopher Michael Donawa; Allan Kielstra; Kemal Ebcioglu; Christoph von Praun; Vivek Sarkar
It is now well established that the device scaling predicted by Moores Law is no longer a viable option for increasing the clock frequency of future uniprocessor systems at the rate that had been sustained during the last two decades. As a result, future systems are rapidly moving from uniprocessor to multiprocessor configurations, so as to use parallelism instead of frequency scaling as the foundation for increased compute capacity. The dominant emerging multiprocessor structure for the future is a Non-Uniform Cluster Computing (NUCC) system with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in horizontally scalable cluster configurations such as blade servers. Unlike previous generations of hardware evolution, this shift will have a major impact on existing software. Current OO language facilities for concurrent and distributed programming are inadequate for addressing the needs of NUCC systems because they do not support the notions of non-uniform data access within a node, or of tight coupling of distributed nodes.We have designed a modern object-oriented programming language, X10, for high performance, high productivity programming of NUCC systems. A member of the partitioned global address space family of languages, X10 highlights the explicit reification of locality in the form of places}; lightweight activities embodied in async, future, foreach, and ateach constructs; a construct for termination detection (finish); the use of lock-free synchronization (atomic blocks); and the manipulation of cluster-wide global data structures. We present an overview of the X10 programming model and language, experience with our reference implementation, and results from some initial productivity comparisons between the X10 and Java™ languages.
international symposium on computer architecture | 1997
Kemal Ebcioglu; Erik R. Altman
Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.
IEEE Transactions on Computers | 2001
Kemal Ebcioglu; Erik R. Altman; Michael Karl Gschwind; Sumedh W. Sathaye
We describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This paper reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.
international symposium on microarchitecture | 1987
Kemal Ebcioglu
We describe a compilation algorithm for efficient software pipelining of general inner loops, where the number of iterations and the time taken by each iteration may be unpredictable, due to arbitrary if-then- else statements and conditional exit statements within the loop. As our target machine, we assume a wide instruction word architecture that allows multi-way branching in the form of if-then-else trees, and that allows conditional register transfers depending on where the microinstruction branches to (a hardware implementation proposal for such a machine is briefly described in the paper). Our compilation algorithm, which we call the pipeline scheduling technique, produces a software- pipelined version of a given inner loop, which allows a new iteration of the loop to begin on every cycle whenever dependencies and resources permit. The correctness and termination properties of the algorithm are studied in the paper.
IEEE Computer | 1993
Gabriel M. Silberman; Kemal Ebcioglu
An architectural framework that allows software applications and operating system code written for a given instruction set to migrate to different, higher performance architectures is described. The framework provides a hardware mechanism that enhances application performance while keeping the same program behavior from a user perspective. The framework is designed to accommodate program exceptions, self-modifying code, tracing, and debugging. Examples are given for IBM System/390 operating-system code and AIX utilities, showing the performance potential of the scheme using a very long instruction word (VLIW) machine as the high-performance target architecture.<<ETX>>
international conference on parallel architectures and compilation techniques | 1999
Byung-Sun Yang; Soo-Mook Moon; Seong-Bae Park; Junpyo Lee; Seungil Lee; Jinpyo Park; Yoo C. Chung; Suhyun Kim; Kemal Ebcioglu; Erik R. Altman
For network computing on desktop machines, fast execution of Java bytecode programs is essential because these machines are expected to run substantial application programs written in Java. Higher Java performance can be achieved by just-in-time (JIT) compilers which translate the stack-based bytecode into register-based machine code on demand. One crucial problem in Java JIT compilation is how to map and allocate stack entries and local variables into registers efficiently and quickly, so as to improve the Java performance. This paper introduces LaTTe, a Java JIT compiler that performs fast and efficient register mapping and allocation for RISC machines. LaTTe first translates the bytecode into pseudo RISC code with symbolic registers, which is then register allocated while coalescing those copies corresponding to pushes and pops between local variables and the stack. The LaTTe JVM also includes an enhanced object model, a lightweight monitor, a fast mark-and-sweep garbage collector, and an on-demand exception handling mechanism, all of which are closely coordinated with LaTTes JIT compilation.
high performance computer architecture | 2001
Krishnan K. Kailas; Kemal Ebcioglu; Ashok K. Agrawala
Clustered ILP processors are characterized by a large number of non-centralized on-chip resources grouped into clusters. Traditional code generation schemes for these processors consist of multiple phases for cluster assignment, register allocation and instruction scheduling. Most of these approaches need additional re-scheduling phases because they often do not impose finite resource constraints in all phases of code generation. These phase-ordered solutions have several drawbacks, resulting in the generation of poor performance code. Moreover the iterative/back-tracking algorithms used in some of these schemes have large turning times. In this paper we present CARS, a code generation framework for Clustered ILP processors, which combines the cluster assignment, register allocation, and instruction scheduling phases into a single code generation phase, thereby eliminating the problems associated with phase-ordered solutions. The CARS algorithm explicitly takes into account all the resource constraints at each cluster scheduling step to reduce spilling and to avoid iterative re-scheduling steps. We also present a new on-the-fly register allocation scheme developed for CARS. We describe an implementation of the proposed code generation framework and the results of a performance evaluation study using the SPEC95/2000 and MediaBench benchmarks.
international conference on supercomputing | 1989
Kemal Ebcioglu; Alexandru Nicolau
This paper presents a new approach to resource-constrained compiler extraction of fine-grain parallelism, targeted towards VLIW supercomputers, and in particular, the IBM VLIW (Very Large Instruction Word) processor. The algorithms described integrate resource limitations into Percolation Scheduling—a global parallelization technique—to deal with resource constraints, without sacrificing the generality and completeness of Percolation Scheduling in the process. This is in sharp contrast with previous approaches which either applied only to conditional-free code, or drastically limited the parallelization process by imposing relatively local heuristic resource constraints early in the scheduling process.
Computer Music Journal | 1992
Mira Balaban; Kemal Ebcioglu; Otto E. Laske
Part 1 Two views on cognitive musicology: artificial intelligence and music - cornerstone of cognitive musicology, O. Laske beyond computational musicology, P. Kugel. Part 2 General problems in modeling musical activities: representing listening behaviour - problems and prospects, S.W. Smoliar symbolic and sonic representations of sound-object structures, B. Bel music structures - interleaving the temporal and hierarchical aspects in music, B. Balaban on designing a typed music language, E.B. Blevis, et al logical representation and induction for computer assisted composition, F. Courtot. Part 3 Music composition: cybernetic composer - an overview, C. Ames and M. Domino Wolfgang - a system using emoting potentials to manage musical design, R.D. Riecken on the application of problem reduction search to automated composition, S.C. Marsella and C.F. Schmidt the observer tradition of knowledge acquisition, O. Laske. Part 4 Analysis: an expert system for harmonizing chorales in the style of J.S. Bach, K. Ebcioglu an expert system for harmonic analysis of tonal music, H.J. Maxwell on the algorithmic representation of musical style, D. Cope. Part 5 Performance: Bol processor grammars, B. Bel and J. Kippen a new approach to music through vision, S. Ohteru and S. Hashimoto. Part 6 Perception: analyzing and representing musical rhythm, C. Linster on the perception of metre, B.O. Miller, et al the quantization problem - traditional and connectionist approaches, P. Desain and H. Honing. Part 7 Learning and tutoring: an architecture for an intelligent tutoring system, M.J. Baker a knowledge intensive approach to mcachine learning in tonal music, G. Widmer.
international conference on computer design | 1998
Kemal Ebcioglu; Jason E. Fritts; Stephen V. Kosonocky; Michael Karl Gschwind; Erik R. Altman; Krishnan K. Kailas; Terry Bright
Presented is an 8-issue tree-VLIW processor designed for efficient support of dynamic binary translation. This processor confronts two primary problems faced by VLIW architectures: binary compatibility and branch performance. Binary compatibility with existing architectures is achieved through dynamic binary translation which translates and schedules PowerPC instructions to take advantage of the available instruction level parallelism. Efficient branch performance is achieved through tree instructions that support multi-way path and branch selection within a single VLIW instruction. The processor architecture is described, along with design details of the branch unit, pipeline, register file and memory hierarchy for a 0.25 micron standard-cell design. Performance simulations show that the simplicity of a VLIW architecture allows a wide-issue processor to operate at high frequencies.