Constantine D. Polychronopoulos

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Constantine D. Polychronopoulos is active.

Explore More

Publication

Featured researches published by Constantine D. Polychronopoulos.

IEEE Transactions on Computers | 1987

Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers

Constantine D. Polychronopoulos; David J. Kuck

This paper proposes guided self-scheduling, a new approach for scheduling arbitrarily nested parallel program loops on shared memory multiprocessor systems. Utilizing loop parallelism is clearly most crucial in achieving high system and program performance. Because of its simplicity, guided self-scheduling is particularly suited for implementation on real parallel machines. This method achieves simultaneously the two most important objectives: load balancing and very low synchronization overhead. For certain types of loops we show analytically that guided self-scheduling uses minimal overhead and achieves optimal schedules. Two other interesting properties of this method are its insensitivity to the initial processor configuration (in time) and its parameterized nature which allows us to tune it for different systems. Finally we discuss experimental results that clearly show the advantage of guided self-scheduling over the most widely known dynamic methods.

International Journal of High Speed Computing | 1989

Parafrase-2: an environment for parallelizing, partitioning, synchronizing, and scheduling programs on multiprocessors

Constantine D. Polychronopoulos; Milind Girkar; Mohammad R. Haghighat; Chia Ling Lee; Bruce Leung; Dale Schouten

Parafrase-2 is a vectorizing/parallelizing compiler implemented as a source to source code restructurer. This paper discusses the organization of Parafrase-2 and goals of the project. Specific topics discussed are : dependence analysis, timing and overhead analysis, interprocedural analysis, automatic scheduling and the graphical user interface.

IEEE Transactions on Parallel and Distributed Systems | 1992

Automatic extraction of functional parallelism from ordinary programs

Milind Girkar; Constantine D. Polychronopoulos

Presents the hierarchical task graph (HTG) as an intermediate parallel program representation which encapsulates minimal data and control dependences, and which can be used for the extraction and exploitation of functional, or task-level parallelism. The hierarchical nature of the HTG facilitates efficient task-granularity control during code generation, and thus applicability to a variety of parallel architectures. The construction of the HTG at a given hierarchy level, the derivation of the execution conditions of tasks which maximizes task-level parallelism, and the optimization of these conditions which results in reducing synchronization overhead imposed by data and control dependences are emphasized. Algorithms for the formation of tasks and their execution conditions based on data and control dependence constraints are presented. The issue of optimization of such conditions is discussed, and optimization algorithms are proposed. The HTG is used as the intermediate representation of parallel Fortran and C programs for generating parallel source as well as parallel machine code. >

architectural support for programming languages and operating systems | 1988

Compiler optimizations for enhancing parallelism and their impact on architecture design

Constantine D. Polychronopoulos

By examining the structure and characteristics of parallel programs the author isolates potential overhead sources. The first compiler optimization considered is cycle shrinking which can be used to parallelize certain types of serial loops. A run-time dependence analysis is then considered along with how it can be performed through compiler-inserted bookkeeping and control statements. Loops with unstructured parallelism, that cannot benefit from existing optimizations, can be parallelized through run-time dependence checking. Finally, barrier synchronization is discussed as one of the most serious sources of run-time overhead in parallel programs. To reduce the impact of barriers, the author briefly discusses the implementation of distributed barriers through the use of a set of shared registers. >

international symposium on low power electronics and design | 1999

Using dynamic cache management techniques to reduce energy in a high-performance processor

Nikolaos Bellas; Ibrahim N. Hajj; Constantine D. Polychronopoulos

In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the instruction cache (I-Cache) and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. In this work, we propose, implement, and evaluate a series of run-time techniques for dynamic analysis of the program instruction access behavior, which are then used to preactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes.

conference on high performance computing (supercomputing) | 1990

Fast barrier synchronization hardware

Carl J. Beckmann; Constantine D. Polychronopoulos

A special-purpose hardware scheme uniquely tailored to barrier synchronization is presented. It allows barrier synchronization to be performed within a single instruction cycle for moderately sized systems, and is scalable with logarithmic increase in synchronization time. It supports a large number of concurrent barriers, and can also be used to support a number of different barrier synchronization schemes. The hardware is relatively simple and inexpensive. Simulation results have shown that, under reasonable assumptions, it can decrease typical parallel-loop execution time significantly, especially for fine-grained and statically scheduled loops.<<ETX>>

international conference on computer design | 1999

Energy and performance improvements in microprocessor design using a loop cache

Nikolaos Bellas; Ibrahim N. Hajj; Constantine D. Polychronopoulos; George Stamoulis

Energy dissipated in on-chip caches represents a substantial portion in the energy budget of todays processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. We extend the work proposed by J. Kin et al. (1997), in which an extra, small cache (called filter cache) is inserted between the CPU data path and the L1 cache and serves to filter most of the references initiated from the CPU. In our scheme, the compiler is used to generate code that exploits the new memory hierarchy and reduces the possibility of a miss in the extra cache. Experimental results across a wide range of SPEC95 benchmarks show that this cache, which we call L-Cache, has a small performance overhead with respect to the scheme without any extra caches, and provides substantial energy savings. The L-Cache is placed between the CPU and the I-Cache. The D-Cache subsystem is not modified. Since the L-Cache is much smaller, and thus, has a smaller access time than the I-Cache, this scheme can also be used for performance improvements provided that the hit rate in the L-Cache is very high. In our experimental results, we show that the L-Cache does indeed improve performance in some cases.

International Journal of Parallel Programming | 1994

The hierarchical task graph as a universal intermediate representation

Milind Girkar; Constantine D. Polychronopoulos

This paper presents an intermediate program representation called the Hierarchical Task Graph (HTG), and argues that it is not only suitable as the basis for program optimization and code generation, but it fully encapsulates program parallelism at all levels of granularity. As such, the HTG can be used as the basis for a variety of restructuring and optimization techniques, and hence as the target for front-end compilers as well as the input to source and code generators. Our implementation and testing of the HTG in the Parafrase-2 compiler has demonstrated its suitability and versatility as a potentially universal intermediate representation. In addition to encapsulating semantic information, data and control dependences, the HTG provides more information vital to efficient code generation and optimizations related to parallel code generation. In particular, we introduce the notion of precedence between nodes of the structure whose grain size can range from atomic operations to entire subprograms.

conference on high performance computing (supercomputing) | 2000

Is Data Distribution Necessary in OpenMP

Dimitrios S. Nikolopoulos; Theodore S. Papatheodorou; Constantine D. Polychronopoulos; Jesús Labarta; Eduard Ayguadé; eacute

This paper investigates the performance implications of data placement in OpenMP programs running on modern ccNUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of state-of-the-art ccNUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution of pages incur modest performance losses. We also show that performance leaks stemming from suboptimal page placement schemes can be remedied with a smart user-level page migration engine. The main body of the paper describes how the OpenMP runtime environment can use page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results support the effectiveness of these mechanisms and provide a proof of concept that there is no need to introduce data distribution directives in OpenMP and warrant the portability of the programming model.

international conference on parallel processing | 2000

User-level dynamic page migration for multiprogrammed shared-memory multiprocessors

Dimitrios S. Nikolopoulos; Theodore S. Papatheodorou; Constantine D. Polychronopoulos; Jesús Labarta; Eduard Ayguadé

This paper presents algorithms for improving the performance of parallel programs on multiprogrammed shared-memory NUMA multiprocessors, via the use of user-level dynamic page migration. The idea that drives the algorithms is that a page migration engine can perform accurate and timely page migrations in a multiprogrammed system if it can correlate page reference information with scheduling information obtained from the operating system. The necessary page migrations can be performed as a response to scheduling events that break the implicit association between threads and their memory affinity sets. We present two algorithms that use feedback from the kernel scheduler to aggressively migrate pages upon thread migrations. The first algorithm exploits the iterative nature of parallel programs, while the second targets generic codes without making assumptions on their structure. Performance evaluation on an SGI Origin2000 shows that our page migration algorithms provide substantial improvements in throughput of up to 264% compared to the native IRIX 6.5.5 page placement and migration schemes.

Explore More