Grigorios Magklis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Grigorios Magklis is active.

Explore More

Publication

Featured researches published by Grigorios Magklis.

high-performance computer architecture | 2002

Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling

Greg Semeraro; Grigorios Magklis; Rajeev Balasubramonian; David H. Albonesi; Sandhya Dwarkadas; Michael L. Scott

As clock frequency increases and feature size decreases, clock distribution and wire delays present a growing challenge to the designers of singly-clocked, globally synchronous systems. We describe an alternative approach, which we call a multiple clock domain (MCD) processor, in which the chip is divided into several clock domains, within which independent voltage and frequency scaling can be performed. Boundaries between domains are chosen to exploit existing queues, thereby minimizing inter-domain synchronization costs. We propose four clock domains, corresponding to the front end , integer units, floating point units, and load-store units. We evaluate this design using a simulation infrastructure based on SimpleScalar and Wattch. In an attempt to quantify potential energy savings independent of any particular on-line control strategy, we use off-line analysis of traces from a single-speed run of each of our benchmark applications to identify profitable reconfiguration points for a subsequent dynamic scaling run. Using applications from the MediaBench, Olden, and SPEC2000 benchmark suites, we obtain an average energy-delay product improvement of 20% with MCD compared to a modest 3% savings from voltage scaling a single clock and voltage system.

international symposium on microarchitecture | 2002

Dynamic frequency and voltage control for a multiple clock domain microarchitecture

Greg Semeraro; David H. Albonesi; Steven G. Dropsho; Grigorios Magklis; Sandhya Dwarkadas; Michael L. Scott

We describe the design, analysis, and performance of an on-line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance. Our algorithm achieves on average a 19.0% reduction in Energy Per Instruction (EPI), a 3.2% increase in Cycles Per Instruction (CPI), a 16.7% improvement in Energy-Delay Product, and a Power Savings to Performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to Performance Degradation ratio of only 2-3. Our Energy-Delay Product improvement is 85.5% of what has been achieved using an off-line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

IEEE Computer | 2003

Dynamically tuning processor resources with adaptive processing

David H. Albonesi; Rajeev Balasubramonian; S.G. Dropsbo; Sandhya Dwarkadas; Eby G. Friedman; Michael C. Huang; Volkan Kursun; Grigorios Magklis; Michael L. Scott; Greg Semeraro; Pradip Bose; Alper Buyuktosunoglu; Peter W. Cook; Stanley E. Schuster

By using adaptive processing to dynamically tune major microprocessor resources, developers can achieve greater energy efficiency with reasonable hardware and software overhead while avoiding undue performance loss. Adaptive processors require few additional transistors. Further, because adaptation occurs only in response to infrequent trigger events, the decision logic can be placed into a low-leakage state until such events occur.

international conference on parallel architectures and compilation techniques | 2002

Integrating adaptive on-chip storage structures for reduced dynamic power

Steven G. Dropsho; Alper Buyuktosunoglu; Rajeev Balasubramonian; David H. Albonesi; Sandhya Dwarkadas; Greg Semeraro; Grigorios Magklis; M.L. Scottt

Energy efficiency in microarchitectures has become a necessity. Significant dynamic energy savings can be realized for adaptive storage structures such as caches, issue queues, and register files by disabling unnecessary storage resources. Prior studies have analyzed individual structures and their control. A common theme to these studies is exploration of the configuration space and use of system IPC as feedback to guide reconfiguration. However when multiple structures adapt in concert, the number of possible configurations increases dramatically, and assigning causal effects to IPC change becomes problematic. To overcome this issue, we introduce designs that are reconfigured solely on local behavior. We introduce a novel cache design that permits direct calculation of efficient configurations. For buffer and queue structures, limited histogramming permits precise resizing control. When applying these techniques we show energy savings of up to 70% on the individual structures, and savings averaging 30% overall for the portion of energy attributed to these structures with an average of 2.1% performance degradation.

IEEE Transactions on Parallel and Distributed Systems | 2007

Understanding the Thermal Implications of Multi-Core Architectures

Pedro Chaparro; José González; Grigorios Magklis; Cai Qiong; Antonio González

Multicore architectures are becoming the main design paradigm for current and future processors. The main reason is that multicore designs provide an effective way of overcoming instruction-level parallelism (ILP) limitations by exploiting thread-level parallelism (TLP). In addition, it is a power and complexity-effective way of taking advantage of the huge number of transistors that can be integrated on a chip. On the other hand, todays higher than ever power densities have made temperature one of the main limitations of microprocessor evolution. Thermal management in multicore architectures is a fairly new area. Some works have addressed dynamic thermal management in bi/quad-core architectures. This work provides insight and explores different alternatives for thermal management in multicore architectures with 16 cores. Schemes employing both energy reduction and activity migration are explored and improvements for thread migration schemes are proposed.

symposium on asynchronous circuits and systems | 2004

Hiding synchronization delays in a GALS processor microarchitecture

Greg Semeraro; David H. Albonesi; Grigorios Magklis; Michael L. Scott; Steven G. Dropsho; Sandhya Dwarkadas

We analyze an Alpha 21264-like globally-asynchronous, locally-synchronous (GALS) processor organized as multiple clock domain (MCD) microarchitecture and identify the architectural features of the processor that influence the limited performance degradation measured. We show that the out-of-order superscalar execution features of a processor, which allow traditional instruction execution latency to be hidden, are the same features that reduce the performance degradation impact of the synchronization costs of an MCD processor. In the case of our Alpha 21264-like processor, up to 94% of the MCD synchronization delays are hidden and do not impact overall performance. In addition, we show that by adding out-of-order superscalar execution capabilities to a simpler microarchitecture, such as an Intel StrongARM-like processor, as much as 62% of the performance degradation caused by synchronization delays can be eliminated.

international symposium on microarchitecture | 2003

Dynamic frequency and voltage scaling for a multiple-clock-domain microprocessor

Grigorios Magklis; Greg Semeraro; David H. Albonesi; Steven G. Dropsho; Sandhya Dwarkadas; Michael L. Scott

Multiple clock domains is one solution to the increasing problem of propagating the clock signal across increasingly larger and faster chips. The ability to independently scale frequency and voltage in each domain creates a powerful means of reducing power dissipation. A multiple clock domain (MCD) microarchitecture, which uses a globally asynchronous, locally synchronous (GALS) clocking style, permits future aggressive frequency increases, maintains a synchronous design methodology, and exploits the trend of making functional blocks more autonomous. In MCD, each processor domain is internally synchronous, but domains operate asynchronously with respect to one another. Designers still apply existing synchronous design techniques to each domain, but global clock skew is no longer a constraint. Moreover, domains can have independent voltage and frequency control, enabling dynamic voltage scaling at the domain level.

international conference on parallel architectures and compilation techniques | 2008

Meeting points: using thread criticality to adapt multicore hardware to parallel regions

Qiong Cai; Jose Gonzalez; Ryan N. Rakvic; Grigorios Magklis; Pedro Chaparro; Antonio González

We present a novel mechanism, called meeting point thread characterization, to dynamically detect critical threads in a parallel region. We define the critical thread the one with the longest completion time in the parallel region. Knowing the criticality of each thread has many potential applications. In this work, we propose two applications: thread delaying for multi-core systems and thread balancing for simultaneous multi-threaded (SMT) cores. Thread delaying saves energy consumptions by running the core containing the critical thread at maximum frequency while scaling down the frequency and voltage of the cores containing non-critical threads. Thread balancing improves overall performance by giving higher priority to the critical thread in the issue queue of an SMT core. Our experiments on a detailed microprocessor simulator with the Recognition, Mining, and Synthesis applications from Intel research laboratory reveal that thread delaying can achieve energy savings up to more than 40% with negligible performance loss. Thread balancing can improve performance from 1% to 20%.

international conference on parallel architectures and compilation techniques | 2009

FASTM: A Log-based Hardware Transactional Memory with Fast Abort Recovery

Marc Lupon; Grigorios Magklis; Antonio González

Version management, one of the key design dimensions of Hardware Transactional Memory (HTM) systems, defines where and how transactional modifications are stored. Current HTM systems use either eager or lazy version management. Eager systems that keep new values in-place while they hold old values in a software log, suffer long delays when aborts are frequent because the pre-transactional state is recovered by software. Lazy systems that buffer new values in specialized hardware offer complex and inefficient solutions to handle hardware overflows, which are common in applications with coarse-grain transactions. In this paper, we present FASTM, an eager log-based HTM that takes advantage of the processor’s cache hierarchy to provide fast abort recovery. FASTM uses a novel coherence protocol to buffer the transactional modifications in the first level cache and to keep the non-speculative values in the higher levels of the memory hierarchy. This mechanism allows fast abort recovery of transactions that do not overflow the first level cache resources. Contrary to lazy HTM systems, committing transactions do not have to perform any actions in order to make their results visible to the rest of the system. FASTM keeps the pre-transactional state in a software-managed log as well, which permits the eviction of speculative values and enables transparent execution even in the case of cache overflow. This approach simplifies eviction policies without degrading performance, because it only falls back to a software abort recovery for transactions whose modified state has overflowed the cache. Simulation results show that FASTM achieves a speed-up of 43% compared to LogTM-SE, improving the scalability of applications with coarse-grain transactions and obtaining similar performance to an ideal eager HTM with zero-cost abort recovery.

high-performance computer architecture | 2005

Distributing the frontend for temperature reduction

Pedro Chaparro; Grigorios Magklis; Jose Gonzalez; Antonio González

Due to increasing power densities, both on-chip average and peak temperatures are fast becoming a serious bottleneck in processor design. This is due to the cost of removing the heat generated, and the performance impact of dealing with thermal emergencies. So far microarchitectural techniques to control temperature have mainly focused on the processor backend (in particular the execution units), whereas the frontend has not received much attention. However, as the temperature of the backend remains controlled and the processor throughput increases, the heat dissipated by the frontend becomes more significant, and one of the major contributors to the total average temperature. This paper proposes and evaluates a distributed frontend for clustered microarchitectures that is able to reduce power density and temperature. First, a distributed mechanism for renaming and committing instructions is proposed. Second, a sub-banked trace cache with a bank hopping mechanism is presented. Finally, a method to improve the sub-banking is proposed based on a biased mapping function to distribute bank accesses to balance temperature.

Explore More