John Kalamatianos | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John Kalamatianos is active.

Explore More

Publication

Featured researches published by John Kalamatianos.

IEEE Transactions on Computers | 2000

High-speed parallel-prefix module 2/sup n/-1 adders

Lampros Kalampoukas; Dimitris Nikolos; Costas Efstathiou; Haridimos T. Vergos; John Kalamatianos

A novel parallel-prefix architecture for high speed module 2/sup n/-1 adders is presented. The proposed architecture is based on the idea of recirculating the generate and propagate signals, instead of the traditional end-around carry approach. Static CMOS implementations verify that the proposed architecture compares favorably with the already known parallel-prefix or carry look-ahead structures.

international symposium on microarchitecture | 1998

Predicting indirect branches via data compression

John Kalamatianos; David R. Kaeli

Branch prediction is a key mechanism used to achieve high performance on multiple issue, deeply pipelined processors. By predicting the branch outcome at the instruction fetch stage of the pipeline, superscalar processors become able to exploit Instruction Level Parallelism (ILP) by providing a larger window of instructions. However, when a branch is mispredicted, instructions from the mispredicted path must be discarded. Therefore, branch prediction accuracy is critical to achieve high performance. Existing branch prediction schemes can accurately predict the direction of conditional branches, but have difficulties predicting the correct targets of indirect branches. Indirect branches occur frequently in Object-Oriented Languages (OOL), as well as in Dynamically-Linked Libraries (DLLs), two programming environments rapidly increasing in popularity. In addition, certain language constructs such as multi-way control transfers (e.g., switches), and architectural features such as 64-bit address spaces, utilize indirect branching. In this paper, we describe a new algorithm for predicting unconditional indirect branches called Prediction by Partial Matching (PPM). We base our approach on techniques proven to work optimally in the field of data compression. We combine a viable implementation of the PPM algorithm with dynamic per-branch selection of path-based correlation and compare its prediction accuracy against a variety of predictors. Our results show that, for approximately the same hardware budget, the combined predictor can achieve a misprediction ratio of 9.47%, as compared to 11.48% for the previously published most accurate indirect branch predictor.

international conference on computer design | 2013

Assessing the impact of hard faults in performance components of modern microprocessors

Nikos Foutris; Dimitris Gizopoulos; John Kalamatianos; Vilas Sridharan

A growing portion of the silicon area of modern high-performance microprocessors is dedicated to components that increase performance but do not determine functional correctness. Permanent hardware faults in these components can lead to performance fluctuation (not necessarily degradation) and do not produce functional errors. Although this fact has been identified previously, extensive research has not yet been conducted to accurately classify and quantify permanent faults in these components over a set of CPU benchmarks or measure the magnitude of the performance impact. Depending on the results of such studies, performance-related components of microprocessors can be disabled in fine or coarse granularities, salvaging microprocessor functionality at different performance levels. This paper analyzes the impact of permanent faults in the arrays and control logic of key microprocessor performance components such as the branch predictor, branch target buffer, return address stack, and data and instruction prefetchers. We apply a statistically safe fault injection campaign for single faults in performance components on a modified version of the cycle-accurate x86 architectural simulator PTLsim running the SPEC CPU2006 suite. Our evaluation reveals significant differences in the effect of faults and their performance impacts across the components as well as within each component (different fields). We classify faults for all components and analyze their IPC impact in the arrays and control logic. Our analysis shows that a very large fraction (44% to 96%) of permanent faults in these components leads only to performance fluctuation. Observation confirms the intuition that there are no functionality errors; however, many cases of a single fault in a performance component can significantly degrade microprocessor performance (2-20%average IPC reduction for SPEC CPU2006).

international conference on acoustics, speech, and signal processing | 1995

Parallel computation of higher order moments on the MasPar-1 machine

John Kalamatianos; Elias S. Manolakos

The design of efficient parallel processing implementations for speeding up the computationally intensive estimation of higher order statistics (HOS) has been recognized as an important task by the signal processing community. We report on the synthesis of minimum running time (latency) data-parallel algorithms that can be employed to compute all moment lags, up to the 3rd or 4th-order, on the MasPar-1 single instruction multiple data (SIMD) parallel system. By construction the synthesized SIMD algorithms require constant memory per processing element (PE), thus allowing the processing of 1-D input data sequences with as many as M=2/sup 10/ data samples. Simulation results are presented showing the gain in speedup and execution times, as compared to optimized versions of the serial estimation algorithm running in powerful workstations.

international symposium on performance analysis of systems and software | 2000

Accurate simulation and evaluation of code reordering

John Kalamatianos; David R. Kaeli

The need for bridging the ever growing gap between memory and processor performance has motivated research for exploiting the memory hierarchy effectively. An important software solution called code reordering produces a new program layout to better utilize the available memory hierarchy. Many algorithms have been proposed. They differ based on: 1) the code granularity assumed by the reordering algorithm, and 2) the models used to guide code placement. In this paper we present a framework that provides accurate simulation and evaluation of code reordering algorithms on an out-of-order superscalar processor. Our approach allows both profile-guided and compile-time approaches to be simulated. Using a single simulation pass, different graph models are constructed and utilized during code placement. Various combinations of basic block/procedure reordering algorithms can be employed. We discuss the necessary modifications made to a detailed simulator of a processor in order to accurately simulate the optimized code layout.

ACM Sigarch Computer Architecture News | 1999

Improving the accuracy of indirect branch prediction via branch classification

John Kalamatianos; David R. Kaeli

Providing accurate branch prediction is critical to eeectively exploit superscalar execution. While most modern processors employ speculative execution to overcome the branch hazard problem, some number of the instructions will have to be discarded when a branch misprediction occurs. Even though existing branch prediction schemes can accurately predict the direction of conditional branches, they still have diiculty predicting the correct targets of indirect branches. This type of branch occurs more frequently in languages used in Object-Oriented Programming (OOP), as well as in Dynamically-Linked Libraries (DLLs), two programming environment rapidly increasing in popularity. In this paper, we investigate the performance of several predictors used to predict the targets of indirect branches. We present indirect branch classiication as a mechanism to characterize the behavior of indirect branches. We then propose hybrid predictors utilizing static and proole-guided branch classiication to improve the accuracy of indirect branches when compared to conventional predictors. Our results based on C and C++ applications show that static and proole-based hybrid predictors can improve predictino accuracy by 13.8-14.3% over the best non-hybrid branch predictor.

ieee international symposium on workload characterization | 1998

Parameter value characterization of Windows NT-based applications

John Kalamatianos; David R. Kaeli; Ronnie Chaiken

Compiler optimizations such as code specialization and partial evaluation can be used to effectively exploit identifiable invariance of variable values. To identify the invariant variables that the compiler misses at compile time, value profiling can provide valuable information. We focus on the invariance of procedure parameters for a set of desktop applications run on MS Windows NT 4.0. Most of those applications are non-scientific and execute interactively through a rich GUI. Due to the dynamic nature of this workload, one would expect that parameter values would exhibit an unpredictable behavior. Our work attempts to address this question by measuring the invariance and temporal locality of parameter values. We also measure she invariance of parameter values for four benchmarks from the SPECINT95 suite for comparison.

design automation conference | 2017

On Characterizing Near-Threshold SRAM Failures in FinFET Technology

Shrikanth Ganapathy; John Kalamatianos; Keith Kasprak; Steven Raasch

Adoption of near-threshold voltage (NTV) operation in SRAM-based memories has been limited by reduced robustness resulting from marginal transistor operation that results in bit failures. Using silicon measurements from a large sample of 14nm FinFET test chips, we show that our cells operate at frequencies of up to 1GHz with a minimum 15% voltage guardband, below which the cells begin to fail. We find that when operated at 32.5% below nominal voltage, >95% of the lines experience fewer than 2 failures, which can be corrected with SECDED ECC. Our results indicate that for frequencies of up to 1GHz, NTV can help maximize power savings potential while requiring minimal protection.

design automation conference | 2017

Compiler Techniques to Reduce the Synchronization Overhead of GPU Redundant Multithreading

Manish Gupta; Daniel Lowell; John Kalamatianos; Steven Raasch; Vilas Sridharan; Dean M. Tullsen; Rajesh K. Gupta

Redundant Multi-Threading (RMT) provides a potentially low cost mechanism to increase GPU reliability by replicating computation at the thread level. Prior work has shown that RMTs high performance overhead stems not only from executing redundant threads, but also from the synchronization overhead between the original and redundant threads. The overhead of inter-thread synchronization can be especially significant if the synchronization is implemented using global memory. This work presents novel compiler techniques using fingerprinting and cross-lane operations to reduce synchronization overhead for RMT on GPUs. Fingerprinting combines multiple synchronization events into one event by hashing, and cross-lane operations enable thread-level synchronization via register-level communication. This work shows that fingerprinting yields a 73.5% reduction in GPU RMT overhead while cross-lane operations reduce the overhead by 43% when compared to the state-of-the-art GPU RMT solutions on real hardware.

vlsi test symposium | 2016

Faults in data prefetchers: Performance degradation and variability

Nikos Foutris; Athanasios Chatzidimitriou; Dimitris Gizopoulos; John Kalamatianos; Vilas Sridharan

High-performance microprocessors employ data prefetchers to mitigate the ever-growing gap between CPU computing rates and memory latency. Technology scaling along with low voltage operation exacerbates the likelihood and rate of hard (permanent) faults in technologies used by prefetchers such as SRAM and flip flop arrays. Faulty prefetch behavior does not affect correctness but can be detrimental to performance. Hard faults in data prefetchers (unlike their soft counterparts which are rare) can cause significant single-thread performance degradation and lead to large performance variability across otherwise identical cores. In this paper, we characterize in-depth both of these aspects in microprocessors suffering from multiple hard faults in their data prefetcher components. Our study reveals fault scenarios in the prefetcher table that can degrade IPC by more than 17%, while faults in the prefetch input and request queues can slow IPC up to 24% and 26%, respectively, compared to fault-free operation. Moreover, we find that a faulty data prefetcher can substantially increase the performance variability across identical cores: the standard deviation of IPC loss for different benchmarks can be more than 4.5%.

Explore More