Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Thomas M. Conte is active.

Publication


Featured researches published by Thomas M. Conte.


international symposium on microarchitecture | 1998

Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

Emre Özer; Sanjeev Banerjia; Thomas M. Conte

Recently, there has been a trend towards clustered microarchitectures to reduce the cycle time for wide issue microprocessors. In such processors, the register file and functional units are partitioned and grouped into clusters. Instruction scheduling for a clustered machine requires assignment and scheduling of operations to the clusters. In this paper, a new scheduling algorithm named unified-assign-and-schedule (UAS) is proposed for clustered, statically-scheduled architectures. UAS merges the cluster assignment and instruction scheduling phases in a natural and straightforward fashion. We compared the performance of UAS with various heuristics to the well-known Bottom-up Greedy (BUG) algorithm and to an optimal cluster scheduling algorithm, measuring the schedule lengths produced by all of the schedulers. Our results show that UAS gives better performance than the BUG algorithm and is quite close to optimal.


international conference on computer design | 1996

Reducing state loss for effective trace sampling of superscalar processors

Thomas M. Conte; Mary Ann Hirsch; Kishore N. Menezes

There is a wealth of technological alternatives that can be incorporated into a processor design. These include reservation station designs, functional unit duplication, and processor branch handling strategies. The performance of a given design is measured through the execution of application programs and other workloads. Presently, trace driven simulation is the most popular method of processor performance analysis in the development stage of system design. Current techniques of trace driven simulation, however, are extremely slow and expensive. A fast and accurate method for statistical trace sampling of superscalar processors is proposed.


ACM Sigarch Computer Architecture News | 2005

Configurable string matching hardware for speeding up intrusion detection

Monther Aldwairi; Thomas M. Conte; Paul D. Franzon

Signature-based Intrusion Detection Systems (IDSs) monitor network traffic for security threats by scanning packet payloads for attack signatures. IDSs have to run at wire speed and need to be configurable to protect against emerging attacks. In this paper we consider the problem of string matching which is the most computationally intensive task in IDS. A configurable string matching accelerator is developed with the focus on increasing throughput while maintaining the configurability provided by the software IDSs. Our preliminary results suggest that the hardware accelerator offers an overall system performance of up to 14Gbps.


international symposium on microarchitecture | 2009

A Benchmark Characterization of the EEMBC Benchmark Suite

Jason A. Poovey; Thomas M. Conte; Markus Levy; Shay Gal-On

Benchmark consumers expect benchmark suites to be complete, accurate, and consistent, and benchmark scores serve as relative measures of performance. however, it is important to understand how benchmarks stress the processors that they aim to test. this study explores the stress points of the EEMBC embedded benchmark suite using the benchmark characterization technique.


international symposium on computer architecture | 1995

Optimization of instruction fetch mechanisms for high issue rates

Thomas M. Conte; Kishore N. Menezes; Patrick Mills; Burzin A. Patel

Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the efficient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not sufficient. Even in the presence of efficient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this problem and presents several schemes with varying degrees of performance and cost. The most-general scheme, the collapsing buffer, achieves near-perfect performance and consistently aligns instructions in excess of 90% of the time, over a wide range of issue rates. The performance boost provided by compiler optimization techniques is also investigated. Results show that compiler optimization can significantly enhance performance across all schemes. The collapsing buffer supplemented by compiler techniques remains the best-performing mechanism. The paper closes with recommendations and suggestions for future.


international conference on parallel architectures and compilation techniques | 2001

Adaptive Mode Control: A Static-Power-Efficient Cache Design

Huiyang Zhou; Mark C. Toburen; Eric Rotenberg; Thomas M. Conte

Lower threshold voltages in deep sub-micron technologies cause store leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance. Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information) and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active, the hardware knows what the hypothetical miss rate would be if all data lines were active and the actual miss rate can be made to precisely track it. Simulations show an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64KB caches.


international symposium on microarchitecture | 1994

Using branch handling hardware to support profile-driven optimization

Thomas M. Conte; Burzin A. Patel; J.S. Cox

Profile-based optimizations can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run 2-30 times slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper proposes using existing branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%-4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. This practically removes the inconvenience of profiling. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems.


international symposium on microarchitecture | 1996

Instruction fetch mechanisms for VLIW architectures with compressed encodings

Thomas M. Conte; Sanjeev Banerjia; Sergei Y. Larin; Kishore N. Menezes; Sumedh W. Sathaye

VLIW architectures use very wide instruction words in conjunction with high bandwidth to the instruction cache to achieve multiple instruction issue. This report uses the TINKER experimental testbed to examine instruction fetch and instruction cache mechanisms for VLIWs. A compressed instruction encoding for VLIWs is defined and a classification scheme for i-fetch hardware for such an encoding is introduced. Several interesting cache and i-fetch organizations are described and evaluated through trace-driven simulations. A new i-fetch mechanism using a silo cache is found to have the best performance.


international symposium on microarchitecture | 1996

Accurate and practical profile-driven compilation using the profile buffer

Thomas M. Conte; Kishore N. Menezes; Mary Ann Hirsch

Profiling is a technique of gathering program statistics in order to aid program optimization. In particular, it is an essential component of compiler optimization for the extraction of instruction-level parallelism. Code instrumentation has been the most popular method of profiling. However real-time, interactive, and transaction processing applications suffer from the high execution-time overhead imposed by software instrumentation. This paper suggests the use of hardware dedicated to the task of profiling. The hardware proposed consists of a set of counters, the profile buffer. A profile collection method that combines the use of hardware, the compiler and operating system support is described. Three methods for profile buffer indexing, address-mapping, selective indexing, and compiler indexing are presented that allow this approach to produce accurate profiling information with very little execution slowdown. The profile information obtained is applied to a prominent compiler optimization, namely superblock scheduling. The resulting instruction-level parallelism approaches that obtained through the use of perfect profile information.


IEEE Computer | 1998

Performance analysis and its impact on design

Pradip Bose; Thomas M. Conte

Methods for designing new computer systems have changed rapidly. Consider general purpose microprocessors: gone are the days when one or two expert architects would use hunches, experience, and rules of thumb to determine a processors features. Marketplace competition has long since forced companies to replace this ad hoc process with a targeted and highly systematic process that focuses new designs on specific workloads. Although the process differs from company to company, there are common elements. The main advantage of a systematic process is that it produces a finely tuned design targeted at a particular market. At its core are models of the processors performance and its workloads. Developing and verifying these models is the domain now called performance analysis. We cover some of the advances in dealing with modern problems in performance analysis. Our focus is on architectural performance, typically measured in cycles per instruction.

Collaboration


Dive into the Thomas M. Conte's collaboration.

Top Co-Authors

Avatar

Erik P. DeBenedictis

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar

Huiyang Zhou

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Kishore N. Menezes

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Matthew D. Jennings

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Eric R. Hein

Georgia Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Paul D. Franzon

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Sorel Reisman

California State University

View shared research outputs
Top Co-Authors

Avatar

Sriseshan Srikanth

Georgia Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge