Sotirios G. Ziavras | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sotirios G. Ziavras is active.

Explore More

Publication

Featured researches published by Sotirios G. Ziavras.

IEEE Transactions on Parallel and Distributed Systems | 1994

RH: a versatile family of reduced hypercube interconnection networks

Sotirios G. Ziavras

The binary hypercube has been one of the most frequently chosen interconnection networks for parallel computers because it provides low diameter and is so robust that it can very efficiently emulate a wide variety of other frequently used networks. However, the major drawback of the hypercube is the increase in the number of communication channels for each processor with an increase in the total number of processors in the system. This drawback has a direct effect on the very large scale integration complexity of the hypercube network. This short note proposes a new topology that is produced from the hypercube by a uniform reduction in the number of edges for each node. This edge reduction technique produces networks with lower complexity than hypercubes while maintaining, to a high extent, the powerful hypercube properties. An extensive comparison of the proposed reduced hypercube (RH) topology with the conventional hypercube is included. It is also shown that several copies of the popular cube-connected cycles network can be emulated simultaneously by an RH with dilation 1. >

dependable systems and networks | 2006

In-Register Duplication: Exploiting Narrow-Width Value for Improving Register File Reliability

Jie S. Hu; Shuai Wang; Sotirios G. Ziavras

Protecting the register value and its data buses is crucial to reliable computing in high-performance microprocessors due to the increasing susceptibility of CMOS circuitry to soft errors induced by high-energy particle strikes. Since the register file is in the critical path of the processor pipeline, any reliable design that increases either the pressure on the register file or the register file access latency is not desirable. In this paper, we propose to exploit narrow-width register values, which present the majority of the generated values, for duplicating a copy of the value within the same data item, called in-register duplication (IRD), eliminating the requirement of additional copy registers. The datapath pipeline is augmented to efficiently incorporate parity encoding and parity checking such that error recovery is seamlessly supported in IRD and the parity checking is overlapped with the execution stage to avoid increasing the critical path. Our experimental evaluation using the SPEC CINT2000 benchmark suite shows that IRD provides superior read-with-duplicate (RWD) and error detection/recovery rates under heavy error injection as compared to previous reliability schemes

IEEE Transactions on Computers | 2009

On the Characterization and Optimization of On-Chip Cache Reliability against Soft Errors

Shuai Wang; Jie S. Hu; Sotirios G. Ziavras

Soft errors induced by energetic particle strikes in on-chip cache memories have become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have exploited information redundancy via parity/ECC codings or cacheline duplication for information integrity in on-chip cache memories. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes may eventually prove significantly inadequate and ineffective. In this paper, we propose a new framework for conducting comprehensive studies and characterization on the reliability behavior of cache memories, in order to provide insight into cache vulnerability to soft errors as well as design guidance to architects for highly efficient reliable on-chip cache memory design. Our work is based on the development of new lifetime models for data and tag arrays residing in both the data and instruction caches. Those models facilitate the characterization of cache vulnerability of stored items at various lifetime phases. We then exemplify this design methodology by proposing reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of our approach.

Concurrency and Computation: Practice and Experience | 2004

Parallel LU factorization of sparse matrices on FPGA‐based configurable computing engines

Xiaofang Wang; Sotirios G. Ziavras

Configurable computing, where hardware resources are configured appropriately to match specific hardware designs, has recently demonstrated its ability to significantly improve performance for a wide range of computation‐intensive applications. With steady advances in silicon technology, as predicted by Moores Law, Field‐Programmable Gate Array (FPGA) technologies have enabled the implementation of System‐on‐a‐Programmable‐Chip (SOPC or SOC) computing platforms, which, in turn, have given a significant boost to the field of configurable computing. It is possible to implement various specialized parallel machines in a single silicon chip. In this paper, we describe our design and implementation of a parallel machine on an SOPC development board, using multiple instances of a soft IP configurable processor; we use this machine for LU factorization. LU factorization is widely used in engineering and science to solve efficiently large systems of linear equations. Our implementation facilitates the efficient solution of linear equations at a cost much lower than that of supercomputers and networks of workstations. The intricacies of our FPGA‐based design are presented along with tradeoff choices made for the purpose of illustration. Performance results prove the viability of our approach. Copyright

international parallel and distributed processing symposium | 2003

Parallel direct solution of linear equations on FPGA-based machines

Xiaofang Wang; Sotirios G. Ziavras

The efficient solution of large systems of linear equations represented by sparse matrices appears in many tasks. LU factorization followed by backward and forward substitutions is widely used for this purpose. Parallel implementations of this computation-intensive process are limited primarily to supercomputers. New generations of field-programmable gate array (FPGA) technologies enable the implementation of system-on-a-programmable-chip (SOPC) computing platforms that provide many opportunities for configurable computing. We present the design and implementation of a parallel machine for LU factorization on an SOPC board, using multiple instances of a soft processor. A highly parallel block-diagonal-bordered (BDB) algorithm for LU factorization is mapped to our multiprocessor. Our results prove the viability of our FPGA-based approach.

Journal of Parallel and Distributed Computing | 1992

On the problem of expanding hypercube-based systems☆

Sotirios G. Ziavras

Abstract Several topologies with important features have been proposed for the interconnection of resources resident in parallel computing systems. The hypercube is one of the most widely used topologies because it provides small diameter and is so robust that it can very efficiently emulate a wide variety of other frequently used structures. Nevertheless, the major drawback of the standard hypercube is that it cannot be expanded in practice. This paper proposes a methodology that modifies hypercube networks in order to support incremental growth techniques. The proposed methodology accomplishes this goal with minimal modifications of individual hypercubes and, contrary to other existing techniques, without any need for extra resources. The effectiveness of the proposed methodology is shown analytically.

IEEE Transactions on Parallel and Distributed Systems | 2004

A super-programming approach for mining association rules in parallel on PC clusters

Dejiang Jin; Sotirios G. Ziavras

PC clusters have become popular in parallel processing. They do not involve specialized interprocessor networks, so the latency of data communications is rather long. The programming models for PC clusters are often different than those for parallel machines or supercomputers containing sophisticated interprocessor communication networks. For PC clusters, load balancing among the nodes becomes a more critical issue in attempts to yield high performance. We introduce a new model for program development on PC clusters, namely, the super-programming model (SPM). The workload is modeled as a collection of super-instructions (SIs). We propose that a set of SIs be designed for each application domain. They should constitute an orthogonal set of frequently used high-level operations in the corresponding application domain. Each SI should normally be implemented as a high-level language routine that can execute on any PC. Application programs are modeled as super-programs (SPs), which are coded using SIs. SIs are dynamically assigned to available PCs at runtime. Because of the known granularity of SIs, an upper bound on their execution time can be estimated at static time. Therefore, dynamic load balancing becomes an easier task. Our motivation is to support dynamic load balancing and code porting, especially for applications with diverse sets of inputs such as data mining. We apply here SPM to the implementation of an a priori-like algorithm for mining association rules. Our experiments show that the average idle time per node is kept very low.

international conference on embedded computer systems: architectures, modeling, and simulation | 2006

On the Characterization of Data Cache Vulnerability in High-Performance Embedded Microprocessors

Shuai Wang; Jie S. Hu; Sotirios G. Ziavras

Energetic-particle induced soft errors in on-chip cache memories have become a major challenge in designing new generation reliable microprocessors. Uniformly applying conventional protection schemes such as error correcting codes (ECC) to SRAM caches may not be practical where performance, power, and die area are highly constrained, especially for embedded systems. In this paper, we propose to analyze the lifetime behavior of the data cache to identify its temporal vulnerability. For this vulnerability analysis, we develop a new lifetime model. Based on the new lifetime model, we evaluate the effectiveness of several existing schemes in reducing the vulnerability of the data cache. Furthermore, we propose to periodically invalidate clean cache lines to reduce the probability of errors being read in by the CPU. Combined with previously proposed early writeback strategies, our schemes achieve a substantially low vulnerability in the data cache, which indicate the necessity of different protection schemes for data items during various phases in their lifetime

Integration | 2007

Coprocessor design to support MPI primitives in configurable multiprocessors

Sotirios G. Ziavras; Alexandros V. Gerbessiotis; Rohan Bafna

The Message Passing Interface (MPI) is a widely used standard for interprocessor communications in parallel computers and PC clusters. Its functions are normally implemented in software due to their enormity and complexity, thus resulting in large communication latencies. Limited hardware support for MPI is sometimes available in expensive systems. Reconfigurable computing has recently reached rewarding levels that enable the embedding of programmable parallel systems of respectable size inside one or more Field-Programmable Gate Arrays (FPGAs). Nevertheless, specialized components must be built to support interprocessor communications in these FPGA-based designs, and the resulting code may be difficult to port to other reconfigurable platforms. In addition, performance comparison with conventional parallel computers and PC clusters is very cumbersome or impossible since the latter often employ MPI or similar communication libraries. The introduction of a hardware design to implement directly MPI primitives in configurable multiprocessor computing creates a framework for efficient parallel code development involving data exchanges independently of the underlying hardware implementation. This process also supports the portability of MPI-based code developed for more conventional platforms. This paper takes advantage of the effectiveness and efficiency of one-sided Remote Memory Access (RMA) communications, and presents the design and evaluation of a coprocessor that implements a set of MPI primitives for RMA. These primitives form a universal and orthogonal set that can be used to implement any other MPI function. To evaluate the coprocessor, a router of low latency was designed as well to enable the direct interconnection of several coprocessors in cluster-on-a-chip systems. Experimental results justify the implementation of the MPI primitives in hardware to support parallel programming in reconfigurable computing. Under continuous traffic, results for a Xilinx XC2V6000 FPGA show that the average transmission time per 32-bit word is about 1.35 clock cycles. Although other computing platforms, such as PC clusters, could benefit as well from our design methodology, our focus is exclusively reconfigurable multiprocessing that has recently received tremendous attention in academia and industry.

Computers & Security | 2010

Efficient hardware support for pattern matching in network intrusion detection

Nitesh B. Guinde; Sotirios G. Ziavras

Deep packet inspection forms the backbone of any Network Intrusion Detection (NID) system. It involves matching known malicious patterns against the incoming traffic payload. Pattern matching in software is prohibitively slow in comparison to current network speeds. Due to the high complexity of matching, only FPGA (Field-Programmable Gate Array) or ASIC (Application-Specific Integrated Circuit) platforms can provide efficient solutions. FPGAs facilitate target architecture specialization due to their field programmability. Costly ASIC designs, on the other hand, are normally resilient to pattern updates. Our FPGA-based solution performs high-speed pattern matching while permitting pattern updates without resource reconfiguration. To its advantage, our solution can be adopted by software and ASIC realizations, however at the expense of much lower performance and higher price, respectively. Our solution permits the NID system to function while pattern updates occur. An off-line optimization method first finds common sub-patterns across all the patterns in the SNORT database of signatures. A novel technique then compresses each pattern into a bit vector, where each bit represents such a sub-pattern. This approach reduces drastically the required on-chip storage as well as the complexity of pattern matching. The bit vectors for newly discovered patterns can be generated easily using a simple high-level language program before storing them into the on-chip RAM. Compared to earlier approaches, not only is our strategy very efficient while supporting runtime updates but it also results in impressive area savings; it utilizes just 0.052 logic cells for processing and 17.77 bits for storage per character in the current SNORT database of 6455 patterns. Also, the total number of logic cells for processing the traffic payload does not change with pattern updates.

Explore More