Paul LeMahieu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paul LeMahieu is active.

Explore More

Publication

Featured researches published by Paul LeMahieu.

IEEE Transactions on Parallel and Distributed Systems | 2001

Computing in the RAIN: a reliable array of independent nodes

Vasken Bohossian; Chenggong Charles Fan; Paul LeMahieu; Marc D. Riedel; Lihao Xu; Jehoshua Bruck

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology.

international parallel and distributed processing symposium | 2000

Computing in the RAIN: A Reliable Array of Independent Nodes

Vasken Bohossian; Chenggong Charles Fan; Paul LeMahieu; Marc D. Riedel; Lihao Xu; Jehoshua Bruck

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple in terfacesto networks configured in fault-tolerant topologies. The RAIN softw arecomponents run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiplenode, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to RAINfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures; 2) fault management techniques based on group membership; and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: highly available video and web servers, and a distributed checkpointing system.

merged international parallel processing symposium and symposium on parallel and distributed processing | 1998

Fault-tolerant switched local area networks

Paul LeMahieu; Vasken Bohossian; Jehoshua Bruck

The RAIN (Reliable Array of Independent Nodes) project at Caltech is focusing on creating highly reliable distributed systems by leveraging commercially available personal computers, workstations and interconnect technologies. In particular the issue of reliable communication is addressed by introducing redundancy in the form of multiple network interfaces per compute node. When using compute nodes with multiple network connections the question of how to best connect these nodes to a given network of switches arises. We examine networks of switches (e.g. based on Myrinet technology) and focus on degree-two compute nodes (two network adaptor cards per node). Our primary goal is to create networks that are as resistant as possible to partitioning. Our main contributions are: (i) a construction for degree-2 compute nodes connected by a ring network of switches of degree 4 that can tolerate any 3 switch failures without partitioning the nodes into disjoint sets; (ii) a proof that this construction is optimal in the sense that no construction can tolerate more switch failures while avoiding partitioning; and (ii) generalizations of this construction to arbitrary switch and node degrees and to other switch networks, in particular to a fully-connected network of switches.

Archive | 1999