Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
aa r X i v : . [ c s . D C ] M a r Fault Tolerant Adaptive Parallel and Distributed Simulation throughFunctional Replication ∗ , Stefano Ferretti, Moreno Marzolla Department of Computer Science and EngineeringUniversity of Bologna, Italy
Abstract
This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middle-ware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models,which are needed to properly simulate and analyze complex systems arising in any kind of scientific orengineering field. PADS takes advantage of multiple execution units run in multicore processors, clusterof workstations or HPC systems. However, large computing systems, such as HPC systems that includehundreds of thousands of computing nodes, have to handle frequent failures of some components. To copewith this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple exe-cution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIAoffers some protection against Byzantine failures, since interaction messages among the simulated entitiesare replicated as well, so that the receiving entity can identify and discard corrupted messages. Resultsfrom an analytical model and from an experimental evaluation show that FT-GAIA provides a high degreeof fault tolerance, at the cost of a moderate increase in the computational load of the execution units.
Keywords:
Simulation, Parallel and Distributed Simulation, Fault Tolerance, Adaptive Systems,Middleware, Agent-Based Simulation
1. Introduction
Computer simulation is an important tool to model, analyze and understand physical, biological andsocial phenomena. Among the different methodologies Discrete Event Simulation (DES) is of particular An early version of this work appeared in [1]. This paper is an extensively revised and extended version of the previouswork in which more than 30% is new material. The publisher version of this paper is available at https://doi.org/10.1016/j.simpat.2018.09.012 . Please cite thispaper as: “Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication.Simulation Modelling Practice and Theory, vol. 93 (May 2019), Elsevier”. ∗ Corresponding Author. Address: Department of Computer Science and Engineering. University of Bologna. Mura AnteoZamboni 7. I-40127, Bologna. Italy. Phone +39 0547 338886, Fax +39 051 2094510
Email addresses: [email protected] (Gabriele D’Angelo), [email protected] (Stefano Ferretti), [email protected] (Moreno Marzolla)
Preprint submitted to Simulation Modelling Practice and Theory March 28, 2019 nterest, since it is frequently employed to model and analyze many types of systems, including computerarchitectures, communication networks, street traffic and others.In a DES, the system is modeled as a set of entities that interact. The simulation has a state whichevolves through the generation of events issued by simulated entities or by a (human or synthetic) supervisorof the simulation. Events occur at discrete points in time. The overall structure of a sequential event-basedsimulator is relatively simple: the simulator engine maintains a list, called Future Event List (FEL), of allpending events, sorted in non decreasing time of occurrence. The execution of the simulation consists ofa loop: at each iteration, the event with lower timestamp t is removed from the FEL, and the simulationtime is advanced to t . Then, the event is executed, possibly triggering the generation of new events to bescheduled for execution at some future time.Continuous advances in our understanding of complex systems, combined with the need for higher modelaccuracy, demand an increasing amount of computational power. The simulation of complex systems mightgenerate a huge amount of events, due to the enormous amount of entities to be simulated and the high rateof events they trigger. Just as an example, think at the Internet of Things (IoT), the network of physicaldevices, vehicles, home appliances and other items embedded with computational and that communicationcapabilities, that nowadays is considered the most prominent infrastructure on top of which novel smartservices will be implemented. Simulating such a kind of system is very demanding and imposes the useof sophisticated simulation techniques [2]. In this kind of scenarios, sequential DES techniques becomeinappropriate for analyzing large or detailed models. DES must thus evolve into something that is able tohandle simulations at larger scales.An alternative approach, called Parallel Discrete Event Simulation (PDES) refers to the execution of asingle discrete event simulation program on a parallel computer [3]. The goal is to parallelize the executionof the simulation events for better scalability.Parallel And Distributed Simulation (PADS) is concerned with the execution of a simulation program oncomputing platforms containing multiple processors [4]. PADS takes advantage of multiple execution unitsto efficiently handle large simulation models. These execution units can be distributed across the Internet, orgrouped as massively parallel computers or multicore processors. While PADS has been used for concurrentexecution of many different simulation paradigms (e.g. continuous simulation, concurrent replication), thispaper focuses on the distributed execution of discrete event simulations, i.e. we use the PADS techniquesfor implementing DES models.More in detail, in PADS, the simulation model is partitioned in submodels, called Logical Processes(LPs) which can be evaluated concurrently by different Processing Elements (PEs). More precisely, thesimulation model is described in terms of multiple interacting Simulated Entities (SEs) which are assignedto different LPs. Each LP runs on a different PE, where a PE is an execution unit acting as a container ofa set of entities. The simulation execution consists of the exchange of timestamped messages, representing2 igure 1: Structure of a Parallel And Distributed Simulation that implements a Discrete Event Simulation model. simulation events, between entities. Each LP has an incoming queue where messages are inserted beforebeing dispatched to the appropriate entities. Without loss of generality, through this paper we will assumethat a PE is a single core of a multicore processor. Figure 1 shows the general structure of a parallel anddistributed simulator.Clearly enough, PADS can strongly benefit from the use of cloud computing infrastructures. Cloud com-puting allows instantiating and dynamically maintaining computing (virtual) machines that meet arbitrarilyvarying resource requirements. Service level agreements can be employed in order to understand if the cloudprovides the Quality-of-Service the user is expecting [5]. QoS guarantees, together with the possibility ofarbitrarily adding or removing resources on demand, provide the simulationist with a very useful computingenvironment to execute complex simulations, without having to manage the computing infrastructure [6].However, as in every distributed system, cloud virtual machines can fail. Thus, fault tolerance schemes arerequired [7].Execution of long-running applications on increasingly larger parallel machines is likely to hit the relia-bility wall [8]. This means that, as the system size (number of components) increases, so does the probabilitythat at least one of those components fails, therefore reducing the system Mean Time To Failure (MTTF).At some point the execution time of the parallel application may become larger than the MTTF of itsexecution environment, so that the application has little chance to terminate normally.As a purely illustrative example, let us consider a PADS with L LPs. Let X i be the stochastic variablerepresenting the duration of uninterrupted operation of the i -th LP, 1 ≤ i ≤ L , taking into account bothhardware and software failures. For the sake of simplicity, we assume that each LP resides on a different PE,so that each hardware failure (i.e. a PE crash) affects an LP only. Assuming that all X i are independentand exponentially distributed (this assumption is somewhat unrealistic but widely used [9]), we have that3 R e li ab ili t y Time10 LPs100 LPs1000 LPs
Figure 2: System reliability of parallel and distributed simulation with different number of LPs, assuming that the MTTF foreach LP is one year; higher is better, log scale on the horizontal axis. the probability P ( X i > t ) that LP i operates without failures for at least t time units is P ( X i > t ) = e − λt where λ is the failure rate. The joint probability that all L LPs operate without failures for at least t timeunits is therefore R ( L, t ) = Q i P ( X i > t ) = e − Lλt ; this is the formula for the reliability of L componentsconnected in series, where each component fails independently, and a single failure brings down the wholesystem.Figure 2 shows the value of R ( L, t ) (the probability of no failures for at least t consecutive time units)for systems with L = 10 , , λ ≈ . × − s − ). We can seethat the system reliability quickly drops as the number of LPs increases: a simulation involving L = 1000LPs and requiring one day to complete is very unlikely to terminate successfully.Although the model above is overly simplified, and is not intended to provide an accurate estimate of thereliability of actual PADS, it does show that building a reliable system out of a large number of unreliableparts is challenging.To put the numbers above more in context, we report on Table 1 the number of cores in the top ten HighPerformance Computing (HPC) systems that appear on the June 2018 edition of the Top500 Supercomputer4 ame System N. of cores R max R peak (TFlop/s) (TFlop/s)Summit IBM Power System AC922 2 , ,
544 122 , . , . , ,
600 93 , . , . , ,
480 71 , . , . , ,
760 61 , . , . ,
680 19 , . , . ,
760 19 , . , . ,
640 17 , . , . , ,
864 17 , . , . ,
968 14 , . , . ,
336 14 , . , . Table 1: The top ten HPC systems in June 2018 Top500 Supercomputer list. R max and R peak are the maximum and theoreticalpeak LAPACK performance, respectively. list . Five systems (Summit, Sunway TaihuLight, Sierra, Tianhe-2A, and Sequoia) have more than onemillion cores, while the others are in the range of hundreds of thousands. As the size of HPC systems grows,reliabilty issues become more and more relevant [10].The reliability of HPC systems has been investigated, among others, in [11, 12]. In [11], the authorsreport about 0 . / × failures/year. In general, it is well understood that no matter how reliable thebasic components are, the future generation of supercomputers will experience an ever increasing stream offailures and must cope with them [8].This paper describes a novel approach to deal with fault tolerance in PADS. The proposed solution,termed FT-GAIA, is a fault tolerant extension of the GAIA/ART`IS parallel and distributed simulationmiddleware [14, 15]. FT-GAIA deals with crash errors and Byzantine faults by resorting to server groups [16]:simulation entities are replicated, in the cloud / distributed computing system, so that the model can beexecuted even if some of them fail. This functional replication is implemented by adding a related softwarelayer in the GAIA/ART`IS stack. The replication of all the simulated entities is transparent to user-level.Thus, FT-GAIA can be used as a drop-in replacement to GAIA/ART`IS when fault tolerance is the majorconcern. Needless to say, fault tolerance increases the computational and communication loads at LPs, thuscausing a moderate increment on the performance of the simulator.The remainder paper is organized as follows. In Section 2 we review the state of the art related to https://top500.org/lists/2018/06/ , accessed August, 2018
2. Background and Related Work
In distributed systems, two typical approaches used to cope with hardware-related reliability are check-pointing and functional replication .The checkpoint-restore paradigm requires the running application to periodically save its state on non-volatile storage (e.g. disk) so that it can resume execution from the last saved snapshot in case of failure.It should be observed that saving a snapshot may require considerable time; therefore, the interval betweencheckpoints must be carefully tuned to minimize the overhead.Functional replication consists of replicating parts of the application on different execution nodes, sothat failures can be tolerated if there is some minimum number of running instances of each component.Note that each component must be modified so that it is made aware that multiple copies of its peers exist,and can interact with all instances appropriately.It is important to remark that functional replication is not effective against logical errors, i.e., bugs inthe running applications, since the bug can be triggered at the same time on all instances. A prominent –and frequently mentioned – example is the failure of the Ariane 5 rocket that was caused by a software erroron its Inertial Reference Platforms (IRPs). There were two IRP, providing hardware fault tolerance, butboth used the same software. When the two software instances were fed with the same (correct) input fromthe hardware, the bug (an uncaught data conversion exception) caused both programs to crash, leaving therocket without guidance [17]. The N -version programming technique [18] can be used to protect againstsoftware errors, and requires running several functionally equivalent programs that have been independentlydeveloped from the same specifications.Although fault tolerance is an important and widely discussed topic in the context of distributed systemsresearch, it received comparatively little attention by the PADS community. In what follows, we describerelated works on simulation that deal with this main issue. In [19] the authors propose a rollback based optimistic recovery scheme in which checkpoints are pe-riodically saved on stable storage. The distributed simulation uses an optimistic synchronization schemein which out-of-order (i.e. “straggler”) events are handled according to the Time Warp protocol [20]. The6ovel idea of this approach is to model failures as straggler events with a timestamp equal to the last savedcheckpoint. In this way, the authors can leverage the Time Warp protocol to handle failures.In [21, 22] the authors propose a framework called Distributed Resource Management System (DRMS)to implement reliable IEEE 1516 federation [23]. The DRMS handles crash failures using checkpoints savedto stable storage, that is then used to migrate federates from a faulty host to a new host when necessary.The simulation engine is again based on an optimistic synchronization scheme, and the migration of LPs(the so called “federates” in the IEEE 1516 terminology) is implemented through Web services.In [24] the authors propose a decoupled federate architecture in which each IEEE 1516 federate isseparated into a virtual federate process and a physical federate process. The former executes the simulationmodel and the latter provides middleware services at the back-end. This solution enables the implementationof fault tolerant distributed simulation schemes through migration of virtual federates.The CUMULVS middleware [25] introduces the support for fault tolerance and migration of simulationsbased on checkpointing. The middleware is not designed to support PADS but it allows the migration ofrunning tasks for load balancing and to improve a task’s locality with a required resource.A slightly different approach is proposed in [26]. In which, the authors introduce the Fault TolerantResource Sharing System (FT-RSS) framework. The goal of FT-RSS is to build fault tolerant IEEE 1516federations using an architecture in which a separate FTP server is used as a persistent storage system. Thepersistent storage is used to implement the migration of federates from one node to another. The FT-RSSmiddleware supports replication of federates, partial failures and fail-stop failures.Recently, in [27] the authors proposed a transparent middleware for dealing with Byzantine fault in HLA-based parallel and distributed simulations. In this case, the solution is based on the usage of replication,checkpointing and message logging technologies.Finally, an approach based on the usage of virtualization techniques is described in [28]. The authorsintroduce a fault resilient framework that dynamically handles virtual machines failures inside the cloudenvironment. The proposed fault resilient framework is based on state saving and snapshots of processedevent list that are implemented in each LP.
In [29] the authors propose the use of functional replication in Time Warp simulations with the aim toincrease the simulator performance and to add fault tolerance. Specifically, the idea is to have copies ofthe most frequently used simulation entities at multiple sites with the aim of reducing message traffic andcommunication delay. This approach is used to build an optimistic fault tolerance scheme in which it isassumed that the objects are fault free most of the time. The rollback capabilities of Time Warp are thenused to correct intermittent and permanent faults. 7n [30] the authors describe DARX, an adaptive replication mechanism for building reliable multi-agentsystems. Being targeted to multi-agent systems, rather than PADS, DARX is mostly concerned withadaptability: agents may change their behavior at any time, and new agents may join or leave the system.Therefore, DARX tries to dynamically identify which agents are more “important”, and what degree ofreplication should be used for those agents in order to achieve the desired level of fault tolerance. It shouldbe observed that DARX only handles crash failures, while FT-GAIA also deals with Byzantine faults.
3. The GAIA/ART`IS Middleware
To make this paper self-contained, we provide in this section a brief introduction of the GAIA/ART`ISparallel and distributed simulation middleware; the interested reader is referred to [14] and the softwarehomepage [31].The
Advanced RTI System (ART`IS) is a parallel and distributed simulation middleware loosely inspiredby the Runtime Infrastructure described in the IEEE 1516 standard “High Level Architecture” (HLA) [32].ART`IS implements a parallel/distributed architectures where the simulation model is partitioned in a setof LPs [4]. As described in Section 1, the execution architecture in charge of running the simulation iscomposed of interconnected PEs and each PE runs one or more LPs (usually, a PE hosts one LP).In a PADS, the interactions between the model components are driven by message exchanges. The lowcomputation/communication ratio makes PADS communication-bound, so that the wall-clock executiontime of distributed simulations is highly dependent on the performance of the communication network(i.e. latency, bandwidth and jitter). Reducing the communication overhead can be crucial to speed up theevent processing rate of PADS. This can be achieved by clustering interacting entities on the same physicalhost, so that communications can happen through shared memory.Among the various services provided by ART`IS, time management (i.e., synchronization) is fundamentalfor obtaining correct simulation runs that respect the causality dependencies of events. ART`IS supportsboth conservative (Chandy-Misra-Bryant [33]) and optimistic (Time Warp [20]) synchronization algorithms.Moreover, a distributed implementation of the time-stepped synchronization is included.The
Generic Adaptive Interaction Architecture (GAIA) [15, 31, 34] is a software layer built on top ofART`IS. In GAIA, each LP acts as the container of some SEs: the simulation model is partitioned in itsbasic components (the SEs) that are allocated among the LPs. The system behavior is modeled by theinteractions among the SEs; such interactions take the form of timestamped messages that are exchangedamong the entities. From the user’s point of view, a simulation model based on ART`IS follows a Multi AgentSystem (MAS) approach. In fact, each SE is an autonomous agent that performs some actions (individualbehavior) and interacts with other agents in the simulation.In most cases, the interaction between the SEs of a PADS are not completely uniform, meaning that8here are clusters of SEs where internal interactions are more frequent. The structure of these clusters ofhighly interacting entities may change over time, as the simulation model evolves. The identification ofsuch clusters is important to improve the performance of a PADS: indeed, by putting heavily-interactingentities on as few LPs as possible, we may replace most of the expensive LAN/WAN communications bymore efficient shared memory messages.In GAIA, the analysis of the communication pattern is based on a set of simple self-clustering heuris-tics [15] that are provided by the framework. All the provided heuristics are generic and not model dependent.For example, in the default heuristic, every few timesteps for each SE is found which LP is the destinationof the large percentage of interactions. If it is not the LP in which the SE is contained then a migrationis triggered. The migration of SEs among LPs is transparent to the simulation model developer; entitiesmigration is useful not only to reduce the communication overhead, but also to achieve better load-balancingamong the LPs, especially on heterogeneous execution platforms where execution units are not identical.In these cases, GAIA can migrate entities away from less powerful PEs, towards more capable processors ifavailable.
4. Fault Tolerant Simulation
FT-GAIA is a fault tolerant extension to the GAIA/ART`IS distributed simulation middleware. As willbe explained below, FT-GAIA uses functional replication of simulation entities to achieve tolerance againstcrashes and Byzantine failures of the PEs.FT-GAIA is implemented as a software layer on top of GAIA and provides the same functionalities ofGAIA with only minor additions. Therefore, FT-GAIA is mostly transparent to the user, meaning thatany simulation model built for GAIA can be easily ported to FT-GAIA. The FT-GAIA extension will beintegrated in the next release of the GAIA/ART`IS simulation middleware and will be available from theofficial GAIA/ART`IS Web site [31].FT-GAIA works by replicating simulation entities (see Fig. 3) to tolerate crash-failures and Byzantinefaults of the LPs. A crash may be caused by a failure of the hardware – including the network connection– and operating system. A Byzantine failure refers to an arbitrary behavior of a LP that causes the LP tocrash, terminate abnormally, or to send arbitrary messages (including no messages at all) to other LPs.Replication is based on the following principle. If a conventional, non-fault tolerant distributed simula-tion is composed of N distinct simulation entities, FT-GAIA generates N × M entities, by generating M independent instances of each simulation entity. All instances A , . . . A M of the same entity A perform thesame computation: if no fault occurs, they produce the same result.Replication comes with a cost, both in term of additional processing power that is needed to execute allinstances, and also in term of an increased communication load between the LPs. Indeed, if two entities A igure 3: Layered structure of the FT-GAIA simulation engine. The user-defined simulation model defines a set of entities { A, B, C, D, E, F } ; FT-GAIA creates multiple (in this example, 3) instances of each entity, that are handled by GAIA. and B communicate by sending a message from A to B , then after replication each instance A i must sendthe same message to all instances B j , 1 ≤ i, j ≤ M , resulting in M (redundant) messages. Therefore, thelevel of replication M must be chosen wisely in order to achieve a good balance between overhead and faulttolerance, also depending on the types of failures (crash failures or Byzantine faults) that the user wants toaddress. Handling crash failures.
A crash failure happens when a LP crashes, but operates correctly until it halts.When a LP terminates, all simulation entities running on that LP stop their execution and the local stateof the computation is lost. From the theory of distributed systems, it is known that M instances of eachsimulation entity are required to tolerate up to ( M −
1) crash failures. Each instance must be executed ona different LP, so that the failure of a LP only affects one instance of all entities executed there. This isequivalent to running M copies of a monolithic (sequential) simulation, with the difference that a sequentialsimulation does not incur in communication and synchronization overhead. However, unlike sequentialsimulations, FT-GAIA can take advantage of more than M LPs, by distributing all the N × M entities onthe available execution units. This reduces the workload on the LPs, reducing the wall-clock execution timeof the simulation model. Handling Byzantine Failures.
Byzantine failures include all types of abnormal behaviors of a PE. Examplesare: the crash of a component of the distributed simulator (e.g., LP or entity); the transmission of erro-neous/corrupted data from an entity to other entities; computation errors that lead to erroneous results.In this case M instances of each SE are necessary to tolerate up to ⌊ ( M − / ⌋ Byzantine faults usingthe majority rule : a SE instance B i can process an incoming message m from A j when it receives onecopy of m from the (strict) majority of the instances of sender A (the strict majority of M instances is ⌈ ( M + 1) / ⌉ ). This applies to synchronous systems where the message delay is bounded and faulty nodes10annot forge messages (i.e., messages are in some sense authenticated). Again, all M instances of each SEmust be located on different LPs.In is worth noting that, GAIA (and therefore FT-GAIA) is based on a time-stepped approach, leading toa synchronous system. Moreover, the presence of a specific end-of-step synchronization message that needsto be received by all LPs represents a bound on the possible latency for correct messages. Thus, we canconclude that FT-GAIA works in a synchronous scenario.The majority rule, as implemented in FT-GAIA, requires that the sequences of messages producedby each working instance of the same simulation entity are equal, i.e. the payload of the i -th messageof each sequence is exactly the same. This comes from the fact that many simulation models require reproducibility of the results, irrespective from the implementation details such as the number of LPs used,or how entities are mapped to the LPs. In turn, reproducibility requires that once started, the behavior ofthe simulation as a whole is fully deterministic. However, there might be scenarios where strict determinismis not required, e.g. in mixed simulations relying on Monte Carlo methods [35] when different executionpaths are actually required. For such scenarios, Byzantine failures are difficult if not impossible to identify,because the messages produced by the instances of the same SE could be different yet correct. In thesesituations, deciding whether a message is correct or not would require some model-specific knowledge,if such knowledge exists at all. Extending FT-GAIA to allow the modeler to specify such knowledge isrelatively straightforward, but so far we have not encountered any use case demanding it. Allocation of Simulation Entities.
Once the level of replication M has been set, it is necessary to decide whereto create the M instances of each SE, so that the constraint that each instance is located on a different LPis met. In FT-GAIA the deployment of instances is performed during the setup of the simulation model. Inthe current implementation, there is a centralized service that keeps track of the initial location of all SEinstances. When a new SE is created, the service creates the appropriate number of instances accordingto the redundancy model to be employed, and assigns them to the LPs so that all instances are locatedon different LPs. Note that all instances of the same SE receive the same initial seed for their internalpseudo-random number generators; this guarantees that their execution traces are the same, regardless ofthe LP where execution occurs and the degree of replication. At the cost of some extra coordination amongthe LPs even the initial SEs deployment could be decentralized. This not challenging under the designviewpoint but would require a more complex implementation and thus it has been left as future work. Message Handling.
We have already stated that fault tolerance through functional replication has a cost interm of increased message load among SEs. Indeed, for a replication level M (i.e., there are M instances ofeach SE) the number of messages exchanged between entities grows by a factor of M .A consequence of message redundancy is that message filtering must be performed to avoid that multiplecopies of the same message are processed more than once by the same SE instance. FT-GAIA takes care of11utomatically filtering the excess messages according to the fault model adopted; filtering is done outsideof the SE, which are therefore totally unaware of this step. In the case of crash failures, only the first copyof each message that is received by a SE is processed; all further copies are dropped by the receiver. In thecase of Byzantine failures with replication level M = 2 f + 1, each entity must wait for at least ( f + 1) copiesof the same message before it can handle it. Once a strict majority has been reached, the message can beprocessed and all further copies of the same messages that might arrive later on can be dropped. Entities Migration.
PADS can benefit from the migration of SEs to balance computation/communicationload and reduce the communication cost, by placing the SEs that interact frequently “next” to each other(e.g. on the same LP) [15]. In FT-GAIA, the entity migration is subject to a new constraint: the instancesof the same SE can never reside on the same LP. More specifically, the SEs migration is handled by theunderlying GAIA/ART`IS middleware: each LP runs a clustering mechanism based on a heuristic functionthat tries to put together (on the same LP) the SEs that interact frequently through message exchanges.Special care is taken to avoid putting too many entities on the same LPs that would become a bottleneck.Once a new feasible allocation is found, the migration of a SE is implemented through moving its statevariables to the destination LP. In different terms, our design choice has been to maintain GAIA andFT-GAIA as separate as possible. In fact, the clustering heuristics used by GAIA are totally unaware ofthe functional replication of SEs. This has simplified the development of FT-GAIA as a separate softwaremodule at the cost of using the generic self-clustering heuristics provided by GAIA. Most likely, specificallytailored heuristics would be able to obtain a better clustering of SEs when considering the presence of copiesof the same SEs.
5. Experimental Performance Evaluation
In this section we evaluate a prototype implementation of FT-GAIA by implementing a simple simulationmodel of a Peer-to-Peer (P2P) communication system. The simulation model built on top of FT-GAIA isexecuted under different workload parameters that will be described in the following. The Wall ClockTime (WCT) of the simulation runs is recorded (excluding the time to setup the simulation) such as othermetrics of interest. The tests were performed on a cluster of workstations, each host being equipped withan Intel Core i5-4590 3.30 GHz processor with 4 physical cores and 8 GB of RAM. The operating systemwas Debian Jessie. The workstations are connected through a Fast Ethernet LAN.
We simulate a simple P2P communication protocol over randomly generated directed overlay graphs.Nodes of the graphs are peers while links represent communication connections [36, 37]. In these overlays,all nodes have the same out-degree, that has been set to 5 in our experiments. During the simulation,12ach node periodically updates its neighbor set. Latencies for message transmission over overlay links aregenerated using a lognormal distribution [38].The simulated communication protocol works as follows. Periodically, nodes send PING messages toother nodes, that in turn reply with a PONG message that is used by the sender to estimate the averagelatencies of the links (note that communication links are, in fact, bidirectional). The destination of a PINGis randomly selected to be a neighbor (with probability p ), or a non-neighbor (with probability 1 − p ). Aneighbor is a node that can be reached through an outgoing link in the directed overlay graph.Each node of the P2P overlay is represented by a SE within some LP. Unless stated otherwise, each LPwas executed on a different PE, so that no two LPs shared the same CPU core. Three different scenariosare considered: a no fault scenario, where no faults occur, a crash scenario, where crash failures occurs andfinally a Byzantine scenario where Byzantine faults occurs.We executed 15 independent replications of each simulation run. In most of the charts in this section,mean values are reported with a 99 .
5% confidence interval.
Figure 4 shows the WCT of the simulation that was executed for 10000 timesteps with a varying numberof SEs; recall that the number of SEs is equal to the number of nodes in the P2P overlay graph. Thenumber of LPs was set to 3, 4, and 5; the number of hosts is equal to the number of LPs, so that each LP isexecuted on a different physical machine. The WCT for the three failure scenarios is shown (i.e., no failure,single crash and single Byzantine failure). In all cases, the adaptive migration heuristic provided by GAIAis disabled.Results with 3 and 4 LPs are similar, with a slight improvement with 4 LPs. Conversely, higher WCTis observed when 5 LPs are used. As expected, the higher the number of SEs the higher the WCT. Thishappens since the simulation incurs in a higher communication overhead. All curves show a similar trend:in particular, it is worth noting that the increment due to the faults management schemes is mainly causedby the higher number of messages that are exchanged among nodes.Figures 5 and 6 show the WCT with 8000 and 16000 SEs with varying number of LPs; again, each LPhas been executed on a different physical host. The two charts emphasize the increment of the time requiredto complete the simulations with 5 LPs and in presence of Byzantine faults. This is due to the increasednumber of messages exchanged among the LPs: each message needs to be sent to three (2 M + 1) differentdestinations in order to guarantee the expected fault tolerance. In the previous experiments, each LP has been allocated in a different host. Figure 7 shows the WCTwhen more than one LP is run in each host. In particular, the following setups are considered: ( i ) 4 LPs13 W a ll C l o ck T i m e ( s e c ) Number of Simulation Entities (SEs)3 LPs, No FT3 LPs, Crash FT3 LPs, Byzantine FT4 LPs, No FT4 LPs, Crash FT4 LPs, Byzantine FT5 LPs, No FT5 LPs, Crash FT5 LPs, Byzantine FT
Figure 4: Wall Clock Time as a function of the number of LPs, for varying number of SEs. The number of hosts is equal tothe number of LPs; migration is disabled. Lower is better. placed over 4 hosts (1 LP per host), ( ii ) 8 LPs placed over 8 hosts (1 LP per host), ( iii ) 8 LPs placedover 4 hosts (2 LPs per host), and ( iv ) 16 LPs over 4 PEs (4 LPs per host). Note that, in any case, thenumber of LPs/host never exceeds the number of cores/host, so that every LP runs on a separate processorcore. For each setup, the three failure scenarios already mentioned (no failures, crash, Byzantine failures)are considered. Again, the migration heuristic provided by GAIA is disabled. Each curve in the figure isrelated to one of those scenarios, when varying the amount of SEs. It is worth noting that, when two ormore LPs are run on the same host, they can communicate using shared memory rather than through theLAN. This means that, in this case the inter-LP communication is more efficient. For better readability, inthis experiment the confidence intervals have been calculated but not reported in the figure.We observe that the scenario with 4 LPs over 4 hosts is influenced by the number of SEs and the failurescenario, while in the other cases it is the number of LPs that mainly determines the simulator performance.When 8 LPs are executed on 4 hosts, the performance is slightly better than the case where 8 LPs are14 W a ll C l o ck T i m e ( s e c ) Number of LPsWCT with a different number of LPs (8000 SEs)No FTCrash FTByzantine FT
Figure 5: Wall Clock Time as a function of the number of LPs, with 8000 SEs; migration is disabled. Lower is better. executed on 8 hosts. This is due to the better communication efficiency provided by shared memory withrespect to the LAN interface.The worst performance is measured when 16 LPs are executed on 4 hosts. This is due to the fact thatthe amount of computation in the simulation model is quite limited. Therefore, partitioning the SEs in16 LPs has the effect to increase the communication cost without any benefit from the computational pointof view (i.e., in the model there is not enough computation to be parallelized).
The impact of the number of faults on the simulation WCT is now studied. Two different setups areconsidered, one with 5 LPs over 5 hosts (Figure 8), and one with 8 LPs over 4 hosts (Figure 9). The choiceof 5 LPs is motivated by the fact that this is the minimum number of LPs that allows us to tolerate up totwo Byzantine faults. Furthermore, the P2P simulation model used in this performance evaluation showsa significant degradation of performance when the number of LPs is larger than 8. As described before,this is due to the specific characteristics of the simulation model, in which there is a limited amount ofcomputation that can be parallelized. On the other hand, partitioning the model on a large number of LPssharply increases the communication cost. More in detail, the setup with 8 LPs on 4 hosts allows testing3 Byzantine faults with 2 LPs per host in a setup with a limited communication overhead.15 W a ll C l o ck T i m e ( s e c ) Number of LPsWCT with different number of LPs (16000 SEs)No FTCrash FTByzantine FT
Figure 6: WCT as a function of the number of LPs, with 16000 SEs. Migration is disabled. Lower is better.
Figure 8 shows the WCTs measured with 0, 1 and 2 faults. Each curve refers to a scenario with 2000or 6000 SEs with crash or Byzantine failures. As expected, the higher the number of faults, the higherthe WCTs, especially when Byzantine faults are considered. Indeed, in this case a higher amount of com-munication messages is required among SEs in order to properly handle faults.A higher WCT is measured with 8 LPs, as shown in Figure 9. In this case, the amount of faults has alimited influence on the simulation performance. As before, the computational load of this simulation modelis too low for gaining from the partitioning in 8 LPs. In other words, the latency introduced by networkcommunications is so high that both the number of SEs and the number of faults have a negligible impacton performances.
Finally, Figure 10 shows the WCT of a simulation composed of 4 LPs (in which each LP was executedon a different host) with different failure schemes, when the adaptive migration of SEs provided by theGAIA framework is enabled/disabled. Also in this case, for better readability, the confidence intervals arenot reported in figure.In this case, the trend obtained with the SEs migration is similar to that obtained when no migrationis performed but the overall performance are better when the migration is turned off. This is due to the16 W a ll C l o ck T i m e ( s e c ) Number of Simulation EntitiesWCT with different number of SEs No FTCrash FTByzantine FT 4 LPs on 4 hosts 8 LPs on 8 hosts 8 LPs on 4 hosts16 LPs on 4 hosts
Figure 7: WCT as a function of the number of LPs, with different numbers of LPs for each host; migration is disabled. Loweris better. overhead introduced by the self-clustering heuristics and the state of the SEs that are transfered betweenthe LPs. In other words, the adaptive clustering of SEs that in many other simulation models has provideda significant gain, in this case, is unable to give a speedup.The main motivation behind this result is the fact that, in this prototype, we have decided to use the verygeneral clustering heuristics that are already implemented in GAIA/ART`IS. These heuristics assume thatthe simulation model is composed of a set of agents, each one with its specific behavior and communicationpattern. In the case of FT-GAIA, this not true. In fact, all the copies of a given SE share exactly the samebehavior and interactions. Moreover, as described before, FT-GAIA adds the constraint that the instancesof the same SE can never reside on the same LP. This constraint affects the free flow on SEs among the LPsand consequently reduces the clustering efficiency.For these reasons, we think that more specific replication-aware clustering heuristics need to be de-signed to improve the clustering performance while balancing the overhead introduced by the fault tolerance17 W a ll C l o ck T i m e ( s e c ) Number of FaultsWCT with a different numbers of Faults (5 LPs)Crash FT, 2000 SEsByzantine FT, 2000 SEsCrash FT, 6000 SEsByzantine FT, 6000 SEs
Figure 8: WCT as a function of the number of faults; 10000 timesteps with 5 LPs; migration is disabled. Lower is better. mechanism.
6. Analytical Reliability Evaluation
In Section 4 we have seen that the FT-GAIA extension of the GAIA/ART`IS middleware works bymaking M copies of each SE, and ensuring that each copy resides on a different LP. This requirement, thatwe call FT-GAIA constraint from now on, guarantees that FT-GAIA can tolerate up to M − ⌊ ( M − / ⌋ Byzantine failures.In this section we perform a reliability analysis of an FT-GAIA simulation to complement the exper-imental performance evaluation from Section 5. The goal of this analysis is to estimate the reliabilityof FT-GAIA when the number of failures is higher than the thresholds above; also, we want to study whathappens if the FT-GAIA constraint is not enforced, that is, what happens if more than one instance ofthe same simulation entity is allowed reside on the same LP. These kinds of analyses would be complexand time-consuming if performed through actual experiments as in the previous section, so we resort toa simpler probabilistic evaluation. We remark that the analysis below is only concerned with the systemreliability, and does not consider any performance metric. Indeed, the content of this section is orthogonal tothe performance analysis described in Section 5. Analytical performance models for distributed simulations18 W a ll C l o ck T i m e ( s e c ) Number of FaultsWCT with different numbers of Faults (8 LPs)Crash FT, 2000 SEsByzantine FT, 2000 SEsCrash FT, 6000 SEsByzantine FT, 6000 SEs
Figure 9: WCT as a function of the number of faults; 2000 timesteps over 8 LPs; migration is disabled. Lower is better. have been proposed in the past [39], but their extension to FT-GAIA would be non-trivial and is outsidethe scope of this work.We analyze the system reliability of FT-GAIA under crash or Byzantine failures of the LPs, since theyare the basic component that can fail in GAIA-FT. Indeed, a crash of a whole host implies a crash of allthe LPs running on it, and a crash of a SE implies a crash of the whole LP where the SE is executed.The analysis presented below relies on the following assumptions: • All crashes are permanent: a crashed LP is never brought back to a functioning state. • Every LP has the same probability to crash. • All instances of each simulation entity are randomly and uniformly placed on the available LPs, eitherrespecting or not respecting the FT-GAIA constraint (we will analyze both scenarios). • SEs are never migrated from one LP to another.While some of the assumptions above are quite limiting, they simplify the analysis considerably and stillprovide useful qualitative information. 19 W a ll C l o ck T i m e ( s e c ) Number of Simulation Entities (SEs)WCT for a varying number of SEs, Migration ON/OFFNo FT, Migration OFFNo FT, Migration ONCrash FT, Migration OFFCrash FT, Migration ONByzantine FT, Migration OFFByzantine FT, Migration ON
Figure 10: WCT with SEs migration ON/OFF, as a function of the number of SEs. Lower is better.
Given a simulation with L LPs and N simulation entities, with M instances of each entity (1 ≤ M ≤ L ),we assume that X randomly chosen LPs crash during the simulation (0 ≤ X ≤ L ). We want to computethe system reliability, that is, the probability that a sufficient number of instances of each entity survivedto ensure that the simulation produces the intended results. In the crash failure model, the reliability R C is the probability that at least one instance of each entity resides on a LP that does not crash; in case ofbyzantine failures, the reliability R B is the probability that at least ⌈ ( M + 1) / ⌉ entities (the majority)reside on LPs that do not crash.For each SE i , let N i be the random variable denoting the number of instances of i that reside on LPsthat did not crash. The pmf (probability mass function) Pr( N i = k ), 0 ≤ k ≤ M , can be derived easily bycasting the original problem into an “urn problem”. If k is greater than L − X , then Pr( N i = k ) is zerosince less than k LPs survived through the end of the simulation. If 0 ≤ k ≤ L − X , then P ( N i = k ) is theprobability of getting k white balls out of M extracted without replacement from an urn containing X blackballs (representing crashed LPs) and L − X white balls (representing LPs that did not crash). Therefore wehave: 20r( N i = k ) = (cid:18) XM − k (cid:19)(cid:18) L − Xk (cid:19) / (cid:18) LM (cid:19) if 0 ≤ k ≤ L − X L − X < k ≤ M (1)The system reliability R C under the crash failure model is the probability that the simulation terminatessuccessfully. This is the joint probability that N i ≥ i . If there are more instances of each SE thancrashed LPs, then R C = 1 since the FT-GAIA constraint ensures that there is at least one live instance ofeach entity. On the other hand, if M ≤ X ≤ L it may happen that all instances of the same entity fail, andthe system reliability can then be computed in this case as: N Y i =1 Pr( N i ≥
1) = N Y i =1 (1 − Pr( N i = 0)) = (cid:20) − (cid:18) XM (cid:19) / (cid:18) LM (cid:19)(cid:21) N Therefore, R C is defined as: R C = ≤ X < M (cid:20) − (cid:18) XM (cid:19) / (cid:18) LM (cid:19)(cid:21) N if M ≤ X ≤ L (2)Note that if X = L (all LPs failed) then R C is zero as expected. Also, observe that R C tends to zero asthe number of entities N approaches infinity. The reliability R B under the Byzantine failure model can be computed in a similar way. The minimumnumber of working instances of each SE that are required to guarantee that the simulation terminates is ⌈ ( M + 1) / ⌉ . If the number of failures X is strictly lower than ⌈ ( M + 1) / ⌉ , then R B = 1. If the numberof failed LPs is greater than or equal to ⌈ ( M + 1) / ⌉ , the reliability becomes strictly less than 1 and can becomputed as the joint probability that the majority of the instances of each entity i are active: N Y i =1 Pr( N i ≥ ⌈ ( M + 1) / ⌉ ) = N Y i =1 L X k = ⌈ ( M +1) / ⌉ Pr( N i = k ) = L X k = ⌈ ( M +1) / ⌉ Pr( N i = k ) N Hence we have: 21 B = ≤ X < ⌈ ( M + 1) / ⌉ L X k = ⌈ ( M +1) / ⌉ Pr( N i = k ) N if ⌈ ( M + 1) / ⌉ ≤ X ≤ L (3)Figure 11 shows the reliability of FT-GAIA using L = 100 LPs with M = 21 instances of each entity,as a function of the number of crashes X . Under the crash failure model (top figure) the system toleratesup to M − ⌈ ( M + 1) / ⌉ − X exceeds the thresholds, the reliability drops; in fact, R B drops fasterthan R C , because the Byzantine failure model requires a higher number of active instances to guaranteethat the simulation terminates successfully.Figure 12 shows the reliability of FT-GAIA as a function of the number of entities N for different numberof faults X (note that the values of X differ for the crash and Byzantine failure models); we assume L = 100LPs and M = 21 instances of each entity. Protecting the simulation against Byzantine faults requires ahigher number of active instances for each SE, since the model is more general than the crash failure model.However, the drawback is that the reliability R B drops very quickly as N increases even when the number offaults X slightly exceeds the threshold. Therefore, the user must be aware that Byzantine faults are muchmore sensitive to the choice of the “correct” value of M than crash failures. We now study what would happen if the FT-GAIA constraint is not applies, i.e., if FT-GAIA wereallowed to put more than one instance of same entity on the same LP. Given a simulation with L LPs, N entities that are replicated M times, and X LPs that crash during the simulation, let N ∗ i be the numberof surviving instances of entity i under the assumption that the FT-GAIA constraint does not apply. Thisscenario can again be analyzed as an urn problem, in this case where the balls are extracted with replacement.The random variables N ∗ i follow a binomial distribution B (cid:0) M, L − XL (cid:1) , so we have:Pr( N ∗ i = k ) = (cid:18) Mk (cid:19) (cid:18) L − XL (cid:19) k (cid:18) XL (cid:19) M − k As above, the system reliability R ∗ C under the crash failures model can be expressed as: R ∗ C = N Y i =1 Pr( N ∗ i ≥
1) = N Y i =1 (1 − Pr( N ∗ i = 0)) = " − (cid:18) XL (cid:19) M N (4)Eq. (4) tells us that the system reliability R ∗ C is strictly less than 1 even in presence of a single crashfailure. Indeed, if the instances of each SE are randomly placed on the LPs, there is a small but non-22egligible probability that all instances of, say, entity i are placed on the same LP that will crash, abortingthe whole simulation. This can not happen if the FT-GAIA constraint is enforced.Figure 13 compares the system reliability with and without the FT-GAIA constraint. We consider asystem with L = 100 LPs and N = 10 simulation entities that are replicated M = 21 times. The FT-GAIAconstraint allows the system to sustain up to M − X < M the reliability R C computed using Eq. (2) is 1. When X < M the reliability R ∗ C computed using Eq. (4) is slightly less than 1;however, the difference is so tiny to be almost negligible. Indeed, Eq. 4 shows that the probability that allinstances of one SE reside on the same (crashed) LP gets smaller as the number of replicas M increases.However, it is important to remember that this is true if the SE instances are randomly placed on the LPs.In practice, however, the placement is not random, at least when the automatic clustering and migrationfacilities of GAIA/ART`IS are enabled. Indeed, GAIA/ART`IS monitors the communication pattern ofthe SEs, and migrate those that exhibit a high level of interaction on the same LP to reduce the numberof remote communications [15]. If the placement of SEs is not random, the FT-GAIA constraint becomesessential to limit the probability that too many instances of the same SE fail at the same time. We can use the results above to provide some guidelines on how the replication level M can be chosenin practice. Note that choosing the “best” value of M is a difficult problem, since the answer dependson the simulation model that is executed, on the execution environment, and on the failure model that isconsidered.If the user requests a strong guarantee that the simulation run is completed without failures, then it isnecessary to choose a value of M that produces a system reliability equal to 1. Assuming that the GAIA-FTconstraint is enforced, Eq. 2 and (3) tells us that the system reliability is one if the number of expectedfailures X is strictly less than M for the crash failure model, and strictly less than ⌊ ( M + 1) / ⌋ for theByzantine failure model.The number of expected failures X can be expressed as X = Lλt (5)where λ is the failure rate of each LP, and t is the duration of the simulation run. Both parameters can beestimated empirically; in particular, λ can be computed as the inverse of the MTTF, that is a quantity thatcan be easily observed from the operational history of the system.Therefore, the simulation can be completed with probability 1 in the crash failure model if X < M ;taking into account Eq. (5) we get:
M > Lλt (6)23imilarly, the simulation can be completed with probability 1 in the Byzantine failure model if
X < ⌊ ( M + 1) / ⌋ ; again, taking into account Eq. (5) we get: M > Lλt − M satisfying (6) or (7) is the replication level that provides the strongest guarantee to completethe simulation, under the simplifying assumptions stated at the beginning of this section.The experimental evaluation illustrated in Section 5 shows that providing protection against Byzantinefailures is more costly in term of wall clock time; however, Byzantine failures are more general than crashfailures. If the user trusts the computation and assumes that a running SE will always compute the correctresult, the more lax crash failure model can be considered, allowing a lower replication level M to be chosen.
7. Conclusions and Future Work
In this paper we described an approach to provide fault tolerance through functional replication inparallel and distributed simulations. Our solution, called FT-GAIA, is an extension to the GAIA/ART`ISsimulation middleware that acts transparently to the user that creates and manages the simulation. Faulttolerance is provided by replicating simulation entities and distributing them on multiple execution nodes.This is a particularly important issue to cope with, especially if we expect to have execution nodes runningcomplex simulation over virtual machines hosted by public or private cloud systems. Replication of theirexecution guarantees tolerance to crash-failures and Byzantine faults of computing nodes. In order to miti-gate the costs of communication among simulation entities, the middleware exploits an automatic migrationof simulated entities among execution nodes with the aim to balance the computational load and minimizethe communication overhead.A preliminary performance evaluation of FT-GAIA has been presented, based on a prototype implemen-tation. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increasein the computational load of the execution units. Moreover, a probabilistic model that drives an analyticalevaluation of the proposed scheme is introduced.As a future work, we aim at improving the efficiency of FT-GAIA by leveraging on ad-hoc clusteringheuristics that are aware of the fault tolerance mechanism implemented by FT-GAIA. For example, evalu-ating the impact on the clustering of all the copies of a given simulation entity instead of considering eachentity by itself. Indeed, we believe that specifically tuned clustering and load balancing mechanisms cansignificantly reduce the overhead introduced by the replication of the simulated entities. Another aspectthat needs to be investigated is the impact of the functional replication on different synchronization algo-rithms used in distributed simulations, e.g. the Chandy-Misra-Bryant (CMB) conservative approach based24n NULL messages [33], or the Time Warp optimistic protocol [20] based on rollbacks, that are the mostcommonly used in practice.
Symbols L := Number of Logical Processes (LPs) N := Number of Simulation Entities (SEs) M := Number of copies of each SE ( M ∈ { , . . . , L } ) X := Number of crashed LPs ( X ∈ { , . . . , L } ) N i := Number of instances of SEs i that do not crash R C := System reliability under the crash failure model R ∗ C := System reliability under the crash failure model (without the FT-GAIA con-straint) R B := System reliability under the Byzantine failure model AcronymsDES
Discrete Event Simulation
FEL
Future Event List
GVT
Global Virtual Time
HPC
High Performance Computing
IRP
Inertial Reference Platform
LVT
Local Virtual Time LP Logical Process
MTTF
Mean Time To Failure
PADS
Parallel And Distributed Simulation
PDES
Parallel Discrete Event Simulation PE Processing Element SE Simulated Entity
WCT
Wall Clock Time 25 eferences [1] G. D’Angelo, S. Ferretti, M. Marzolla, L. Armaroli, Fault-tolerant adaptive parallel and distributed simulation, in: Pro-ceedings of the 20th ACM/IEEE International Symposium on Distributed Simulation and Real Time Applications (DS-RT), DS-RT ’16, IEEE Computer Society, Washington, DC, USA, 2016, pp. 37–44. doi:10.1109/DS-RT.2016.11 .[2] G. D’Angelo, S. Ferretti, V. Ghini, Multi-level simulation of internet of things on smart territories, Simulation ModellingPractice and Theory (SIMPAT) 73 (2017) 3–21. doi:10.1016/j.simpat.2016.10.008 .[3] R. M. Fujimoto, Parallel discrete event simulation, Commun. ACM 33 (10) (1990) 30–53. doi:10.1145/84537.84545 .[4] R. M. Fujimoto, Parallel and distributed simulation systems, Wiley series on parallel and distributed computing, Wiley,2000.[5] S. Ferretti, V. Ghini, F. Panzieri, M. Pellegrini, E. Turrini, Qos-aware clouds, in: Proc. 2010 IEEE 3rd Int. Conf. on CloudComputing, CLOUD ’10, IEEE Computer Society, 2010, pp. 321–328. doi:10.1109/CLOUD.2010.17 .[6] M. Marzolla, S. Ferretti, G. D’Angelo, Dynamic resource provisioning for cloud-based gaming infrastructures, Comput.Entertain. 10 (1) (2012) 4:1–4:20. doi:10.1145/2381876.2381880 .[7] R. M. Fujimoto, Research challenges in parallel and distributed simulation, ACM Trans. Model. Comput. Simul. 26 (4)(2016) 22:1–22:29. doi:10.1145/2866577 .[8] X. Yang, Z. Wang, J. Xue, Y. Zhou, The reliability wall for exascale supercomputing, Computers, IEEE Transactions on61 (6) (2012) 767–779. doi:10.1109/TC.2011.106 .[9] G. Bolch, S. Greiner, H. de Meer, K. Trivedi, Queueing Networks and Markov Chains: Modeling and PerformanceEvaluation with Computer Science Applications, Wiley, 1998.[10] X. Yang, Z. Wang, J. Xue, Y. Zhou, The reliability wall for exascale supercomputing, IEEE Transactions on Computers61 (6) (2012) 767–779. doi:10.1109/TC.2011.106 .[11] B. Schroeder, G. Gibson, A large-scale study of failures in high-performance computing systems, IEEE Transactions onDependable and Secure Computing 7 (4) (2010) 337–350. doi:10.1109/TDSC.2009.4 .[12] N. El-Sayed, B. Schroeder, Reading between the lines of failure logs: Understanding how hpc systems fail, in: 201343rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013, pp. 1–12. doi:10.1109/DSN.2013.6575356 .[13] I. P. Egwutuoha, D. Levy, B. Selic, S. Chen, A survey of fault tolerance mechanisms and checkpoint/restart im-plementations for high performance computing systems, The Journal of Supercomputing 65 (3) (2013) 1302–1326. doi:10.1007/s11227-013-0884-0 .[14] L. Bononi, M. Bracuto, G. D’Angelo, L. Donatiello, Scalable and efficient parallel and distributed simulation ofcomplex, dynamic and mobile systems, in: Proceedings of the 2005 Workshop on Techniques, Methodologies andTools for Performance Evaluation of Complex Systems, IEEE Computer Society, Washington, DC, USA, 2005. doi:10.1109/FIRB-PERF.2005.17 .[15] G. D’Angelo, The simulation model partitioning problem: an adaptive solution based on self-clustering, Simulation Mod-elling Practice and Theory (SIMPAT) 70 (2017) 1 – 20. doi:10.1016/j.simpat.2016.10.001 .[16] F. Cristian, Understanding fault-tolerant distributed systems, Commun. ACM 34 (2) (1991) 56–78. doi:10.1145/102792.102801 .[17] M. Dowson, The ariane 5 software failure, SIGSOFT Softw. Eng. Notes 22 (2) (1997) 84–. doi:10.1145/251880.251992 .[18] A. Avizienis, The N-version approach to fault-tolerant software, IEEE Trans. Softw. Eng. 11 (12) (1985) 1491–1501. doi:10.1109/TSE.1985.231893 .[19] O. P. Damani, V. K. Garg, Fault-tolerant distributed simulation, in: Proceedings of the Twelfth Workshop onParallel and Distributed Simulation, PADS ’98, IEEE Computer Society, Washington, DC, USA, 1998, pp. 38–45. doi:10.1145/278008.278014 .
20] D. R. Jefferson, Virtual time, ACM Trans. Program. Lang. Syst. 7 (3) (1985) 404–425. doi:10.1145/3916.3988 .[21] M. Ekl¨of, F. Moradi, R. Ayani, A framework for fault-tolerance in hla-based distributed simulations, in: Proceedings ofthe 37th Conference on Winter Simulation, WSC ’05, Winter Simulation Conference, 2005, pp. 1182–1189.[22] M. Eklof, R. Ayani, F. Moradi, Evaluation of a fault-tolerance mechanism for hla-based distributed simulations, in:Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation, PADS ’06, IEEE ComputerSociety, Washington, DC, USA, 2006, pp. 175–182. doi:10.1109/PADS.2006.18 .[23] IEEE Standard for Modeling and Simulation (M&S) High Level Architecture (HLA)–Framework and Rules, IEEE Std1516-2010 (Revision of IEEE Std 1516-2000) (2010). doi:10.1109/IEEESTD.2010.5553440 .[24] D. Chen, S. J. Turner, W. Cai, M. Xiong, A decoupled federate architecture for high level architecture-based distributedsimulation, Journal of Parallel and Distributed Computing 68 (11) (2008) 1487–1503. doi:10.1016/j.jpdc.2008.07.010 .[25] J. A. Kohl, P. M. Papadopoulas, Efficient and flexible fault tolerance and migration of scientific simulations using cumulvs,in: Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, SPDT ’98, ACM, New York, NY,USA, 1998, pp. 60–71. doi:10.1145/281035.281042 .[26] J. L¨uthi, S. Großmann, Computational Science - ICCS 2004: 4th International Conference, Krak´ow, Poland, June 6-9,2004, Proceedings, Part III, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, Ch. FT-RSS: A Flexible Framework forFault Tolerant HLA Federations, pp. 865–872. doi:10.1007/978-3-540-24688-6_111 .[27] Z. Li, W. Cai, S. J. Turner, Z. Qin, R. S. M. Goh, Transparent three-phase byzantine fault tolerance for parallel and dis-tributed simulations, Simulation Modelling Practice and Theory 60 (2016) 90 – 107. doi:10.1016/j.simpat.2015.09.012 .[28] A. W. Malik, I. Mahmood, Crash me inside the cloud: A fault resilient framework for parallel and discrete event simulation,in: Proceedings of the Summer Simulation Multi-Conference, SummerSim ’17, Society for Computer Simulation Interna-tional, San Diego, CA, USA, 2017, pp. 1:1–1:10.URL http://dl.acm.org/citation.cfm?id=3140065.3140066 [29] D. Agrawal, J. R. Agre, Replicated objects in time warp simulations, in: Proceedings of the 24th Conference on WinterSimulation, WSC ’92, ACM, New York, NY, USA, 1992, pp. 657–664. doi:10.1145/167293.167662 .[30] Z. Guessoum, J.-P. Briot, N. Faci, O. Marin, Towards Reliable Multi-Agent Systems. An Adaptive Replication Mechanism, International Journal of MultiAgent and Grid Systems 6 (1). doi:10.3233/MGS-2010-0139 .[31] Parallel And Distributed Simulation (PADS) research group, http://pads.cs.unibo.it (2018).[32] IEEE 1516 Standard, Modeling and Simulation (M&S) High Level Architecture (HLA) (2000).[33] K. M. Chandy, J. Misra, Asynchronous distributed simulation via a sequence of parallel computations, Commun. ACM24 (4) (1981) 198–206. doi:10.1145/358598.358613 .[34] G. D’Angelo, M. Marzolla, New trends in parallel and distributed simulation: From many-cores to cloud computing,Simulation Modelling Practice and Theory (SIMPAT) doi:10.1016/j.simpat.2014.06.007 .[35] R. Y. Rubinstein, D. P. Kroes, Simulation and the Monte Carlo method, Wiley, 2016, 3rd edition.[36] G. D’Angelo, S. Ferretti, Simulation of scale-free networks, in: Proc. of International Conference on Simulation Tools andTechniques, Simutools ’09, 2009, pp. 20:1–20:10. doi:10.4108/ICST.SIMUTOOLS2009.5672 .[37] G. DAngelo, S. Ferretti, Highly intensive data dissemination in complex networks, Journal of Parallel and DistributedComputing 99 (2017) 28 – 50. doi:10.1016/j.jpdc.2016.08.004 .[38] J. F¨arber, Network game traffic modelling, in: Proceedings of the 1st Workshop on Network and System Support forGames, NetGames ’02, ACM, New York, NY, USA, 2002, pp. 53–57. doi:10.1145/566500.566508 .[39] F. Quaglia, V. Cortellessa, B. Ciciani, Trade-off between sequential and time warp-based parallel simulation, IEEE Trans.Parallel Distrib. Syst. 10 (8) (1999) 781–794. doi:10.1109/71.790597 .
20 40 60 80 10000.20.40.60.81 R e li ab ili t y R e li ab ili t y Figure 11: Reliability of FT-GAIA to crash (top) and Byzantine failures (bottom), as a function of the number of failures X ;we assume L = 100 LPs and M = 21 instances of each entity. The vertical line is at M . R e li ab ili t y
30 faults40 faults50 faults60 faults0 20000 40000 60000 80000 10000000.20.40.60.81 Number of entities R e li ab ili t y
15 faults20 faults25 faults30 faultsCrash failure modelByzantine failure model
Figure 12: Reliability of FT-GAIA for crash failures as a function of the number of simulation entities N , with L = 30 LPsand M = 11 instances of each entity.
20 40 60 80 10000.20.40.60.81 Number of failed LPs R e li ab ili t y With FT-GAIAWithout FT-GAIACrash failure model with/without the FT-GAIA constraint
Figure 13: Reliability with and without the FT-GAIA constraint as a function of the number of failed LPs X , with L = 100LPs, N = 10 entities and M = 21 instances of each entity (vertical line).= 21 instances of each entity (vertical line).