SPECI-2: An open-source framework for predictive simulation of cloud-scale data-centres
SSPECI-2
An open-source framework for predictive simulation of cloud-scale data-centres
Ilango Leonardo Sriram, Dave Cliff
Department of Computer Science, University of Bristol, United Kingdom { ilango, dc } @cs.bris.ac.uk Keywords: Data Centre, Cloud Computing, Middleware, Normal Failure, Discrete Event Simulation.Abstract: We introduce Version 2 of SPECI, a system for predictive simulation modeling of large-scale data-centres, i.e.warehouse-sized facilities containing hundreds of thousands of servers, as used to provide cloud services.
We introduce Version 2 of SPECI (Simulation Pro-gram for Elastic Computing Infrastructure), a systemfor predictive simulation modelling of ultra-large-scale data-centres (DCs), i.e. warehouse-sized facil-ities containing hundreds of thousands of servers, asused to provide cloud computing services.The move toward cloud computing is driving theconstruction of ever bigger DCs. For example, Mi-crosoft’s latest cloud-computing DC in Chicago hasan estimated budget of US$500m and capacity for224,000 blade-servers (Miller, 2009). The scale ofsuch facilities means that the designers of these fa-cilities have to work with data from development andtesting set-ups that are often several orders of magni-tude smaller than the final product. But architecturesand management policies that work on a few hundredservers may not scale well to facilities housing hun-dreds of thousands (Jogalekar and Woodside, 2000).However, although predictive simulation models havebecome commonplace (Hey et al., 2009), there is nowell-established simulator to evaluate DC designs. Arealistic simulator is difficult to achieve, as it needsto accommodate many models, such as network con-nectivity or disk access models, even heterogeneity,but many of these models lack a uniform definition:e.g. although many clouds use virtualization someuse MapReduce. We believe it will require a set ofsimulation tools each modelling aspects of the cloud,and present SPECI-2 for modelling middleware pol-icy distribution in virtualised cloud DCs. This paper explaines the SPECI model, thechanges over the previous version and the reasons forthese changes, and details of the implementation.
SPECI-2’s goal remains to answer the same ques-tions and requirements brought to SPECI-1 and de-scribed in (Sriram, 2009): consistency in middlewarepolicy distribution. Among practitioners there is theunderstanding that middleware for ultra-large DCscan only operate on a certain scale if it is broken intopolicies which are distributed to the managed compo-nents and executed locally, as opposed to the use ofcentralised control components. Because the middle-ware’s settings and available resources change veryfrequently, and changes can originate at arbitrary lo-cations, new policies need to be continuously com-municated to the nodes. Core to this problem are communication protocols , which allow componentsin the DC to communicate to other components, andthe component-subscription network topology , whereservices follow status changes of a subset of othercomponents, where dependencies exist, in form of subscriptions .The SPECI simulation models a DC hosting anumber n of cloud services, which are connectedthrough the subscription network. Each of these ser-vices has a state that can change at a rate f . Based onthe frequency of f and the update protocol in place, a r X i v : . [ c s . D C ] J un ome services’ subscriptions will become inconsis-tent with the current state, and inconsistencies mightbe propagated through the network before the systemreturns to a consistent state. Every unit time, SPECIprovides a monitoring probe of the current number ofinconsistencies, and the number of network packetsdealt with by every component in the DC. SPECI-2 is the first public release revised from ourexperiences with our earlier experimental versions ofSPECI, the source of several previous peer-reviewedpublications (Sriram, 2009; Sriram and Cliff, 2010;Sriram and Cliff, 2011). In our initial work withSPECI we used a simplification of a single hop con-nectivity between components. Since then, (Barrosoand H¨olzle, 2009) published details about the hierar-chies in Google’s DCs, and of relations between net-work connection costs for interconnects. A DC is nowmodelled to contain aisles or clusters; each aisle con-tains racks; each rack contains chasses; each chas-sis contains blades; and each blade contains or runscloud services. The quantification of this hierarchycan be specified in the configuration of the simulationrun. When a cloud service communicates with an-other cloud service, the communication now followsthe component tree.Second, SPECI-1 used to poll a Boolean status ofaliveness, to see whether a subscribed component wasalive or had failed. This was a simplification to thepolling of policies that could be changed by any ser-vice, where the simulator measures the consistencyof middleware policies in place. SPECI-2 now has anInteger representing the version number of the currentpolicy in place, and this version number can be incre-mented over time. All nodes subscribed will need toknow this update, as it could potentially mean newsecurity settings, or other new behaviour. To continueto accommodate component failure, the version num-ber 0 represents a failed service. Cloud Services withpolicy version 0 thus no longer participate in updatesand no longer generate any load.A further new requirement for SPECI-2 is a non-functional requirement: SPECI-1 suffered from weakperformance and in particular of a heavy memoryfootprint. At runtime, this type of simulation dependsmore on the system memory requirements more thanon the available CPU cycles, as it is designed to runin-memory and even in current HPC centres it is notcommon to find nodes with more than 10gb per core.For this reason, SPECI-2 no longer maintains a javaobject for every component and every network link,but only for the cloud services which represent the components at the leaf of the DC hierarchy tree, andno longer uses a generic DES simulation engine withheavy weight multi-purpose queues, but uses a cus-tomised event queue that is more efficient, becauseit uses knowledge of the character of the set-up to-wards resource optimisation: only the nearest eventsin time are in sorted order. This saves memory andcomputational requirements for the queue, as in thesimulation set-up most insert operations join the un-sorted queue. Further, the one to one relationship be-tween components in the DC and java objects was re-moved and aggregated in singleton classes contain-ing the behavioural logic. To save further memory,monitoring data is no longer kept in memory overrun-time, but stored in persistent files and analysedin post-processing. The SPECI-2 simulator shows aJVM memory footprint of 5.5GB RAM when mod-elling a DC with 10 Cloud-Services with 316 sub-scriptions each. This is remarkably smaller footprintthan that of SPECI-1 which required 25GB for thisconfiguration.
A typical simulation run involves three scripts. Thefirst generates a set of properties files, one for eachcombination of configuration parameters that shall besimulated. The SPECI-2 simulator takes such a con-figuration file as input, runs the simulation, and writesthe monitoring probes to a comma separated valuesfile. The output of these runs are the monitoringprobes, which output the current simulation time, thenumber of consistent and of inconsistent services, theload, and the maximum local loads. For easy portabil-ity, the output is written to files and not to a database.Hence, for the post processing after the simulationruns, the third script is used to merge the content ofthe many output files into one, analyse the data statis-tically, calculate means and confidence intervals, andfinally to generate graphs. There are a few pythonscripts that create the graphs using matplotlib, the oneto choose depends on the desired output.
The SPECI-2 java simulator is started with an argu-ment that passes on the location of a configurationfile. The entry class is SimulationRunner. This classfirst reads the configuration file and sets the config-uration parameters in a static class. It then createsthe utility objects required, e.g. those used for rele-vant random draws to wire the subscriptions. It thencreates a structure object that contains the DC setup,hence it generates the layout and components for theC based on the configuration file. This includes ar-rays for every type of physical objects, with the ele-ments of the array keeping track of the load in formof access counts generated by every individual com-ponent of that type. There are arrays containing el-ements for every aisle, and likewise for componentsat rack level, chassis level, and blade level. For theCloud Service level components modelled, the struc-ture object creates an object for for every Cloud Ser-vice, which holds both integers for monitored load aswell as a pointer to the object of the relevant service.There are two utility classes with public staticmethods, SubscriptionGenerator and Protocol. Onlythe Persistence class contains both static methodsand variables, and Configuration has static variablesthat are read from file once initialised, this reducesthe amount of file access required. Finally, there isone singleton class that stores the arrays and accesscounts. To continue the initialisation phase, the Sim-ulationRunner entry class then calls a utility functionthat wires the subscriptions to each of the Cloud ser-vices depending on the current configuration param-eters and then initialises the queue. Once the data-centre is initialised, the execution of the simulation isentirely driven by the queue. After the execution ofan event it will schedule itself for its next update.The simulation queue is a custom queue whichholds tuples of time and int, with positive integers re-ferring to the id of a cloud service and negative inte-gers being predefined events other than updates, suchas events for changes to occur, or monitoring probesto being taken. For performance reason, the queueis divided into two array lists: a sorted list for thoseevents to be executed shortly, which always has thenext event in time at the beginning, and an unsortedlist for events further away. This promises perfor-mance advantages, as the nature of the experiment issuch that most of the newly arriving events will befurther away in time than the time of the mean of theevents in the queue: thus for most insert operationscostly sorted inserts can be avoided.In summary, SPECI-2 gained performance, read-ability and extensibility, at the cost of style: somecomponents have centralised knowledge although thesimulator models a decentralised DC. The use of sin-gleton helps performance, but on the other it makesintegration testing very difficult.
To confirm that SPECI-2’s outcome is in line withSPECI-1, we have constructed experiments mimick-ing the flat model used in SPECI-1. To achieve such a setup without any hierarchies and providing a one-hop connectivity, in this section we model all cloudservices of the entire DC to fit on a single blade, andset the unit cost of communicating to another cloudservice on the same blade to be 1 per access count.This way we could reproduces some of the resultspublished from SPECI-1 to verify the simulator to becompatible with SPECI-1.The model observes the number of nodes that havean inconsistent view of the system. A node has an in-consistent view if any of the subscriptions that nodehas contains incorrect aliveness information. Thenumber of inconsistent nodes is measured over timeand observed once every ∆ t (=1sec).For the graphs shown here, we assume that thenumber of subscriptions grows slower than the totalsize of the DC, and so we set the average number ofsubscriptions per node to √ n . For each of these sizesa failure or change rate distribution f was chosen suchthat on average over the runtime 0.01%, 0.1%, 1%,and 10% of the nodes would fail. The graphs containthe half-width of the 95% confidence intervals, whichfor the load graphs however are small and barely vis-ible in the graph.Figure 1 and Figure 2 show the effect of the sub-scription graph topology on the levels of inconsisten-cies, see (Sriram and Cliff, 2010) for an explanationof the topology networks and the original figures fromSPECI-1. If the subscriptions graph has the structureof a Random or Barabasi-Albert graph, the distribu-tion is more resilient towards transitive passing-on ofinconsistencies than with Strogratz-Watts or Regulargraphs. On the other hand it also generates a signifi-cantly higher load. This shows that the nature of thesubscription graph, which is intrinsic by the jobs thatreside on the DC, needs to be taken into account whentuning the middleware. Similar replication has beenmade to reconfirm other graphs previously publishedbut will not be shown here. This section shows further results of simulations ofthe previous scenario on a hierarchically wired DC.As an exploratory hierarchy we compare the previousresults with a DC set-up of 4 cloud services per blade,16 blades per chassis, 4 chasses per rack and 16 racksper aisle, and we leave it to the future work to inves-tigate the effect of varying these hierarchies.The consistency graph of the TransitiveP2P pro-tocol for the various subscription topologies is essen-tially identical to the one in Figure 1, which showsa “flat” DC. This is due to the fact that the logical igure 1: Depending on the nature of the subscriptiongraph, the middleware exhibits variations in the number ofinconsistencies.Figure 2: If the distribution of the subscriptions is of a regu-lar graph or Strogratz-Watts network, they require a higherload, which offsets their advantages in terms of inconsisten-cies layer that deals with the communication protocol isnot affected by the physical layout, as it still contin-ues to communicate with the same other services asin the flat scenario. For this experiment, the place-ment and choice of subscriptions is dependent on thenetwork subscription topology graphs, and is not cor-related with the geographical distance in the DC. Onthe other hand Figure 3 shows a much higher loadcount than Figure 2, as potentially multiple hops arerequired for every communication, and as costs areintroduced. Note, compared to the experiments re-ported in (Sriram and Cliff, 2010) here only thoseservices that do not experience failure over the sim-ulation time are counted towards the average load andaverage inconsistencies. This makes the load entirelyindependent of the change rate. In Figure 3 one canobserve that unlike in the flat DC the load does notincrease by a constant factor. The step from 10 to10 is more than an order of magnitude bigger thanthe step from 10 to 10 . This is a direct effect ofthe scale requiring longer communication paths, andmore subscriptions being further away. This type ofobservation can allow us to model communicationand management cost of placement strategies in thefuture work. Figure 3: The load on the hierarchical DC is much higherthan the load in Figure 2, as potentially multiple hops arerequired for every communication, and as costs are intro-duced. This feature allows us to model communication andmanagement cost of placement strategies.
In this paper we have introduced SPECI-2. It hasbenefits in performance and extensibility, and it mod-els a hierarchical DC. We have demonstrated both theneed for such tools as well as a simulation architec-ture suitable for a hierarchical DC layout. With therelease of SPECI-2 we are hoping to attract a com-munity of researchers interested in modelling aspectsof DCs. We have further shown that SPECI-2 is com-patible with the results published using SPECI-1 andhave shown areas of investigation that can be followedwith SPECI-2 in the future.
REFERENCES
Barroso, L. A. and H¨olzle, U. (2009). The datacen-ter as a computer: An introduction to the designof warehouse-scale machines.
Synthesis Lectures onComputer Architecture , 4(1):1–108.Hey, T., Tansley, S., and Tolle, K. (2009).
The FourthParadigm: Data-Intensive Scientific Discovery . Mi-crosoft.Jogalekar, P. and Woodside, M. (2000). Evaluating thescalability of distributed systems.
IEEE Transactionson Parallel and Distributed Systems , 11(6)(11:6):589–603.Miller, R. (2009). Microsoft unveils its container-poweredcloud. .Sriram, I. (2009). Speci, a simulation tool exploringcloud-scale data centres. In
Cloud Computing: FirstInternational Conference, CloudCom 2009, LNCS5931 , pages 381–392, Beijing, China. Springer-VerlagBerlin Heidelberg.Sriram, I. and Cliff, D. (2010). Effects of component-subscription network topology on large-scale datacentre performance scaling. In
Proceedings of the15th IEEE International Conference on Engineeringf Complex Computer Systems (ICECCS 2010) , pages72 – 81, Oxford, UK. IEEE Computer Society.Sriram, I. and Cliff, D. (2011). Hybrid complex networktopologies are preferred for component-subscriptionin large-scale data-centres. In da F. Costa et al., L.,editor,
CompleNet 2010 , volume 116 of