[PDF] Performance modeling of a distributed file-system

Abstract

Data centers have become center of big data processing. Most programs running in a data center processes big data. The storage requirements of such programs cannot be fulfilled by a single node in the data center, and hence a distributed file system is used where the the storage resource are pooled together from more than one node and presents a unified view of it to outside world. Optimum performance of these distributed file-systems given a workload is of paramount important as disk being the slowest component in the framework. Owning to this fact, many big data processing frameworks implement their own file-system to get the optimal performance by fine tuning it for their specific workloads. However, fine-tuning a file system for a particular workload results in poor performance for workloads that do not match the profile of desired workload. Hence, these file systems cannot be used for general purpose usage, where the workload characteristics shows high variation. In this paper we model the performance of a general purpose file-system and analyse the impact of tuning the file-system on its performance. Performance of these parallel file-systems are not easy to model because the performance depends on a lot of configuration parameters, like the network, disk, under lying file system, number of servers, number of clients, parallel file-system configuration etc. We present a Multiple Linear regression model that can capture the relationship between the configuration parameters of a file system, hardware configuration, workload configuration (collectively called features) and the performance metrics. We use this to rank the features according to their importance in deciding the performance of the file-system.

Full PDF

PPerformance modeling of a distributed file-system

Sandeep Kumar ∗ Indian Institute of Technology DelhiNew Delhi, [email protected]

ABSTRACT

Data centers have become center of big data processing. Mostprograms running in a data center processes big data. The storagerequirements of such programs cannot be fulfilled by a single nodein the data center, and hence a distributed file system is used wherethe the storage resource are pooled together from more than onenode and presents a unified view of it to outside world. Optimumperformance of these distributed file-systems given a workloadis of paramount important as disk being the slowest componentin the framework. Owning to this fact, many big data processingframeworks implement their own file-system to get the optimalperformance by fine tuning it for their specific workloads. However,fine-tuning a file system for a particular workload results in poorperformance for workloads that do not match the profile of desiredworkload. Hence, these file systems cannot be used for generalpurpose usage, where the workload characteristics shows highvariation.In this paper we model the performance of a general purposefile-system and analyse the impact of tuning the file-system on itsperformance. Performance of these parallel file-systems are not easyto model because the performance depends on a lot of configurationparameters, like the network, disk, under lying file system, numberof servers, number of clients, parallel file-system configuration etc.We present a Multiple Linear regression model that can capture therelationship between the configuration parameters of a file system,hardware configuration, workload configuration (collectively calledfeatures) and the performance metrics. We use this to rank thefeatures according to their importance in deciding the performanceof the file-system.

KEYWORDS

Filesystem, Performance Modelling

Due to data explosion in recent years, many computer programsoften involves processing large amount of data. For example Face-book processes about 500 PB of data daily [1]. The storage capacityof a single machine is generally not enough to stored the completedata. Machines in a data centre pool their storage resources to-gether to provide support for storing data in the range of Petabytes.Also using a single node to store large data sets creates other issuesin terms of availability and reliability of data. The single machineis a single point of failure and two processes, running in parallel,working on isolated part of data cannot do it in a parallel wayaffecting the performance. ∗ Work done by author as a Master student at IISc.

DataSplit 1 Split 2 Split 3 Split 4

Node 1 Node 2 Node 3 Node 4

Parallel WriteParallel Read

Uni ﬁ ed Storage View Figure 1: Typical working of a parallel file-system.

One way to achieve this possible is to use a parallel file system.Figure 1 shows a general architecture of a parallel file-system. Ma-chines can contributed storage resources in a pool which is thenused to create a virtual disk volume. The file being stored on thefile-system is stripped and stored part-wise on different machines.This allows for write, and later read, operations to happen in paral-lel; boosting the performance significantly. Most of the parallel filesystem can do auto load balancing of the data on the servers (nodeswhich are used in the parallel file system), which makes sure thatno single node will become a bottleneck. Using replication (wherethere are a number of copies of a file stored on different servers),we can achieve high availability of the data. Hadoop DistributedFile System (HDFS), LustreFS, GlusterFS [2–4, 9] are some of thecommon parallel file system. All of them have the same purposebut differ significantly in their architecture. Some of them are usedfor general purpose storage (Lustre Gluster), whereas some areoptimized for very specific kind of workloads (HDFS).

Using a parallel file system effectively requires an understandingof the architecture and the working of the file system plus a knowl-edge about the expected workload for the file system. A majorproblem with any given parallel file system is that they come upwith many configurable options which requires an in depth knowl-edge about the design and the working of the file system and theirrelationship with the performance. Together with the hardwareconfiguration, they overwhelm the user with the number of waysone can configure the cluster, consequently, an user generally endsup using the filesystem with the default options and this may resultin an under-utilization of the resources in the cluster and a belowaverage performance. After experiencing low performance, user a r X i v : . [ c s . D C ] A ug ries to upgrade the system without really knowing the cause of theperformance bottleneck, and many times ends up making a wrongdecision, like buying more costly hard drives for the nodes whenthe real cause of the performance bottleneck is the network.Through our model we have identified some of the key configura-tion parameters in the Gluster File system, hardware configurationand the workload configuration (features) that significantly affectthe performance of the parallel file system and, after analysing theirrelationship with each other, we were able to rank the features ac-cording to their importance in deciding the performance. We werealso able to build a prediction model which can be used to deter-mine the performance of the file system given a particular scenario,i.e. when we know the hardware and file system configuration. Given the importance of each of the features in a cluster, we canweigh the positive and negative points of a given design for such acluster system and choose a design that satisfy our requirements.The result obtained from the analysis can be used to analyse someof the design options available and select the best possible amongthem, given some initial requirements. Based on the requirement,such as, the cluster should be dense and low power; i.e. it shouldtake less space on a rack and use less power than a traditionalcluster system; the network connectivity option was limited toGigabit (InfiniBand connectors take a lot of space, thus violatingthe condition of the dense systems). To make the system use lesspower, we have to choose a suitable CPU for the system, whichuses less power than the conventional CPU, and is not a bottleneckin the system. The list of possible CPUs are shown in the table 1.

CPU Points Power Usage

Low power Atom 916 8.5 WattLow power Xeon E3-1265 8306 45 WattLow power Xeon E3-1280 9909 95 WattXeon E5430 3975 80 WattXeon E5650 7480 95 Watt

Table 1: Choice of the CPU along with their processingpower (mentioned as points as calculated by the PassMarkBenchmark) and Power Usage. [5]

As seen from the result in table 8, we can see that InfiniBand isbig plus point for the performance of a distributed program andthe processing power of the CPU is the actual bottleneck. Howeverif we move to a Gigabit connection and towards a denser servercluster (12 nodes or 24 nodes in a 3U rack space), then the networkperformance is the limiting factor and CPUs with low processingpower are well capable of handling the load. Atom CPU have thelowest power consumption, but their processing power is also verylow and even when gigabit is used the CPU will be the bottleneck,so it was not a suitable option.After weighing the processing capability and the power usageof all the CPUs, low power Xeon E3-1265 Servers with Gigabitconnectivity has been chosen. Why?Add Section introductions here.

There are a number of parallel file systems available for use ina cluster. Some of them can be used for general purpose storagewhereas some of them are optimized for some specific type of usage.While selecting the file system for our experiment, along with thegeneral requirements of a file system like consistency and reliability,we also focused on the following: • Usage scenarios: We checked in which scenarios this filesystem can be used and whether it is adaptable to thedifferent workloads demands of the user or if it is designedfor only one type of specific workload. • Ease of installation: It should be easy to install, like whatinformation it requires during the installation and whethera normal user will have access to it. • Ease of management: How easy it is to manage the filesystem, so that it can be managed without special training • Ease of configuration: A file system will require manytweaks during its life time. So it must be very easy toconfigure the file system and also if it can be done online(when the virtual volume is in use) then that is a huge pluspoint. • Ease of monitoring: An admin monitoring the usage of thefile system should me able to make better decisions aboutload balancing and whether the cluster needs a hardwareupgrade. • Features like redundancy, striping etc.: The parallel filesystems are generally installed on commodity hardware,and as the size of cluster grows chances of a node failingalso increases. So the file system must support featurelike replication, which make sure that even if some of thenodes fails there is less chance of data corruption. Stripinghelps in reading a file in parallel which results in a hugeperformance gain. • Point of failures and bottlenecks: We looked at the archi-tecture of the file system to figure out how many points offailure the file system has or what can result in a bottleneckfor the performance.We studied the features of some of the most common filesys-tem being used as of today. The first was the Lustre File system,which is used in some of the worlds fastest Super computers [3, 4].Improvement in the disk operation is obtained by separating themetadata from the data. To handle the metadata there is a separatemetadata server used, whereas data is stored in separate servers.This metadata server is used only during resolving a file path and inthe permission checks. During the actual data transmission, clientdirectly contacts the data storage server. Using a single metadataserver increases the performance but it also becomes a performancebottleneck as all the requests for any file operation has to go throughthis. This is also a single point of failure, because if the metadataserver is down, technically whole file system is down (howeverthis can be tackled by configuring a machine as the backup of themetadata server which kicks in when the main metadata serverfails). Installing the Luster file system is not an easy process as itinvolves applying several patches to the kernel.The next was the Hadoop distributed file system (HDFS) which isgenerally used along with the Hadoop, an open source framework etwork Global Namespace

Storage PoolGlobal Namespace

Client/Apps Client/Apps Client/Apps

Figure 2: Architecture of the Gluster File system [2] for distributed computing. Hadoop is generally used to processlarge amount of data, so HDFS is optimized to read and writelarge files in a sequential manner. It relaxes some of the POSIXspecifications to improve the performance. The architecture isvery similar to that of the Lustre file system. It also separates themetadata from the data and a single machine is configured as themetadata server, and therefore it suffers from the same problem ofperformance bottleneck and single point of failure. Lastly we lookedinto Gluster File system, which is a evolving distributed file systemwith a design that eliminates the need of a single metadata server.Gluster can be easily installed on a cluster and can be configuredaccording to the expected workload. Because of these reasons wechoose Gluster file system for further analysis.

Gluster file system is a distributed file system which is capable ofscaling up to petabytes of storage and provide high performancedisk access to various type of workloads. It is designed to run oncommodity hardware and ensure data integrity by replication ofdata (user can choose to opt out of it). The Gluster file system differsfrom the all of the previous file systems in its architecture design.All of the previous file systems increased the I/O performance byseparating out the data from the metadata by creating a dedicatednode for handling the metadata. This achieved the purpose butduring high loads, this single metadata server can be reason for aperformance bottleneck and is also a single point of failure. But inGluster file system there is no such single point of failure as themetadata is handled by the under lying file system on top of whichthe Gluster file system is installed, like ext3, xfs, ext4 etc. The maincomponents of the Gluster architecture is: • Gluster Server: Gluster server needs to be installed on allthe nodes which are going to take part in creation of aparallel file system. • Bricks: Bricks are the location on the Gluster servers whichtake part in the storage pool, i.e. these are the location thatis used to store data. • Translators: Translators are a stackable set of options forvirtual volume, like Quick Read, Write Behind etc. • Clients: Clients are the node that can access the virtualvolume and use it for storage purpose. This virtual volume can be mounted using a number of ways, the most popularare, Gluster Native Client, NFS, CIFS, HTTP, FTP etc.Gluster has an inbuilt mechanism for file replication and striping.It can use Gigabit as well as InfiniBand for communication or datatransfer among the servers or with the clients. It also gives optionsto profile a volume through which we can analyse the load on thevolume. Newer version of Gluster supports Geo Replication, whichreplicates the data in one cluster to an another cluster located faraway geographically. Because of these features, we chose to dofurther analysis on the Gluster File system.

Performance of a parallel file system depends on the underlyinghardware, the file system configuration and the type of the workload for which it is used.

Hardware configurable features are shown in Table 2.

Parameter Description

Network Raw Bandwidth Network SpeedDisk Read Speed Sequential and RandomDisk Write Speed Sequential and RandomNumber of Servers File-System serversNumber of clients Clients accessing ServersReplication No. of copies of a file blockStriping No. of blocks in which a file is split

Table 2: Hardware and Software configurable Parameters

GlusterFS configurable parameters:(1) Striping Factor: Striping Factor for a particular Glustervolume. This determines the number of parts in which afile is divided.(2) Replication Factor: Replication Factor for a particular Glus-ter volume. This determines the number of copies for agiven part of a file.(3) Performance Cache: Size of the read cache.(4) Performance Quick Read: This is a Gluster translator thatis used to improve the performance of read on small files.On a POSIX interface, OPEN, READ and CLOSE commandsare issued to read a file. On a network file system the roundtrip overhead cost of these calls can be significant. Quick-read uses the Gluster internal get interface to implementPOSIX abstraction of using open/read/close for reading offiles there by reducing the number of calls over networkfrom n to 1 where n = no of read calls + 1 (open) + 1 (close).(5) Performance Read Ahead: It is a translator that does thefile prefetching upon reading of that file.(6) Performance Write Behind: In general, write operationsare slower than read. The write-behind translator improveswrite performance significantly over read by using fiaggre-gated background writefi technique.(7) Performance io-cache: This translator is used to enable thecache. In the client side it can be used to store files that it s accessing frequently. On the server client it can be usedto stores the files accessed frequently by the clients.(8) Performance md-cache: It is a translator that caches meta-data like stats and certain extended attributes of filesWorkload features: • Workload Size: Size of the total file which is to be read orwritten from the disk (virtual volume in case of Gluster). • Block Size: How much data to be read and written in singleread or write operation

The performance metrics for the Gluster file system under observa-tion are: • Maximum write Speed • Maximum Read Speed • Maximum Random Read Speed • Maximum Random Write SpeedThese performance metrics were chosen because they capturesmost of the requirements of the application.

We have used various tools to measure the different aspects of thefile systems and the cluster, whose results were later use to generatethe file system model. The tools used were as follow: • Iozone: Iozone is a open source file system benchmarktool that can be used to produce and measure a varietyof file operation. It is capable of measuring various kindof performance metric for a given file system like: Read,write, re-read, re-write, read backwards, read strided, fread,fwrite, random read, pread, mmap etc. It is suitable forbench marking a parallel file system because it supports adistributed mode in which it can spawn multiple threads ondifferent machine and each of them can write or read datafrom a given location on that particular node. Distributedmode requires a config file which tells the node in whichto spawn thread, where Iozone is located on that particularnode and the path at which Iozone should be executed.Example of a config file (say iozone.config): node1 /usr/bin/iozone /mnt/glusterfsnode2 /usr/bin/iozone /mnt/glusterfsnode3 /usr/bin/iozone /mnt/glusterfsnode4 /usr/bin/iozone /mnt/glusterfs

This config file can be used to spawn 4 threads on node1,node2, node3 and node4 and execute iozone at a given path(on which the parallel file system or the virtual volume ismounted) Some of the option of the IOzone tools that wasused – ± m : run the iozone tool in distributed mode – -t : No of thread to spawn in distributed mode – -r : Block size to be used during the bench-marking(It is not the block size of the file system) – -s : Total size of the data to be written to the filesystem. – -i : This option is used to select the test from test suitefor which the benchmark is supposed to run. – -I : This options enables the O SYNC option, whichforces every read and write to come from the disk.(This feature is still in beta and does not work properlyon all the clusters) – -l : Lower limit on number of threads to be spawned – -u : Upper limit on the number of threads to be spawnedExample: iozone -+m iozone.config -t -r -s -i -I It does not matter on which node this command is issued,till the iozone.config file contains valid entries. Iozone canbe used to find out the performance of the file system forsmall and large files by changing the block size. Supposethe total workload size is 100 MB, then if the block size is2 KB then its like reading or writing 51200 files to the disk.Whereas if the block size is 2048 KB then its like readingor writing to 50 files. • Iperf: Suppose the total workload size is 100 MB, then ifthe block size is 2 KB then its like reading or writing 51200files to the disk. Whereas if the block size is 2048 KB thenits like reading or writing to 50 files. iperf -s then on the client side connect to the server by issuing thecommand: iperf -c node1 (assuming there is an entry for node1 in the /etc/hosts file). • Matlab and IBM SPSS, is used to create the model from thegenerated data and to calculate the accuracy of the model.

The whole of the

Fist cluster comprises oftotal 36 nodes in total, out of which 1 node is the login node and isused to store all the users data. The remaining 35 nodes are usedfor computing purposes. The configuration of the 35 nodes are:CPU: 8 core Xeon processor @ 2.66 GHz RAM: 16 GB Connectivity:Gigabit (993 Mb/sec), Infiniband (5.8 Gb/sec) Disk: 3 TB 7200 RPM(5 nodes), 250 GB 7200 RPM (All of them). Out of these 35 nodes, wereconfigured the 5 of them with 3TB hard disk, and used them asGluster servers for our experiments. Rest of the nodes were chosento act as the client. This setting made sure that if a client wantsto write or read data to the virtual volume (mounted via NFS orGluster Native client from the servers), the data has to go outsidethe machine, and the writing of data to a local disk (which probablywill increase the speed) is avoided.

In the atom cluster there are 4 nodes. Theconfiguration of each of the atom nodes are as follow: CPU: Atom,4 Core processor @ 2 GHz RAM: 8 GB Connectivity: Gigabit (993Mb/sec) Disk: 1 TB, 7200 RPM Clients are connected to the clusterfrom outside via a Gigabit connection.

The data generation for the analysis was a challenge since somesettings for some of the features vastly degrades the performance f the file system. For example, when the block size is set at 2 KB,the speed for writing 100 MB of file (to disk and not in the cache,by turning on the O SYNC) was 1.5 MB/sec. The largest size of theworkload in our experiment is 20GB. So the time taken to write20GB to the disk will be around 30000 seconds, i.e 8.3 Hours. Inthe worst case when number of clients is 5 and replication is also 5,the total data written to the disk will be 500 GB (20*5*5). The timetaken to write 500 GB of data will be around 8.6 days, and includingtime taken for reading, random reading and random writing thedata, the total time will be , 8.6 + 3.34+3.34+8.6 = 23.88 days.The read and write speed achieved above is maximum, whencache size is equal to or more than 256 KB, as shown in Figure 4.If we try to model all of them together (using the normal valuesfor the Gluster configuration), Gluster configuration is neglectedas they have negligible effect on the performance when comparedto Hardware and Workload configuration. But they become im-portant once the hardware and type of workload is fixed, Glusterconfiguration is used to configure the File system in the best waypossible. For this reason we decided to create two models, one forthe hardware and the workload configuration and another for theGluster configuration.For the first model the parameters varied are listed in Table 3.The size of the workload was dependent on the system RAM, whichis 16 GB. If we try to write a file whose size is smaller than the 16GB then the file can be stored in the RAM and we will get a highvalue for read and speed. To avoid this workload size of 20 GB waschosen. But cache plays an important role when we try to read orwrite small files. We capture that behavior of the file system bymaking the workload size less than the RAM. For the second model,the workload size was fixed at 100 MB and Table 4 contains the listof the parameters and the values that they take during benchmarkprocess. Disk was unmounted and mounted before each test toavoid the caching effect. Parameter Values

Network Gigabit and InfinibandDisk Read Speed 117 MB/sec, 81 MB/secDisk Write Speed 148 MB/sec, 100 MB/secBase File System Ext3, XFSNumber of Servers 1-.1Number of clients 1-5Striping 1-5Replication 1-5Workload Size 100 MB, 200 MB, 500 MB,700 MB, 1 GB, 10 GB, 20 GB

Table 3: Hardware configurable Parameters varied on

Fist

Cluster

As stated earlier our goal is to figure out the relationship and impactof the hardware, Gluster and workload features on the performancefeatures. Multiple linear regression fits our criteria well as it can beused for:

Feature Values

Block Size 2 KB, 4 KB, 8 KB to 8192 KBPerformance Cache Size 2 MB, 4 MB, 8 MB to 256 MBWrite Behind Translator On/OffRead Ahead Translator On/OffIO Cache Translator On/OffMD Cache Translator On/Off

Table 4: Gluster Parameters varied on

Fist

Cluster and AtomClusterFigure 3: Figure showing an approx. linear relationshipbetween Number of Gluster Servers, Striping and Numberof client with the performance metrics Write Speed, ReadSpeed, Random Read Speed, Random Write Speed • Prediction: Predicting the value of Y (the performancemetric), given the cluster environment and the Glusterconfiguration (X). • Description: It can also be used to study the relationshipbetween the X and Y, and in our case we can study theimpact of the various configurations on the performance,for example, which of them affects the performance themost and which has the least effect on the performance .1 Assumptions Before applying multiple linear regression, the condition of linearitymust be satisfied i.e. there should be a linear relationship amongthe Dependent variable and the Independent variables to get agood fit. The relationship among the configuration features and theperformance features can be seen in Figure 3.

To study the relationship between more than one independent vari-able (features) and one dependent variable (performance) multipleregression technique is used. The output of the analysis will becoefficients for the independent variables, which is used to writeequation like: y i = β + β x i + β x i + ... + β p x ip + ϵ i , i = , .., ny i = x Ti β + ϵ i (1)where, y i : Dependent variablefis i th sample (Performance metric). x i : Independent variable’s i th sample (Configuration parameter). ϵ : error term β : Coefficients (output of analysis) n : No of data sample we have p : No of independent variable we haveIn vector form: Y = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) y y ... y n (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) , X = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) X T X T ... X Tn (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) x . . . x p x . . . x p ... . . . ... x n . . . x np (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) β = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) y y ... y n (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ϵ = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) y y ... y n (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) β can be calculated by the formula: β = ( X T X ) − X T Y (2) The accuracy of the model can be checked by using: R : It represents the proportion of the variance in depen-dent variable that can be explained by the independent variables.So, for a good model we want this value to be as high as possible. R can be calculated in following way: TSS = n (cid:213) i = ( y i − ¯ y ) , SSE = n (cid:213) i = ( y i − ˆ y ) , SSR = n (cid:213) i = ( ˆ y i − ¯ y ) (3)where, • TSS: Total Sum of square • SSE: Sum of Squares Explained • SSR: Sum of Square Residual • y : Dependent Variable • ¯ y : Mean of y • ˆ y : Predicted value for y • n : Total no of data samples available R = − SSETSS = SSRTSS (4) Value of R varies from [0,1]. R . R2 is generally positive biased estimate ofthe proportion of the variance in the dependent variable accountedfor by the independent variable, as it is based on the data itself.Adjusted R2 corrects this by giving a lower value than the R2 ,which is to be expected in common data.ˆ R = − (cid:32) n − n − k − (cid:33) ( − R ) (5)where, • n : Total no. of samples • k : Total no. of features or independent variables. Predictors are ranked according to the sensitivity measure whichis defined as follow: S i = V i V ( Y ) − V ( E ( Y | X i )) V ( Y ) (6)where, V ( Y ) is the unconditional output variance. Numerator is the ex-pectation E over X − i , which is over all features, expect X i , then V implies a further variance operation on it. S is the measure sensitiv-ity of the feature i. S is the proper measure to rank the predictorsin order of importance [8]. Predictor importance is then calculatedas normalized sensitivity: V I i = S i (cid:205) kj = S j (7)where, V I is the predictor importance. We can calculate the predic-tor importance from the data itself.

The R test is the measure of goodness of fit on whole of the trainingdata. Cross validation test is used to check the accuracy of themodel when it is applied to some unseen data. For this the data setis divided into two sets called training set and the validation set.The size of these set can be decided as per the requirement. We setthe size of the training set size equal to 75% of the data set and thevalidation set was 25%. The distribution of data samples to theseset was completely random. Two separate analyses are done, one for the hardware and theworkload configuration and another for the Gluster configuration.In the first model, the assumption of linearity holds and hence themultiple linear regression model gives an accuracy of 75% - 80%.However, due to the huge impact of the cache and block size inthe Gluster configuration, the assumption of linearity is no longervalid. So we used predictor importance analysis to rank the featuresaccording to their importance.

The output of the Multiple Linearregression analysis, i.e the coefficients corresponding to each of theconfigurable feature is shown in the Table 5 From the coefficient able we can see that the Random write performance is most ef-fected by the Network, followed by Replication, No of clients , Noof servers and Striping. The signs tell us the relationship, for exam-ple, on increasing the Network bandwidth the write performanceincreases, and increasing the replication factor decreases the writespeed. The output of the Multiple Linearregression analysis, i.e the coefficients corresponding to each of theconfigurable feature is shown in the Table ?? .From the coefficient table we can see that the Read performanceis most effected by the Network followed by Replication, No ofclients and No of servers. Replication is having an negative effecton the read performance instead of no effect or some positive effectbecause of the replication policy of the Gluster. If we have specifieda replication factor of say n, then when a client tries to read a file,it has to go to every server to ensure that the replicas of that file isconsistent everywhere and if it is not consistent, read the newestversion and start the replication process. Because of this overheadthe read performance of the file system drops with an increase inthe replication factor. The output of the MultipleLinear regression analysis, i.e the coefficients corresponding toeach of the configurable feature is shown in the Table ?? . From thecoefficient table we can see that the Random read performance ismostly effected by the Network, then by Replication followed byBase file system, No of clients and No of servers. The output of the Multi-ple Linear regression analysis, i.e the coefficients corresponding toeach of the configurable feature is shown in the Table ?? .From the coefficient table we can see that the Random write per-formance is most effected by the Network, followed by Replication,No of clients , No of servers and Striping. Data samples were generated for 100 MB of file. Since the size ofthe file is much smaller than the RAM available (8 GB), then duringthe benchmarks, cache effect will be present. To avoid this, weenabled the O SYNC option in the Iozone benchmark test whichwill ensure that every read and write operation is done directly toand from the disk. But we also want to see the effect of cache whenwe turn on some translators that are cache dependent. So we ranevery benchmark test twice, one time with the O SYNC option ONand another time with the O SYNC option OFF.

When the O SYNC option isON, i.e. every read or write operation goes to the disk, then thedominating feature is the block size, which is to be expected, asincreasing cache size and turning on translators will not help as weare forcing every operation to come from the disk. We are doingit for a 100 MB file, but same kind of behaviour can be seen whendealing with large files (size greater than the RAM size).

With the O SYNC option off,the cache effect can be seen in the Read performance of the system,where as the Block size is still the dominant feature The size of the

Figure 4: Relationship of the Performance metrics with theBlock Size with O SYNC OnFigure 5: Relationship of the Performance metrics with theBlock Size with O SYNC Off workload is 100 MB only, and still the block size is the dominatingfactor but the effect of cache can be seen in the performance of readspeed and random read speed as Gluster have some translator (readahead and quick read) that uses cache to optimize the performanceof read of the files, specially for the small files. The performanceof write and random write follows the same pattern as early. Wecan see that the speed of the read operation remain the same withchange in the block size as compared to early when the O SYNCoption was ON and everything has to come from disk. oefficientsFeature Write Read Random Read Random Write Constant -120.485 -142.931 -39.676 -114.066Network 271.936 237.502 162.003 230.091Disk Read Speed ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ Table 5: The coefficients for the model for Write performance of the parallel file systemCoefficientsFeature Write Read Rnd Read Rnd Write

Adjusted R Table 6: Cross validation was done by training the model on75% of data and then running validation on the rest of the25% data ImportanceO SYNC OffFeature Sync. Rnd. Write Read Rnd. Read

Block Size ≈ ≈ Cache ≈ ≈ ≈ ≈ Table 7: Predictor Importance with O SYNC ON

To verify the result we compared the performance of an OceanModeling code in different scenarios. Ocean code is used in themodelling of the ocean behaviour given some initial observation. Ituses ROMS modelling tool to do the calculation and MPI to run on acluster by spawning 48 threads.

NTIMES factor controls the numberof iterations inside the code and is the major factor in deciding therun time of the code The optimal value of NTIMES is 68400 (changesdepending upon the requirement). During its execution it processesmore than 500 GB of data. We tested its performance in situationsas listed in the table 8 From the table 8 we can see that the networkindeed is the most dominating factor in deciding the performanceof the code. On gigabit network, the performance of more numberof servers falls behind the performance of less number of servers oninfiniband and as we increase the number of servers on infinibandthe time taken to complete the code decreases. We can see fromthe time taken to run the code that if Gigabit network is used,the performance of the Atom CPU is very close to that of Xeonbecause the network was the bottleneck here. This fact was helpfulin deciding which low power servers to be bought for CASL lab.

Configuration Time Taken ≈ ≈ . ≈ ≈

31 Hrs2 Gluster Servers, Infiniband ≈

478 min, ≈ . ≈

282 min, ≈ . ≈

144 min, ≈ . Table 8: Performance of the Ocean code in different scenar-ios

For a first order model, Multiple Linear Regression is good enough,as the performance is mainly decided by only one or two features(network in case of the Gluster file system). For a much more de-tailed model which can explain the interaction between the variousof the features, a much more sophisticated tool is required. Thecache of the system also plays an very importance role in decidingthe performance but requires support of the file system (gluster filesystem optimizes the read speed using the cache of the system). • Unified Model: Because of the huge difference betweenthe impact of the hardware configuration features andthe Gluster configuration features, a separate model wasrequired. A unified model can be developed by addingsome bias to the Gluster configuration features which willincrease its importance and the we can incorporate all ofthe configuration features in one model. • Automate the whole process: Build a tool which can donecessary things like generate the data and then do appro-priate analysis on it automatically and give us the desiredresult.

The machine learning algorithm used in [6], Kernel Canon-ical Correlation Analysis (KCCA) is used to predict theperformance of the file system. Apart from the featuresof the hardware and file system, they also include somefeatures from the application itself, since KCCA can tellus the relationship between the features of the applicationand the features of the file system and the hardware cluster.From some initial benchmark test we observer that only ne or two of the configuration features determine theperformance of the parallel filesystem, so there is no pointin going for a very detailed model using complex methods.For the sake of completeness we are using more features( and not only those one which have a major impact onthe performance). A more detailed model is required toanalyse the interaction between these features.[7] fine tunes the performance of MapReduce in a spe-cific scenario. REFERENCES

Predicting and Optimizing SystemUtilization and Performance via Statistical Machine Learning (2017), 22–31.[8] Andrea Saltelli, Stefano Tarantola, Francesca Campolongo, and MarcoRatto. 2004.

Sensitivity Analysis in Practice: A Guide to Assessing Scien-tific Models . Halsted Press, New York, NY, USA.[9] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 2010. The HadoopDistributed File System. In . 1–10. https://doi.org/10.1109/MSST.2010.5496972. 1–10. https://doi.org/10.1109/MSST.2010.5496972