[PDF] AutoTiering: Automatic Data Placement Manager in Multi-Tier All-Flash Datacenter

Abstract

In the year of 2017, the capital expenditure of Flash-based Solid State Drivers (SSDs) keeps declining and the storage capacity of SSDs keeps increasing. As a result, the "selling point" of traditional spinning Hard Disk Drives (HDDs) as a backend storage - low cost and large capacity - is no longer unique, and eventually they will be replaced by low-end SSDs which have large capacity but perform orders of magnitude better than HDDs. Thus, it is widely believed that all-flash multi-tier storage systems will be adopted in the enterprise datacenters in the near future. However, existing caching or tiering solutions for SSD-HDD hybrid storage systems are not suitable for all-flash storage systems. This is because that all-flash storage systems do not have a large speed difference (e.g., 10x) among each tier. Instead, different specialties (such as high performance, high capacity, etc.) of each tier should be taken into consideration. Motivated by this, we develop an automatic data placement manager called "AutoTiering" to handle virtual machine disk files (VMDK) allocation and migration in an all-flash multi-tier datacenter to best utilize the storage resource, optimize the performance, and reduce the migration overhead. AutoTiering is based on an optimization framework, whose core technique is to predict VM's performance change on different tiers with different specialties without conducting real migration. As far as we know, AutoTiering is the first optimization solution designed for all-flash multi-tier datacenters. We implement AutoTiering on VMware ESXi, and experimental results show that it can significantly improve the I/O performance compared to existing solutions.

Full PDF

AAutoTiering: Automatic Data Placement Manager inMulti-Tier All-Flash Datacenter

Zhengyu Yang ∗ , Morteza Hoseinzadeh ‡ , Allen Andrews † , Clay Mayers † ,David (Thomas) Evans † , Rory (Thomas) Bolt † , Janki Bhimani ∗ , Ningfang Mi ∗ and Steven Swanson ‡∗ Dept. of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115 ‡ Dept. of Computer Science and Engineering, University of California San Diego, San Diego, CA 92093 † Samsung Semiconductor Inc., Memory Solution Research Lab, Software Group, San Diego, CA 92121

Abstract —In the year of 2017, the capital expenditure of Flash-based Solid State Drivers (SSDs) keeps declining and the storagecapacity of SSDs keeps increasing. As a result, the “selling point”of traditional spinning Hard Disk Drives (HDDs) as a backendstorage – low cost and large capacity – is no longer unique,and eventually they will be replaced by low-end SSDs whichhave large capacity but perform orders of magnitude betterthan HDDs. Thus, it is widely believed that all-ﬂash multi-tierstorage systems will be adopted in the enterprise datacenters inthe near future. However, existing caching or tiering solutions forSSD-HDD hybrid storage systems are not suitable for all-ﬂashstorage systems. This is because that all-ﬂash storage systems donot have a large speed difference (e.g., 10x) among each tier.Instead, different specialties (such as high performance, highcapacity, etc.) of each tier should be taken into consideration.Motivated by this, we develop an automatic data placementmanager called “AutoTiering” to handle virtual machine diskﬁles (VMDK) allocation and migration in an all-ﬂash multi-tier datacenter to best utilize the storage resource, optimize theperformance, and reduce the migration overhead. AutoTieringis based on an optimization framework, whose core techniqueis to predict VM’s performance change on different tiers withdifferent specialties without conducting real migration. As far aswe know, AutoTiering is the ﬁrst optimization solution designedfor all-ﬂash multi-tier datacenters. We implement AutoTieringon VMware ESXi [1], and experimental results show that it cansigniﬁcantly improve the I/O performance compared to existingsolutions.

Index Terms —All-Flash Datacenter Storage, Caching & Tier-ing Algorithm, NVMe SSD, Big Data, Cloud Computing, Re-source Management, I/O Workload Evaluation & Prediction

I. I

NTRODUCTION

A basic credendum of cloud computing can be summarizedas: user devices are light terminals to assign jobs and gatherresults, while all heavy computations are conducted on remotedistributed server clusters. This light-terminal-heavy-server structure makes high availability no longer an option, buta requirement in today’s datacenters. Furthermore, when webring compute, network, and storage capabilities into balance,it is found that the biggest challenge here is closing the gapbetween compute and storage performance to shift storage’scurve back towards Moore’s law [2]. In other words, storageI/O is the biggest bottleneck in large scale datacenters. Asshown in study [3], the time consumed to wait for I/Os is themain cause of idling and wasting CPU resources, since lotsof popular cloud applications are I/O intensive, such as video

This work was completed during Zhengyu Yang and Morteza Hoseinzadeh’sinternship at Samsung Semiconductor Inc. This project is partially supportedby NSF grant CNS-1452751. streaming, ﬁle sync and backup, data iteration for machinelearning, etc.To solve the problem caused by I/O bottlenecks, parallelI/O to multiple HDDs in Redundant Array of IndependentDisks (RAID) becomes a common approach. However, theperformance improvement from RAID is still limited, there-fore, lots of big data applications strive to store intermediatedata to memory as much as possible such as Apache Spark.Unfortunately, memory is too expensive, and its capacity isvery limited (e.g., 64 ∼ global (i.e., for allVMs in the datacenter) migration and allocation solution overruntime. AutoTiering’s approximation approach further solvesthe simpliﬁed problem in a polynomial time, which considersboth historical and predicted performance factors, as well as a r X i v : . [ c s . PF ] M a y stimated migration cost. This comprehensive methodologyprevents to frequently migrate back and forth VMs betweentiers due to I/O spikes. Speciﬁcally, AutoTiering uses a micro-benchmark-based sensitivity calibration and regression sessionto predict VM’s performance change on different tiers withoutperforming real migration, since different VMs may havedifferent beneﬁts of being upgraded to a high end tier. Weimplement AutoTiering on VMware ESXi [1] and evaluateits performance with a set of representative applications. Theexperimental results show that AutoTiering can signiﬁcantlyimprove the I/O performance by increasing I/O throughputand bandwidth as well as reducing I/O latency.The rest of this paper is organized as follows. Sec. IIpresents the background and literature review. Sec. III for-mulates the problem and introduces an optimization frame-work. Sec. IV proposes our approximation algorithm to solvethe problem in a polynomial time. Experimental evaluationresults and analysis are discussed in Sec. V. We present theconclusions in Sec. VI.II. B ACKGROUND AND L ITERATURE R EVIEW

Substantial work has been done to improve I/O operationsin datacenters at both hardware and software levels. In thissection, we discuss some of these work as well as theevolutionary inclination towards AutoTiering.

SSD as Cache/Tier of HDD:

In the recent years, Flash-basedSSDs have been commonly used in datacenters. An SSD canbe used whether as a cache for HDD or as a distinct storagetier. The main difference is that the cache approach has twocopies for hot data, one in SSD and one in HDD (two copiesare synced under the write through policy, and are not syncedunder the write back policy), while the tiering approach simplymigrates data between tiers and only keeps one version of thedataset. Lots of caching and tiering mechanisms [5]–[13] aredeveloped for cloud storage systems.

Data Placement in Multi-tier All Flash Data Center:

SSD-HDD based solutions may work for a limited number ofusers (VMs) with mediate I/O intensity and small workingset size, but for the era of super-scale clusters (e.g., cloudingcomputing, IoT in 5G network), the I/O bottleneck getsmitigated, but not resolved [14].The main reason is that inboth SSD-HDD caching and tiering approaches, there stillexists a huge performance gap between SSDs and HDDs.With the decreasing price and increasing capacity of SSDs,a promising solution to quench this gap is to setup an all-ﬂash datacenter which is becoming a reasonable solution inthe near future. With the aim of further reducing the overallcapacity, studies [15], [16] proposed to periodically recom-pute VM assignments. Study [17] introduced mathematicalmodel formulations for big data application performance andmigrating VMs among tiers, with the aim of minimizing theoverhead of data migration. Study [18] proposed a non-volatilememory based cache policy for SSDs by splitting the cacheinto four partitions and determining them to their desiredsizes according to a page’s status. We summarize the specs ofSSDs with different ends available in market by July 2017 inTable. I. As we can see, an all-ﬂash multi-tiers solution can be built from different SSDs with different specialties, e.g., superperformance tier with 3D XPoint SSD, high performance tier with NVMe and SLC SSDs, and large capacity tier with MLCand TLC SSDs.

TABLE I: Performance and cost of different SSDs in July 2017.

SSD Type Cost ( $ /GB ) Max Size ( Bytes ) Read Time ( µs ) Write Time ( µs )3D XPoint 4.50 375G 10 10NVMe 0.57 3.2T 20 20SLC SATA3 0.64 480G 25 250MLC SATA3 0.30 2T 50 750TLC SATA3 0.28 3.84T 75 1125 III. P

ROBLEM F ORMULATION

In this section, we ﬁrst formulate the problem of VMallocation and migration in an all-ﬂash multi-tier datacenter,and then develop an optimization framework to best utilizeSSD resource among VMs in the datacenter.

A. System Architecture

Fig. 1 illustrates the system architecture of AutoTieringwhich has the following components: • IO Filter:

Attached to each VMDK (virtual machine diskﬁle) being managed on each host, it is responsible forcollecting I/O related statistics as well as running speciallatency tests on every VMDK. The data will be collectedand sent to the AutoTiering Daemon on the host [19]. • Daemon : Running on the VM hypervisor of all hosts,it tracks the workload change (i.e., I/O access patternchange) of each VM, collects the results of latencyinjection tests from the IO Filter, and sends them to theController. • Controller:

Running on a dedicated server, it is respon-sible for making decisions to trigger migration based onthe predicted VM performance (if it is migrated to othertiers) and the corresponding migration overhead.

B. Optimization Framework

To develop an optimization framework aimed at minimizingthe total amount of server hours by determining a VM mi-gration schedule, we formulate the problem by investigatingthe following factors: First, from each VM’s point of view,the reason for a certain VM to be migrated from one tierto another is that the VM can perform better (e.g., lessaverage I/O latency, higher IOPS, higher throughput, etc.)after migration. Second, the corresponding migration cost needto be considered, because migration is relatively expensive(consumes resource and time) and not negligible. Third, fromthe global optimization ’s point of view, it is hard to satisfyall VMs to be migrated to their favorite tiers at the same timedue to resource constraints and their corresponding SLAs (i.e.,Service Level Agreement). Fourth, the global optimization should consider overtime changes of all VMs as well as post-effects of migration. For example, the current best allocationsolution may lead to a bad situation for the future since VMsare changing behaviors during runtime. Based on these factors,

PPOSAPPOS VM Hypervisor

AutoTieringDaemon

AutoTiering ControllerVM Server 1 VM Server 2 VM Server N…All-Flash Storage Pool Fiber Channels

IO Filter

APPOSAPPOS

IO Filter

APPOSAPPOS

IO Filter

APPOSAPPOS VM Hypervisor

AutoTieringDaemonIO Filter

APPOSAPPOS

IO Filter

APPOSAPPOS

IO Filter

APPOSAPPOS VM Hypervisor

AutoTieringDaemonIO Filter

APPOSAPPOS

IO Filter

APPOSAPPOS

IO Filter …SSDSSD SSDSSD SSDSSD SSDSSDSSDSSD SSDSSD Tier 1SSDSSD SSDSSD SSDSSD SSDSSDSSDSSD SSDSSD Tier 2SSDSSD SSDSSD SSDSSD SSDSSDSSDSSD SSDSSD Tier M…

Fig. 1: Architecture of AutoTiering in a multi-tier all-ﬂash storagesystem. our optimization framework needs to consider potential bene-ﬁts and penalties, migration overhead, historical and predictedperformance of VMs on each tier, SLA, as shown in Eq. 1to 5. Table II represents notations that we use in this paper.

Maximize: (cid:88) ∀ v, ∀ t, ∀ τ w v,τ · [ (cid:88) ∀ k α k · r ( v, t v,τ , τ, k ) − β · g ( v, t v,τ − , t v,τ )] , (1) Subject to: for ∀ v, ∀ τ : t v,τ (cid:54) = ∅ , and | t v,τ | ≥ , (2)for ∀ τ , τ ∈ [0 , + ∞ ) : r ( v, t v,τ , τ , k s ) ≡ r ( v, t v,τ , τ , k s ) ≥ , (3)for ∀ v, ∀ t, ∀ τ : r ( v, t v,τ , τ, k ) = r Prd ( v, t v,τ − , t v,τ , τ − , k ) ≥ , (4) (cid:88) ∀ v r ( v, t v,τ , τ, k ) ≤ Γ k · R ( t v,τ , k ) , (5) The main idea is to maximize the “

Proﬁt ”, which is the en-tire performance gain minus penalty, i.e., “

Performance Gain- Performance Penalty ”, as shown in the objective functionEq. 1. The inner “sum” operator conducts a weighted sumof the usage of all types of resources (such as throughput,bandwidth and storage size, etc.) of each VM, assumingmigrating VM v from tier t τ − to t τ . The outer “sum” operatorfurther iterates all possible migration cases, where weightparameter w v,τ reﬂects the SLA of VM v in τ epoch. Noticethat the term “migration” in this paper is not to migrate aVM from one host server to another. Instead, only backendVMDK ﬁles are migrated from one to the other SSD tier.As a result, non-disk-I/O related resources (e.g., host-sideCPU and memory) are not considered in this paper. Eq. 2guarantees that each VM is hosted by at least one disk tier.In fact, each VM can have multiple VMDKs, and each ofthem can be located at different tiers. Unlike the previousSSD-HDD tiering work [12], [20] that are operating on ﬁne-grained data blocks due to HDD speed bottleneck, the minimalunit to migrate in this paper is the entire VMDK of eachVM. Eq. 3 ensures that storage size (i.e., VMDK size) willnot change before and after migrations, where k s is the typeindex of storage resource. Eq. 4 shows that a prediction model TABLE II: Notations.

Notation Meaning v , v i VM v , i -th VM, v ∈ [1 , v max ] , where v max is the last VM. t , t i Tier t , i -th tier, t ∈ [1 , t max ] , where t max is the last tier. t v,τ VM v ’s hosting tier during epoch τ . k Different types of resources, k ∈ [1 , k max ] , such as throughput, bandwidth,storage size, and etc. τ Temporal epoch ID, where τ ∈ [0 , + ∞ ) . α k , β k -th resource’s weight, migration costweight. r ( v, t v,τ , τ, k ) Predicted type of k resource usage of VM v running on tier t v,τ . g ( v, t v,τ − , t v,τ ) Migration cost of VM v during epoch τ from tier t v,τ − to tier t v,τ . k s “Storage” resource’s index. µ ( v, τ − Estimated migration speed of VM v attime τ − epoch. P r ( v, t, τ ) , P w ( v, t, τ ) Read/write resource of VM v on tier t during epoch τ . P r (Λ , t, τ ) , P w (Λ , t, τ ) All remaining available read/write re-source of tier t during epoch τ . Γ k Upper bound (in percentage) of each typeof resource that can be used. R ( t v,τ , k ) Total capacity of k -th type of resource ontier t v,τ . L t Original average I/O latency (without in-jected latency) of tier t . b v , m v Parameters of TSSCS liner regressionmodel ( y = mx + b ). s v Average I/O size of VM v . S v , VM[v].size Storage size of VM v . w v,τ , wetP [ t ] , wetB [ t ] , wetS [ t ] Weight of VM v and weights of tier t ’seach type of resource. maxP [ t ] , maxB [ t ] , maxS [ t ] , spc [ t ] .P , spc [ t ] .B , spc [ t ] .S Preset available resource caps and special-ties of tier t , P=throughput, B=bandwidth,S=storage size. function is utilized to predict the performance gain (for details,see Sec. IV-2). Eq. 5 lists resource constraints, where Γ k ispreset upper bound (in percentage) of each type (i.e., k -th)of resource that can be used. Finally, the temporal migrationoverhead is the size of the VM to be migrated, dividedby the bottleneck of migrate-out read speed and migrate-inwrite speed, i.e., g ( v, t v,τ − , t v,τ ) = r ( v,t v,τ − ,τ − ,k s ) µ ( v,τ − . Themigration speed is µ ( v, τ −

1) = min ([ P r (Λ , t v,τ − , τ −

1) +

P w (Λ , t v,τ − , τ − , where P r (Λ , t v,τ − , τ − isthe function of available remaining read throughput. Since weare going to live migrate VM v , the read throughput used bythis VM ( P r ( v, t v,τ − , τ − ) is also available and has beenadded back here. P w (Λ , t v,τ − , τ − gives the migrate-inwrite throughput.Since the system has no information on the future workloadI/O patterns, it is impossible to conduct the global optimizationfor all τ time periods during runtime. Moreover, the decisionmaking process for each migration epoch in the global optimalsolution depends on the past and future epochs, which meansthat Eq. 1 cannot be solved by traditional sub-optimal-baseddynamic programming techniques. Lastly, depending on theomplexity of the performance prediction model (e.g., Eq. 4),the optimization problem can easily become NP-hard.IV. A PPROXIMATION A LGORITHM D ESIGN

To obtain the result close to the optimized solution ina polynomial time, we have to relax some constraints. Indetail, we ﬁrst downgrade the goal from “ global optimizingfor all time ” to “ only optimizing for each epoch ” (i.e., runtimegreedy). Furthermore, since we have the foreknowledge ofeach tier’s performance “specialty” (such as high throughput,high bandwidth, large space, small write ampliﬁcation func-tion, large over-provisioning ratio, large program/erase cycles,etc.), we can make migration decisions based on the ranking ofthe estimated performance of each VM on each tier, consid-ering each tier’s specialties and corresponding estimation ofmigration overhead. Details of our approximation algorithmare discussed in the following subsections.

1) Main Procedure:

Alg. 1 lines 1-8 show the mainprocedure of AutoTiering, which periodically monitors theperformance and checks whether a VM needs to be mi-grated. Speciﬁcally, monitorEpoch is the frequency of eval-uating and regressing the performance estimation model, and migrationEpoch is the frequency of triggering VM migrationfrom one tier to the other one. The migration decision ismade at the beginning of each migrationEpoch , which isgreater than monitorEpoch . Apparently, the smaller temporalwindow sizes, the more frequent monitoring, measurement,and migration. The system administrator can balance a tradeoffbetween the accuracy and the migration cost by conducting asensitivity analysis before deployment. As shown in line 4of Alg. 1, procedure tierSpdSenCalibrate estimates VM’sperformance on each tier based on the regression model. Line5, thus, calculates performance matrices, and line 6 calculatesthe score by considering the historical and current performanceof each VM, estimates the corresponding migration overhead.Finally, VM migrations are triggered, see line 8. We describetheir details in the following subsections.

2) Tier Speed Sensitivity Calibration:

In order to estimateVM’s performance on other tiers without conducting actualmigration, we ﬁrst try to “emulate” the speed of tiers bymanually injecting a synthetic latency to each VM’s I/Os, andmeasure the resultant effect on total I/O latency by callingIOFilter APIs. Our preliminary experiments show that theperformance variation can be modeled to a linear function.VMs running different types of applications have varyingperformance sensitivity to the tier. Motivated by these ob-servations, we introduce a micro-benchmark session, called“Tier Speed Sensitivity Calibration Session (

TSSCS )”, topredict (i.e., without conducting actual migration) how muchperformance beneﬁt (resp. performance penalty) it can takefor each VM if we migrate that VM to a faster (resp. slower)tier. In detail,

TSSCS has the following properties: • Lightweight : Running inside the IOFilter,

TSSCS injectsa synthetic tiny latency to each VM in a very lowfrequency, without affecting the current hosting workloadperformance.

Algorithm 1:

AutoTiering Procedure Part I. Procedure autoTiering() while T rue do if currT ime MOD monitorEpoch = 0 then tierSpdSenCalibration () ; calCapacityMatrices () ; calScore () ; if currT ime MOD migrationEpoch = 0 then triggerMigration () ; Procedure tierSpdSenCalibration() for t ∈ tierSet do for v ∈ V MSet [ t ] do for samplesWithLatency ∈ sampleSet[t][v] do CV + = calCV ( samplesW ithLatency ) ; samplesW ithLatency = avg ( samplesW ithLatency ) ; CV / = len ( sampleSet [ t ][ v ]) ; if CV ≥ then V M [ v ] .conf = 0 . ; else V M [ v ] .conf = 1 − CV ; ( V M [ v ] .m, V M [ v ] .b ) = regress ( sampleSet [ t ][ v ]) ; return ; Procedure calCapacityMatrices() for t ∈ tierSet do for v ∈ V MSet [ t ] do Lat = estimateAvgLat ( v, t ) ; if Lat > 0 then IOP S = 10 /Lat ; else IOP S = 0 ; V MCapMat [ t ][ v ] .P = IOP S ; V MCapMat [ t ][ v ] .B = IOPS × V M [ v ] .avgIOSize ; V MCapMat [ t ][ v ] .S = V M [ v ] .size ; return ; Procedure estimateAvgLat( v,t ) return V M [ v ] .m × ( tierLatency [ t ] − tierLatency [ V M [ v ] .tier ])) + V M [ v ] .b ; • Multi-Samples per Latency : TSSCS improves the accu-racy of emulating each VM’s performance under eachtier by taking the average over multiple samples that areobtained with the same injected latency of each tier. • Multi-Latencies per Session : TSSCS takes multiple laten-cies per session to reﬁne the regression. • Multi-Sessions during Runtime : TSSCS is periodicallytriggered to update the regression model by calling“ tierSpdSenCalibration ” in Alg. 1 line 4.Fig. 2 depicts an example of three VMs running on threedifferent tiers: VM v on tier t , VM v on tier t , andVM v on tier t . Assume t is 2,000 µs faster than t ,and t is 2,000 µs faster faster than t . We run TSSCS oneach VM on their hosting tier, and get the latency curvesshown in three plots in Fig. 2. Since all injected latencies areadditional to the original bare latency (i.e., without injectedlatency), we have to align them according to the absolutelatency values (i.e., bare latency + injected latency). Noticethat since we cannot inject negative latencies (i.e., obviouslywe can only slow down the tier), the dash lines in subﬁgures

Tier1 Tier2 VM1 on Tier1VM2 on Tier2VM3 on Tier3 L -L L -L L -L L -L Tier3 A v er ag e I / O L a t e n c y ( µ s ) A v er ag e I / O L a t e n c y ( µ s ) A v er ag e I / O L a t e n c y ( µ s ) Injected I/O Latency (µs)Injected I/O Latency (µs)Injected I/O Latency (µs) L -L L -L L -L L -L L -L v → t v → t v → t v → t v → t v → t v → t v → t v → t Fig. 2: Example of average I/O latency estimation. “VM2 on tier2” and “VM3 on tier3” are regressed based onthe solid lines. After that, we can draw three (colored) linesfor each tier based on their absolute latency values (i.e., redfor tier1, green for tier2, and blue for tier3). Then, we caneasily predict the average I/O latency values of each VMon each tier (i.e., red points for t , green points for t , andblue points for t ). We see that VM1 is the most sensitiveone to tier speed changes (i.e., with the greatest gradient),while VM3 is the least sensitive one (i.e., relatively ﬂat).Therefore, intuitively, we should assign VM1 to tier1 (fastesttier) and VM3 to tier3 (slowest tier). Alg. 1 lines 9-21 describethe procedure of tier speed sensitivity calibration session( TSSCS ). In addition, AutoTiering calculates the coefﬁcientvariation (CV) of sampling results to decide the estimationconﬁdence, see in lines 13 and 16-19.

3) Performance Capacity Matrices:

Once we have theaverage latency vs. injected latency curves of each VM of thecurrent moment, we calculate corresponding performance esti-mation of throughput (denoted as P , unit in IOPS), bandwidth(denoted as B , unit in MBPS), and storage size (denoted as S ,unit in bytes), and record them into three two-dimensional ma-trices, i.e., V M CapM at [ t ][ v ] .P , V M CapM at [ t ][ v ] .B , and V M CapM at [ t ][ v ] .S , where t and v are IDs of tier and VM,respectively. Compared to the ﬁrst two matrices, the last matrixis relatively straightforward to be obtained by calling thehypervisor APIs to measure the storage size that each VMis occupying. As shown in Alg. 1 lines 22-33, AutoTieringupdates the VM capacity matrices (i.e., V M CapM at ) byreiterating for all tiers and VMs. It estimates “new” latencyunder other tiers by calling the “ estimateAvgLat ( v, t ) ”function in Alg. 1 line 25. Lines 34-35 show the detail of estimateAvgLat function, where the input parameters areVM v and target tier t for estimation, which returns anestimation based on linear regressions of m and b values.Once the estimated average I/O latency results are obtained,we calculate the throughput and bandwidth in Alg. 1 lines 30 Algorithm 2:

AutoTiering Procedure Part II. Procedure calScore() for t ∈ tierSet do for v ∈ V MSet [ t ] do if maxP [ t ] < V MCapMat [ t ][ v ] .P or maxB [ t ] < V MCapMat [ t ][ v ] .B or maxS [ t ] < V MCapMat [ t ][ v ] .S then V MCapRatMat [ t ][ v ] .P = 0 ; V MCapRatMat [ t ][ v ] .B = 0 ; V MCapRatMat [ t ][ v ] .S = 0 ; tierV MP erfScore [ t ][ v ] = − ; continue; /* Convert to percent capacities */; V MCapRatMat [ t ][ v ] .P = V MCapMat [ t ][ v ] .PmaxI [ t ] ; V MCapRatMat [ t ][ v ] .B = V MCapMat [ t ][ v ] .BmaxP [ t ] ; V MCapRatMat [ t ][ v ] .S = V MCapMat [ t ][ v ] .SmaxS [ t ] ; tierV MP erfScore [ t ][ v ] = agingF actor × ttlV MCapRatMat [ t ][ v ] + currCapScore ( t, v ) − wetMig [ t ] × migCost ( t, v ) ; return ; Procedure migCost( t,v ) migSpd = min ( remReadT hrput ( V M [ v ] .tier ) + V M [ v ] .currReadT hrpt, remW riteT hrput ( t )) ; return V M [ v ] .size/migSpd ; Procedure triggerMigration() for t ∈ tierSet do for v ∈ descendingSortByScore ( tierV MP erfScore [ t ]) do if V M [ v ] .isAssigned = F alse and tierV MP erfScore [ t ][ v ] (cid:54) = − and tierHasCapacityF orV M ( t, v ) then assignVMToTier(v,t); VM[v].isAssigned=True; return ; and 31. Lastly, the storage size will also be updated into the V M CapM at (i.e., Alg. 1 line 32). Furthermore, since it getsharder to evaluate demands of different recourse types together(because they have different units), AutoTiering normalizes theVM’s estimated/measured throughput, bandwidth and storageutilization value according to the total available resourcecapacity of each tier, which is called the normalized capacityutilization rate matrix

V M CapRatM at , as shown in Alg. 2lines 4-13.

4) Performance Score Calculation:

AutoTiering takes threesteps to calculate the performance score based to reﬂectthe following factors: First,

Characteristics of both tier andVM : the score should reﬂect each tier’s specialty and eachVM’s workload characteristics running on each tier. Thus, oursolution is to calculate each VM’s score on each tier separately.Second,

SLA weights : VMs are not equal since they havedifferent SLA weights, as shown in Eq. 1. Third,

Conﬁdenceof estimation : we use coefﬁcient variation calculated in perfor-mance matrices to reﬂect the conﬁdence of estimation. Fourth,

History and migration costs : a convolutional aging factor forhistorical scores and estimated migration cost are consideredduring the score calculation.

Step 1] Tier Specialty Matrix : To reﬂect the specialty,we introduce a two-dimension tier-specialty matrix spc . Forexample, “ spc [ t ] .P = 1 , spc [ t ] .B = 1 and spc [ t ] .S = 0 ”reﬂects that tier t is good at throughput and bandwidth, butbad at storage capacity. In fact, this matrix can be extendedto a ﬁner granularity to further control specialty degree, andmore types of resources can be included into this matrix, ifneeded. Moreover, tiers are sorted in the order of high-to-low-end (e.g., most-to-least-expensive-tier) in the matrix, andthis order is regarded as a priority order during the migrationdecision making period. [Step 2] Orthogonal Match between VM Demands and TierSpecialties: The next question is “ how to reﬂect each VM’sperformance on each tier AND reﬂect how good VMs canutilize each tier’s specialty? ”. Our solution is to introduce aprocess called “orthogonal match” (denoted as “ Ω ”) to scorethe “matchness”. This process is a per-VM-per-tier multiplica-tion operation of “specialty” matrix and ‘ ‘ V M CapRatM at ”matrix, i.e., currCapScore ( t, v ) = Ω( t, v )= (cid:2) spc [ t ] .P × wetP [ t ] , spc [ t ] .B × wetB [ t ] , spc [ t ] .S × wetS [ t ] (cid:3) ×  V MCapRatMat [ t ][ v ] .PV MCapRatMat [ t ][ v ] .BV MCapRatMat [ t ][ v ] .S  × V M [ v ] .SLA × V M [ v ] .conf ÷ ( wetP [ t ] + wetB [ t ] + wetS [ t ]) , (6) where currCapScore gives the current capacity score, and V M CapRatM at is the VM capacity utilization rate matrix. [Step 3] Convolutional Performance Score:

The ﬁnal perfor-mance score is a convolutional sum of historical score, currentepoch capacity score and penalty of corresponding migrationcost, i.e., tierV MP erfScore = agingF actor × histT ierV MP erfScore + currCapScore − wetMig × migCost (7) This process is also shown in Alg. 2 line 14. Speciﬁcally, toavoid the case that some VMs are frequently migrated backand forth between tiers (due to making decision only based onrecent one epoch which may contain I/O spikes or bursties),AutoTiering needs to convolutionally consider history scores,with a preset agingF actor to fadeout outdated scores. Currentcapacity score currCapScore is calculated by the orthogonalmatch procedure. Additionally, Alg. 2 lines 16-18 show theprocedure of migrationCost calculation.V. P

ERFORMANCE E VALUATION

A. Evaluation Methodology1) Implementation Details:

We build AutoTiering onVMware ESXi hypervisor 6.0.0 [1]. Table III summarizes theserver conﬁguration of our implementation. Table IV furthershows the specs of each tier (each tier has multiple SSDs).We set the specialty matrix such that tier 1 is good forthroughput and bandwidth performance, tier 2 is the secondaryperformance tier but with larger capacity, and tier 3 is thecapacity tier to replace HDDs.

2) Workloads:

To evaluate performance under differentalgorithms, we use IOMeter [21] and FIO [22] to generate I/Oworkloads to represent real world use cases. Table V showssome statistical analysis of 14 used workloads. Each VM has

TABLE III: Host server conﬁguration.

Component Specs

Host Server HPE ProLiant DL380 G9Host Processor Intel Xeon CPU E5-2360 v3Host Processor Speed 2.40GHzHost Processor Cores 12 CoresHost Memory Capacity 64GB DIMM DDR4Host Memory Data Rate 2133 MHzHost Hypervisor VMware ESXi 6.0.0 multiple VMDKs with different sizes, such as system disk,datastore disk, and etc.

TABLE IV: Multi-tier ﬂash drivers conﬁguration.

Tier Model Protcl. IOPS MBPS PerDiskR W R W Size(GB)

3) Comparison Candidates:

We compare AutoTiering ( AT )with two other solutions [23]: (1) IDT : IOPS Dynamic Tiering,implements dynamic conﬁguration and placement using agreedy IOPS-only criteria where higher IOPS extents moveto higher IOPS tiers; and (2)

EDT : Extent-based DynamicTiering, updates VM-tier assignment for every epoch, basedon both VM capacity and IOPS requirements. To fully utilizethe high speed all-ﬂash datacenter, we slightly modiﬁed IDTand EDT to support per-VMDK-based operation.

TABLE V: Resource demands of selected workloads.

Load Workload Represented Scenarios Thrupt. BW.(IOPS) (BPS)

Heavy BasicVerify SQL database server 95.5K 373MSSDSteady System development 116K 453MZipf IOs Web apps 1942K 7585MAsyncRead Read intensive apps 88.3K 345MAsyncWrite Write intensive apps 6.65K 25MMiddle Flow Big data frameworks 19.2K 150MIOmeter File server 47K 205MJESD High endurance apps 18.3K 136MLatencyProﬁle Cloud system manager 39.6K 155MSSDTest Hardware development 47K 205MLight RandZone Multi-user database 7.75K 30.3MSurfaceScan Enterprise backup server 6.98K 436MSyncRead Read intensive sync apps 6.65K 25MSyncWrite Metadata sync server 4 16K

B. Study on Throughput, Bandwidth and Latency Changes

Fig. 3 illustrates the average throughput, bandwidth, andnormalized latency of all tiers over time for both read ( Rd )and write ( Wt ) I/Os. AutoTiering achieves up to 44.74% and38.78% higher IOPS than IDT and EDT. Similar results canbe obtained for bandwidth and latency as shown in Figs. 3(b)and (c). Fig. 4 depicts per-tier results to further show the per-formance improvement brought by AutoTiering. We observethat AutoTiering performs the best in terms of (both readand write) throughputs, bandwidths and latencies on tier 1,which is because the specialty matrix sets tier 1 to optimizeperformance-sensitive workloads. On the other hand, we alsosee that AutoTiering sometimes achieves lower throughput andbandwidth in tier 2 and 3 compared with IDT and EDT. Thisis because IDT is IOPS-only algorithm, which migrates high-IOPS-demand (especially write I/O) workloads to tier 1, such a) Average throughput of all tiers (b) Average bandwidth of all tiers (c) Average normalized latency of all tiers T h r o u g hpu t ( I O P S ) Rd Thrupt Wt Thrupt B a nd w i d t h ( M B P S ) Rd BW Wt BW N o r m a li ze d L a t e n c y ( % ) Rd Lat Wt Lat

Fig. 3: Average throughput, bandwidth, and latency of all tiers. (a) Average throughput of each tier T h r o u g hpu t ( I O P S ) IDT EDT AT B a nd w i d t h ( M B P S ) IDT EDT AT N o r m a li ze d L a t e n c y ( % ) IDT EDT AT (b) Average bandwidth of each tier (c) Average normalized latency of each tier

Fig. 4: Average throughput, bandwidth, and latency of each tier. (a) CDF of throughput of all tiers(b) CDF of bandwidth of all tiers P [ < x ] Throughput of all tiers (IOPS)IDTEDTAT P [ < x ] Bandwidth of all tiers (MBPS)IDTEDTAT

Fig. 5: CDF of throughput and bandwidth of all tiers. that the write IOPS is optimized. Similarly, EDT considersboth IOPS and capacity, and thus has slightly better write IOPScompared to AutoTiering in the capacity tier 3. It is worth tomention that AutoTiering achieves the lowest latencies in allcases except write latency in tier 2 (as shown in Fig. 4(c) 5thcolumn), because AutoTiering migrates many VMDKs thathave large average I/O size (high write bandwidth), and thus,as a tradeoff, the latency is increased. Moreover, Fig. 5 depictsthe distribution of total throughput and bandwidth of all tiersfor different algorithms. From Fig. 5(a), we observe that underAutoTiering (red curve), majority of I/Os has more than 100KIOPS and even half of them have more than 125K IOPS. In contrast, 90% of I/Os are less than 100K IOPS (blue curve)under IDT, and almost all I/Os from IDT are less than 100K(green curve). Similarly, from Fig. 5(b), we can see that themajority (around 90%) of IDT and EDT I/Os are less than1,200 MBPS, while more than half of AutoTiering’s I/Os canachieve larger than 1,200 MBPS bandwidth.

C. Study on Runtime Distribution of Resource Utilization

Fig. 6 shows runtime changes of throughput, bandwidth andlatency distribution across tiers over time. We observe thatthe areas in (c) and (f) are larger than those in (a)-(b), and(d)-(f), respectively. Area in (i) is also much smaller thanthose in (g)-(h). This veriﬁes our observations in Sec. V-Bthat AutoTiering achieves better throughput and bandwidthperformance than IDT and EDT. We also observe in (a) to(f) that area of each tier in AutoTiering is “thicker” than thatin IDT and EDT (after AutoTiering’s warming up periods from0-3 epochs). This indicates that AutoTiering can better utilizethroughput and bandwidth resources of each tier by proper andless migrations. In (i), we see that the majority of the latencyof AutoTiering is located in tier 3 (the write latency “

T3 RdLat ”), which is because that tier 3 is regarded as the capacitytier to replace HDD. As a result, AutoTiering migrates read-intensive VMs with large VMDKs to leverage tier 3, and leavestiers 1 and 2 for other write-intensive workloads.

D. Study on Migration Overhead

To investigate the migration overhead of three algorithms,we show the normalized temporal migration cost results inFig. 7. The blue bars show the normalized total migrateddata size, and the green bars show the normalized numberof VMs that are migrated (multiple migrations on a singleVM is counted as 1). The former is to reﬂect the “workingvolume size” and the latter is to reﬂect the “working set size”.AutoTiering performs best among the three, since it migratesless data and interrupts less VMs, which saves lots of systemresources. In summary, AutoTiering chooses the best tier foreach VM for better performance and prevents unnecessarymigrations due to I/O spikes, ascribed to its comprehensivedecision which is based on a more accurate performanceestimation method. VI. C

ONCLUSION

We present a novel data placement manager “AutoTiering”to optimize the virtual machine performance by allocatingand migrating them across multiple SSD tiers in the all-ﬂash datacenter. AutoTiering is based on an optimizationframework to provide the global best migration and allocationsolution over runtime. We further proposed an approximationalgorithm to solve the problem in a polynomial time, which T h r o u g hpu t ( I O P S ) Epoch T1 Rd Thrupt T1 Wt ThruptT2 Rd Thrupt T2 Wt ThruptT3 Rd Thrupt T3 Wt Thrupt T h r o u g hpu t ( I O P S ) Epoch T1 Rd Thrupt T1 Wt ThruptT2 Rd Thrupt T2 Wt ThruptT3 Rd Thrupt T3 Wt Thrupt T h r o u g hpu t ( I O P S ) Epoch T1 Rd Thrupt T1 Wt ThruptT2 Rd Thrupt T2 Wt ThruptT3 Rd Thrupt T3 Wt Thrupt B a nd w i d t h ( M B P S ) Epoch T1 Rd BW T1 Wt BWT2 Rd BW T2 Wt BWT3 Rd BW T3 Wt BW B a nd w i d t h ( M B P S ) Epoch T1 Rd BW T1 Wt BWT2 Rd BW T2 Wt BWT3 Rd BW T3 Wt BW B a nd w i d t h ( M B P S ) Epoch T1 Rd BW T1 Wt BWT2 Rd BW T2 Wt BWT3 Rd BW T3 Wt BW N o r m a li ze d L a t e n c y ( % ) Epoch T1 Rd Lat T1 Wt LatT2 Rd Lat T2 Wt LatT3 Rd Lat T3 Wt Lat N o r m a li ze d L a t e n c y ( % ) EpochT1 Rd Lat T1 Wt LatT2 Rd Lat T2 Wt LatT3 Rd Lat T3 Wt Lat N o r m a li ze d L a t e n c y ( % ) Epoch T1 Rd Lat T1 Wt LatT2 Rd Lat T2 Wt LatT3 Rd Lat T3 Wt Lat (a) Throughput of each tier under IDT (b) Throughput of each tier under EDT (c) Throughput of each tier under AT(d) Bandwidth of each tier under IDT (e) Bandwidth of each tier under EDT (f) Bandwidth of each tier under AT(g) Normalized latency of each tier under IDT (h) Normalized latency of each tier under EDT (i) Normalized latency of each tier under AT

Fig. 6: Runtime changes of throughput, bandwidth, and latency of each tier under different algorithms.

Migrated Data (Bytes)VM Migrated (

Fig. 7: Normalized migration cost results. considers both historical and predicted performance factors,and estimated migrating cost. Experimental results show that [2] T. N. Theis and H.-S. P. Wong, “The end of moore’s law: A new begin-ning for information technology,”

Computing in Science & Engineering ,vol. 19, no. 2, pp. 41–50, 2017.[3] D. G. Andersen and S. Swanson, “Rethinking ﬂash in the data center,”

IEEE micro

Proceedings of the 1993 ACMSIGMOD international conference on Management of data , Washington,DC, 1993, pp. 297–306.[6] M. Kampe, P. Stenstrom, and M. Dubois, “Self-correcting lru replace-ment policies,” in

Proceedings of the 1st conference on Computingfrontiers , Ischia, Italy, 2004, pp. 181–191.[7] T. Johnson and D. Shasha, “2Q: A low overhead high performancebuffer management replacement algorithm,” in

Proceedings of the 20thInternational Conference on Very Large Data Bases , San Francisco, CA,1994, pp. 439–450.[8] Y. Zhou, J. Philbin, and K. Li, “The multi-queue replacement algorithmfor second level buffer caches,” in

Proceedings of the 2001 USENIXAnnual Technical Conference , Boston, MA, 2001, pp. 91–104.[9] D. Lee, J. Choi, J.-H. Kim, S. Noh, S. L. Min, Y. Cho, and C. S. Kim,“LRFU: A spectrum of policies that subsumes the least recently usedand least frequently used policies,”

IEEE Transactions on Computers ,vol. 50, no. 12, pp. 1352–1361, 2001.[10] L. A. Belady, “A study of replacement algorithms for a virtual-storagecomputer,”

IBM Systems journal , vol. 5, no. 2, pp. 78–101, 1966.

AutoTiering can signiﬁcantly improve system performance.R

ACM SIGARCH ComputerArchitecture News , vol. 42, no. 3, pp. 277–288, 2014.[12] F. Meng, L. Zhou, X. Ma, S. Uttamchandani, and D. Liu, “vCacheShare:automated server ﬂash cache space management in a virtualizationenvironment,” in

USENIX ATC , 2014.[13] K. Krish, A. Anwar, and A. R. Butt, “HATS: A Heterogeneity-awareTiered Storage for Hadoop,” in

Cluster, Cloud and Grid Computing,2014 14th IEEE/ACM International Symposium on , 2014.[14] T. Pritchett and M. Thottethodi, “Sievestore: A highly-selective,ensemble-level disk cache for cost-performance,” in

Proceedings of the37th annual international symposium on Computer architecture , Saint-Malo, France, 2010, pp. 163–174.[15] F. Xu, F. Liu, L. Liu, H. Jin, B. Li, and B. Li, “iAware: Making livemigration of virtual machines interference-aware in the cloud,”

IEEETransactions on Computers , vol. 63, no. 12, pp. 3012–3025, 2014.[16] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper, “Resource poolmanagement: Reactive versus proactive or let’s be friends,”

ComputerNetworks , vol. 53, no. 17, pp. 2905–2922, 2009.[17] T. Setzer and A. Wolke, “Virtual machine re-assignment considering mi-gration overhead,” in

Network Operations and Management Symposium(NOMS), 2012 IEEE . IEEE, 2012, pp. 631–634.[18] Z. Fan, D. H. Du, and D. Voigt, “H-ARC: A non-volatile memorybased cache policy for solid state drives,” in

Mass Storage Systems andTechnologies (MSST), 2014 30th Symposium on . IEEE, 2014, pp. 1–11.[19] “vsphere apis for i/o ﬁltering (vaio) program,” https://code.vmware.com/programs/vsphere-apis-for-io-ﬁltering.[20] T. Luo, S. Ma, R. Lee, X. Zhang, D. Liu, and L. Zhou, “S-CAVE:Effective ssd caching to improve virtual machine storage performance,”in

Proceedings of the 22nd international conference on Parallel archi-tectures and compilation techniques