[PDF] A Data-Assisted Reliability Model for Carrier-Assisted Cold Data Storage Systems

Abstract

Cold data storage systems are used to allow long term digital preservation for institutions' archives. The common functionality among cold and warm/hot data storage is that the data is stored on some physical medium for read-back at a later time. However in cold storage, write and read operations are not necessarily done in the same exact geographical location. Hence, a third party assistance is typically utilized to bring together the medium and the drive. On the other hand, the reliability modeling of such a decomposed system poses few challenges that do not necessarily exist in other warm/hot storage alternatives such as fault detection and absence of the carrier, all totaling up to the data unavailability issues. In this paper, we propose a generalized non-homogenous Markov model that encompasses the aging of the carriers in order to address the requirements of today's cold data storage systems in which the data is encoded and spread across multiple nodes for the long-term data retention. We have derived useful lower/upper bounds on the overall system availability. Furthermore, the collected field data is used to estimate parameters of a Weibull distribution to accurately predict the lifetime of the carriers in an example scale-out setting. In this study, we numerically demonstrate the significance of carriers' presence and the key role that their timely maintenance plays on the long-term reliability and availability of the stored content.

Full PDF

UUNEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 1

A Data-Assisted Reliability Model forCarrier-Assisted Cold Data Storage Systems

Suayb S. Arslan, James Peng, and Turguy Goker

Abstract —Cold data storage systems are used to allow longterm digital preservation for institutions’ archive. The commonfunctionality among cold and warm/hot data storage is that thedata is stored on some physical medium for read-back at a latertime. However in cold storage, write and read operations are notnecessarily done in the same exact geographical location. Hence,a third party assistance is typically utilized to bring together themedium and the drive. On the other hand, the reliability modelingof such a decomposed system poses few challenges that do notnecessarily exist in other warm/hot storage alternatives such asfault detection and absence of the carrier, all totaling up to thedata unavailability issues. In this paper, we propose a generalizednon-homogenous Markov model that encompasses the aging ofthe carriers in order to address the requirements of today’scold data storage systems in which the data is encoded andspread across multiple nodes for the long-term data retention.We have derived useful lower/upper bounds on the overallsystem availability. Furthermore, the collected ﬁeld data is usedto estimate parameters of a Weibull distribution to accuratelypredict the lifetime of the carriers in an example scale-out setting.In this study, we numerically demonstrate the signiﬁcance ofcarriers’ presence and the key role that their timely maintenanceplays on the long-term reliability and availability of the storedcontent.

Index Terms —Cold data storage, Non-homogenous Markovmodel, Reliability, Availability, Simulation, Aging, Archive. N OMENCLATURE n / ˜ n The number of nodes present in the system /the blocklength of an MDS code. k / ˜ k The number of data nodes in the system / thepayload length of an MDS code. s Number of node states. λ Rate of node failure process. φ Rate of robot (carrier) repair process. µ Rate of data repair process. θ Rate of failure detection process. N s Total number of Markov states when the num-ber of node states is s .ADF Availability,Detection,Failure.RAID Redundant Array of Inexpensive/IndependentDisks.MTTDU Mean time to data unavailability.MTTDL Mean time to data loss.MDS Maximum Distance Separable.UCER Uncorrectable error rate.MTBF Mean time between failures. Suayb S. Arslan is with the Department of Computer Engineering, MEFUniversity, Istanbul, Turkey. e-mail: [email protected] Peng ad Turguy Goker are with Advanced Channel Team, QuantumCorp., Irvine, CA, USA. e-mails: { james.peng,turguy.goker } @quantum.com. CT Continous Time.SBF Swaps/exchnges before failure.TBS Time between swaps/exchanges.TPM Transition Probability Matrix. κ Tape cartridge damage ratio. M Fundamental matrix of an absorbing Markovchain. I Identity matrix. a ij The entry in the i th row and j th column of thematrix A . (cid:15) Drive read hard error probability. Γ( . ) Complete Gamma function. r The rate of the MDS code given by ˜ k/ ˜ n . η Drive read hard error rate. B ( . ; ., . ) Incomplete Beta function.CDF Cumulative Distribution Function. F P ( . ) CDF of Poisson binomial distribution. (cid:60){ . } The real part of the argument. g The shape parameter of the Weibull distribu-tion. y The scale parameter of the Weibull distribution. β i Survival probability of the node i . hs ( . ) Harmonic sum function. LB Performance Lower Bound.

U B

Performance Upper Bound.I. I

NTRODUCTION T HE physical medium on which the infrequently ac-cessed data a.k.a. cold data is stored is quite susceptibleto loss and corruption due to exposure to heat, humidity,dust-like contaminants, and in particular faulty and unpre-dictable read/write hardware behaviors. Typical examples in-clude drives and tape/optical media where the data is laidon a physical medium for long term storage by utilizingthe magnetic ﬁeld directions [1]. On the other hand, oneof the most popular cold storage options in today’s worldis the Deoxyribonucleic acid (DNA) based media in whichthe data is embedded inside the DNA strands and read/writeoperations are performed by special hardware doing speciﬁcchemical operations known as synthesizers and sequencers ,respectively [2]. Strands are carried inside the bottles andprotected in special environments such as the most recent molecular hopper technology [3] for long time preservation.Similarly, data carriers (such as mechanical robots) are usedto carry the tape cartridges or optic discs to drives in order tosuccessfully complete read/write operations.In an accurate reliability modeling for cold data storagesystems (such as tape), the media failure, faulty drive be- a r X i v : . [ c s . PF ] N ov NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 2 haviour, failure detection, data spreading – that is how muchthe data is spread across different independent cold storagenodes – should all be taken into account [4]. Particularly, thedetection of faulty system behaviour and lost data can be doneby periodic system/data check operations which is known as scrubbing in disk-based systems [5]. Such a scrubbing processcan be incorporated into reliability models [6], [7], [8]. Incontrast, in case of carrier failures, data is not lost but wouldbe temporarily unavailable until the carrier is replaced/ﬁxed bythe maintenance team. Therefore, the actual reliability modelshould treat data durability and availability individually and beable to neatly merge these important data-related performanceindicators.The reliability studies for disk arrays have been the centerof research interest in our modern age to help build sus-tainable and efﬁcient data centers. One of the most compre-hensive/generalization studies to date is given in [9] wherehard errors and generic erasure coding are taken into account.Incorporation of both uncorrelated and correlated hard errorswith Markov modelling are considered in a later study [10].Furthermore, some special cases of the proposed model in[9] is considered and analyzed in [11]. One of the mostgeneral Markov model with horizontal and vertical data al-location policies is recently studied for disk systems [12]in which failure detection process is not taken into account.Furthermore, the analysis is extended to study the effect ofmultidimensional redundancy on aging [13]. A special Markovchain is also considered to measure RAID-6 systems [14]without much generalization. Due to analytical complexity,software tools are later introduced for accurate prediction oflarge scale data storage architectures [15]. With regard tocold storage systems, several past research studied application-speciﬁc back-up systems in which keeping the copies ofthe data (replication) was the primary means of providingdurability [16]. Although they have paid attention to failuredetection problem, they did not conﬁgure their model fortrue cold storage environment requirements i.e., the necessityof carrying media from one location to another, detrimentaleffects of mechanical components, unavailability of carriersand drive-related hard errors. In [17], the work is extended tocover 4-copy case where the backup system consists of bothtapes and hard drives with different failure and repair ratesi.e., a heterogeneous storage network. Since the storage mediaand internal mechanics are different for tape systems and harddrives, the proposed model quickly gets complicated as thenumber of copies increase. Therefore, extending it beyond4-copy seems to be quite challenging and no systematicextension is proposed in the same line of work. In [18], a twodimensional Markov process is proposed for modeling explicitand latent errors in disk-based distributed storage systems inwhich failure detection is assumed to take almost no time.Since only disk-based systems are considered, the study is notextended to take into account the presence of carriers.Note that the overall scale-out system reliability relies onthe durability of constituent cold storage units such as tapes.Several studies conducted life expectancy and media stabilitytests for magnetic tape and the results revealed that a theoretic50-100 years lifespan is possible for magnetic tapes [19], [20]. Similar studies exist for optical disks [21] as well.On the other hand, continuously failing storage media,conventionally neglected unavailability, unexpected and time-dependent hard errors (mostly due to aging) make thesepredicted numbers obsolete. In addition, in all kinds of digitaldata storage, data is usually erasure coded and spread acrossdifferent storage nodes through intelligent allocation schemes[22] for better durability, space overhead and data loss char-acteristics [23]. Data access patterns and system-level detailsare collected by most companies for better understanding oftheir real workload and help develop better business strategies[24]. This type of data can further be used using machinelearning techniques to improve the overall storage reliability[25]. Hence, accurately estimating the data durability requiresa data-assisted reliability model which should take all afore-mentioned detrimental effects into account.Due to the complexities of data dependent density esti-mation and the time dependent carrier aging phenomenon,we formulated the proposed Markov model as a simulationplatform to estimate the distributions of time to data-loss anddata-unavailability. Simple statistics such as mean time to dataloss (MTTDL) and unavailability (MTTDU) will be derivedfrom the simulation data and compared for various choices ofsystem parameters as well as few theoretical results knownfor a limited parameter space. For a better analysis, underreasonably good assumptions, lower and upper bounds on thesystem performance will be derived and presented togetherwith the numerical results.The rest of the paper is structured as follows. In SectionII, the general multi-dimensional Markov model is introducedin which carrier availability is taken care of. Moreover, itscomplexity analysis is addressed based on the number ofstates. In Section III, the most common problems in coldstorage i.e., hard read errors and carrier unavailabilities areprecisely modeled, transition and probability matrices aredeﬁned, lower and upped bounds on the performance arederived. Since the lifetime of the carrier is dependent on theusage pattern, data-assistance is utilized in Section IV to modelthe aging phenomenon. Finally, we provide few numericalresults in Section V to demonstrate the signiﬁcance of carrieravailability and ﬁnally Section VI concludes the paper.II. A C

ARRIER - PRESENT R ELIABILITY M ODEL

In this section, we describe the general model that incorpo-rates data as well as carrier presence in to a Markov model.We also consider a speciﬁc special case that is most commonin realistic cold storage applications.

A. General Model Description

The proposed reliability model incorporates data–drivendensity estimation (a known distribution ﬁt) and a continuoustime Markov process to accurately estimate the density ofthe system data loss, and associated ﬁrst order statistics suchas MTTDL and MTTDU metrics. The model has two typesof states: node states and system states all combined andcharacterized by the Markov states. Although the proposed

NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 3 n A ( n − A, F ( n − A, D ( n − A, F ( n − A, F, D ( n − A, D · · · .... . . k A, ( n − k ) F k A, ( n − k − F, D... k A, ( n − k ) D F Start θ µ θ ( n − λ θ µ ( n − λ µ ( n − λ ( n − k ) θ µ kλ ( n − k − θθ ( n − k ) µ kλnλ ( n − λ ( n − λ kλ Fig. 1. Generalized Markov model for s = 3 node states i.e., represented by Availability (A), Detection (D) and Failure (F). The state F denotes the totalfailure/unavailability. Note that since s = 3 , corresponding Markov model can be represented on a two dimensional plane. In general, Markov model is s − dimensional. model is generally applicable to any type of cold storage, wespeciﬁcally consider a scale–out tape library system.Although the paper shall treat the number of node statesto be any arbitrary number, for simplicity, we will give ourexamples by proposing three different node states that arethe most commonly assumed in cold data storage community.There are deﬁned to be Available (A),

Failed (F) and

Detected (D) for a given node and we automatically generate theassociated Markov states of the overall system. We wouldlike to remind that typical continuous Markov processes areheavily used for warm/hot storage devices consisting of harddisk or solid state device arrays (such as given in [9]). Previouswork focused on Markov processes with typically two nodestates, namely Available and Failure. Indeed, our treatmentmakes our model a generalized version of all the previousMarkov models used with conﬁgurable parameters. Althoughit becomes impossible to derive closed form expressions,we shall use approximations to derive analytical results forperformance metrics such as lower and upper bounds on themean statistics.As assumed by many past research studies such as [26], [27]and [28], the cold storage system is protected by a (˜ n, ˜ k ) Max-imum Distance Separable (MDS) erasure code where ˜ n is thecodeword (block) and ˜ k is the payload lengths, respectively.Note that with this setting, the conventional replication (copy)system corresponds to ˜ k = 1 . The quantity r = ˜ k/ ˜ n = k/n is termed as the rate of the code. In other words, since weuse MDS codes, we assume ˜ n and ˜ k to be a multiple of n and k , respectively. Due to encode/decode complexity andwithout loss of generality, we shall assume n = ˜ n and k = ˜ k throughout the document to convey the main idea.We realize that node states and Markov states are not thesame, in fact, a Markov state can have triple node states. Wecan visualize each Markov state consisting of three bucketscounting the number of nodes having each of A, F and D nodestates. Furthermore, we assume nodes are exactly the sametype and fail with the same rate λ i.e., homogeneous storage network. This means that due to a physical and irreversibleerror, data cannot be read from the tape that resides in thatnode. In addition, we have three more processes running in thesystem; two of them are the concurrent and identical data andcarrier repairer processes and the third one is the concurrentand identical failure detector process. In this study, we assumethat an error detection process is run on tapes and repair themwhenever an error is detected. In addition, media carriers suchas robots can also be repaired since without their availability,all data operations will cease. Robot and data repairer as wellas failure detector processes are assumed to be exponentiallydistributed with rates φ , µ and θ , respectively.The complexity of the proposed continuous Markov processis strongly tied to the total number of system states which area function of node states. For s > , i.e., number of nodestates being greater than one , suppose we have i availablenodes with k ≤ i ≤ n (each containing a single data chunk -remember this is a requirement for perfect data reconstruction)then we shall have s − node states to share a total of n − i data chunks. In this case, the total number of decompositionsis given by (cid:18) n − i + s − s − (cid:19) (1)For instance, Fig. 1 shows all of the system states (Markovstates) as a function of k and n if the node states are s = 3 ,represented by A, D and F. Yet in another case, the state“Queued for Service: QS” can be added to make number ofnode states , i.e., s = 4 . The closed form expression tocalculate the total number of Markov states (for general s ) In our formulation, we assume that node states A and F are naturallypresent in any reliability model. In further extensions of the model proposedin this study, different processes can be incorporated such as detection,participation and aging.

NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 4 is given by (including the total failure state) N s = n (cid:88) i = k (cid:18) n − i + s − s − (cid:19) + 1 = n − k (cid:88) i =0 (cid:18) i + s − s − (cid:19) + 1= (cid:18) n − k + s − n − k (cid:19) + 1 (2)Note that for the general case, we can further express N s as follows N s = 1 + (cid:18) n − k + s − n − k (cid:19) (3) = 1 + (cid:18) n − k + 11 (cid:19) + s − (cid:88) i =2 (cid:18) n − k + i − i (cid:19) (4) ≥ s − (cid:88) i =0 (cid:18) n − k + 1 i (cid:19) (5)where we clearly see that equality in equation (5) holdsonly for s = 2 , . For small s , equation (5) can be usedas an accurate approximation. Note that going from (3) to(4), we have used induction. In the A, we provide tighterupper and lower bounds for N s and asymptotically analyzethe complexity of the ﬁnal Markov reliability model. B. A special case: 3-node-state Markov Model

As introduced earlier, let us assume we have three nodestates: A, D and F. Therefore, there are only three (in general s ) destinations that a next state change could result in. Forinstance, for a given state index ( i A, j D, z F ) , Table I sum-marizes all the state indexes as possible destinations. In thetable, DS stands for destination state and i, j, z should satisfythe inequalities k ≤ i ≤ n, ≤ j ≤ n − k, ≤ z ≤ n − k, i + j + z = n One of the things that is not accounted for in the simulationmodel is a transformation method from the three-index statename to a single index name that runs between 0 and N − for simulation convenience. One straightforward method is tolet the common index c ind to be c ind = 12 ( j + z )( j + z + 1) + z + 1= ( n − i + 1)!2( n − i − z + 1 = (cid:18) n − i + 12 (cid:19) + z + 1 (6)where j + z = n − i is replaced to ﬁnd the ﬁrst equality. Notethat there is a one-to-one relationship between c ind and ( i, j, z ) and we can similarly ﬁnd the inverse transform of the index c ind . To ﬁnd ( i, j, z ) from a given c ind , we ﬁrst need to ﬁnd themaximum i ∈ { k, k + 1 , . . . , n } such that c ind > i ( i + 1) / .Curr. State Destinations Transition Rate i A, j F, z D ( i − A, ( j + 1) F, z D iλi A, j F, z D i A, ( j − F, ( z + 1) D jθi A, j F, z D ( i + 1) A, j F, ( z − D zµ TABLE IS

TATE TRANSITIONS AND RATES

From equation (6), we can ﬁnd z since we know n and i .Finally, we use the fact that i + j + z = n to determine j .Also the maximum of c ind shall be achieved with the stateindex ( k, , n − k ) . Since N s = max { c ind } + 1 , the followingcan be shown to be true N = (cid:18) n − k + 12 (cid:19) + n − k + 2 (7) = 12 (( n − k + 1)( n − k ) + 2( n − k + 2)) (8) = 12 ( n − k + 2)( n − k + 1) + 1 (9) = n − k +1 (cid:88) i =1 i + 1 = n − k (cid:88) i =0 ( i + 1) + 1 (10)Note that the ﬁnal equality is the same as the result givenby the Eqn. (2) when s = 3 . In order to derive performanceexpressions and apply the general model to a particular prac-tical application, we shall assume s = 3 for the rest of ourdiscussions.III. F AILURE T YPES AND C ARRIER U NAVAILABILITY IN C OLD S TORAGE

Hard error scenarios are well understood in warm/hot stor-age realms i.e., when the drive and the storage medium aretightly coupled. In addition, there is no separate carrier avail-ability problem due to this coupling. However, modeling andincorporating the hard errors as well as the carrier availabilityall at the same time into a reliability model is much morechallenging in a cold data storage context.Let us consider tape library systems as an example use case.One of the fundamental challenge is that the robots (carrierdevices) can make some given number of exchanges–swaps(round-trips) before failure. Since such a constraint dependson time and the frequency of use (load of the system), thiswould add non-homogeneity to the Markov model at hand.Also, there are two driving forces for aging in the samesystem: (1) The user data access pattern which is usuallyless dominant in a cold storage setting and (2) the internallygenerated access requests due to system/data repair operationswhich will lead to extra robot exchanges, drive load/unloadcycles, tape positioning etc., to be able to meet the systemreliability goals. Similar observations can be made for otherpopular cold storage alternatives.In this study, we constrained our set and assumed three dom-inant factors two of which directly affects the data durabilitywhile the other only changes the data unavailability. Thesefactors, which need to be identiﬁed carefully, are explained indetail in the following subsections.

A. Drive Read Failure

This type of hard error is also pretty prevalent in warm/hotstorage where drives become unable to read the data due to anuncorrectable error by the virtue of internal error correctiondecoding failure of the drive. Uncorrectable error rate (UCER)is usually given in terms of errors per number of bytes or bits

NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 5 Q =  − nλ nλ ∆ n . . . nλ (1 − ∆ n )0 − ( θ + ( n − λ ) θ ( n − λ ∆ n − . . . n − λ (1 − ∆ n − ) µ − ( µ + ( n − λ ) 0 . . . n − λ (1 − ∆ n − )0 0 0 0 . . . n − λ (1 − ∆ n − ) ... ... ... . . . ... ... . . . − ( kλ + ( n − k ) µ ) kλ . . .  read and are usually due to random noise effects. This harderror mechanism is usually assumed to be time-independentand seeing at least one read error can be calculated byindependence assumption, given by (cid:15) = 1 − (1 − U CER ) tape capacity (11)where we assume the worst case scenario i.e., bulk reads i.e.,the entire tape is read and tape capacity is the number ofbytes a tape can store. Note that in cold storage, this worstcase scenario is quite common. B. Storage Medium (Tape) Damage

In our study, we model the probability of tape damagedue to manufacturing reasons (in the infant mortality period)at the onset or external factors such as humidity, pressureand stringent temperature conditions later on in their lifetime.Such factors are assumed to be static and persists after failuredetection and correction throughout the lifespan of the datastored in the cold storage. For this rationale, we used κ > to model this damage probability. More speciﬁcally, we referto tape damages to be κ × percent of all the tapes containedin a given library. C. Carrier (Robot) Failure

Suppose that library robots (carriers) can make m number ofexchanges (round-trips) before they fail and eventually becomeunable to complete tasks initiated by the libraries includingthe detection and repair processes. We assume the robot typesand qualities are identical in all of the libraries. Based on theavailable data extracted from our local library systems, weshall show that m can be modelled as a Weibull distributedrandom variable with some shape( g ) and scale( y ) parameters.It is typical to assume the time between exchanges to beexponentially distributed with rate ω . The rate ω depends ona number of parameters such as the number of users using thesystem, the total number of libraries in a scale-out setting, thetime of the year etc. The time to robot failure (characterized bythe random variable Y ) is therefore the sum of m exponentialdistributions each with the same rate ω i.e., Gamma distributedwith the following pdf f Y ( y ; ω, m ) = ω m y m − e − ωy Γ( m ) = ω m y m − e − ωy ( m − ω, m > . (12)where Γ( . ) is the complete Gamma function. Note that sinceGamma distribution is NOT memoryless, it requires aging to be taken care of by inserting the time dependent conditionalCDF (carrier survival probabilities) given by β ( t ) = P ( Y > t | l exchanges made )= 1 − P ( Y < t | l exchanges made )= Γ( l, ωt )Γ( l ) = Γ( l, ωt )( l − , t > . (13)where l < m exchanges are assumed to be made by the robot.In that case, the latter conditional probability is also Gammadistributed with the pdf f Y ( y ; ω, l ) . Also, the included upperincomplete Gamma function is given by Γ( l, ωt ) = (cid:90) ∞ ωt x l − e − x dx (14)On the other hand, the hard error rate is given by η =1 − (1 − (cid:15) )(1 − κ ) . Note that we need to modify the modelto compensate for the hard errors. Note that hard errors splitthe state transition from the current state ( i A, j F, z D ) to thedestination state (( i − A, ( j + 1) F, z D ) which originallyhappens with rate iλ . Similar to the observations in previousstudies, for each state with i > k , we need a transition tothe total failure ( F ) state to be able to incorporate the harderrors. With i available nodes, we can tolerate up to i − k − concurrent hard errors to successfully make it to the state (( i − A, ( j + 1) F, z D ) . Assuming independence, this happenswith probability ∆ i := i − k − (cid:88) l =0 (cid:18) il (cid:19) η l (1 − η ) i − l = 1 − I η ( i − k, k + 1) (15)where l is the number of hard errors that occur at thesame time while rebuilding or during regular data checks.Also we have used the regularized beta function I x ( a, b ) = B ( x ; a, b ) /B ( a, b ) instead to avoid the instability and preci-sion issues of the binomial CDF. Here B ( x ; a, b ) is called theincomplete beta function and is given by B ( x ; a, b ) = (cid:90) x v a − (1 − v ) b − dv (16)and its complete version B ( a, b ) = B (1; a, b ) = Γ( a )Γ( b )Γ( a + b ) . Notethat we have the following limit lim η → ∆ i = 0 where the pro-posed model reduces to a simple transition from ( n A, F, D)to F with rate nλ . Since the hold times are assumed to beexponential, the mean time to stay in that state is /nλ whichcan be thought as the lower bound on the durability of thesystem. On the other hand, for small κ and (cid:15) , the upperbound can closely be approximated by the reliability of thesystem presented in Fig. 1. Finally, we summarize the new NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 6

Curr. State Destinations Transition Rate i A, j F, z D ( i − A, ( j + 1) F, z D iλ ∆ i i A, j F, z D F iλ (1 − ∆ i ) i A, j F, z D i A, ( j − F, ( z + 1) D jθi A, j F, z D ( i + 1) A, j F, ( z − D zµ TABLE IIS

TATE TRANSITIONS AND CORRESPONDING TRANSITION RATES state transition table in Table II with indexes satisfying k + 1 ≤ i ≤ n, ≤ j ≤ n − k − ≤ z ≤ n − k − , i + j + z = n. Note that if we redraw the overall Markov system given inFig. 1 to incorporate hard errors, it will make it look morecomplicated. To reach such a transition table, we have madea few assumptions that can be listed as follows. • In a typical repair process, only k tapes are selected forrepair process. In case of locally repairable codes [29],this number can be reduced. Alternatively, more than k tapes can be requested and only earliest k reads can beused to improve performance. • Hard errors are assumed to be independent i.e., datacannot have more than one segment(data chunk) in thesame library node.

D. Transition Rate and Probability Matrices

The transition rate matrix (TRM) Q is a N s × N s real valuedmatrix, whose entries q ij ≥ for i (cid:54) = j represent the ratedeparting from state i and arriving in state j . The transitionrate matrix for our generalized model is shown below. Notethat we have included the total failure state ( F ) as part of thematrix and hence the last row becomes all-zero vector. Wenotice that the diagonal entries satisfy q ii = − (cid:88) j (cid:54) = i q ij ⇒ (cid:88) j q ij = 0 ∀ i ∈ { , , . . . , N s − } (17)which means the rows of the matrix must sum to zero.Furthermore, let us deﬁne Q in which entries are deﬁnedas q ij := q ij / | q ii | for all i and j . Then, the transitionprobability matrix (TPM) is given by P = I + Q . Notethat this is the precise version of uniformization technique that compute transient solutions of ﬁnite state continuous-timeMarkov chains, by approximating the process using a discretetime Markov chain. This formulation will be useful when wederive the upper bound on the performance. E. Modeling the Carrier Availability

The treatment of the previous section did not include theavailability of the carriers in the state transition matrix. Oneof the observations is that although the node failures (e.g. tapefailures) are independent of robots’ availability, node repairand failure detection mechanisms are highly dependent onthe availability of carriers (robots) i.e., the rates that describe In that uniformization technique, q ii is replaced with γ ≥ max | q ii | . failure detection and node repair must be time-dependent aswell. As the time passes by, detection and repair rates will godown unless carriers are updated sufﬁciently fast.When nodes fail due to various reasons, failure detectionprocess immediately commences. Similarly, when these fail-ures are detected, the associated repair process starts immedi-ately. So for a given operating time t , system robots will notbe of the same age and quality (due to potential replacementsetc). This leads to unequal treatment of storage nodes and oursimulation setup must keep track of indexes for which robotsare replaced in order to model the aging phenomenon.

1) Time-dependent Failure Detection:

For simplicity, letus assume each node has a single carrier (through averagingarguments, it can be generalized to multiple carriers withoutchanging the following discussion) and let ψ o ( t ) be the prob-ability of o ∈ { , , . . . , i } available robots (conditioned on aspeciﬁc set of i nodes) in the system at time t with survivalprobabilities β s , β s , . . . , β s i where s m ∈ { , , . . . , n } . Itcan be shown that ψ o ( t ) = (cid:60){ F P ( o ) } − (cid:60){ F P ( o ) } (18)where (cid:60){ . } denotes the real part and F P ( . ) is the CDF ofPoisson binomial distribution given by F P ( o ) = 1 n + 1 i (cid:88) l =0 e − √− πlon +1 i (cid:89) m =1 (1 + ( e √− πln +1 − β s m ( t )) (19)where √− is the complex number that is a solution ofthe equation x = − . On the other hand, since in ourstudy we assume failures, detections and repairs all to beexponentially distributed and detection and carrier repairs canonly happen consecutively, the natural consequence of sum ofmultiple independent exponential distributions is no surprise.However, to be able to make our later analysis analyticallytractable, we will use a ﬁrst order approximation in thissubsection . More speciﬁcally, we will assume the sum of x + 1 exponential distributions with rates φ, θ , . . . , θ x to beapproximately exponentially distributed with rate R θ ( φ ) givenby R θ ( φ ) = 1 φ + (cid:80) xc =1 1 θ c (20)where θ = [ θ , . . . , θ x ] . When a failure event is detected bythe system, a state transition happens from the originator state ( i A, j F, z D ) to the destination state ( i A, ( j − F, ( z + 1) D ) for j > . While performing the detection, we need j robots tocomplete the process, if found less, say b < j , then we needto repair j − b robots to have a total of j robots to work onthe detection process. Suppose that at time t , we conditionon having l failed robots satisfying ≤ l ≤ j ≤ n − k .Then the conditional repair rate i.e., the rate of making thedetection transition in the Markov model is given by ( j − l ) θ + lR θ ( φ ) . Thus, summing over all possibilities of l , we Although in numerical result section, we will show that this assumptionis a good approximation by simulating the actual distributions.

NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 7 get the unconditional node failure detection rate given by θ j ( t ; φ ) = j (cid:88) l =0 (( j − l ) θ + lR θ ( φ )) ψ j − l ( t ) (21) = jθ − ( θ − R θ ( φ )) j (cid:88) l =0 lψ j − l ( t ) (22) = jθ − ( θ − R θ ( φ )) j (cid:88) m =1 (1 − β s m ( t )) (23) = jθ − θ θ + φ j (cid:88) m =1 (1 − β s m ( t )) (24)where s m ∈ { , . . . , n } . Notice that we have the inequalityfor any t , jθ − jθ θ + φ ≤ θ j ( t ; φ ) ≤ jθ (25)which implies that as φ → ∞ i.e., robot repairs beinginstantaneous, the detection rate would be jθ which is thesame as that of without any robot failures as given in Fig. 1.

2) Time-dependent Carrier Repair:

After a node failure isdetected, our system immediately begins the repair process andthe completion of the repair process implies a state transitionfrom the originator state ( i A, j F, z D ) to the destination state (( i + 1) A, j F, ( z − D ) for all originator states having z > .Let us suppose we are in state ( i A, j F, z D ) at time t and l of i available nodes have their carrier robot already failed. Notethat for classical MDS codes, we need to have k helper nodesto be able to complete the data request successfully . Supposefurther that x of these requests are from the failed set, and k − x are from the available and operational ones. Due to samplingwithout replacement, probability of that happening is given bythe hypergeometric distribution . In this particular condition,we need to wait for the x failed carriers to be repaired ﬁrstwhich is given by the maximum repair time and typicallynot distributed exponentially. In fact, this distribution can beshown to be equal to the sum of exponential distributionswhich in this subsection is assumed to be close to anotherexponential distribution with rate /φ (cid:80) xm =1 /m where theharmonic sum in the rate can be approximated closely by hs ( x ) := x (cid:88) m =1 m ≈ log( x ) + ζ + 12 x − x + 1120 x (26)where ζ = 0 . is known as Euler–Mascheroniconstant.After all the necessary repair information is collected byany of the z detected nodes, each begins the computationneeded for the repair process and write the repaired data tothe corresponding storage unit. But the write process needsat least one carrier/robot available. The availability analysis isquite similar to the same case with detection process (each Various network codes exist that may require to access more than or lessthan k helper nodes with partial node content accesses for full recovery [30].The present discussion only slightly changes in case such class of codes areused instead. Sampling with replacement would lead to a Binomially distributed statis-tics instead. node uses their own robot for detecting the failure) and thusthe rate of such happening is given by represented by µ z ( t ; φ ) expressed as µ z ( t ; φ ) = zµ − µ µ + φ j (cid:88) m =1 (1 − β s m ( t )) (27)On the other hand, the conditional repair rate (conditionedon x and l ) can be expressed as µ iz ( t ; φ, k | x, l ) = (cid:0) i − lk − x (cid:1)(cid:0) lx (cid:1)(cid:0) ik (cid:1) (cid:18) µ z ( t ; φ ) + hs ( x ) φ (cid:19) − (28)where s m ∈ { , . . . , n } . Finally, the unconditional repair ratecan be obtained by summing over all x and l as follows, µ iz ( t ; φ, k ) = i (cid:88) l =0 ψ i − l ( t ) l (cid:88) x =0 µ iz ( t ; φ, k | x, l ) (29) = i (cid:88) l =0 ψ i − l ( t ) µ z ( t ; φ ) (30) + i (cid:88) l =0 ψ i − l ( t ) l (cid:88) x =1 µ iz ( t : φ, k | x, l ) (31)Note that if φ → ∞ , i.e., we assume immediate robotrepairs, we shall have µ iz ( t ; ∞ , k ) = i (cid:88) l =0 ψ i − l ( t ) l (cid:88) x =0 (cid:0) i − lk − x (cid:1)(cid:0) lx (cid:1)(cid:0) ik (cid:1) zµ = zµ (32)meaning that robot repairs being instantaneous, the node repairrate would be zµ which is the same as that of without any robotfailures. Finally, we summarize the new state transition tablein Table III with indexes satisfying the following inequalities k ≤ i ≤ n, ≤ j ≤ n − k ≤ z ≤ n − k, i + j + z = n F. Lower/Upper bounds on the Performance

For a given ﬁnite carrier repair rate φ < ∞ , if we letthe exchange rate tend to large values the carrier repairs willnot be able to catch up, eventually resulting in total carrierunavailability. In that particular case, it is of interest to drivethe lower bound on performance in a closed form. We notethat in case of total carrier unavailability, there is no failuredetection and therefore it means no data repair in a coldstorage context and hence, the survival time depends on whichstate the system is in and whether the hard error leads tounrecoverable state transitions. In light of this observation, thelower bound ( LB ) is approximated in B in terms of ∆ i ’s asfollows LB ≈ (cid:80) ni = k (1 − ∆ i ) (cid:81) nj = i +1 ∆ j (cid:32) log (cid:16) ni − (cid:17) /λ + λ (cid:80) l =0 ( − l (cid:16) n l − n l + n l (cid:17) (cid:33) (33)with ∆ k = 0 and n l = n − l ( n − i + 1) . On the other hand,if we let the exchange rate tend to zero there will be no needfor carrier repairs, resulting in total carrier availability. In that NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 8

Weibull FitParameterEstimationData Sets Time Series AnalysisDensity Estimation Estimated Shape andScale parameters

SBF: Swaps before failure

MTBF = sΓ(1+1/g)RandomNumberGenerator(Weibull)Non-HomogeneousCT-Markov ProcessSimulation PlatformTBS: Time btw swaps Failure/Repair/Detection Rates(n,k) - Codeword/payload lengthsUCERTape Age Failure (percentage)Tape Capacity

Fig. 2. Data-Inspired overall modeling and simulation platform implemented in MATLAB. Density estimation problem is reduced down to parameter estimationthrough assuming a Weibull distribution for robot exchange performance data. In this ﬁgure, s represents the scale and g represents the shape parameters.MSBF: Mean Time Before Failure. particular case, it is of interest to drive the upper bound onperformance in a closed form. We realize that there is onlyone absorbing state in our model (Failure state) and hence,the TPM is already in its canonical form, P N s × N s = (cid:20) L N s − × N s − R N s − × . . . (cid:21) For an absorbing Markov chain, we know that the inverseof I − L matrix is called the fundamental matrix (denoted as M ) and it can be expressed as M = ( I − L ) − = I + ∞ (cid:88) i =1 L i (34)in which m ij entry provides the expected number of timesthat the Markov process visits the transient state s j when it isinitialized in the transient state s i . Since we initially assumeall nodes to be available in the beginning, we are interestedin m j s i.e., the system is assumed to be in state n A in thebeginning of the operation. Since for s j , all outgoing tran-sitions happen according to exponential distributions and thehold time is given by the minimum which is also distributedexponentially with rate − q jj . This implies the average holdtime in each visit to s j is given by − /q jj . Finally, the upperbound can be approximated by U B ≈ − N s − (cid:88) j =1 m j q jj (35)Note that this is only an approximation since TPM is anapproximation to the continuous time Markov model. Also, wecan analytically assess the upper bound on the time-dependentperformance including the robot failure and repair processesby considering only the rate matrix given by Table III insteadof Table II. This is possible because we have approximateddistributions as exponential to keep Markovianity intact. This Current State Destination State Transition Rate i A, j F, z D ( i − A, ( j + 1) F, z D iλ ∆ i i A, j F, z D F iλ (1 − ∆ i ) i A, j F, z D i A, ( j − F, ( z + 1) D θ j ( t ; φ ) i A, j F, z D ( i + 1) A, j F, ( z − D µ iz ( t ; φ, k ) TABLE IIIS

TATE TRANSITIONS AND CORRESPONDING TRANSITION RATES . approximation will later in numerical results section be veri-ﬁed to be sufﬁciently accurate for the range of parameters ofinterest.IV. D ATA -A SSISTED M ODELING F RAMEWORK

In our modeling framework, we utilize a data-inspiredapproach for estimating the number of round-trips (exchangesin our context) that a carrier make before a critical failurehappens. The critical failure takes place when the robot is nolonger able to operate within the library system due to variousreasons till they are replaced with the new one. In our tapeapplication, the total number of robot exchanges before failure(SBF) is assumed to be Weibull distributed which shall bevalidated by the collected ﬁeld data using enterprise Quantumlibraries. Weibull distribution is completely characterized bytwo independent parameters called the shape ( g ) and scale( y ). The reason we choose Weibull is twofold. First, it is thegeneralization of the most commonly assumed exponentialdistribution (single parameter) in literature. In other words,by selecting appropriate parameter values Weibull can betransformed to exponential distribution. Secondly, it is heavytailed and closely characterize the observed ﬁeld data. Werealize that the heavy-tailed distributions characterize varioustypes of data accurately as the number parameters of thedistribution increase. For instance, it is reported in various NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 9 studies that the data object size tends to possess heavy-tailed distribution such as Pareto [31]. On the other hand,several studies show that heavy-tail distributions might wellcharacterize local ﬁle system dynamics and ﬁle sizes [32],archival data [33] and the data stored and communicated overthe world wide web [34].We note that based on the available ﬁeld data and Weibullassumption, the challenging density estimation problem istransformed into parameter estimation problem. More pre-cisely, we estimate the shape ( g ) and scale ( y ) parametersof the distribution through simple linear regression. Secondly,we obtain an estimate of the distribution of the time betweenexchanges/swaps. Using the same data set, this distribution isobserved to have exponential tail and hence a single parameter(the rate) will have to be estimated. An exponential assumptionis also quite nifty because the corresponding count processwill become analytically tractable Poisson distribution. Sincethe estimated parameter is a function of the utilization rate ofthe system and hence is time-dependent, we shall test a rangeof values in our simulations to illustrate the overall picture. Asummary of the modeling framework is depicted in Fig. 2. Inthis framework, the data-based parameter estimations ( ˆ g and ˆ y )are fed into the proposed non-homogeneous Markov Processas estimated inputs. In addition to these inputs, we also set therest of the simulation parameters λ, µ, θ, n, k, κ, (cid:15) as well asthe number of simulation instances to some appropriate valuesbased on the ﬁeld data and our experience with 6TB tapes.The system is protected with a ( n, k ) MDS code. The randomnumber generator chooses a random SBF value according tothe estimated Weibull distribution and repeats this process anduses a unique realization at each iteration of the simulation. Wetypically simulate over 10000 times to obtain reliable values.The main purpose of the simulation platform is to estimatethe distribution of the overall data loss and/or unavailability(which ever one degrades the performance ﬁrst) at the sametime to demonstrate the implicit relationship of these twoimportant performance metrics. In other words, we can ﬁnallynumerically estimate MTTDL and data MTTDU metrics quiteconﬁdently. In addition, the mean value of the number ofexchanges is given by ˆ y Γ(1 + 1 / ˆ g ) which shall be used asthe guideline of robot performance in the numerical resultssection.Note that there are more than one way for the estimationof the Weibull distribution parameters i.e., g and y . Weadapt simple linear regression in this study to estimate theseparameters of the Weibull distribution. However, few algebraicmanipulations are needed to put the CDF of the Weibull in anappropriate form. Accordingly, let us remember the WeibullCDF W ( t ) as given by the equation. W ( t ) = 1 − e ( t/y ) g (36)which can be rearranged and expressed as the following linearequation ln ( − ln (1 − W ( t ))) = g ln( t/y ) = g ln( t ) − g ln( y ) (37)If we set the ordinate to the left hand side, ln ( − ln (1 − W ( t ))) and abscissa to ln( t ) , and apply a ln(t) -6-5-4-3-2-10 l n ( - l n ( - W ( t ))) DataLinear Regression

Fig. 3. Robot exchange data and Weibull parameter estimations using a linearregression. linear regression, we shall have a linear function thatwill naturally have an intercept( I ) and a slope( S ). Usingthese estimates we can generate the estimates of the shapeparameter ( ˆ g ) as well as the scale parameter ( ˆ y ) as followsshown below, ˆ g = S , ˆ y = exp ( −I / ˆ g ) . (38)In other words, the slope of the line shall be the shapeparameter whereas the scale parameter needs to be calculatedbased on the estimate of the shape parameter according toequation (38). To demonstrate the accuracy of the Weibullassumption, we recorded around 40000 robot exchanges beforethey cease operation. These equal-quality robots are operatinginside Quantum Scalar i6K enterprise libraries which canhouse up to 12000 cartridges and is optimized for high density.This data is plotted in Fig. 3 based on the formulation givenin equation (37), where the intercept and slope can easily befound and used to calculate the shape parameter, ˆ g = 0 . and scale parameter ˆ y = 491669 . The accumulation in thedata for t values satisfying ≤ ln( t ) ≤ is due to thefact that most robots have a logarithmic lifetime in that range.Based on the estimated parameters of the Weibull distribution,the average number of exchanges can be calculated to be ˆ y Γ(1 + 1 / ˆ g ) = 580747 exchanges before critical robot failurehappens. In the numerical results section, we shall choose ourparameters within the ballpark of these ﬁgures to make ourresults/conclusions realistic. We ﬁnally note that the use ofdistributions with more parameters could approximate the databetter, however it will only result in extremely minor accuracyadvantage at the expense of increased estimation complexity.V. N UMERICAL R ESULTS

As it is usually the case with cold (and archival) storageplatforms, we primarily focus on the read-back or in otherwords the data retrieval performance in this section. Wepresent few numerical results for the proposed simulation andmodeling platform. The intention is to illustrate reliability (in

NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 10

TABLE IVS

IMULATION PARAMETERS

Parameter Value λ (hours) 1/50000 µ (hours) 1/24 θ (hours) 1/8760 ( n, k ) Variable (cid:15) (UCER) − tape capacity ˆ g (shape, Weibull) 0.37 and 0.67 ˆ y (scale, Weibull) 525985 κ of simulations > xph ) andcarrier repair rates ( φ ). The former typically changes basedon the system utilization rate whereas the latter is underthe control of system maintenance team. Another reason forchoosing these parameters to vary is that they directly affectthe unavailability of the system i.e., in the absence of thecarrier (failed carrier) overall data access time increases untilcarrier repair takes over. We particularly note that most ofthe carrier devices (e.g. robots) are shipped with a maximumexchange/swap rate number for reliable operation (such as 840exchanges per hour ( xph ) [35]), we vary the abscissa fromsome small exchange number to somewhere above the reportedmaximums and present results in a log-log plot. Similarly,when we plot the MTTDU in terms of φ , we ﬁxed theexchange/swap rate and varied the carrier repair rate to see theeffect of repair frequency on the unavailability performance.Please note that the system could be operating at any point onthese performance curves at a given time t .The parameters of the simulation are brieﬂy summarized inTable IV. As can be seen, we have assumed a day-long meandata/tape repair and a year-long mean failure detection time asa starting point. These numbers again are application speciﬁcand can be changed per use case. We have selected few ex-ample half-rate code parameters such as (4 , and (6 , withvarying reliability guarantees. The scale–out system consistsof n identical libraries where each library stores and handlesonly one chunk of data when requested. Each library has theirown unique robot and contains multiple storage units such astapes. Tapes and robots are assumed to be of equal quality andtype. The direct effect of inner details of the scale-out librarysystem such as the total number of libraries M , the number oftapes per library, the geometry of the tape shelf locations etc.are accounted by the Weibull parameter estimations (shapeand scale) and data-assisted modeling framework. This dataanalysis saves us from getting into the inner complexities of -6 -4 -2 Robot (Carrier) Repair Rate ( φ ) M TT DU ( hou r s ) × ActualExponential Approx.

Fig. 4. The accuracy of exponential-tail approximation with respect toMTTDU as a function of robot repair rate ( φ ) using a (4,2) MDS code fortwo different exchange rates 10xph and 100xph. -2 Exchange rate (xph) M TT D L / M TT DU ( hou r s ) ~01/86401/7201/48LB Fig. 5. MTTDL/MTTDU (hours) as a function of Exchange rate (round tripsor swaps per hour) and a (4,2) MDS codes with the shape parameter 0.67 fordifferent repair rates φ . LB stands for the derived lower bound. library systems and provides us the statistical nature of thenumber of exchanges per library. This is later used as an inputfor the proposed generalized Markov model introduced in theprevious section. Finally, note that UCER is assumed to be − which is way lower than − , the UCER of the knowndisk systems. This is due to the high data durability guaranteesof the next generation tape technology [10].Our ﬁrst simulation is presented in Fig. 4 where we clearlydemonstrate the validity of our exponential tail assumptionmade earlier (such as the expression (20)) for approximatingthe non-exponential distributions that appear in various stagesof the proposed Markov model. We have used (4,2) MDS codeand three exchange rates namely 10xph, 100xph and 1000xphand plotted MTTDU results using both the actual distributions NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 11 as well as the exponential-tail approximation (used in thissimulation) as a function of robot repair rate. One of the initialobservations is that our exponential-tail assumption leads toa lower bound on the actual MTTDU values. Furthermore,the worst case difference between the actual and approximateMTTDU values do not affect the number of nines , a metrictypically used to express system reliability with regard toMTTDL performance metric in the industry. For the rest of thissubsection, we shall present our results using the exponential-tail approximation due to simpler formulation as well asanalytical tractability.In Fig. 5, we present MTTDL/MTTDU in hours as afunction of exchange rate for a (4,2) MDS code. In light of ourdata observations and equations (37) and (38) derived earlierfor estimating the scale and shape parameters, we have foundthat two shape parameters . and . along with the scaleparameter are most common giving us the averagetotal number exchanges of and , respectively,before a robot failure happens. Note that if a robot lasts onlyafter one year, these numbers would indicate an average of79.4 xph and 251.2 xph, respectively. We have also includedthe availability lower bound (as given by the equation (33))in our plots which do not change with the growing exchangerate.As can be seen from Fig. 5, as the exchange rate tendsto zero (going from right to left on abscissa), the reliabilityclosely converges to the durability of the system model intro-duced in Fig. 1 where the availability issue posses no more riskto data access anymore. On the other hand, as the exchangerate tends to large values (going from left to right on abscissa),MTTDU converges to the durability lower bound. Dependingon the operating point of the library system, our modelclearly shows how the unavailability changes as a function ofexchange rate if we do not have sufﬁciently frequent robotrepair in place. Also, it can be observed from the samesimulation data that a robot repair rate of φ = 1 / seems tobe sufﬁciently frequent for the system maintenance and hencewe do not see any notable reduction in the availability for thisparticular repair rate.On the other hand, we observe from Fig. 5 (and for thatmatter in Figs. 6 and 7) that as we increase the robot repair rate φ , i.e., we perform more frequent robot repairs, we improve theavailability. However, at some point, increasing φ does not helpus much i.e., system’s robots are repaired fast enough that nounavailability leads to a dramatic performance loss. In order toﬁnd the optimal robot repair rate, we also need to plot MTTDUas a function of φ for a range of exchange/swap rates. Theseperformance plots are shown in Fig. 6 and Fig. 7 for the half-rate MDS codes (4,2) and (6,3), respectively. Also included inthe same plots are the corresponding lower and upper boundscomputed using equations (33) and (35), respectively. Thereare two interesting observations common to both plots. One ofthe observations is that MTTDU performances converge aftercertain exchange rates. For example in Fig. 6, φ = 1 / and φ = 1 / do not provide dramatically different MTTDUperformances. Exact same trend can be observed for (6,3)MDS code in Fig. 7 as well. Therefore, since lowering theexchange rate improves the availability, we can talk about -6 -4 -2 Robot (Carrier) Repair Rate ( φ ) M TT DU ( hou r s ) (4,2) exr=1/100(4,2) exr=1/500(4,2) exr=1/50LB(4,2) exr = 1/1000UB Fig. 6. MTTDU (hours) as a function of robot repair rate ( φ ) and few exampleexchange rates (exr) shown for (4,2) MDS code. LB stands for the lowerbound. -6 -4 -2 Robot (Carrier) Repair Rate ( φ ) M TT DU ( hou r s ) (6,3) exr=1/100(6,3) exr=1/500(6,3) exr=1/50LB(6,3) exr=1/1000UB Fig. 7. MTTDU (hours) as a function of robot repair rate ( φ ) and few exampleexchange rates (exr) shown for (6,3) MDS code. LB stands for the lowerbound. an optimal repair rate beyond which we do not experienceany unavailability for all possible exchange rates of interest.Having determined the optimal exchange rate for a givensystem is important from both user satisfaction and energysavings point of views. The second observation with respectto these plots is that by keeping the code rate ﬁxed, as theblocklength of the MDS code gets larger, the associated lowerbound gets worse. This is due to the number of parities donot scale as much as needed to compensate for the increasedblocklength. However, using the expressions derived for thelower bound (the equation (33)), one can immediately noticethat the performance difference between different size MDScodes of the same rate will disappear as n tends large. FromFig. 6 and Fig. 7, we can quantify this difference for both half-rate codes, namely (4,2) and (6,3) MDS codes, respectively. NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 12

VI. C

ONCLUSION

The reliability modeling of cold data storage involves a setof challenges due to its speciﬁc functional requirements. Manyexternal and data-unrelated factors play signiﬁcant roles in thedata-loss and data-unavailability guarantees provided to theend-user. In this study, we have pointed out several of theseimportant factors and proposed a data-assisted general Markovmodel for cold data storage equipped with carrier assistance.Furthermore, we have proposed a data-assisted simulationplatform for a (˜ n, ˜ k ) -coded scale-out cold data storage systemin which we have accounted for different node states, harderrors and data-unavailability all at the same time. A tapelibrary system is considered as a special use case of thisreliability model. We have clearly demonstrated the effectsof choosing different carrier repair rates on the reliability andavailability of the system based on the operating exchange rate.In addition, useful upper and lower bounds on the availabilityperformance are derived. Finally, we have investigated thecritical choice of blocklength of the underlying ﬁxed-rate MDScode and its effect on the system availability. One of the keyfeatures of the system is its data-driven distribution estimationframework used to model aging as well as its straightforwardapplicability to replication based systems for both reliabilityengineers and system designers. As a future work, we shallextend our model to encompass more than three node states,the possibility of having multiple carriers (robots) per nodeand real-time parameter estimations for an adaptive reliabilitymodel. A CKNOWLEDGEMENTS

Authors would like to acknowledge various anonymousreviewers for their valuable input and suggestions that tremen-dously improved the quality of the presentation. This work is ajoint collaboration with Quantum Corporation, LTO AdvancedDevelopment Team and Dr. Arslan.A

PPENDIX AU PPER / LOWER BOUND ON N s Observe that using Vandermonde convolution for the ex-pression given for N s , we can rewrite for it s > N s = (cid:18) n − k + s − s − (cid:19) + 1 (39) = s − (cid:88) j =1 (cid:18) s − j (cid:19)(cid:18) n − k + 1 j + 1 (cid:19) + n − k + 2 (40) ≥ s − (cid:88) j =2 (cid:18) s − j − (cid:19) j − (cid:18) n − k + 1 j (cid:19) + (cid:88) j =0 (cid:18) n − k + 1 j (cid:19) (41)from which we can deduce that N s ≥ s − (cid:88) j =0 (cid:18) n − k + 1 j (cid:19) . (42) Note that the lower bound in Eq. (41) is a tighter comparedto one in (42). For the upper bound, we observe that N s = s − (cid:88) j =1 (cid:18) s − j (cid:19)(cid:18) n − k + 1 j + 1 (cid:19) + n − k + 2= s − (cid:88) j =2 (cid:18) s − j − (cid:19)(cid:18) n − k + 1 j (cid:19) + (cid:88) j =0 (cid:18) n − k + 1 j (cid:19) ≤ s − (cid:88) j =0 (cid:18) s − j (cid:19)(cid:18) n − k + 1 j (cid:19) = s − (cid:88) j =0 (cid:18) s − j (cid:19)(cid:18) n − k + 1 n − k + 1 − j (cid:19) = (cid:18) s + n − ks − (cid:19) (43)where (43) results from (43) using Pascal’s triangle inequalitywhich for any positive m ≥ c is given by (cid:18) mc (cid:19) = (cid:18) m − c − (cid:19) + (cid:18) m − c (cid:19) (44)Since (cid:0) s + n − ks − (cid:1) = (cid:0) s + n − kn − k +1 (cid:1) we can deduce that for a ﬁxed s (cid:28) n and a scaling k that is linear in some large n , i.e., k = αn for any { α : 0 < α < } , the total number ofsystem (Markov) states will scale with O ( n min { s − ,n − k } ) = O ( n min { s − , (1 − α ) n } ) = O ( n s − ) . In other words, the com-plexity of our simulation framework (and the correspondingMarkov chain) grows exponentially in the number of nodestates unless the rate of the code goes to unity ( r → ), i.e., k becomes sublinear in n with constant n − k < s . In that casethe total number of system states would scale with O ( n n − k ) with no dependence on the number of node states, s .A PPENDIX BD ERIVATION OF P ERFORMANCE L OWER B OUND

Note that we can rewrite the conditional probability that wedo not end up in the total data loss (failure) state when thereare i available nodes, ∆ i in an explicit integral form ∆ i = Γ( i + 1)Γ( i − k )Γ( k + 1) (cid:90) η ( t ) v a − (1 − v ) b − dv = i !( i − k − k ! (cid:90) η v i − k − (1 − v ) k dv = ( i − k ) (cid:18) ik (cid:19) (cid:90) (cid:15) + κ − (cid:15)κ v i − k − (1 − v ) k dv (45)where η = 1 − (1 − (cid:15) )(1 − κ ) is the hard error rate. To derive thelower bound we consider the case of total carrier unavailability.Thus, this results in no failure detection and henceforth no datarepair process is initiated. This leads to the simpliﬁed versionof the Markov model as shown in Fig. 8. Note that the harderror that leads to total failure when there are i available nodeshappens with probability (1 − ∆ i ) (cid:81) nj = i +1 ∆ j . The total timebefore failure is the sum of average times spent in each of thevisited system states i.e., /nλ, / ( n − λ, . . . , /iλ . Finally,by summing over all possible i , we can estimate the lower NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 13 n A ( n − A, F . . . ( k + 1) A, ( n − k − F k A, ( n − k ) F F nλ ∆ n nλ (1 − ∆ n )( n − λ (1 − ∆ n − )( n − λ ∆ n − k +1) λ ∆ k +1 ( k + 1) λ (1 − ∆ k +1 ) Fig. 8. The general CT Markov model reduces to single dimensional one with the following transition rates. bound on the total average time spent before failure as LB = n (cid:88) i = k (1 − ∆ i ) n (cid:89) j = i +1 ∆ j n (cid:88) j = i jλ = n (cid:88) i = k (1 − ∆ i ) n (cid:89) j = i +1 ∆ j ( hs ( n ) − hs ( i − ≈ n (cid:88) i = k (1 − ∆ i ) n (cid:89) j = i +1 ∆ j (cid:32) log (cid:18) ni − (cid:19) /λ + 1 λ (cid:88) l =0 ( − l (cid:18) n l − n l + 1120 n l (cid:19) (cid:33) (46)with ∆ k = 0 and n l = n − l ( n − i +1) . Note that if there is nohard errors, i.e., ∆ i = 1 for i = n, . . . , k + 1 , we shall havethe simpliﬁed lower bound with the closed form expressiongiven by (cid:80) nj = k /jλ . In general we have LB ≤ n (cid:88) j = k jλ = hs ( n ) − hs ( k − (47)Note that the approximation follows due to (26).R EFERENCES[1] J. Allen-Robertson, “The Materiality of Digital Media: The Hard DiskDrive, Phonograph, Magnetic Tape and Optical Media in TechnicalClose-up,”

New Media & Society , vol. 19, no. 3, pp. 455–470, 2017.[2] G. M. Church, Y. Gao, and S. Kosuri, “Next-Generation Digital Infor-mation Storage in DNA,”

Science , vol. 337, no. 6102, pp. 1628–1628,2012.[3] Y. Qing, S. A. Ionescu, G. S. Pulcu, and H. Bayley, “Directional Controlof a Processive Molecular Hopper,”

Science , vol. 361, no. 6405, pp.908–912, 2018.[4] S. S. Arslan, J. Lee, J. Hodges, J. Peng, H. Le, and T. Goker, “MDSProduct Code Performance Estimations under Header CRC Check Fail-ures and Missing Syncs,”

IEEE Transactions on Device and MaterialsReliability, , vol. 14, no. 3, pp. 921–930, 2014.[5] A. Oprea and A. Juels, “A Clean-Slate Look at Disk Scrubbing.” in

FAST , 2010, pp. 57–70.[6] J. Ryu and C. Park, “Effects of Data Scrubbing on Reliability in StorageSystems,”

IEICE Transactions on Information and Systems , vol. 92,no. 9, pp. 1639–1649, 2009.[7] T. J. Schwarz, Q. Xin, E. L. Miller, D. D. Long, A. Hospodor, and S. Ng,“Disk Scrubbing in Large Archival Storage Systems,” in

The IEEEComputer Society’s 12th Annual International Symposium on Modeling,Analysis, and Simulation of Computer and Telecommunications Systems,2004.(MASCOTS 2004). Proceedings.

IEEE, 2004, pp. 409–418.[8] I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou, “Disk Scrubbing versusIntra-Disk Redundancy for High-Reliability RAID Storage Systems,”in

ACM SIGMETRICS Performance Evaluation Review , vol. 36, no. 1.ACM, 2008, pp. 241–252.[9] J. L. Hafner and K. Rao, “Notes on Reliability Models for Non-MDSErasure Codes,”

IBM Res. rep. RJ–10391, 2006 , 2006. [10] A. Dholakia, E. Eleftheriou, X.-Y. Hu, I. Iliadis, J. Menon, and K. Rao,“A New Intra-Disk Redundancy Scheme for High-Reliability RAIDStorage Systems in the Presence of Unrecoverable Errors,”

ACM Trans-actions on Storage (TOS) , vol. 4, no. 1, p. 1, 2008.[11] J.-F. Pˆaris, S. T. J. Schwarz, A. Amer, and D. D. Long, “HighlyReliable Two-Dimensional RAID Arrays for Archival Storage,” in . IEEE, 2012, pp. 324–331.[12] S. S. Arslan, “A Reliability Model for Dependent and Distributed MDSDisk Array Units,”

IEEE Transactions on Reliability , vol. 68, no. 1, pp.133–148, 2019.[13] ——, “Redundancy and Aging of Efﬁcient Multidimensional MDSParity-Protected Distributed Storage Systems,”

IEEE Transactions onDevice and Materials Reliability , vol. 14, no. 1, pp. 275–285, 2014.[14] P. Rahman and G. D. N. F. Shavier, “Analysis of Mean Time to Data Lossof Fault-Tolerant Disk Arrays RAID-6 based on Specialized MarkovChain,” in

IOP Conference Series: Materials Science and Engineering ,vol. 327, no. 2. IOP Publishing, 2018, pp. 022–086.[15] R. J. Hall, “Tools for predicting the reliability of large-scale storagesystems,”

ACM Transactions on Storage (TOS) , vol. 12, no. 4, p. 24,2016.[16] P. Constantopoulos, M. Doerr, and M. Petraki, “Reliability Modeling forLong Term Digital Preservation,” in ,2005.[17] Y. Han and C. P. Chan, “Modeling System Reliability for DigitalPreservation: Model Modiﬁcation and Four-Copy Model Study,” in

Proceedings of the Fifth International Conference on Preservation ofDigital Objects (iPRES) . The British Library, London., 2008.[18] L. Ivanichkina and A. Neporada, “The Reliability Model of a DistributedData Storage in Case of Explicit and Latent Disk Faults,”

ARPN Journalof Engineering and Applied Sciences.–10 (20).–2015.–C , pp. 9150–9158, 2015.[19] J. Judge, R. Schmidt, R. Weiss, and G. Miller, “Media Stability andLife Expectancies of Magnetic Tape for Use with IBM 3590 and DigitalLinear Tape systems,” in

IEEE, 2003, pp. 97–100.[20] R. D. Weiss, “Environmental Stability Study and Life Expectancies ofMagnetic Media for Use with IBM 3590 and Quantum Digital LinearTape Systems.”

Report to National Archives and Records Administration , 2002.[21] C. J. Shahani, B. Manns, and M. Youket, “Longevity of CD media:research at the library of congress,”

Preservation Research and TestingDivision, Washington, DC , 2003.[22] R. P. Jernigan IV, A. Tracht, and P. F. Corbett, “Data Allocation withinA Storage System Architecture,” Nov. 10 2009, US Patent 7,617,370.[23] A. Cidon, S. Rumble, R. Stutsman, S. Katti, J. Ousterhout, andM. Rosenblum, “Copysets: Reducing the frequency of data loss in cloudstorage,” in , 2013, pp. 37–48.[24] M. M. Botezatu, I. Giurgiu, J. Bogojeska, and D. Wiesmann, “PredictingDisk Replacement towards Reliable Data Centers,” in

Proceedings of the22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining . ACM, 2016, pp. 39–48.[25] F. Mahdisoltani, I. Stefanovici, and B. Schroeder, “Proactive errorprediction to improve storage system reliability,” in , 2017, pp. 391–402.

NEDITTED: ACCEPTED FOR PUBLICATION IN ELSEVIER RELIABILITY AND SYSTEM SAFETY JOURNAL, OCT 2019 14 [26] W. A. Burkhard and J. Menon, “Disk array storage system reliability,” in

FTCS-23 The Twenty-Third International Symposium on Fault-TolerantComputing . IEEE, 1993, pp. 432–441.[27] R. B. Wideman, S. S. Arslan, J. Lee, and T. Goker, “Data Deduplicationwith Adaptive Erasure Code Redundancy,” Nov. 22 2016, US Patent9,503,127.[28] S. S. Arslan, T. Goker, and R. B. Wideman, “Joint De-Duplication-Erasure Coded Distributed Storage,” Jun. 11 2019, US Patent10,318,389.[29] D. S. Papailiopoulos and A. G. Dimakis, “Locally Repairable Codes,”

IEEE Transactions on Information Theory , vol. 60, no. 10, pp. 5843–5855, 2014.[30] A. G. Dimakis, K. Ramchandran, Y. Wu, and C. Suh, “A Survey onNetwork Codes for Distributed Storage,”

Proceedings of the IEEE ,vol. 99, no. 3, pp. 476–489, 2011.[31] M. Satyanarayanan, “A Study of File Sizes and Functional Lifetimes,”in

ACM SIGOPS Operating Systems Review , vol. 15, no. 5. ACM,1981, pp. 96–108.[32] A. B. Downey, “The Structural Cause of File Size Distributions,”in

MASCOTS 2001, Proceedings Ninth International Symposium onModeling, Analysis and Simulation of Computer and TelecommunicationSystems . IEEE, 2001, pp. 361–370.[33] V. Ramaswami, K. Jain, R. Jana, and V. Aggarwal, “Modeling HeavyTails in Trafﬁc Sources for Network Performance Evaluation,” in

Computational Intelligence, Cyber Security and Computational Models .Springer, 2014, pp. 23–44.[34] W. Gong, Y. Liu, V. Misra, and D. Towsley, “On the Tails of Web FileSize Distributions,” in

Proceedings of the Annual Allerton Conference onCommunication Control and Computing , vol. 39, no. 1. The University;1998, 2001, pp. 192–201.[35] S. Richards, “Maintaining a Large Scale, Very Active Tape Archive,”in34th International Conference on Massive Storage Systems andTechnology