DEAP Cache: Deep Eviction Admission and Prefetching for Cache
Ayush Mangal, Jitesh Jain, Keerat Kaur Guliani, Omkar Bhalerao
DDEAP Cache: Deep Eviction Admission and Prefetching for Cache
Ayush Mangal, ∗ Jitesh Jain, ∗ Keerat Kaur Guliani, ∗ Omkar Bhalerao ∗ Vision & Language Group, IIT [email protected], jitesh [email protected], [email protected], [email protected]
Abstract
Recent approaches for learning policies to improve caching,target just one out of the prefetching , admission and evic-tion processes. In contrast, we propose an end to end pipelineto learn all three policies using machine learning. We alsotake inspiration from the success of pretraining on largecorpora to learn specialized embeddings for the task. Wemodel prefetching as a sequence prediction task based on pastmisses. Following previous works suggesting that frequencyand recency are the two orthogonal fundamental attributes forcaching, we use an online reinforcement learning techniqueto learn the optimal policy distribution between two orthog-onal eviction strategies based on them. While previous ap-proaches used the past as an indicator of the future, we insteadexplicitly model the future frequency and recency in a multi-task fashion with prefetching, leveraging the abilities of deepnetworks to capture futuristic trends and use them for learn-ing eviction and admission. We also model the distributionof the data in an online fashion using Kernel Density Esti-mation in our approach, to deal with the problem of cachingnon-stationary data. We present our approach as a ”proof ofconcept” of learning all three components of cache strategiesusing machine learning and leave improving practical deploy-ment for future work. . Introduction
Caches having low latency have limited space, which mustbe utilized effectively. Since the problem of accessing suchdata from the main memory is predictive in nature, variousefforts have previously been directed to applying machinelearning techniques to the task of cache optimisation.[2] modelled the task of prefetching as a sequence predic-tion problem based on past misses, which we adopt as well.[5] demonstrated frequency and recency as two orthogonalattributes for cache eviction decisions and learned an opti-mal policy distribution between two approaches based on past estimates namely LRU and LFU. [3] used an imitationlearning-based approach for cache replacement, whereinthey used a byte-level representation to deal with the expo-nential size of address vocabulary. They also observed that ∗ All authors contributed equally, the names are listed in alpha-betical orderCopyright c (cid:13) learning both prefetching and replacement had not been ap-propriately addressed in any previous work. Improving onthese approaches, we address the main contribution of thispaper as : • We propose a machine learning method to learn all three components of caching strategies, i.e., prefetching, admis-sion and replacement. • We enhance the byte level representations using recent ad-vances in natural language processing. • We tackle the problem of non-stationary data by mod-elling the data distribution explicitly using Kernel DensityEstimation (KDE). • We explicitly model the future estimates of two orthogo-nal attributes, namely frequency and recency, for learningthe optimal replacement and admission policies, insteadof using the past as an indicator of the future.
Methodology
Training Mode (Offline)
Pretrained Byte Embeddings using Word2Vec:
Following[3], we use byte-level embeddings of the two features usedto represent cache misses: the missed address and the cor-responding Program Counter (PC). However, we take it onestep further, deriving from recent advances in pretraining ontext corpora to train Word2Vec [4] based specialized byteembeddings.
Sequence Modelling for Prefetching Candidates:
The se-quence of the obtained ”miss” embeddings is passed througha Long Short-Term Memory (LSTM) network to get aprobability-wise prediction of the expected (subsequent)cache misses to be prefetched.
Sequence Distribution Estimation:
We deal with the prob-lem of non-stationarity of the data to be cached, by explic-itly modelling the current distribution of the sequence usinga non-parametric method called Kernel Density Estimation,and feed the resultant distribution vector into the pipeline.
Multitasking Frequency and Reuse Distance Predictionwith Prefetching:
Unlike previous works, we model future frequency and reuse distance (timesteps till next occurence)by applying a learnable decoder to the embedding of the ad-dress in question, along with the current estimate of the dis-tribution, in a multi-task fashion with prefetching prediction. a r X i v : . [ c s . O S ] S e p igure 1: Schematic diagram of our approach. We feed specialised embeddings extracted from input address sequence into ourDEAP Cache model to make admission, prefetching and eviction decisions. Testing Mode (Online)
Admission Policy:
We use the decoder mentioned in theprevious section to predict an estimate of the future re-cency/frequency of the address and then use a threshold todecide whether to admit the address or not.
Prefetching Policy:
We maintain an online buffer of the past k misses and pass samples from it in every T timesteps to theLSTM model to get candidates for prefetching. Eviction Policy:
We modify the LeCaR approach of [5] anduse the concept of regret minimization to learn the optimalprobability distribution between two eviction policies, onebased on future recency and other on future frequency. Notethat [5] instead used LRU and LFU that modelled the futurebased on past metrics. We refer the reader to the supplemen-tary for a detailed description of our approach.
Experiments and Results
To test the validity of our approach, we considered five base-line approaches for evaluation: LRU, LFU, FIFO, LIFO, Or-acle. We used a free publicly available dataset due to fi-nancial constraints as students. As can be seen in Table 1,our approach supersedes the Mean Hit Rate obtained by allprevious classical approaches and comes the closest in per-formance to the optimal figure obtained from BELADY’s al-gorithm (Oracle) [1], thus demonstrating the validity of ourapproach. We open-sourced the code and provide a detailedaccount for reproducibility in the supplementary. Conclusion & Future Work
In this work, we proposed an end to end pipeline for learning all the three components of caching strategies using machinelearning and demonstrated the superiority of our approachover classical baselines. Improving our approach’s practicaldeployment and evaluating on large-scale real-time bench-marks is an interesting future direction. We derived our dataset from the dataset found here The codebase and dataset used can be found here
Method Mean Hit Rate
LRU 0.42LFU 0.43FIFO 0.36LIFO 0.03BELADY (Oracle) 0.54
Ours 0.48
Table 1: Different approaches and their mean hit rates
References [1] L. A. Belady. “A study of replacement algorithms fora virtual-storage computer”. In:
IBM Systems Journal . Vol. 5. InternationalMachine Learning Society (IMLS), 2018, pp. 3062–3076.
ISBN : 9781510867963. arXiv: 1803.02329.[3] Evan Zheran Liu et al. “An Imitation Learning Ap-proach for Cache Replacement”. In: arXiv preprintarXiv:2006.16239 (2020).[4] Tomas Mikolov et al. “Distributed representations ofwords and phrases and their compositionality”. In:
Advances in neural information processing systems .2013, pp. 3111–3119.[5] Giuseppe Vietri et al. “Driving cache replacementwith ML-based LeCaR”. In: .USENIX Association, 2018. upplementaryImplementation Details
We will now describe the implementation of our approach indetail. We will first describe the training phase, and then theonline testing simulation.
Training
The training phase happens in an offline fashion, where wesample sequences from our dataset and carry out simulationsto train our model. It is mainly concerned with: • Pretraining of specialised byte embeddings for the domainusing word2vec based approach. • Training of an LSTM model for predicting future missaddresses for prefetching based on the sequence of pastmisses. • Estimating the current distribution of the address se-quence in a quick online fashion. • Training a decoder to predict the future reuse distance (timesteps till next occurence) and frequency estimates in amulti-task fashion with prefetching candidate preditction.
Pretrained Byte Embeddings using Word2Vec
Following [7], to deal with the exponential size of the vo-cabulary of all possible unique addresses (for example, possible unique addresses in a 32-bit address system), wemodel the addresses and PCs using byte-level embeddings.Going one step ahead, we take inspiration from recent ad-vances in pretraining word embeddings and use a variant ofword2vec [8] to get pretrained byte-level embeddings usinga large corpora of addresses and PCs generated via multipleprogram simulations. While using these byte-level embed-dings in our model, we apply a learnable multi-layer percep-tron (MLP) above the concatenated byte embeddings, whichcombines the byte level embeddings to learn an ”addresslevel” embedding in a manner that is specially optimized forthe downstream tasks. For example, for a 4 byte address A ,represented in bytes as ( B , B , B , B we get byte em-beddings b j and address-level embeddings a as: b j = W Tj b j j ∈ · · · a = f ( b ⊕ b ⊕ b ⊕ b Where W j are the corresponding embedding matricestrained using w2vec, ⊕ represents the concatenation of thebyte embeddings, and f () represents the MLP used to con-vert the byte level embeddings to an address level embed-ding. Sequence Modelling for Prefetching Candidates
Following [3], we model the problem of prefetchingas a sequence prediction task using the past history ofcache miss addresses A m and their corresponding pro-gram counters P m . Given an input sequence of miss ad-dresses [ A m , A m . . . , A mk ] and corresponding PC addresses [ P m , P m . . . , P mk ] , the aim is to predict a set of n mostprobable prefetching candidate addresses [ C , C . . . , C n ] . We generate the miss address embeddings [ a , a . . . , a k ] and PC embeddings [ p , p . . . , p k ] from the input sequenceusing the method described in this section . We then con-catenate the missed addresses’ embeddings and missed PCs’embeddings to get the input embeddings [ e , e . . . , e k ] . e i = a i ⊕ p i The input embeddings are then fed into a Long Short TermMemory (LSTM) model which captures both the short andlong term dependencies across the sequence to predict futuremiss addresses [5] better. −→ h i = LST M ( f ) ( e i , −→ h i − ) The hidden state from the last step h k is then fed into adense layer g j ( j ∈ · · · with a softmax applied on top, toget the probability distribution over the future miss address’predictions. Note that to deal with the exponential size of theaddresses, we predict each byte ˆ b j separately. ˆ b j = sof tmax ( g j ( h k )) j ∈ · · · Where g j is the dense layer to predict the byte b j . Weuse the Cross-Entropy loss to train our prefetching candidateprediction model, which is given by: L prefetching = − ( n ) (cid:80) ni =1 (cid:80) j =0 (cid:80) c =0 b i,j,c log(ˆ b i,j,c ) Sequence Distribution Estimation
The input address sequence distribution is bound to be non-stationary in the real world online setting where caches areusually deployed. To deal with this, we propose explicitlymodelling the current distribution of the sequence and pro-viding an inductive bias to help our model deal with theproblem of moving address distribution. To make the ap-proach quick and online, we refrain from using deep learn-ing methods to estimate the distribution and instead used aclassical non-parametric way of probability density estima-tion, i.e.,
Multivariate Gaussian Kernel Density estimation [6]. Given an input sequence of embeddings [ e , e . . . , e k ] ,we obtain the distribution vector d i corresponding to the se-quence by applying Kernel Density Estimation (KDE) on it: d i = KDE ( e , e . . . , e k ) Multitasking Frequency and Reuse DistancePrediction Decoder with Prefetching CandidatePrediction
Following recent advances in multi-task learning [2] indicat-ing that positive transfer from various related tasks improvesperformance, we apply multi-task learning to the problem ofpredicting an estimate of the ”future” frequency and reusedistance of an address in question along with the task of pre-fecthing candidate prediction. Reuse distance is defined asthe number of timesteps until the next occurence of the ad-dress, and represents a future ”recency” of the address. Asimilar approach was followed in [7], but it was only lim-ited to predicting reuse distance. We extend their approachto predicting both an estimate of the future frequency andthe reuse distance in a multi-task fashion along with therefetching predictions. Note that the distribution vector rep-resenting an estimate of the current distribution plays a vitalrole in this prediction, and thus we concatenate the distribu-tion vector d i with the address embedding a i and then feedit to an MLP based decoder with multiple heads. This givesus an estimate of the future frequency f i and reuse distance r i : z i = a i ⊕ d i f i = F ( z i ) r i = R ( z i ) Where F () and G () are the frequency and reuse distanceprediction MLP head, respectively. To train this decoder inan end-to-end fashion along with our prefetching module,we would need to use the addresses predicted by the mod-ule. However, this would be problematic, since the prefetch-ing module outputs a probability distribution instead of aparticular address, and taking an argmax () to get the mostprobable prediction would introduce non-differentiability (which is undesirable for an end-to-end pipeline that relieson backpropagating on the prefetching model as well). Tobypass the non-differentiability of argmax, we instead use atemperature-based approach to convert the soft probabilityinto an approximate argmax, which is then utilised to calcu-late the address embeddings during training. However, notethat this is not required during testing since we already haveaccess to eviction/addmission address candidates for whichwe need to make frequency-recency estimates.We use the mean squared error (MSE) loss between thepredicted frequency/reuse distance and the ground truth totrain the decoder, which is given by: L frequency = ( n ) (cid:80) ni =1 ( ˆ f i − f i ) L recency = ( n ) (cid:80) ni =1 (ˆ r i − r i ) Where ˆ f i , f i and ˆ r i , r i are the predicted and ground truthfrequency and reuse distance respectively. The total loss fortraining is given by a weighted average of the three losses: L total = w L prefetching + w L frequency + w L recency Testing Simulation Pipleine
Consider an input sequence of addresses [ A , A . . . , A T ] and corresponding PC addresses [ P , P . . . , P T ] . We needa caching strategy to cache them efficiently in a cache C of size s , to maximise the number of hits (hit-rate). Anycaching strategy would consist of three main components: • Admission Policy - Deciding whether to cache an addressafter it causes a miss. • Prefetching Policy - Predicting the addresses which will lead to misses in the future and ”prefetching them”. • Eviction Policy - Deciding which address to evict to makespace for new addresses to be admitted into the cache.We now describe all three components of our strategy:
Admission Policy
Given an input PC ( P t ) and Address ( A t ), we first embedthem into an embedding e t as described in this section .We then predict its estimated frequency f t and reuse dis-tance r t using the decoder trained as described in this sec-tion . We then admit only those addresses which have theirfrequency above a threshold α (to admit more frequent ad-dresses) or reuse distance below a threshold β (to admit ad-dresses which will occur soon because of shorter reuse dis-tance). Thus the admission decision y t is given by: y t = (cid:26) , f t > α or r t < β, , otherwise Prefetching Policy
We maintain an offline buffer consisting of the past k missaddresses and their corresponding PC addresses. During thesimulation, after every T timesteps, we sample both of themfrom the ”miss buffer” and pass them into the LSTM modelas described in this section . The model outputs a probabil-ity distribution of the candidates to be prefetched. We thenprefetch the n most probable candidates by appropriatelysampling from this distribution. Online Learning of Eviction using Modified LeCaR
As mentioned in [9], there are two fundamental and or-thogonal characteristics of elements for caching namely fre-quency and recency . Correspondingly, there are two funda-mental eviction strategies LRU ( Least Recently Used) andLFU ( Least Frequently Used). Hence [9] proposed an on-line method of Le arning Ca che R eplacement (LeCaR) usingthe concept of regret minimisation [10] to learn the optimalprobability distribution among the two orthogonal policiesof LRU and LFU.However, both these eviction strategies, as used in LeCaR,used the past frequency/recency as an estimate of the corre-sponding future values, which is based on the assumptionthat the past is a good indicator of the future. We feel thisassumption can prove to be questionable. We propose toimprove this model by explicitly predicting the future fre-quency and recency ( via reuse distance) using the decoders F () and R () as described in this section . We then applyregret minimization to choose the optimal probability dis-tribution between the policies of evicting the address withthe least predicted future frequency and the address with thehighest predicted reuse distance in the future. We claim thatusing the predicted future values from our DL model servesas a better estimate of the future than the past. We refer thereader to the original paper [9] for more details about theoriginal approach. Reproducibility Index
In this section, we provide a detailed description of our ex-perimentation settings, the baselines used, and dataset de-tails to improve our method’s reproducibility. As students,we have limited access to computational resources and com-mercial hardware benchmarks and hope that making our ap-proach reproducible would allow evaluating and improving yperparameter Search Space Optimal Choice
Number of Epochs uniform-integer[1, 20] 20Training Batch Size choice[32, 64, 128, 256, 512] 256Optimizer choice[Adam, SGD] AdamLearning Rate loguniform-float[1e-5, 1e-1] 1e-3Training Temperature choice[1e-3, 1e-2] 1e-3LSTM Hidden Cell Size uniform-integer[20, 40] 40Decoder Hidden Size 10 10Prefetching Input Sequence Length choice[20, 30] 30Address Embedding Size uniform-integer[5, 25] 20Weight for Cross Entropy Loss 0.33 0.33Weight for Frequency MSE-Loss 0.33 0.33Weight for Reuse Distance MSE-Loss 0.33 0.33Word2Vec Number of Epochs uniform-integer[20, 500] 120Word2Vec Learning Rate loguniform-float[1e-5, 1e-2] 3e-3Word2Vec Weight Decay loguniform-float[1e-6, 10] 1e-3Word2Vec Optimizer choice[Adam, SGD] AdamWord2Vec Encoder Hidden Layer Size uniform-integer[50, 200] 128Word2Vec Byte Embedding Dimension uniform-integer[5, 25] 20Word2Vec Context Size uniform-integer[2, 10] 4Admission Frequency Threshold ( α ) choice[50, 300, 500, 1000,3000] 3000Admission Reuse Distance Threshold ( β ) choice[500, 3000, 5000, 7000,8000] 7000Miss Buffer Size choice[30, 50, 70, 100] 50Test Simulation Prefetching Interval choice[10, 20, 30, 50] 30Cache Size choice[32, 64] 32Test Simulation Batch Size choice[5000, 10000] 10000Table 2: Hyperparameter search space for our modelour approach’s practical deployment. We have open-sorucedour codebase and the dataset used at (https://github.com/vlgiitr/deep cache replacement). Baselines
We now desribe the baseline against which our model wascompared : • Least Recently Used (LRU):
A fundamental techniquebased on recency in which we evict the candidate that hasbeen the least recently used. • Least Frequently Used (LFU):
A fundamental techniquebased on frequency in which we evict the candidate thathas been the least frequently used. • First in First Out (FIFO):
A heuristic-based techniquein which we evict the candidate which entered the cachethe earliest. • Last in First Out (LIFO):
A heuristic-based techniquein which we evict the candidate which entered the cachethe most recently. • BELADY (Oracle):
A theoretically optimal oracle [1]which has access to the entire future sequence and takesan optimal decision based on it.
Hyperparameters
We provide a detailed description of the hyper-parameterssearch space used to train our approach in Table 2 for en- hancing reproducibility. We describe the following hyperpa-rameters : • Number of Epochs:
The number of epochs for which thetraining was carried out. • Training Batch Size:
The batch size for the training. • Optimizer:
The optimizer used to improve the modelthrough backpropagation. • Learning Rate:
The learning rate used to train the model. • Training Temperature:
The temperature used to performargmax in a differentiable manner. • LSTM Hidden Cell Size:
The dimension size of the hid-den state of the LSTM. • Decoder Hidden Size:
The dimension size of the hiddenlayer of the decoder used to get the frequency and reusedistance of an address. • Prefetching Input Sequence Length :
The number ofpast miss addresses and PC address taken in the sequencewhich is then fed into the LSTM model as input. • Address Embedding Size:
The dimension size of the en-coded address embeddings using an MLP on top of thebyte embeddings. • Weight for Cross Entropy-Loss:
The relative weight ofthe cross entropy-loss for prefetching candidate predic-tion in the total loss. • Weight for Frequency MSE-Loss:
The relative weightof the frequency MSE-loss in the total loss.
Weight for Reuse Distance MSE-Loss:
The relativeweight of the reuse distance MSE-loss in the total loss. • Word2Vec Number of Epochs:
The number of epochsfor which the word2vec based pretraining was carried outfor the byte embeddings. • Word2Vec Learning Rate:
The learning rate used inWord2vec. • Word2Vec Weight Decay:
The weight decay used to pro-vide regularization during word2vec. • Word2Vec Optimizer:
The optimizer used in word2vec. • Word2Vec Encoder Hidden Layer Size:
The dimensionsize of the hidden layer of the encoder used to get the byteembedding. • Word2Vec Context Size:
The number of surroundingbytes used as input while training using word2vec. • Admission Frequency Threshold:
The frequencythreshold value above which admission occurs. • Admission Recency Threshold:
The recency thresholdvalue below which admission occurs. • Miss Buffer Size:
The number of past address misses andprogram counters used as input during prefetching. • Test Simulation Prefetching Interval:
The number oftimesteps after which we run one instance of prefetching. • Cache Size:
The size of the cache used for test simulation. • Test Simulation Batch Size:
The batch size used for car-rying out the test simulation.
Dataset
The standard benchmark dataset for training and evalua-tion of cache simulations is the SPEC CPU2000 benchmark[4], which costs $2000 and hence was out of the financialscope of the student authors. So, we used a proxy datasetavailable at (https://cseweb.ucsd.edu/classes/sp14/cse240A-a/project1.html) as a reference for preparing our datasets.We created five files with sequence length ranging from180000 to 500000 using the preceding URL. Testing the va-lidity of our approach on the SPEC CPU2000 benchmark isan interesting future direction.
References [1] L. A. Belady. “A study of replacement algorithms fora virtual-storage computer”. In:
IBM Systems Journal
Machinelearning . Vol. 5. InternationalMachine Learning Society (IMLS), 2018, pp. 3062–3076.
ISBN : 9781510867963. arXiv: 1803.02329.[4] John L Henning. “SPEC CPU2000: Measuring CPUperformance in the new millennium”. In:
Computer
Neural computation
Pattern Recognition
ICML 2020 (2020).[8] Tomas Mikolov et al. “Distributed representations ofwords and phrases and their compositionality”. In:
Advances in neural information processing systems .2013, pp. 3111–3119.[9] Giuseppe Vietri et al. “Driving cache replacementwith ml-based lecar”. In: { USENIX } Workshopon Hot Topics in Storage and File Systems (HotStor-age 18) . 2018.[10] Martin Zinkevich et al. “Regret minimization ingames with incomplete information”. In: