Optimizing Video Caching at the Edge: A Hybrid Multi-Point Process Approach
Xianzhi Zhang, Yipeng Zhou, Di Wu, Miao Hu, James Xi Zheng, Min Chen, Song Guo
OOPTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 1
Optimizing Video Caching at the Edge:A Hybrid Multi-Point Process Approach
Xianzhi Zhang, Yipeng Zhou,
Member, IEEE,
Di Wu,
Senior Member, IEEE,
Miao Hu,
Member, IEEE,
James Xi Zheng, Min Chen,
Fellow, IEEE, and Song Guo,
Fellow, IEEE
Abstract —It is always a challenging problem to deliver a huge volume of videos over the Internet. To meet the high bandwidth andstringent playback demand, one feasible solution is to cache video contents on edge servers based on predicted video popularity.Traditional caching algorithms (e.g., LRU, LFU) are too simple to capture the dynamics of video popularity, especially long-tailedvideos. Recent learning-driven caching algorithms (e.g., DeepCache) show promising performance, however, such black-boxapproaches are lack of explainability and interpretability. Moreover, the parameter tuning requires a large number of historical records,which are difficult to obtain for videos with low popularity. In this paper, we optimize video caching at the edge using a white-boxapproach, which is highly efficient and also completely explainable. To accurately capture the evolution of video popularity, we developa mathematical model called
HRS model, which is the combination of multiple point processes, including Hawkes’ self-exciting,reactive and self-correcting processes. The key advantage of the HRS model is its explainability, and much less number of modelparameters. In addition, all its model parameters can be learned automatically through maximizing the Log-likelihood functionconstructed by past video request events. Next, we further design an online HRS-based video caching algorithm. To verify itseffectiveness, we conduct a series of experiments using real video traces collected from Tencent Video, one of the largest online videoproviders in China. Experiment results demonstrate that our proposed algorithm outperforms the state-of-the-art algorithms, with12.3% improvement on average in terms of cache hit rate under realistic settings.
Index Terms —video caching, edge servers, point process, Monte Carlo, gradient descent. (cid:70)
NTRODUCTION D UE to the fast growth of the online video market,the online video streaming service has dominated theInternet traffic. It was forecasted by Cisco [1] that videostreaming applications will take up the Internet traffic from59% in 2017 to 79% in 2022. On one hand, online videoproviders need to stream HD (high definition) videos withstringent playback requirements. On the other hand, bothvideo population and user population are growing rapidly.Thereby, edge devices have been pervasively exploited byonline video providers to cache videos so as to reduce theInternet traffic and improve the user Quality of Experience(QoE) [2], [3], [4], [5], [6], [7].We consider the video caching problem on edge serversthat can provide video streaming services for users in a cer- This work was supported by the National Natural Science Foundation of Chinaunder Grant U1911201, U2001209, ARC DE180100950, the project PCLFuture Greater-Bay Area Network Facilities for Large-scale Experiments andApplications (LZC0019). • Xianzhi Zhang, Di Wu and Miao Hu are with the Department ofComputer Science, Sun Yat-sen University, Guangzhou, 510006, China,and Guangdong Key Laboratory of Big Data Analysis and Processing,Guangzhou, 510006, China. D. Wu is also with Peng Cheng Labora-tory, Shenzhen 518000, China. (E-mail: [email protected]; { wudi27, humiao5 } @mail.sysu.edu.cn.). • Yipeng Zhou and James Xi Zheng are with the Department ofComputing, FSE, Macquarie University, Australia, 2122. Y. Zhou isalso with Peng Cheng Laboratory, Shenzhen 518000, China. (E-mail:[email protected]; [email protected]). • Min Chen is with School of Computer Science and Technology, HuazhongUniversity of Science and Technology, Wuhan 430074, China (E-mail:[email protected]). • Song Guo is with the Department of Computing, The Polytechnic Uni-versity of Hong Kong, Hong Kong (E-mail: [email protected]). tain area ( e.g. , a city). If a video is cached, edge servers canstream the video to users directly with a shorter responsetime. Nevertheless, requests for videos that are missed byedge servers have to be directed to remote servers, resultingin lower QoE. From video providers’ perspective, the targetis to maximize the cache hit rate of user video requests byproperly caching videos on edge servers. Intuitively, videosto be requested most likely in the future should be cachedon edge devices. It is common to leverage historical videorequests to construct statistical models in order to predictvideos’ future request rates.Briefly, there are two approaches to predict video pop-ularity so as to make video caching decisions. The firstapproach is to predict video popularity based on empiricalformulas. Classical LRU (Least Recently Used), LFU (LeastFrequently Used) and their variants [8], [9], [10] are suchvideo caching algorithms. However, such approach is sub-ject to the difficulty of choosing parameter values. For exam-ple, it is not easy to choose an appropriate time window sizein LRU and LFU algorithms [8], [11]. The second approachis learning-driven video caching algorithm. Typically, NN(neural network) models such as LSTM (Long Short-TermMemory) [12], [13] can be leveraged to predict video pop-ularity so as to make right video caching decisions. Thestrength of this approach is that model parameters canbe automatically determined through learning historical re-quest patterns, and thus such algorithms can achieve bettervideo caching performance than classical algorithms [13].Yet, such models require a long training time, and are lackof explainability and interpretability.In this paper, we aim at optimizing the performance of a r X i v : . [ c s . MM ] F e b PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 2 video caching on edge servers using a white-box approach.To this purpose, we develop a mathematical model called
HRS model to capture the evolution process of video popu-larity. The HRS model is the combination of multiple pointprocesses, including the Hawkes process [14], the reactiveprocess [15] and the self-correcting process [16]. The pointprocesses enable us to exploit timestamp information ofhistorical video requests and time-varying kernels to predictfuture events.Specifically, the intuition behind the HRS model is as fol-lows: with the Hawkes process, we can model the positiveimpact of the occurrence of a past event, and link futurevideo request rates with the past video request events; withthe reactive process, we can take the influence of negativeevents (e.g., the removal of a video from the recommen-dation list) into account; with the self-correcting process,we can restrict the growth of video popularity. Comparedto NN(neural network)-based models, our proposed HRSmodel is completely explainable and interpretable. More-over, the number of parameters of HRS model is much lessthan that of NN-based models.In summary, our main contributions in this paper can besummarized as below: • We develop a hybrid multi-point process modelcalled
HRS to accurately predict the evolution ofvideo popularity (i.e., video request rate). Differentfrom NN-based model, our HRS model is completelyexplainable and interpretable. The HRS model canlink future video request rates with both past videorequest events and negative events, and take thecharacteristics of edge servers into account. • We propose an online video caching algorithm foredge servers based on the HRS model. Due to muchless model parameters, the algorithm has very lowcomputation complexity. All parameters of the HRSmodel can be determined automatically by maximiz-ing the Log-likelihood function. Thus, the algorithmcan be executed frequently to update cached videostimely according to the dynamics of video popular-ity. • We conduct extensive real trace-driven experimentsto validate the effectiveness of our proposed algo-rithm. The video request traces are collected fromTencent Video, one of the largest online videoproviders in China. The experimental results showthat the HRS-based online caching algorithm canachieve the highest cache hit rate with 12.3% im-provement than the best baseline algorithm in termsof cache hit rate. In particular, the improvement isover 24% when the caching capacity is very limited.In addition, the execution time of our algorithm ismuch lower than that of NN-based caching algo-rithms.The rest of the paper is organized as follows. We firstprovide an introduction of preliminary knowledge in thenext section which is followed by the description of the HRSmodel in Sec 3. Notably, the new online caching algorithmis proposed in Sec. 4. The experimental results are presentedin Sec 5; while the related works in this area are discussedin Sec 6 before we finally conclude our paper.
Fig. 1. The system architecture of the video caching system with multipleedge servers.
RELIMINARY
We consider a video caching system with multiple edgeservers. The system architecture is illustrated in Fig. 1.Online video providers (OVP) store a complete set of videoswith population C . A number of edge servers are deployedin the proximity of end users. Each edge server can exclu-sively cover users in a certain area. Since edge servers arecloser to end users, serving users with edge servers can notonly reduce the Internet traffic but also improve the userQoE [6]. The problem is to predict the video request rateson each edge server in the future so that the most popularvideos can be cached in time by the edge server.Our objective is to maximize the cache hit rate on eachedge server, which is defined as the number of requestsserved by the edge server divided by the total number ofrequests. This objective can be transformed to maximize theprediction accuracy of future video request rates.Without loss of generality, we consider the video cachingproblem for a particular edge server, which can store S videos with S < C . To simplify our analysis, we assumethat all videos are of the same size. In real systems, videoswith different sizes can be split into blocks of the same size.This simplification can significantly reduce the implemen-tation cost in real video caching systems [4], [17]. With thissimplification, it is apparent that the top S most frequentlyrequested videos should be cached on each edge server, andour main task is to predict which S videos will be mostpopular in the future. For convenience, major notations usedin this paper are summarized in Table 1.We first define the event sets as follows. All past videorequests are denoted by an event set E = { e , e , ..., e K } .Let ε = { τ , τ , ..., τ K } denote the occurrence time of allpast events, τ < τ <, . . . , < τ K . In other words, eventsare recorded according to their occurrence time points. Eachevent in the set E is a tuple, i.e. , e = (cid:104) τ, i (cid:105) , where τ is therequest time and i is the video index. Besides, we define theset ε ti as the timestamp set for historical events of video i before time t and ε t = ∪ ∀ i ε ti . PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 3
TABLE 1Notations used in the paper
Notation Description C The total number of videos stored in OVP. S The total number of videos cached on an edge server. K The total number of request records. E The event set of all requested records. ε t / ζ t The timestamp set of all requested records / negative events before time t . ε ti / ζ ti The timestamp set of requested records / negative events of video i before time t . τ / τ (cid:48) The timestamp of any request record / negative event. λ i ( t ) The conditional intensity function of video i at time t . ˜ λ i ( t ) The estimation of λ i ( t ) , which is defined by Eq. (7) for HRS. ˆ λ i ( t ) The positive form of ˜ λ i ( t ) adjusted by g ( x ) = s log(1 + exp( x/s )) . β i The bias of intensity function for HRS associated with video i . ω i / α i / γ i The parameter of SE / SC / SR term in HRS associated with video i respectively. k ( t − τ ) / k ( t − τ (cid:48) ) The exponential kernel function of SE / SR term reflecting the influence of past event defined by Eq. (8). δ / δ The decay parameter of k ( t − τ ) / k ( t − τ (cid:48) ) , which can be determined through cross validation in experiments. T The entire observation period time for the computation of likelihood function. θ The parameters matrix of a point process and θ = [ β (cid:124) , ω (cid:124) , α (cid:124) , γ (cid:124) ] in HRS model. θ ( j ) i The substitute of any parameter associated with video i after j iterations in gradient descent algorithm, such as β ( j ) i . ll ( θ ) The Log-likelihood function of point process, which is defined by Eq.(11) and Eq.(12) for HRS. ¯ ll ( θ ) The evaluated Log-likelihood function of HRS, which is estimated by Monte Carlo method and defined by Eq. (15). ρ β / ρ ω / ρ α / ρ γ The regularization parameter of β i / ω i / α i / γ i , which can be determined through cross validation in experiments. M The number of sample times in Monte Carlo method. t ( m ) The timestamp of the m -th sample in Monte Carlo method. Φ i ( t ) / Ψ i ( t ) / Γ i ( t ) The kernel function of HRS defined by Eq. (24). ∆ t The time interval of online kernel functions update. ∆ T The time interval of online parameter update. ∆ M The number of sample times of online parameter update. k th The threshold to truncate the sum of k of online parameter update. Point process is a family of models which are generatedfrom on individual events to capture the temporal dynamicsof event sequences via the conditional intensity function[18]. We employ the point process models to predict thefuture request rate given the historical request events of aparticular video. In general, the predicted request rate ofvideo i can be defined as λ i ( t | ε ti ) , where ε ti is the timestampset of historical requested records of video i before time t .The conditional intensity function represents the expectedinstantaneous request rate of video i at time t . In the rest ofthe paper, we use λ i ( t ) to represent λ i ( t | ε ti ) .We first introduce three typical point process modelsbefore we specify the expression of λ i ( t ) , HP [14] is also called self-exciting process. For HP, theoccurrence of a past event will positively affect the arrivalrate of the same type of events in the future. Given thetimestamp set of historical events ε ti , the arrival rate of suchevents at time t can be predicted according to the followingformula: λ i ( t ) = β i + (cid:88) ∀ τ ∈ ε ti φ i ( t − τ ) . (1)Here, β i is the deterministic base rate and φ i ( t − τ ) isthe kernel function to reflect the influence of the past eventat time τ on the arrival rate of the same type at time t . Moreover, φ i ( t − τ ) should be a monotonically decreasingfunction with t − τ in that more recent events have higherpositive influence on the future request rate. The reactive process [15] is an extension of HP by linkingthe future event with more than one types of events. HPonly considers the influence of positive events, while thereactive process models both exciting and restraining effectsof historical events on instantaneous intensity. The futurerate can be given by: λ i ( t ) = β i + (cid:88) ∀ τ ∈ ε ti φ exci ( t − τ ) − (cid:88) ∀ τ (cid:48) ∈ ζ ti φ resi ( t − τ (cid:48) ) . (2)Here ε ti denotes the same timestamp set of positiveevents as that in HP while ζ ti represents the timestamp setof negative events restraining the intensity function of video i before time t . φ exci and φ resi are kernel functions to reflectthe influence of positive and negative events respectively. Compared to the Hawkes process and reactive process,the intensity function of the self-correcting process is morestable over time [16]. Once an event occurs, the intensityfunction will be reduced by a factor e − α i . Here α i is a pa-rameter representing the ratio of correcting. Mathematically,the intensity function can be given by λ i ( t ) = exp( µ i t − α i N i ( t )) , (3) PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 4 where µ i is the rate of the steadily increase of intensityfunction and N i ( t ) is the number of historical requests ofvideo i until time t . Generally, the time series of events basedon the self-correcting process are more uniform than thosebased on other processes such as HP. To apply the point process models to predict video requestrates, we need to determine the parameters (denoted by θ ) defined in each process, such as µ and α in the self-correcting process. An effective approach to determine theseparameters is to maximize the likelihood of the occurrenceof a target event set in the entire observation time ( 0 , T ] , i.e. , E T .Given the overall intensity function λ ( t ) = (cid:80) ∀ i λ i ( t ) and the occurrence time τ l of the last event in the his-torical event time set ε t , the probability that no event i occurs in the period [ τ l , t ) is P (cid:0) no event in [ τ l , t ) (cid:12)(cid:12) ε t (cid:1) =exp (cid:16) − (cid:82) tτ l λ ( x )d x (cid:17) . Thus, the probability density that anevent i occurs at time t is given by: P (cid:0) t, i (cid:12)(cid:12) ε t (cid:1) = λ i ( t ) exp (cid:18) − (cid:90) tτ l λ ( x )d x (cid:19) . (4)The detailed derivation can be found in [19]. Given theevent set E T = { e , . . . , e K } , where e k = (cid:104) τ k , i k (cid:105) and K isthe total number of events during the time interval ( 0 , T ] ,it is easy to derive the likelihood function for a given eventset using Eq. (5). L (cid:16) θ (cid:12)(cid:12) E T (cid:17) = K (cid:89) k =1 λ i k ( τ k ) exp (cid:32) − (cid:90) τ k τ k − λ ( t )d t (cid:33) × exp (cid:32) − (cid:90) Tτ k λ ( t )d t (cid:33) , = (cid:89) ∀ i (cid:89) τ ∈ ε Ti λ i ( τ ) × exp (cid:32) − (cid:90) T λ ( t )d t (cid:33) . (5)For convenience, we let τ = 0 and align all time seriesfor different videos i to the same initial point τ = 0 . Inpractice, it is difficult to manipulate the likelihood function.Equivalently, we can optimize the Log-likelihood function,which is defined as: ll (cid:16) θ (cid:12)(cid:12) ε T (cid:17) = (cid:88) ∀ i (cid:88) ∀ τ ∈ ε Ti log ( λ i ( τ )) − (cid:90) T λ ( t )d t, (6)where ε T is the timestamp set of target events in the entireobservation time interval ( 0 , T ] . ODEL
In this section, we describe the HRS model which is thecombination of three point process models introduced inthe last section.
First of all, we explain the intuition behind the HRS model,and the reasons why we take three types of point processesinto account. • Self-exciting : If a video has attracted a number ofuser requests recently, each request can impose somepositive influence on the future request rate of thesame video. This is the self-exciting process depictedby the Hawkes process. It can well capture videosthat are becoming more and more popular. • Reactive process : Different from the positive influenceof historical request events, there also exist eventsthat impose negative influence on the future requestrates. For example, the popularity of a video maysharply drop down if the video is removed fromthe recommendation list. Such negative events canbe modeled by the reactive process. • Self-correcting : The user population covered by anedge server is limited. Thus, if users do not repeat-edly request videos, a video can not stay popularforever. In fact, users seldom replay videos they havewatched [20]. It is expected that the popularity ofa video will diminish with time after a majorityof users have requested it. The restriction of thelimited user population can be captured by the self-correcting process.Based on the above discussion, we can see that theevolution of video popularity is a very complicated process.It is difficult to precisely model the video request rate bymerely utilizing only one particular kind of point process.In view of that, we propose to construct the video requestintensity function by combing three types of point processestogether. The proposed request rate intensity function ispresented by: ˜ λ i ( t ) = β i (cid:124)(cid:123)(cid:122)(cid:125) bias + self-exciting (cid:122) (cid:125)(cid:124) (cid:123) ω i (cid:88) τ ∈ ε ti k ( t − τ ) exp[ − α i N i ( τ )] (cid:124) (cid:123)(cid:122) (cid:125) self-correcting − γ i (cid:88) τ (cid:48) ∈ ζ ti k ( t − τ (cid:48) ) (cid:124) (cid:123)(cid:122) (cid:125) self-restraining , (7)where λ i ( t ) is the true request rate of video i at time t ,and ˜ λ i ( t ) is the estimation of λ i ( t ) . Each term in Eq. (7) isexplained as follows. • The first term β i is the bias of the intensity functionfor video content i with positive value ( β i > ). • As we have marked in Eq. (7), ω i (cid:80) τ ∈ ε ti k ( t − τ ) isthe SE (self-exciting) term. It is designed based onthe Hawkes process and ω i k ( t − τ ) is the positiveinfluence imposed by the video request event at time τ . k ( t − τ ) is a kernel function to be specified laterand ω i is a parameter to be learned. • The second term γ i (cid:80) τ (cid:48) ∈ ζ ti k ( t − τ (cid:48) ) , which is calledSR (self-restraining) term, captures the influence ofnegative events such as the event that a video isremoved from the recommendation list. This SR termis designed based on the reactive process. ζ ti is theset including the timestamp of all negative events inthe period [0 , t ) . Similar to modeling the influence ofpositive events, k ( t − τ (cid:48) ) is the kernel function toaccount for the influence of a negative event. γ i is aparameter to be learned. PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 5 • The last term exp[ − α i N i ( τ )] is the SC (self-correcting) term and N i ( τ ) is the number of historicalrequests of video i until time τ . The implication isthat the influence of a positive event will be smallerif more users have watched the video i . For example,if there are two movies: A and B. Movie A has beenwatched by 99% users, but Movie B is a new onewatched by only 1% users. Then, the influence ofa request event for Movie A or B should be verydifferent.For point process models, it is common to adopt expo-nential kernel functions to quantify the influence of histori-cal events [14], [18], [19]. Thus, in the HRS model, the kernelfunctions k ( t − τ ) and k ( t − τ (cid:48) ) are set as: k ( t − τ ) = exp [ − δ ( t − τ )] ,k ( t − τ ) = exp [ − δ ( t − τ (cid:48) )] . (9)Here δ > and δ > are two hyper-parameters, whichcan be determined empirically through cross validation.From Eq. (8), we can observe that kernel functions decaywith t − τ , implying that the influence gradually diminisheswith time.By considering the reality of video request rates, we needto impose restrictions on the intensity function defined inEq. (7). • The video request rate is non-negative. However, dueto the SR term, it is not guaranteed that Eq. (7) alwaysyields a non-negative request rate. Besides, the Log-likelihood function requires that the video requestrate must be positive. Thus, we define ˆ λ i ( t ) = s log (cid:16) λ i ( t ) /s ) (cid:17) . (10)Here s is a small positive constant number. We utilizethe property that the function g ( x ) = s log(1 +exp( x/s )) ≈ max { , x } , i.e. , the ReLU function, as s → [21]. • All parameters β i , ω i , α i and γ i should be positivenumbers to correctly quantify the influence of eachterm.The intensity function ˆ λ i ( t ) is the final estimation formof the request intensity λ ti for video i at time t . Given the event sets ε T and ζ T , and the parameters θ =[ β (cid:124) , ω (cid:124) , α (cid:124) , γ (cid:124) ] , the Log-likelihood function is definedaccording to Eq. (6) as: ll (cid:16) θ (cid:12)(cid:12) ε T , ζ T (cid:17) = (cid:88) ∀ i (cid:88) τ ∈ ε Ti log ˆ λ i ( τ ) − (cid:88) ∀ i (cid:90) T ˆ λ i ( t ) d t. (11)In the rest of this work, we use the shorthand notation ll ( θ ) =: ll (cid:0) θ (cid:12)(cid:12) ε T , ζ T (cid:1) if event sets are clear in the context.In addition, to simplify our notations, let I i = (cid:82) T ˆ λ i ( t ) d t represent the integral term in the Log-likelihood function ofvideo i . Then, the Log-likelihood function can be rewrittenas: ll ( θ ) = (cid:88) ∀ i (cid:88) ∀ τ ∈ ε Ti log ˆ λ i ( τ ) − (cid:88) ∀ i I i . (12) The challenge to maximize the Log-likelihood functionlies in the difficulty to derive I i due to the complication ofEq. (10). Thus, we resort to the Monte Carlo estimator toderive I i approximately.We briefly introduce the Monte Carlo estimator as fol-lows. Given a function f ( x ) , M samples of x can be uni-formly and randomly selected from the domain of x , say ( a, b ) . Then, the integral of f ( x ) can be calculated as: I = (cid:90) ba f ( x ) d t = E (cid:104) ¯ I M (cid:105) , s.t. ¯ I M = 1 M M (cid:88) m =1 f ( x ( m ) ) p ( x ( m ) ) , x ( m ) ∼ U ( a, b ) ,p ( x ( m ) ) = 1 b − a , (13)where a and b are lower and upper limit points of the inte-gral function respectively. The expected value of the integralterm, i.e. , I = E (cid:2) ¯ I M (cid:3) , can be approximately performed bythe average of f ( x ( m ) ) /p ( x ( m ) ) .Suppose the integral range of I i is from to T , we canapply the Monte Carlo estimator to evaluate I i as follows: I i ≈ TM M (cid:88) n =1 ˆ λ i (cid:16) t ( m ) (cid:17) , s.t. t ( m ) ∼ U (0 , T ) . (14)The Log-likelihood function can be approximately eval-uated by: ¯ ll ( θ ) = (cid:88) ∀ i (cid:88) ∀ τ ∈ ε Ti log ˆ λ i ( τ ) − (cid:88) ∀ i ¯ I Mi , = (cid:88) ∀ i (cid:88) ∀ τ ∈ ε Ti log ˆ λ i ( τ ) − TM (cid:88) ∀ i M (cid:88) m =1 ˆ λ i (cid:16) t ( m ) (cid:17) , s.t. t ( m ) ∼ U (0 , T ) . (15)Equivalently, we can minimize the negated Log-likelihood function. By involving the regularization terms,our problem can be formally defined as: min θ L = − ¯ ll ( θ ) + ρ β (cid:107) β (cid:107) + ρ ω (cid:107) ω (cid:107) + ρ α (cid:107) α (cid:107) + ρ γ (cid:107) γ (cid:107) , s.t. β i , ω i , α i ,γ i > , for ∀ i, (16)where ρ β , ρ ω , ρ α and ρ γ are regularization parameters.Since the Log-likelihood in Eq. (15) is a convex function,we can solve Eq. (16) by using the Gradient Descent (GD)algorithm [18], [19]. By differentiating ¯ ll ( θ ) with respect toeach parameter θ i ∈ θ , one can derive the following results: ∂ ¯ ll ( θ ) ∂θ i = (cid:88) τ ∈ ε Ti λ i ( τ ) ∂ ˆ λ i ( τ ) ∂θ i − TM M (cid:88) m =1 ∂ ˆ λ i ( t ( m ) ) ∂θ i . (17) θ i is a parameter associated with video i , such as β i . PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 6
According to Eq. (17), we need to derive ∂ ˆ λ i ( t ) / ∂θ i inorder to obtain the gradient of ¯ ll ( θ ) . Thus, by differentiat-ing ˆ λ i ( t ) shown in Eq. (10) with respect to each parameter,we can obtain Eqs. (18)-(21) as follows: ∂ ˆ λ i ( t ) ∂β i = ∂ ˆ λ i ( t ) ∂ ˜ λ i ( t ) , (18) ∂ ˆ λ i ( t ) ∂α i = − ∂ ˆ λ i ( t ) ∂ ˜ λ i ( t ) ω i (cid:88) τ ∈ ε ti k ( t − τ ) N i ( τ ) exp[ − α i N i ( τ )] , (19) ∂ ˆ λ i ( t ) ∂ω i = ∂ ˆ λ i ( t ) ∂ ˜ λ i ( t ) (cid:88) τ ∈ ε ti k ( t − τ ) exp[ − α i N i ( τ )] , (20) ∂ ˆ λ i ( t ) ∂γ i = − ∂ ˆ λ i ( t ) ∂ ˜ λ i ( t ) (cid:88) τ (cid:48) ∈ ζ ti k ( t − τ (cid:48) ) . (21)Here ∂ ˆ λ i ( t ) /∂ ˜ λ i ( t ) can be calculated by Eq. (22), whichgives ∂ ˆ λ i ( t ) ∂ ˜ λ i ( t ) = exp(˜ λ i ( t ) /s )1 + exp(˜ λ i ( t ) /s ) . (22)By integrating Eqs. (18)-(22) with Eq. (17), we can obtainthe gradient expression of ¯ ll ( θ ) . Then, let θ ( j ) i represent anyparameter to be determined after j iterations. We update θ i according to: θ ( j +1) i ← θ ( j ) i + η (cid:32) − ∂ ¯ ll ( θ ) ∂ θ ( j ) i + ρ θ i θ ( j ) i (cid:33) , (23)where ρ θ i is the regularization parameter associated withthe parameter θ i and η is the learning rate depending onthe GD algorithm. In our work, to adhere to Eq. (23) andensure all parameters are within the boundaries specified inEq. (16), we adopt the L-BFGS [22] algorithm to minimizethe objective function. The computation complexity for a round of iteration todetermine the parameters in the HRS model is O ( | ε T | + | ζ T | + CM ) , where | ε T | and | ζ T | are the total numbersof positive and negative events respectively, C is the totalnumber of videos and M is the Monte Carlo samplingnumber of each video.We analyze the detailed computation complexity as fol-lows. To minimize the objective function in Eq. (16), it isnecessary to compute gradients according to Eq. (18)-(21).To ease our discussion, we define three functions as below: Φ i ( t ) = (cid:88) τ ∈ ε ti k ( t − τ ) exp[ − α i N i ( τ )] (24a) Ψ i ( t ) = (cid:88) τ (cid:48) ∈ ζ ti k ( t − τ (cid:48) ) (24b) Γ i ( t ) = (cid:88) τ ∈ ε ti k ( t − τ ) N i ( τ ) exp[ − α i N i ( τ )] (24c)We need to compute ε Ti for different Φ i ( t ) ’s and Γ i ( t ) ’s,and ζ Ti for Ψ i ( t ) ’s. Specifically, for a particular event withoccurrence time τ and its previous event with occurrencetime τ l , we can compute Φ i ( τ ) = Φ i ( τ l ) exp [ − δ ( τ − τ l )] + exp[ − α i N i ( τ )] . Thus, the computation complexity is O (1) to compute Φ i ( t ) for each event and the complexity is O ( | ε Ti | ) to complete the computation for all video i events.Recalling that ε T = ∪ ∀ i ε Ti , the whole computation complex-ity for all Φ i ( t ) ’s is O ( | ε T | ) . Similarly, the computation com-plexity is O ( | ε T | ) / O ( | ζ T | ) to compute all Ψ i ( t ) ’s/ Γ i ( t ) ’s.With the above computations, the complexity to com-pute the first term of Eq. (17) is O ( | ε T | ) . For the Monte Carloestimator, it needs to sample M times for each video. Thus,there are totally CM samples in a round of iteration for C videos. Besides, given the sampled time point t ( m ) , thecomplexity is O (1) to compute all gradients. By wrappingup our analysis, the overall computation complexity foreach iteration is O ( | ε T | + | ζ T | + CM ) . NLINE
HRS B
ASED V IDEO C ACHING A LGO - RITHM
The complexity O ( | ε T | + | ζ T | + CM ) is not high if the train-ing algorithm is only executed once. Yet, the online videosystem is dynamic because user interests can change overtime rapidly, and fresh videos (or users) enter the onlinevideo system continuously. The computation load will betoo heavy if we need to update the predicted video requestrates very frequently. Thus, in this section, we propose anonline video caching algorithm based on the HRS model.The online algorithm can update the predicted video requestrates by minor computation with incremental events sincethe last update.The online video caching algorithm needs to cope withtwo kinds of changes. The first one is the kernel functionupdate. Given the HRS model parameters, the video requestrates predicted according to Eq. (7) should be updatedaccording to the latest events. The user interest can be verydynamic. For example, users may prefer News videos in themorning, but Movie and TV videos in the evening. Thus,the predicted request rates should be updated instantly andfrequently in accordance with the latest events. The secondone is the parameter update. The model parameters such as ω i and γ i capture the influence weight of each term in theHRS model. In a long term, due to the change of users andvideos, model parameters should be updated as well. Wediscuss the computation complexity to complete the aboveupdates separately. According to Eq. (7), if there are new events, we need toupdate kernel functions, i.e. , Φ i ( t ) and Ψ i ( t ) , so as to update λ i ( t ) . Suppose that the time point of the last update is t and the current time point to update request rates is t + ∆ t .Then, the computation complexity to complete the update is O ( | ε t +∆ t − ε t | + | ζ t +∆ t − ζ t | ) . In other words, the complexityis only a linear function with the number of incrementalevents.
2. Here, we utilize the property that k (0) = 1
3. In fact, this complexity is an upper bound since we can completethe computation of Ψ i ( t ) in the first iteration without the necessity toupdate it in subsequent iterations. PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 7
Algorithm 1:
Kernel Function Online Update Algo-rithm
Input: ∆ t , { ε t +∆ t − ε t } , { ζ t +∆ t − ζ t } , Φ i ( t ) ’s , Ψ i ( t ) ’s , N i ( t ) ’s Output: Φ i ( t + ∆ t ) ’s , Ψ i ( t + ∆ t ) ’s , N i ( t + ∆ t ) ’s while ∀ i do Φ i ( t + ∆ t ) ← Φ i ( t ) · exp [ − δ ∆ t ] τ l ← t for ∀ τ ∈ { ε t +∆ ti − ε ti } do N i ( τ ) ← N i ( τ l ) + 1 Φ i ( t + ∆ t ) ← Φ i ( t + ∆ t ) + k ( t + ∆ t − τ ) · exp[ α i N i ( τ )] τ l ← τ end N i ( t + ∆ t ) ← N i ( τ l ) Ψ i ( t + ∆ t ) ← Ψ i ( t ) · exp [ − δ ∆ t ] for ∀ τ (cid:48) ∈ { ζ t +∆ ti − ζ ti } do Ψ i ( t + ∆ t ) ← Ψ i ( t + ∆ t ) + k ( t + ∆ t − τ (cid:48) ) end end return Φ i ( t + ∆ t ) ’s , Ψ i ( t + ∆ t ) ’s , N i ( t + ∆ t ) ’sFor using exponential kernel functions, we can provethat k ( t + ∆ t − τ ) = k ( t − τ ) exp [ − δ ∆ t ] , (25) k ( t + ∆ t − τ ) = k ( t − τ ) exp [ − δ ∆ t ] . (26)Note that the term exp [ − α i N i ( τ )] in Eq. (7) is not dependenton t . Thus, we can complete the update of terms ω i Φ i ( t ) and γ i Ψ i ( t ) with O (1) by multiplying exp [ − δ ∆ t ] and exp [ − δ ∆ t ] respectively. Then, we only need to add | ε t +∆ ti − ε ti | + | ζ t +∆ ti − ζ ti | for each video i . Note that it is unnecessaryto update videos without any new event. Thus, the overallcomputation complexity is O ( | ε t +∆ t − ε t | + | ζ t +∆ t − ζ t | ) . Thealgorithm details for updating kernel functions are shown inAlgorithm 1. To update parameters in accordance with new events, it isnecessary to update gradients based on Eq. (18)-(21) andexecute the GD algorithm again to obtain updated param-eters. To avoid confusing with the kernel function update,we suppose the time point of the last parameter update is T and the current time point is T + ∆ T .In Eqs. (18)-(21), we can also see kernel functions Φ i ( t ) ’s, Ψ i ( t ) ’s and Γ i ( t ) ’s. Therefore, the update of them will befirstly introduced. If maintaining a fixed value during thelearning process, the kernel functions Ψ i ( t ) ’s can be simplyupdated by incremental new events based on the discussionin the last subsection. However, as for Φ i ( t ) ’s and Γ i ( t ) ’s,we need to scan all historical events again to compute theirvalues when the parameter α has been updated, whichresults in very high computation complexity. To reduce theadditional complexity, we propose to use a threshold k th totruncate the sum of kernel functions. k th is a very small
4. Note that the update frequency of parameters is different from thatof video request rates.
Algorithm 2:
Parameter Online Learning Algo-rithm
Input: θ (0) , T , ∆ T , { ε T +∆ T − ε T + ln kthδ } , { ζ T +∆ T − ζ T } , Ψ i ( T ) Output: θ ( j ) Update Ψ i ( t ) where t ∈ [ T, T + ∆ T ) once at first j ← while The termination condition is not satisfied do Update Φ i ( t ) , Γ i ( t ) where t ∈ [ T, T + ∆ T ) withset { ε T +∆ T − ε T + ln kthδ } l ← ∇ ← [0] ∗ C while ∀ i do for ∆ M samples do t ( m ) ∼ Unif ( T, T + ∆ T ) Calculate ˆ λ i ( t ( m ) ) , ˜ λ i ( t ( m ) ) according toEq. (7) and Eq. (10) with Φ i ( t ( m ) ) and Ψ i ( t ( m ) ) l ← l − ˆ λ i ( t ( m ) ) for θ i ∈ { β i , ω i , α i , γ i } do Calculate ∂ ˆ λ i ( t ( m ) ) / ∂θ i by one ofequation in Eqs. (18)-(21)corresponding to θ i ∇ θ i ← ∇ θ i − ∂ ˆ λ i ( t ( m ) ) / ∂θ i end end l ← (∆ T / ∆ M ) · l ∇ ← (∆ T / ∆ M ) · ∇ for ∀ τ ∈ { ε T +∆ Ti − ε Ti } do l ← l + log ˆ λ i ( τ ) for θ i ∈ { β i , ω i , α i , γ i } do ∇ θ i ← ∇ θ i + ( ∂ ˆ λ i ( τ ) / ∂θ i ) / ˆ λ i ( τ ) end end end Update θ ( j +1) ← θ ( j ) by adopting L-BFGSalgorithm with l , ∇ and the penalty term j ← j + 1 end T ← T + ∆ T ; return θ ( j ) number. If k ( t − τ ) < k th where t is the current timeand τ is the occurrence of a particular event, it impliesthat the influence of the historical event before time τ isnegligible. Therefore, it is trivial to ignore this event sothat the computation complexity will not continuously growwith time. Given k th and the current update time T + ∆ T , itis easy to verify that events between time [ T + ln k th δ , T +∆ T ) needs to be involved to update Φ i ( t ) ’s and Γ i ( t ) ’s. Thus, theupper bound of computation complexity for kernel renewalis O ( | ε T +∆ T − ε T + ln kthδ | + | ζ T +∆ T − ζ T | ) .Next, we introduce the update of all gradients in Eq. (17).The first term relevant to recent new events can be up-dated in O ( | ε T +∆ T − ε T | ) . Furthermore, the Monte Carlosampling term needs to be trimmed during online learningprocess, i.e. , ∆ T ∆ M (cid:80) ∆ Mm =1 ∂ ˆ λ i ( t ( m ) ) ∂θ i . We suppose there are ∆ M PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 8
Fig. 2. Modules of an edge cache system implementing online HRSalgorithm. A representative workflow is presented in order. new samples during the period from [ T, T + ∆ T ) , where ∆ M/M = ∆
T /T since we update the parameters in ashorter time window. Therefore, we can conclude that thecomplexity to calculate the Monte Carlo sampling term is O ( C · ∆ M ) with updated kernels. Here C is the total numberof videos in the system.By wrapping up our analysis, the overall computationcomplexity to update all gradients is O ( | ζ T +∆ T − ζ T | + | ε T +∆ T − ε T + ln kthδ | + | ε T +∆ T − ε T | + C · ∆ M ) in eachiteration. Note that this is also an upper bound of thecomputation complexity. In the end, we present the detailedonline learning algorithm for training the HRS model inAlgorithm 2. In Fig. 2, we describe the framework of our system includingan
Online HRS model , a
Data Processor and an
Edge Cache .As shown in Fig. 2, the
Data Processor is responsiblefor preprocessing the request records. It is also responsi-ble for recording positive and negative events. The
OnlineHRS model utilizes the user request records from the
DataProcessor to periodically update kernel functions and theHRS model parameters. the
Edge Cache can periodicallyupdate the cached videos based on the update of predictionresults generated by the HRS model. Based on the updatedrequest rates of all videos, the
Edge Cache maintains videoswith the highest request rates in its cache until the cacheis fully occupied. To reduce the computation overhead, the
Edge Cache can only re-rank videos with new request ratesupdated by the
Online HRS model .The
Online HRS model can be further decomposed intothree parts, including
HRS Trainer , Parameter Updater and
Kernel Updater . A typically workflow through these threeparts is shown as follows:A) This is the first step to deploy the HRS algorithm onan edge node. The HRS model can be initially trainedby the
HRS Trainer with the long term request recordsaccording to the method introduced in Section 3.2.B) With the arrival of new video requests, the param-eters such as α i and β i in the HRS model can be refined by the Parameter Updater according to Alg.2.Thus, these parameters reflect not only long-term butalso short-term popularity trends of videos. To avoidoverfitting, these parameters should not be updatedover frequently. In our experiments, we update pa-rameters for every a few number of days.C) The user view interest may change very quicklywith time, which can be captured by the renewal ofprediction on video request rates through frequentlyupdating the kernel functions. With the incremen-tal user requests, the
Kernel Updater can efficientlyupdate the kernel functions to trace the latest userrequest interest.
VALUATION
We evaluate the performance of our HRS algorithm byconducting experiments with real traces collected from theTencent Video.
Tencent video is one of the largest online video streamingplatforms in China. We collected a total of 30 days ofrequest records from Nov 01, 2014 to Nov 30, 2014. Afterdata cleaning, encoding and masking, we randomly samplea dataset which contains a population C = 20 K of uniquevideos from 5 cities in Guangdong province in China.There is a total number of K = 15 . M (15 , , request records in this dataset and we make thedataset publicly available on GitHub . Each requestrecord in our dataset is represented by the metadata (cid:104) V ideoID, UserID, T IME, P ROV INCE, CIT Y (cid:105) . Giventhe lack of negative events, the set of negative events isgenerated from request records as follows: if a video stayscold without being requested for a period, a negative eventof this video will be marked in the dataset. Empirically, theperiod is set as 12 hours.We divided the dataset into five parts based on thedate of request records for cross-validation and hyper-parameters selection following the Forward-Chaining trick[23]. Each part includes the request records in six days. Inthe first fold, Part I, including the request traces in the first6 days, is used as the training set; while Part II, with therecords from day 6 to 12, is used as validation set. Part III,including the request records of the next 6 days, is used asthe test set. In the second fold, Part I and II together serveas the training set, while Part III and IV are used as thevalidation set and test set respectively. Finally, we employParts I-III as the training set and the rest two parts as thevalidation and test set respectively in the third fold.
We employ two metrics for evaluation. • Cache Hit Rate is defined as the number of requestshit by the videos cached on the edge server dividedby the total number of requests issued by users. If theHRS runs independently on multiple edge servers,
5. Tencent Video: https://v.qq.com6. https://github.com/zhangxzh9/HRS Datasets
PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 9 the overall cache hit rate of the whole system willbe calculated by the weighted average hit rates ofmultiple edge servers. • Execution Time : Due to the possibility that the edgeserver is with very limited computing resource, it isdesirable that the computation load of the cachingalgorithm is under control so that cached videos canbe updated timely. Thus, we use the execution timeof each algorithm as the second evaluation metric.
We compare the performance of HRS with the followingbaselines: • LRU (Least Recently Used) , which always replacesthe video that was requested least recently with thenewly requested one when the cache is full. • OPLFU (Optimal Predictive Least Frequently Used) [9], which is a variant of LFU. Different from LFU,it predicts the future popularity by matching andusing one of Linear, Power-Law, Exponential andGaussian functions. Caching server maintains thecache list based on the estimated future popularitydetermined by the selected function. Due to the highcomputation complexity, we only use Linear, Power-Law and Exponential functions in our experiments. • POC (Popcaching) [24], which learns the relationshipbetween the popularity of videos and the contextfeatures of requests, and stores all features in theLearning Database for video popularity prediction.Once a request arrives, POC will update the featuresof the requested video online and predicts videopopularity by searching the Learning Database withthe context features. we set the number of requestsin the past 1 hour, 6 hours, 1 day as the first threefeatures while 10 days, 15 days and 20 days as thefourth dimension feature for three folds, respectively. • LHD (Least Hit Density) [25], which is a ratherrigorous eviction policy to determine which videoshould be cached. LHD predicts potential hits-per-space-consumed (hit density) for all videos usingconditional probability and eliminates videos withthe least contribution to the cache hit rate. An publicimplementation of the LHD algorithm in GitHub can be obtained and we reuse it in our experimentsby setting all parameters as default values. • DPC (DeepCache) [13], which predicts video pop-ularity trend by leveraging the LSTM Encoder-Decoder model. An M -length input sequence with d -dimensional feature vector representing the pop-ularity of d videos is required for the model. A K -length sequence will be exported for prediction.Here M and K are hyper-parameters for model. Allmodel settings are the same as the work [13] in ourexperiments. • Optimal (Optimal Caching) , which is achieved byassuming that all future requests are known so thatthe edge server can always make the optimal cachingdecisions. It is not a realistic algorithm, but can be
7. https://github.com/CMU-CORGI/LHD used to evaluate the improvement space of eachcaching algorithm.
We simulate a video caching system shown in Fig. (2) toevaluate all caching algorithms. In the gradient descentalgorithm, there are six hyper-parameters which are initial-ized as δ = 0 . , δ = 1 . and ρ β = ρ α = ρ ω = ρ γ = e empirically. Their values will be determined through val-idation. For all parameters θ in HRS, their initial valuesare set as , except that the initial values of γ are set as0.1, referring to the settings in some previous papers [15],[19], [26]. Moreover, the number of sample times in theMonte Carlo estimation is set as , times for everyday, e.g. , we set M = 1 , , in the first fold, and set M = 2 , , and M = 3 , , for the second andthird folds respectively. The time interval ∆ T to updatethe online HRS model is set as two days. The iterationprocess in the gradient descent algorithm will be terminatedif the improvement of cache hit rate in the validation set isnegligible after an iteration.By default, the time interval to update kernel functions isset as 1 hour, and the truncating threshold k th is set as e − for parameter online learning. Some detailed experimentsare conducted to study the influence of these two parame-ters.Furthermore, all algorithms except LHD are pro-grammed with Python [27] and executed in Jupyter Note-book [28] with a single process. As for LHD, we reusethe code and estimate the execution time according to [25].Besides, the execution time is measured on an Intel serverwith Xeon(R) CPU E5-2670 @ 2.60GHz. We first evaluate the HRS algorithm with other baselinecaching algorithms through experiments by varying thecaching capacity from 0.1% to 25% of the total number ofvideos ( i.e. , the caching size is varied from 20 to 5K videos).The experiment results of 5 cities are presented in Fig. 3 withthe y-axis representing the averaged cache hit rate and Fig. 4shows the results of cache hit rate of a single server at theprovince level.Through the comparison, we can see that: • HRS outperforms all other baseline algorithms eval-uated in terms of the cache hit rate under all cachesizes over the test time, with an overall average of12.3% improvement. DPC is the second best one inmost cases. • It is an effective approach to reduce the Internettraffic by caching videos with HRS. For example, thecache hit rate is more than 45% by only caching 2.5%video contents implying that the Internet traffic canbe reduced by 45%. • It is more efficient to utilize the caching capacity byusing HRS. To demonstrate this point, let us see aspecific case with a target cache hit rate of 10% inFig. 3. In this case, HRS requires a cache size of 40
PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 10 (a) Cache hit rate with small cache at city level. (b) Cache hit rate with large cache at city level.
Fig. 3. Cache hit rate for HRS and other baseline algorithms when varying the cache capacity (size) from S = 0 .
1% (20) to S = 25% (5000) at thecity level. For clarity, Fig. (3a) and Fig. (3b) show the performance under the small and large capacity respectively. (a) Cache hit rate with small cache at province level. (b) Cache hit rate with large cache at province level. Fig. 4. Cache hit rate for HRS and other baseline algorithms when varying the cache capacity (size) from S = 0 .
1% (20) to S = 25% (5000) at theprovince level. For clarity, Fig. (4a) and Fig. (4b) show the performance under the small and large capacity respectively. videos. In comparison, DPC/LHD needs nearly 2-2.4 times cache capacity to achieve the same goal.The performance improvement against the secondbest solution exceeds 124% with 0.1% limited cachingcapacity, showing the outstanding ability of HRS topredict the most popular video when the resource isconstrained and a more accurate decision is needed. • Compared to other baseline algorithms at theprovince level, HRS also achieves an overall averageof 8.4% improvement. The HRS model performsbetter at the city level than at the province levelfor the reason that video popularity trends are moreaccurately reflected by leveraging the SC term in cityedge servers. In fact, the HRS model performs betterif the request rate is higher.Moreover, to check the stability of each video cachingalgorithm, we plot the cache hit rate over time with a fixedcaching capacity S = 200 (equivalent to about 1% of totalvideos). The results are presented in Fig. 5, showing theaveraged cache hit rate of all algorithms versus the date. Inother words, each point in the figure represents the average cache hit rate over a day. From the results shown in Fig. 5,we can draw a conclusion that HRS is always the best oneachieving the highest cache hit rate among video cachingalgorithms except the Optimal one indicating the gain ofHRS is very stable over time. We study the sensitivity of two crucial parameters: ∆ t and k th in Tables 2 and 3 to see how these two hyper-parameters affect the video caching performance. All otherhyper-parameters are kept unchanged as we vary ∆ t and k th .We repeat the experiments presented in Fig. 3 by settingdifferent values for ∆ t . The parameter ∆ t indicates howfrequently the HRS model updates the kernel functions. Aswe can see in Table 2, the cache hit rate is higher if ∆ t issmaller because the latest user trends can be captured intime with a smaller ∆ t . It also confirms that the user interestis very dynamic over time. However, it is more reasonableto set ∆ t = 1 hour since the improvement using ∆ t = 0 . hour is small but with higher time complexity . PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 11
Fig. 5. Average cache hit rate of each day over the test period. Eachpoint in the figure shows the average cache hit rate over a day. Thecache capacity is fixed at S = 1% (200) . In Table 3, we further investigate the influence of k th onthe cache hit rate. k th is negligible and can be discardedto control the computation complexity. We reuse the settingof the experiment in Fig. 3 except varying k th . As we cansee from Table 3, the overall cache hit rate is better if k th is smaller since more kernel functions are reserved forcomputation. Because the cache hit rate is very close bysetting k th equal to e − or e − , we finally set k th = e − in our experiments with lower time complexity. We conduct experiments to evaluate the execution time ofeach video caching algorithm under various caching sizesin Fig. 6. Notably, HRS(Online) and HRS are conducted toexamine the influence of online algorithm on computationcomplexity. Eventually, the tests are carried out on 5 citiesand all test periods in the three folds are considered toachieve convincing results.As we see from Fig. 6, the heuristically designed algo-rithms, i.e. , LHD and LRU, achieve the lowest executiontime. However, HRS is the best one compared with otherproactive video caching algorithms, i.e. , DPC, POC andOPLFU. The execution time of the online HRS-based algo-rithm is very short since the truncated threshold for kernelrenewal can control the computation complexity. Moreover,this experiment results indicate the feasibility of HRS foronline video caching.
TABLE 2Cache hit rate (%) under different ∆ t (hour(s)). The setting is the sameas that in Fig. 3 except ∆ t . Update
Cache Capacity
Interval ∆ t = 4 . ∆ t = 1 . ∆ t = 0 . Fig. 6. Comparison of execution time under different cache sizes. Re-sults show the total time(s) for 5 cities to maintain the caching list duringthe test period.
ELATED W ORK
Caching at the edge is an effective way to alleviate the back-haul burden and reduce the response time. In recent years,more and more research interest has been devoted to inves-tigating the caching problem on edge servers. Golrezaei etal. presented a novel architecture of distributed caching withthe D2D collaboration to improve the throughput of wirelessnetworks [2]. Gregori et al. executed the caching strategieson small base stations or user terminals by D2D communi-cations [3]. Different from caching by D2D communication,caching at the edge has more potential to make precisedecisions with features of edge servers. Poularakis et al. formulated a joint routing and caching problem guaranteedby an approximation algorithm to improve the percentageof requests served by small base stations (SBSs) [4]. Jiang et al. developed a cooperative cache and delivery strategyin heterogeneous 5G mobile communication networks [5].Yang et al. devised an online algorithm which estimates fu-ture popularity by location customized caching schemes inmobile edge servers [6]. Moreover, with the ability to learnthe context-specific content popularity online, a context-aware proactive caching algorithm in wireless networks wasintroduced by Muller et al. [7].With the explosive growth of video population, it isurgent to develop more intelligent video caching algorithms
TABLE 3Cache hit rate(%) under different values for the threshold k th . Thesetting is the same as that in Fig. 3 expect k th . Truncating
Cache Capacity
Threshold k th = e − k th = e − k th = e − PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 12 by identifying popularity patterns in historical records. Itwas summarized in [29] and [30] that diverse approachesfor content caching have been implemented in the Internetnowadays. However, less attention has been allocated tooptimize the caching methods and most of them weredeployed based on heuristic algorithms such as LRU, LFUand their variants [8], [9], [10], which are lightweight butinaccurate, and thus often fail to seize viewers’ diverse andhighly dynamic interests.Some proactive models including regression models [31],auto regressive integrated moving average [32] and clas-sification models [33] were proposed to forecast the pop-ularity of content. Moreover, quite a few learning-drivencaching algorithms were proposed for some special appli-cation scenarios. Wu et al. proposed an optimization-basedapproach with the aim to balance the cache hit rate andcache replacement cost [34]. Wang et al. developed a trans-fer learning algorithm to model the prominence of videocontent from social streams [35], while Roy et al. proposed anovel context-aware popularity prediction policy based onfederated learning [36].Besides, with the rapid development of deep learning,a significant amount of research efforts has been devotedto predicting content popularity using the neural networkmodel. Tanzil et al. adopted a neural network model toestimate the popularity of contents and select the physicalcache size as well as the place for storing contents [37].Feng et al. proposed a simplified Bi-LSTM (bidirectionallong short-term memory) neural network to predict thecorresponding popularity profile for every content class[12]. LSTM was also implemented for content caching in[13]. However, NN-based models typically require a largenumber of historical records for tuning the extensive param-eters. But with the sparse requested records of cold videos,it is not easy to learn an appropriate model for prediction.Further, because the popularity distribution of contents mayconstantly change over time [38], it is difficult to makedecisions based on outdated dataset. Thus, some onlinelearning models which are more responsive to continuouslychanging trends of content popularity were proposed in [6],[7], [24].
The point processes are frequently used to model a series ofsuperficially random events in order to reveal the underly-ing trends or predict future events. Bharath et al. considereda learning-driven approach with independent Poisson pointprocesses in a heterogenous caching architecture [39]. Shang et al. [40] formulated a model to obtain a large-scale user-item interactions by utilizing point process models. Xu et al. [26] modeled user-item interactions via superposedHawkes processes, a kind of classic point process model,to improve recommendation performance. More applica-tions of point processes in recommendation systems canbe found in [41], [42]. Furthermore, point processes havebeen applied to study social networks between individualusers and their neighbors in [43]. Ertekin et al. [15] usedreactive point process to predict power-grid failures, andprovide a benefit-and-cost analysis for different proactivemaintenance schemes. Mei et al. [21] proposed a novel model which was a combination of point processes andneural networks to improve prediction accuracy for futureevents. The reason why point processes have been appliedin predicting discrete events in the future lies in that theoccurrence of a past event often gives a temporary boostto the occurrence probability of events of the same typein the future. Naturally, the video request records can beregarded as time series events, which can be modeled bypoint processes. However, there is very limited work thatexplored the utilization of point process models to improvevideo caching decisions, which is the motivation of ourwork.
ONCLUSION
In this work, we propose a novel HRS model to makevideo caching decisions for edge servers in online videosystems. HRS is developed by combing the Hawkes pro-cess, reactive process and self-correcting process to modelthe future request rate of a video based on the historicalrequest events. The HRS model parameters can be de-termined through maximizing the Log-likelihood of pastevents, and detailed iterative algorithms are provided. Inview of the dynamics of user requests, an online HRS-basedalgorithm is further proposed, which can process the requestevents in an incremental manner. In the end, we conductextensive experiments through real video traces collectedfrom Tencent Video to evaluate the performance of HRS. Incomparison with other baselines, HRS not only achieves thehighest cache hit rate, but also maintains low computationoverhead. R EFERENCES [1] Cisco VNI, “Global Mobile Data Traffic Forecast Update, 2017–2022,” white paper , 2019.[2] N. Golrezaei, A. Molisch, A. G. Dimakis, and G. Caire, “Femto-caching and device-to-device collaboration: A new architecturefor wireless video distribution,”
IEEE Communications Magazine ,vol. 51, no. 4, pp. 142–149, 2013.[3] M. Gregori, J. G´omez-Vilardeb´o, J. Matamoros, and D. Gunduz,“Wireless content caching for small cell and D2D networks,”
IEEEJournal on Selected Areas in Communications , vol. 34, no. 5, pp. 1222–1234, May 2016.[4] K. Poularakis, G. Iosifidis, and L. Tassiulas, “Approximation al-gorithms for mobile data caching in small cell networks,”
IEEETransactions on Communications , vol. 62, no. 10, pp. 3665–3677, Oct2014.[5] W. Jiang, G. Feng, and S. Qin, “Optimal Cooperative ContentCaching and Delivery Policy for Heterogeneous Cellular Net-works,”
IEEE Transactions on Mobile Computing , vol. 16, no. 5, pp.1382–1393, May 2017.[6] P. Yang, N. Zhang, S. Zhang, L. Yu, J. Zhang, and X. Shen,“Content Popularity Prediction Towards Location-Aware MobileEdge Caching,”
IEEE Transactions on Multimedia , vol. 21, no. 4, pp.915–929, Apr 2019.[7] S. Muller, O. Atan, M. Van Der Schaar, and A. Klein, “Context-Aware Proactive Content Caching with Service Differentiation inWireless Networks,”
IEEE Transactions on Wireless Communications ,vol. 16, no. 2, pp. 1024–1036, Feb 2017.[8] A. Jaleel, K. B. Theobald, S. C. Steely, and J. Emer, “High per-formance cache replacement using re-reference interval prediction(RRIP),”
ACM SIGARCH Computer Architecture News , vol. 38, no. 3,pp. 60–71, Jun 2010.[9] J. Famaey, F. Iterbeke, T. Wauters, and F. De Turck, “Towards apredictive cache replacement strategy for multimedia content,”
Journal of Network and Computer Applications , vol. 36, no. 1, pp.219–227, jan 2013.
PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 13 [10] M. Z. Shafiq, A. X. Liu, and A. R. Khakpour, “Revisiting cachingin content delivery networks,” in
Proceedings of the ACM SIGMET-RICS 2014 . Association for Computing Machinery, 2014, pp. 567–568.[11] M. Ahmed, S. Traverso, P. Giaccone, E. Leonardi, and S. Niccolini,“Analyzing the Performance of LRU Caches under Non-StationaryTraffic Patterns,” arXiv e-prints , p. arXiv:1301.4909, Jan. 2013.[12] H. Feng, Y. Jiang, D. Niyato, F. C. Zheng, and X. You, “Contentpopularity prediction via deep learning in cache-enabled fog radioaccess networks,” in
Proceedings of the 38th GLOBECOM . Instituteof Electrical and Electronics Engineers Inc., Dec 2019.[13] A. Narayanan, S. Verma, E. Ramadan, P. Babaie, and Z. L. Zhang,“DEEPCACHE: A deep learning based framework for contentcaching,” in
Proceedings of Workshop on Network Meets AI and ML,Part of SIGCOMM 2018 . New York, New York, USA: Associationfor Computing Machinery, Inc, Aug 2018, pp. 48–53.[14] A. G. Hawkes, “Spectra of Some Self-Exciting and Mutually Excit-ing Point Processes,”
Biometrika , vol. 58, no. 1, p. 83, Apr 1971.[15] S¸. Ertekin, C. Rudin, T. H. McCormick et al. , “Reactive pointprocesses: A new approach to predicting power failures in un-derground electrical systems,”
Annals of Applied Statistics , vol. 9,no. 1, pp. 122–144, 2015.[16] V. Isham and M. Westcott, “A self-correcting point process,”
Stochastic Processes and their Applications , vol. 8, no. 3, pp. 335–347,1979.[17] K. Shanmugam, N. Golrezaei, A. G. Dimakis, A. F. Molisch, andG. Caire, “FemtoCaching: Wireless content delivery through dis-tributed caching helpers,”
IEEE Transactions on Information Theory ,vol. 59, no. 12, pp. 8402–8413, Dec 2013.[18] D. Daley, Daryl J and Vere-Jones,
An introduction to the theory ofpoint processes: volume II: general theory and structure , ser. Probabilityand Its Applications. New York, NY: Springer New York, 2008,vol. 6, no. 13.[19] M.-A. Rizoiu, Y. Lee, S. Mishra, and L. Xie, “A Tutorial onHawkes Processes for Events in Social Media,” arXiv e-prints , p.arXiv:1708.06401, Aug. 2017.[20] J. Wu, Y. Zhou, D. M. Chiu, and Z. Zhu, “Modeling dynamics ofonline video popularity,”
IEEE Transactions on Multimedia , vol. 18,no. 9, pp. 1882–1895, Sep 2016.[21] H. Mei and J. Eisner, “The neural Hawkes process: A neurally self-modulating multivariate point process,” in
Proceedings of the 31stNIPS , Dec 2017, pp. 6755–6765.[22] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A Limited MemoryAlgorithm for Bound Constrained Optimization,”
SIAM Journalon Scientific Computing , vol. 16, no. 5, pp. 1190–1208, Sep 1995.[23] C. Bergmeir and J. M. Ben´ıtez, “On the use of cross-validation fortime series predictor evaluation,”
Information Sciences , vol. 191, pp.192–213, May 2012.[24] S. Li, J. Xu, M. Van Der Schaar, and W. Li, “Popularity-drivencontent caching,” in
Proceedings of the 35th IEEE INFOCOM . IEEE,Jul 2016, pp. 1–9.[25] N. Beckmann, H. Chen, and A. Cidon, “LHD: Improving cache hitrate by maximizing hit density,” in
Proceedings of the 15th NSDI ,2018, pp. 389–403.[26] H. Xu, D. Luo, X. Chen, and L. Carin, “Benefits from superposedHawkes processes,” in
Proceedings of the 21st AISTATS
Advanced Content Delivery,Streaming, and Cloud Services , vol. 51, no. 4, pp. 305–328, Oct 2014.[30] H. S. Goian, O. Y. Al-Jarrah, S. Muhaidat, Y. Al-Hammadi, P. Yoo,and M. Dianati, “Popularity-Based Video Caching Techniques forCache-Enabled Networks: A Survey,”
IEEE Access , vol. 7, pp.27 699–27 719, 2019.[31] Z. Wang, L. Sun, C. Wu, and S. Yang, “Enhancing internet-scalevideo service deployment using microblog-based prediction,”
IEEE Transactions on Parallel and Distributed Systems , vol. 26, no. 3,pp. 775–785, 2015.[32] D. Niu, Z. Liu, B. Li, and S. Zhao, “Demand forecast and perfor-mance prediction in peer-assisted on-demand streaming systems,”in
Proceedings of the 30th IEEE INFOCOM . IEEE, 2011, pp. 421–425.[33] M. Rowe, “Forecasting audience increase on YouTube,” in
Proceed-ings of the 8th ESWC , 2011. [34] Y. Wu, C. Wu, B. Li, L. Zhang, Z. Li, and F. C. Lau, “Scalingsocial media applications into geo-distributed clouds,”
IEEE/ACMTransactions on Networking , vol. 23, no. 3, pp. 689–702, Jun 2015.[35] S. D. Roy, T. Mei, W. Zeng, and S. Li, “Towards cross-domainlearning for social video popularity prediction,”
IEEE Transactionson Multimedia , vol. 15, no. 6, pp. 1255–1267, Jun 2013.[36] Y. Wu, Y. Jiang, M. Bennis, F. Zheng, X. Gao, and X. You, “ContentPopularity Prediction in Fog Radio Access Networks: A Feder-ated Learning Based Approach,” in
Proceedings of the 39th IEEEINFOCOM , vol. 2020-June. Institute of Electrical and ElectronicsEngineers Inc., Jun 2020.[37] S. M. Tanzil, W. Hoiles, and V. Krishnamurthy, “Adaptive Schemefor Caching YouTube Content in a Cellular Network: MachineLearning Approach,”
IEEE Access , vol. 5, pp. 5870–5881, 2017.[38] S. Dhar and U. Varshney, “Challenges and business models formobile location-based services and advertising,”
Communicationsof the ACM , vol. 54, no. 5, pp. 121–129, May 2011.[39] B. N. Bharath, K. G. Nagananda, and H. V. Poor, “A learning-based approach to caching in heterogenous small cell networks,”
IEEE Transactions on Communications , vol. 64, no. 4, pp. 1674–1686,Apr 2016.[40] J. Shang and M. Sun, “Local Low-Rank Hawkes Processes forTemporal User-Item Interactions,” in
Proceedings of the 18th ICDM ,vol. 2018-Novem. IEEE, Dec 2018, pp. 427–436.[41] N. Du, Y. Wang, N. He, and L. Song, “Time-sensitive recommenda-tion from recurrent user activities,” in
Proceedings of the 28th NIPS ,2015, pp. 3492–3500.[42] H. Dai, Y. Wang, R. Trivedi, and L. Song, “Recurrent coevolu-tionary latent feature processes for continuous-time recommen-dation,” in
Proceedings of the 1st DLRS , 2016, pp. 29–34.[43] K. Zhou, H. Zha, and L. Song, “Learning triggering kernels formulti-dimensional Hawkes processes,” in
Proceedings of the 30thICML , 2013, pp. 1301–1309.
Xianzhi Zhang received the B.S. degree fromNanchang University (NCU), Nanchang, China,in 2019. He is currently working toward the M.S.degree in Sun Yat-sen University, Guangzhou,China. His current research interests includecontent caching, applied machine learning andedge computing, and multimedia communica-tion.
Yipeng Zhou is a lecturer in computer sciencewith Department of Computing at MacquarieUniversity, and the recipient of ARC DiscoveryEarly Career Research Award, 2018. From Aug.2016 to Feb. 2018, he was a research fellow atthe Institute for Telecommunications Research(ITR) with University of South Australia. From2013.9-2016.9, He was a lecturer with Collegeof Computer Science and Software Engineering,Shenzhen University. He was a Postdoctoral Fel-low with Institute of Network Coding (INC) of TheChinese University of Hong Kong (CUHK) from Aug. 2012 to Aug. 2013.He won his PhD degree supervised by Prof. Dah Ming Chiu and Mphildegree supervised by Prof. Dah Ming Chiu and Prof. John C.S. Lui fromInformation Engineering (IE) Department of CUHK. He got Bachelordegree in Computer Science from University of Science and Technologyof China (USTC).
PTIMIZING VIDEO CACHING AT THE EDGE: A HYBRID MULTI-POINT PROCESS APPROACH 14
Di Wu (M’06-SM’17) received the B.S. degreefrom the University of Science and Technologyof China, Hefei, China, in 2000, the M.S. degreefrom the Institute of Computing Technology, Chi-nese Academy of Sciences, Beijing, China, in2003, and the Ph.D. degree in computer scienceand engineering from the Chinese University ofHong Kong, Hong Kong, in 2007. He was a Post-Doctoral Researcher with the Department ofComputer Science and Engineering, PolytechnicInstitute of New York University, Brooklyn, NY,USA, from 2007 to 2009, advised by Prof. K. W. Ross. Dr. Wu iscurrently a Professor and the Associate Dean of the School of Com-puter Science and Engineering with Sun Yat-sen University, Guangzhou,China. His research interests include edge/cloud computing, multimediacommunication, Internet measurement, and network security. He wasthe recipient of the IEEE INFOCOM 2009 Best Paper Award, IEEEJack Neubauer Memorial Award, and etc. He has served as an Editorof the Journal of Telecommunication Systems (Springer), the Journalof Communications and Networks, Peer-to-Peer Networking and Ap-plications (Springer), Security and Communication Networks (Wiley),and the KSII Transactions on Internet and Information Systems, and aGuest Editor of the IEEE Transactions on Circuits and Systems for VideoTechnology. He has also served as the MSIG Chair of the MultimediaCommunications Technical Committee in the IEEE CommunicationsSociety from 2014 to 2016. He served as the TPC Co-Chair of theIEEE Global Communications Conference - Cloud Computing Systems,and Networks, and Applications in 2014, the Chair of the CCF YoungComputer Scientists and Engineers Forum - Guangzhou from 2014 to2015, and a member of the Council of China Computer Federation.
Miao Hu (S’13-M’17) is currently an AssociateResearch Fellow with the School of ComputerScience and Engineering, Sun Yat-Sen Univer-sity, Guangzhou, China. He received the B.S.degree and the Ph.D. degree in communica-tion engineering from Beijing Jiaotong University,Beijing, China, in 2011 and 2017, respectively.From Sept. 2014 to Sept. 2015, he was a Vis-iting Scholar with the Pennsylvania State Uni-versity, PA, USA. His research interests includeedge/cloud computing, multimedia communica-tion and software defined networks.
James Xi Zheng , PhD in Software Engineeringfrom UT Austin, Master in Computer and Infor-mation Science from UNSW, Bachelor in Com-puter Information System from FuDan; Chief So-lution Architect for Menulog Australia, now di-rector of Intelligent systems research center (it-seg.org), deputy director of software engineer-ing, global engagement, and assistant professorin Software Engineering at Macquarie University.Specialised in Service Computing, IoT Securityand Reliability Analysis. Published more than 80high quality publications in top journals and conferences (PerCOM,ICSE, IEEE Communications Surveys and Tutorials, IEEE Transac-tions on Cybernetics, IEEE Transactions on Industrial Informatics, IEEETransactions on Vehicular Technology, IEEE IoT journal, ACM Trans-actions on Embedded Computing Systems). Awarded the best paperin Australian distributed computing and doctoral conference in 2017.Awarded Deakin Research outstanding award in 2016. His paper isrecognized as a top 20 most read paper (2017-2018) in Concurrencyand Computation: Practice and Experience. His another paper on IoTnetwork security (2018) is recognized as highly cited paper. Guest Editorand PC members for top journals and conferences (IEEE Transactionson Industry Informatics, Future Generation Computer Systems, Per-COM). WiP Chair for PerCOM 2020 and Track Chair for CloudCOM2019. Publication Chair for ACSW 2019 and reviewers for many Transjournals and CCF A/CORE A* conferences.
Min Chen is a full professor in School of Com-puter Science and Technology at Huazhong Uni-versity of Science and Technology (HUST) sinceFeb. 2012. He is the director of Embedded andPervasive Computing (EPIC) Lab at HUST. Heis Chair of IEEE Computer Society (CS) SpecialTechnical Communities (STC) on Big Data. Hewas an assistant professor in School of Com-puter Science and Engineering at Seoul NationalUniversity (SNU). He worked as a Post-DoctoralFellow in Department of Electrical and ComputerEngineering at University of British Columbia (UBC) for three years.Before joining UBC, he was a Post-Doctoral Fellow at SNU for one andhalf years. He received Best Paper Awardfrom QShine 2008, IEEE ICC2012, ICST IndustrialIoT 2016, and IEEE IWCMC 2016. He serves aseditor or associate editor for Information Sciences, Information Fusion,and IEEE Access, etc. He is a Guest Editor for IEEE Network, IEEEWireless Communications, and IEEE Trans. Service Computing, etc. Heis Co-Chair of IEEE ICC 2012-Communications Theory Symposium,and Co-Chair of IEEE ICC 2013-Wireless Networks Symposium. Heis General Co-Chair for IEEE CIT-2012, Tridentcom 2014, Mobimedia2015, and Tridentcom 2017. He is Keynote Speaker for CyberC 2012,Mobiquitous 2012, Cloudcomp 2015, IndustrialIoT 2016, and The 7thBrainstorming Workshop on 5G Wireless. He has more than 300 pa-per publications, including 200+ SCI papers, 80+ IEEE Trans./Journalpapers, 18 ISI highly cited papers and 8 hot papers. He has publishedfour books: OPNET IoT Simulation (2015), Big Data Inspiration (2015),5G Software Defined Networks (2016) and Introduction to CognitiveComputing (2017) with HUST Presss, a book on big data: Big DataRelated Technologies (2014) and a book on 5G: Cloud Based 5GWireless Networks (2016) with Springer Series in Computer Science.His latest book (co-authored with Prof. Kai Hwang), entitled Big DataAnalytics for Cloud/IoT and Cognitive Computing (Wiley, U.K.) appearsin May 2017. His Google Scholars Citations reached 11,300+ with anh-index of 53. His top paper was cited 1100+ times. He is an IEEESenior Member since 2009. He got IEEE Communications Society FredW. Ellersick Prize in 2017.