[PDF] Deep Learning for Latent Events Forecasting in Twitter Aided Caching Networks

Abstract

A novel Twitter context aided content caching (TAC) framework is proposed for enhancing the caching efficiency by taking advantage of the legibility and massive volume of Twitter data. For the purpose of promoting the caching efficiency, three machine learning models are proposed to predict latent events and events popularity, utilizing collect Twitter data with geo-tags and geographic information of the adjacent base stations (BSs). Firstly, we propose a latent Dirichlet allocation (LDA) model for latent events forecasting taking advantage of the superiority of the LDA model in natural language processing (NLP). Then, we conceive long short-term memory (LSTM) with skip-gram embedding approach and LSTM with continuous skip-gram-Geo-aware embedding approach for the events popularity forecasting. Lastly, we associate the predicted latent events and the popularity of the events with the caching strategy. Extensive practical experiments demonstrate that: (1) The proposed TAC framework outperforms the conventional caching framework and is capable of being employed in practical applications thanks to the associating ability with public interests. (2) The proposed LDA approach conserves superiority for natural language processing (NLP) in Twitter data. (3) The perplexity of the proposed skip-gram-based LSTM is lower compared with the conventional LDA approach. (4) Evaluation of the model demonstrates that the hit rates of tweets of the model vary from 50% to 65% and the hit rate of the caching contents is up to approximately 75\% with smaller caching space compared to conventional algorithms.

Full PDF

aa r X i v : . [ ee ss . SP ] J a n Deep Learning for Latent Events Forecasting inTwitter Aided Caching Networks

Zhong Yang,

Student Member, IEEE,

Yuanwei Liu,

Senior Member, IEEE,

Yue Chen,

Senior Member, IEEE , and Joey Tianyi Zhou,

Senior Member, IEEE

Abstract

A novel Twitter context aided content caching (TAC) framework is proposed for enhancing thecaching efﬁciency by taking advantage of the legibility and massive volume of Twitter data. For thepurpose of promoting the caching efﬁciency, three machine learning models are proposed to predict latentevents and events popularity, utilizing collect Twitter data with geo-tags and geographic information ofthe adjacent base stations (BSs). Firstly, we propose a latent Dirichlet allocation (LDA) model for latentevents forecasting taking advantage of the superiority of LDA model in natural language processing(NLP). Then, we conceive long short-term memory (LSTM) with skip-gram embedding approach andLSTM with continuous skip-gram-Geo-aware embedding approach for the events popularity forecasting.Lastly, we associate the predict latent events and the popularity of the events with the caching strat-egy. Extensive practical experiments demonstrate that: (1) The proposed TAC framework outperformsconventional caching framework and is capable of being employed in practical applications thanks tothe associating ability with public interests. (2) The proposed LDA approach conserves superiority fornatural language processing (NLP) in Twitter data. (3) The perplexity of the proposed skip-gram basedLSTM is lower compared with conventional LDA approach. (4) Evaluation of the model demonstratesthat the hit rates of tweets of the model vary from 50% to 65% and the hit rate of the caching contentsis up to approximately 75% with smaller caching space compared to conventional algorithms.

I. I

NTRODUCTION

Recent advances in mobile smart devices, ubiquitous social media and application bringstremendous expansion of mobile data trafﬁc. The visual networking index (VNI) report fromCisco [2] reveals that global mobile data trafﬁc (GMDT) is expected to grow to 77 exabytes

Part of this paper has been presented in IEEE Global Communication Conference (GLOBECOM) 2019 [1].Z. Yang, Y. Liu and Y. Chen are with the School of Electronic Engineering and Computer Science, Queen Mary Universityof London, London E1 4NS, UK. (email: { zhong.yang, yuanwei.liu, yue.chen } @qmul.ac.uk)Joey Tianyi Zhou is with Agency for Science Technology and Research (A*STAR), Singapore.(email: [email protected]) per month by 2022, a seven-fold increase over 2017. Speciﬁcally, 5G will be 3.4 percent ofconnections but 11.8 percent of total trafﬁc by 2022. The extensive GMDT growth and innovativenetwork technology compel network providers to investigate novel techniques for sufﬁcing thenetwork services and easing up backhaul transmission.Content caching at networks edges is a prospective approach for reducing the network backhaultransmission and bringing down the network delay [3, 4]. Nevertheless, caching superiority isrelated closely with the popularity of the contents in the network. In conventional cachingframeworks, the user preference is assumed following a generalized Zipf law [5]: the contentrequest rate r ( i ) for the i -th content in the network is proportional to i α where α is thetemperature of the network and typically less than Unit. However, the Zipf law is a experiencedistribution and lacks theoretical guarantee. Speciﬁcally, the counting methods based on Zipfdistribution properties only demonstrate the frequency of the words rather than the concurrencybetween the words. Therefore, the extracted topics are incomplete, which impairs the performanceof text-related content prediction in wireless caching. Moreover, employing the properties of Zipflaw of text-related content to determine what to cache regardless of the locations of base stations(BS) is pervasive yet prodigal [6], especially for the legibility and massive volume of social mediadata. According to [7], social media usage is one of the most popular online activities and thenumber of people using social media worldwide increases to almost 3.43 billion, 3.5 times thatof 2010.Social media motivated caching strategy has attracted attentions from both academia andindustry [8–11]. The authors in [8] analyse the Twitter data of 2016 U.S. presidential electionutilizing LSTM networks to reduce the service latency. A preference-aware optimization [9]considers user side adaptive streaming, coordinated bandwidth allocation, and network sidecaching content selection. In [10], caching cost of the base stations (BS) and social factorsamong mobile users (MU) are considered in ultra-dense small cell networks (UDN) to obtaineffective caching incentives and the optimal social group utility. Zeydan, etal, [11] proposed abig data enabled caching architecture, in which a vast amount of data is harnessed for contentpopularity estimation and content caching. Twitter is one of the most popular social mediaplatforms in the world, that contains countless open accessed tweets (messages) published bythe users from different regions. With billions of new tweets being posted every day, the freshnessand chronological variation of the text contents in tweets are attracting more and more researchesto exploit tweets to gather large amounts of public data for big data related researches.As the public tends to post their preferences and interests on social media platforms, it bringsus an opportunity to cache text-related content more accurately in the BS through topics/events prediction. To efﬁciently predict caching text-related contents among different BS, it is plausibleto associate the public preference with the Twitter topical issues. Top words in Twitter topics havebeen proved to vary according to different locations in London [12]. Twitter data indicate thepreference of the public towards speciﬁc topics [13]. After extracting latent topics from tweets,the text-related contents caching in the BS can be determined. However, the Zipf-distributionis not sufﬁcient accurate since the structure of the natural language shows a statistic structurebeyond Zipf distribution properties [14], which leads to incapability of Zipf-based algorithms inextracting latent topics/events from tweets..Big data techniques have enabled the industry to deal with large amounts of data in highefﬁciency. Wireless content caching system is the crucial key to improve the efﬁciency of thewireless caching at the edge of the wireless network. However, it is always nontrivial to determinewhat to cache at the wireless devices [15]. With the aid of friendly application programminginterface (API) offered by the social media platforms, we are able to easily ﬁlter the social mediacontents of large amounts within the constrained region. Since the APIs also enable users to ﬁlterthe regions of tweets based on their location tags (geo-tags), relating regional preference to thetweets in that region seems to be a feasible way to determine what to cache based on the publicpreferences. Combining the APIs with big data techniques, individual BS is therefore capableof automatically allocate the local public-posted information. As aforementioned, with the aidof the machine learning approaches, the preference is predicted and the text-related cachingcontents are associated with the preference of the regional public. A. Prior Work

Caching at the edge of the wireless networks is considered one of the most important directiondue to its great potential of forwarding desired contents during the rush hours, which alleviates thenetwork burden [16]. With the increased number of mobile users in wireless networks, pushingfrequently requested contents close to them has been deemed as a efﬁcient way of overcomingthe bottlenecks of the wireless communication systems. Furthermore, caching contents accordingto the preference of the MUs can improve the efﬁciency and prevent congestions to someextent [15]. According to [17], video content caching approaches have been proved to be efﬁcientin improving video throughput.Machine learning approaches and social media platforms have attracted more attention inthe ﬁeld of wireless caching recently. With the aid of machine learning approaches, proactiveallocating systems are able to better predict the wireless trafﬁc patterns [18]. By exploitingtopical issues in the regions and the interactions among the public, the networks are capable of better predicting the text-related contents at the edge of the networks. To reinforce the quality ofthe wireless network service, caching has been proved to be effective in caching radio contents[19] and the hit rate has been improved.Besides, with the aim to establish a proactive device-to-device (D2D) caching network, authorsof a paper linked the users together to share caching contents [18]. Aiming to extract contextualcontents from users’ interactions, there is also a paper proposed a framework of wireless caching[20]. According to the prior works, the application of machine learning approach has been appliedon the video caching ﬁeld to reinforce the wireless caching network [19]. There is also a paperwhich employs several parameters to evaluate the different candidate contents through makingevaluations on their popularity proﬁle [21].Related works include the effort to associate different users of the social networks with the aimto establish a proactive D2D caching network [18]. The thought is to retrieve the cached contentsfrom other users to satisfy the requests from others users for the same contents. Similar to thiswork, there is also a paper proposed a framework of wireless caching [20], which extracting thecontextual information from the users’ interactions. The results mentioned in the above two paperdemonstrate that the contextual extraction wireless caching models are capable of reinforcingthe accuracy of wireless caching and reducing the redundancy of the BS caching contents.Acknowledged from the prior works, the application of machine learning approach has beenapplied on the video caching ﬁeld [19]. In this paper, the authors demonstrated several approachestowards determining the video contents to be cached at the BS, including Least Recently Used(LRU), Least Frequently Used (LFU) and their proposed machine learning approach. The hitrates of their models vary from 80% to 90% as the caching size varies from 10 GB to 100 GB.However, in their model, the monotonic video caching dataset is exploited, which related poorlyto the text and other aspects of public preference. There is a paper exploits several parametersto determine the popularity of the candidate contents to be cached [21]. In this paper, authorsclassify the contents based on their ”popularity proﬁle”, which is determined by algorithms.After the evaluation, the data with high popularity proﬁle are cached at the BS while the otherare downloaded from the network. Final result demonstrated that the fetch rate increases alongwith the popularity proﬁle while the fetch rates vary from 15% to 25% based on their proposedalgorithm.Therefore, based on the two prior papers mentioned above, we are capable of concluding thatthe prediction and determination of complex, comprehensive wireless caching contents, whichincluding different types of data (video, images, text, etc.) are increasingly more nontrivialcompared to the wireless caching model towards one monotonic caching content. Further, deep learning approaches have been proved to be effective and efﬁcient while determining wirelesscaching contents.

B. Motivation and Contribution

Our motivation is to combine different types of machine learning models to propose a wirelesscaching framework, which exploits social media data as reference to determine the text-relatedcontents at BS. Since there are different data types in the twitters including images, videos,music, etc., the proposed model is general and feasible of multiple caching context. As mentionedabove, social media platforms have been regarded as a major network trafﬁc consumer due totheir popularity [22]. After determining what heavy-weighted caching contents (images, videos)to cache, the caching-enabled wireless BS is capable of reducing the burden of the networkand reduce backhaul capacity. Since allocating different topics of text-related caching contentrequires distinct parameters of Zipf-based models in different areas to obtain satisfying cachingpredictions, the efﬁciency of setting up such networks is limited and the interactions amongthe networks are limited. Therefore, an autonomous and reliable wireless caching frameworkis desired. Furthermore, when indigenous BSs are capable of forming a autonomous region toreinforce multi-BS caching, which leads to less computing costs and wastes.One core problem is that tweets are considered to be challenging for topics allocating due totheir colloquialism, short length (less than 280 components) as well as the informal usage oflanguage [23]. The conventional counting approaches, like latent semantic analysis (LSA) [24],are insufﬁcient to ﬁnd the topics. To enhance the accuracy, we try to apply the following twomachine learning models, namely latent Dirichlet allocation (LDA), and long short-term memory(LSTM). The Beyes topic modelling approach, latent Dirichlet allocation (LDA), which employmultinomial probability over terms for topic allocation [25]. LDA is able to obtain preciseoutcome from data of social media like Twitter [26]. As the preference of the public in thesame region tends to vary chronologically, we employ long short-term memory (LSTM) [27]to model long-term contextual information. LSTM is an approach of recurrent neural network(RNN), which is feasible of allocating topics based on context [28].The other core problem is how to associate the tweet text, which represents the public prefer-ence, with the solid caching contents. In an effort to associate Twitter topics with correspondingBS, we propose the approach to arrange the tweets to their pertinent BS in London. Previousstudies on Twitter demonstrate that Twitter topics with geographic information represent thepreference of the public. After obtaining the topics from the tweets, we associate the topics withactual tweet text to determine what media ﬁles to cache in order to achieve better performance.

Considering the difference between that the caching background of social media platforms andthat of traditional content streams (i.e. the video sites), we propose criteria to demonstrate thecoherency between the topics and future media contents. Our contributions are summarized asfollows: • We propose a novel Twitter context aided content caching (TAC) framework in whichTwitter topics with BS geography information are associated with social media platformAPIs. In this practical framework deep learning approaches are conceived in each BS toreinforcing the autonomous determination of wireless text-related contents caching at theedge of wireless networks. • We conceive a versatile LDA model for latent events forecasting utilizing collected practicaldata from Twitter. The proposed LDA model is capable of conserving superiority for naturallanguage processing (NLP) in Twitter data. • We adopt a LSTM with skip-gram embedding for content popularity forecasting. In an effortto predict latent events as well as their locations, we propose a new model that exploitsLSTM and continuous skip-gram-Geo-aware embedding approach to forecast not only theterms of the prospective Twitter events but also the location where the tweet are posted. • Extensive practical experiments veriﬁed the performance of the proposed framework andalgorithms. The proposed TAC framework is capable of generating satisfactory forecastresults among different regions. Judging the caching system with the hit rates upon datasetsof different sizes, the hit rates of tweets vary from 50% - 65% while the hit rate of thecaching contents reaches up to approximately 75%. The framework also saves cachingspaces compared to the conventional algorithms.

C. Organization

The paper is arranged into following sections. In Section II, related works, which employmachine learning approaches to solve wireless caching problems, as well as their performancesare listed. In Section III, the structure of the framework and the preparing procedure of thedatasets are demonstrated. In Section IV, the problems of topics extraction and prediction alongwith the solutions are demonstrated. In Section V, experiments are proposed to evaluate themodel. Numerical and literal results are listed to evaluate the topic extraction procedures. InSection VI, the topics obtained from Section V are applied to determine the wireless cachingcontents. Certain numerical results are demonstrated to present the accuracy and efﬁciency ofour approaches.

II. TAC F

RAMEWORK AND D ATASET P REPARATION

In this section, the proposed TAC framework and process of Twitter dataset preparation aredemonstrated.

A. TAC Framework !"

C-&’$"%%$6’"

Fig. 1. The proposed Twitter aided content caching framework.

The TAC framework is illustrated in Fig.1. The tweets are collected through the TwitterAPI with their geo-tags. After extracting the geo-tags from the tweets, the information of thenearest BS is obtained through a python Crawler. After offering the crawler with the geo-tagsfrom tweets, the crawler returns the locations of BS . Afterwards, the dataset is arranged intodifferent regions as an integration of the text with the geographic information. Thereinafter, threemachine learning models are invoked to generate predictions according to the collected datasetfrom Twitter. Finally, since the Twitter API is offering URLs directing to the media ﬁles (images,videos, etc.) if the ﬁles are contained in a tweet, the framework determines which of the ﬁlesto cache based on the descriptions in tweets associated with the topics prediction. The datasetthroughout the paper was collected through the Twitter API with ﬁlter, which gathers the tweetswith geo-tags in London. The latitude and the longitude restrictions are given as follows: • Latitude restriction from 51.7136401 to 51.3679144 • Longitude restriction from 0.285472 to -0.4488468

B. Dataset Preparation

The Twitter dataset is collected between 27th January 2018 and 27th February 2018 (31 daysin total) which composes approximately 70000 tweets with geo-tags in total. Compared with the BS data retrieved from https://mastdata.com parallel collecting experiment, which exploits the keyword “UK” to ﬁlter tweets and obtained2 million tweets in total, the portion of tweets with geo-tags is approximately 2.89% of thetweets with keyword “UK”. The coordinates of the tweets are restricted by the parameters asdemonstrated above. The time range within a day of the collection is constrained from 7:00 AMto 15:30 PM (GMT).

1) Distribution of the data:

Due to the different distribution of population and BS in London,the magnitude of the tweets within different districts varies during the same time period asillustrated in the density map. We divide the entire London area equally into nine smaller areasby latitude and longitude of BS locations.

2) Data cleaning:

To satisfy the prerequisites of training machine learning models with thedataset, some characters are deleted based on the standards below: • Characters that do not belong to English. • Punctuation and Stopwords (in the nltk.corpus package of Python). • Numbers.Since URLs in the tweets are largely points to speciﬁc objects, such as a web page or apiece of video, they are valuable in selecting caching contents. The URLs directing to the mediaﬁles in tweets are stored separately with the tweet text. Therefore, it offers us an opportunity toassociate the caching media ﬁles with their tweet text.Particularly, the twitter tags, such as “

3) Caching-contents datasets:

Regarding the fact that when a tweet contains certain text-related media ﬁles—images, videos, the URLs directing to the ﬁles are given when retrievingthe tweets from Twitter API. Since we need to determine the exact caching contents based ontweet text, it is necessary to obtain the caching contents for the text datasets mentioned above.Among the approximately 70000 tweets with geo-tags in total, 7.69% of them contain mediaﬁles(images and videos). As the tweets with videos are approximately 10 times more comparedto the ones with videos, images take up a larger proportion in tweets’ media ﬁles. Furthermore,since the videos in tweets are mainly short videos, the total occupied caching space of imagesis twice as large as that of videos. According to the reasons above, in this paper, we focus oncaching both the images and videos from the tweets rather than merely the videos.

III. LDA L

EARNING FOR L ATENT E VENTS F ORECASTING

After demonstrating our thought of the structure of the proposed TAC framework, the mainconcept of this subsection is to determine text-related caching contents based on twitter events.Regarding our proposed TAC framework, obtaining Twitter topics is necessary for determiningthe caching contents. Therefore, the methods in this section are the basis for the followingcaching-determination procedure. In this section, the core problem—how to extract or predict thetopics from the existing tweets based on machine learning approaches, is taken into consideration.This section is separated into three approaches of solving the problem. In the ﬁrst segment,we illustrate how the LDA approach is exploited to extract latent topics from the existingtweets. In the second segment, we demonstrate how to chronologically predict future tweetsbased on LSTM. In the last segment, we display how to exploit our new embedding method tochronologically predict future tweets associated with geographical locations based on LSTM.

A. Extracting Latent Events from Tweets

The introduction section has demonstrated that the twitter events are associated strongly withthe public preference in prior works, in which approach are we able to extract the events withexisting twitter text is a crucial term towards realising our proposed wireless caching framework.In this section, we demonstrate how to exploit LDA method to obtain satisfying events extraction.

1) The LDA model:

LDA is an unsupervised Beyesian probabilistic model with the objective toidentify the probable topics from the documents [29]. The proposed LDA model for latent eventsforecasting is illustrated in Fig. 2. As in Fig. 2, the idea is to determine the corresponding topic z m,n of the n th word in m th document based on parameter α while to determine the probability w m,n of each word under the given parameter β and a given topic in order to obtain everyword in this given topic. Each document is assigned a distribution to the latent topics. Through , m n z , m n w M N HyperparametersTweets Words in this tweetTopic Word a b q Determine Sample

Sample

Fig. 2. The structure of LDA model for latent events forecasting. setting the K latent topics, the words are associated with the topics as P ( topic | document ) and P ( topic | term ) [26].The objective is to obtain the probabilities that each word belongs to different latent events,and therefore, we allocate events by listing the words of the highest probability in each latentevent, illustrated in Eq. (1), P ( Z | W, α, β ) , (1)where Z is the event, W stands for the document, and α , β are parameters.According to Fig.2 and its approach originated from [29], the probability equation is givenby Eq. (2), p ( θ | α ) = Γ( P i α i ) Q i Γ( α i ) ( Y i θ α i − i ) , (2)where θ is the event multinomial distribution parameter, α is a k-dimensional vector of theDirichlet districution. θ complies to the Dirichlet distribution of parameter α .With given parameter α and β , the joint probability distribution is able to be demonstrated inEq. (3), p ( θ, z, w | α, β ) = p ( θ | α ) Y Nn =1 p ( z n | θ ) · p ( w n | z n , β ) , (3)where z , w stand for event set and document set, N stands for the N words in document w . As θ , z are latent variables, we adopt Gibbs Sampling to marginalize them.

2) Model Learning:

The learning procedure of the LDA model is based on the Gibbs Sam-pling. The learning procedures of the LDA model are demonstrated in

Algorithm 1 . Here, weset the quantity of iterations to 100, the size of the latent topics to 20, and words per topic to 7.

B. Forecast Events based on Tweets

LSTM is compatible for chronological data forecasting. Compared with conventional RNN,the LSTM model has the advantage in memorizing the long-term memory [30]. The structure ofthe proposed LSTM model is illustrated in Fig. 3. In Fig. 3, LSTM cells are basic units of theLSTM model. The two arrows surrounding the cell

Cell t are the vector transfer of the long-termmemory c t − from the former cell Cell t − and the short-term memory h t − from the former cell Cell t − . Inside the cell are neural network Layers, including tanh layers and sigmoid layers. X t and h t are the input and output at the time t . The algorithm ﬂow of training the LSTM modelis listed in Algorithm 2

The skip-gram embedding approach is commonly employed in natural language processing(NLP). The idea is to feed the model with the words of adjacent positions and therefore associate Algorithm 1

LDA Learning for Latent Events Forecasting For every word w in each document, assign the word to a random latent topic z ; repeat Scan the corpus, for every word w in the corpus, do Gibbs Sampling, and obtain its latestlatent topic; update the corpus; until The Gibbs Sampling Converges OR Reach the maximum iteration number Gather the topic-word ( W - Z ) Co-occurrence Matrix, which is the result of LDA model; Obtain the topic distribution of the model ~θ new ; h h h h h h h h h (cid:266) (cid:266) (cid:266) (cid:266) (cid:266)(cid:266) X t-1 X t X t+1 Y t-1 Y t Y t+1 Multi-layer LSTM Network

InputOutput

Skip-gram embedded tweetsProbability MatrixOriginal TweetsPre-training

Skip-gram embedding

Pre-training network

Lookup

Skip-gram Pre-training

Fig. 3. LSTM model for content popularity forecasting the words with the context. For example, while skip is 1 and a batch is n+1 words, we ﬁrstlyfeed the model with the range of i th to ( i + n ) th words of the text, and the second step is tofeed the ( i + 1) th to ( i + n + 1) th words [31].

1) Model Training:

To train the LSTM model, we exploited the Twitter dataset from 27thJanuary 2018 to 26th February 2018 (30 days) as the training dataset, and tweets of 27thFebruary 2018 as the testing dataset. Here, the dictionary is limited to be 60000 words. Tofulﬁll the prerequisites, each tweet is mapped into a vector based on the indexes of each wordin the dictionary. The vectors are fed into the network based on their chronological order. Theparameters of the networks are detailed in Table I. We use a gradient descent optimizer inTensorFlow to automatically adjust the learning rate.

2) Forecasting:

The model forecasts the terms of the events through a softmax (normalizedexponential) layer, which is commonly used loss function for multi-class classiﬁcation [32]. The Algorithm 2

LSTM for Content Popularity Forecasting Initial the model layers and hidden sizes with given parameters; Feed the ﬁrst batch of input X , calculate the memory h and Cell State C ; repeat For time stamp t , the forget gate layer σ determine what parts of the cell state C t − arepreserved; For time stamp t , the input gate layer σ determines which value between h t − and x t isupdated. And derive the cell state of this time stamp C t , memory h t and candidate value e C t through tanh layers and operators; Invoke a sigmoid layer (output gate layer) to determine what parts of the cell state arethe output and update the cell state; Evaluate perplexity through calculating cross entropy between the forecasting and X t +1 ; Invoke the gradient descent optimizer with a given learning rate to optimize the model; until Reach the maximum epoch number

Save the model;

TABLE IP

ARAMETERS OF

LSTM

MODELS

Parameter Medium Large

Initial scale of weights 0.04 0.05Initial learning rate 0.1 0.2maximum permissible norm of the gradient 5 10number of LSTM layers 2 3number of unrolled steps of LSTM 20 50hidden size 650 1500max epoch 65 55learning rate decline epoch 25 30learning rate decline rate 0.8 / events are sorted according to the predicted probabilities and the events with higher probabilitiesare cached in our framework. C. Predicting Events based on Tweets and Geographical Locations

The anticipation of this model is to combine the geographic information with the Twittertext. To achieve the objective, we append the tokens indicating the latitude ranges and longitude ranges to the tweet vectors. By feeding the LSTM model with the embedded vectors includingthe tokens, the text are therefore able to be associated with the geographic information.

1) Model Training:

In this new model, we apply BoW embedding to map each tweet into avector. Regardless of the order of the components, we merely assume the terms are ingredientsof the topic of the tweet. The skip-gram approach is invoked to feed the vector of tweets intothe model. The process is illustrated in Fig. 4. Fig. 4 presents the inputs of the proposed LSTMfor latent events forecasting. In Fig. 4, each single tweet is mapped into a vector based on itscontent and the geographic information of its BS. Thus, for the ﬁrst n elements ( W to W n ) insingle tweet vector, they are the numerical representation of the words in this tweet based onthe index in the above dictionary. !" ! ’() ’() !*+,-./012,,3&4 !" !" ! =" = !*+,.>28+2-2&:?:3*&.3&.-@38A4+?0.2012,,3&4 B" B %76;,7&()6’$81(3;*7&<$6<=6’’()*

C+343&?6.BD22:

Fig. 4. LSTM inputs based on skip-gram embedding.

After the data cleaning procedure mentioned in data preparation section (removing the mean-ingless parts in tweet text), most tweets are able to be restrained within 20 words (approximately97%). While larger vector leads to lower accuracy, we constrain the vector size to 20 elements.For those less than 20 words, UNK tokens are ﬁlled into the vector. For tweets longer than 20words, we truncate the ﬁrst 20 words where the tags and most words are preserved. Afterwards,the last two elements in the tweet vector represent the latitude range and the longitude rangeof the tweet. The Latitude token ranges from 81112 to 81114 while the Longitude token rangesfrom 91112 to 91114 (to avoid the collision with the indexes of the dictionary). Afterwards, every m tweet vectors are arranged into the single input vector based on their chronological order.Between the single vectors, we invoke the skip-gram approach to maintain the chronologicalrelations among the tweets as illustrated in Fig. 4. The second input vector includes V n +1 to V n + m +1 tweet vectors ( V n ) along with the ﬁrst single vector includes V n to V n + m tweet vectors.Every k single input vectors are deemed as a batch. The LSTM model itself is the basic LSTMneural network based on TensorFlow. Here, we evaluate the dataset through the LSTM modelsof two scales. The basic parameters of the models are listed in Table II. TABLE IIP

ARAMETERS OF

LSTM

MODELS

Parameter Medium Large

Initial scale of weights 0.04 0.05Initial learning rate 0.1 0.2maximum permissible norm of the gradient 5 10number of LSTM layers 2 3number of unrolled steps of LSTM 50 100hidden size 650 1000max epoch 45 55batch size 20 20learning rate decline epoch 25 30learning rate decline rate 0.8 /

2) Predicting Geo-information and tweet content:

The softmax layer is utilized to predict thetopic according to the descending order of the probability of the words. The prediction onlycontains the words in the dictionary. To predict the geographic information that in which ofthe 9 areas the tweet is posted, we sum up the probability of each geographic elements in thesingle tweet vectors. As the output of the softmax layer is a ( k · m · ( n + 2)) × vocabulary size matrix and the k, m, n are illustrated in Fig. 4. The vocabulary size is set to be 92000 to includeall the indexes including both dictionary indexes and geographic indexes. Then for every n + 2 rows (this represents a single tweet), we separately sum up the th , th and th columns of these n + 2 rows. The greatest value among the three indicates that the tweet belongsto the corresponding latitude area. The determination of the longitude is of the same approach.IV. P RACTICAL E XPERIMENTS FOR L ATENT E VENTS F ORECASTING

In this section, practical experiments are demonstrated to evaluate the three models utilizingthe collected data from Twitter. The three models generate different types of outcomes: 1) TheLDA model predicts latent events. 2) The LSTM model with skip-gram embedding forecastsrelated words of events. 3) The LSTM model with skip-gram-geo-aware embedding predictsrelated words of the event and its location. To demonstrate the performance of the proposed models, we adopt the perplexity value as thekey performance indicator, because perplexity is an well accepted approach to evaluate naturallanguage processing (NLP) models. The deﬁnition of perplexity is illustrated in Eq. (4): H ( p,q ) = 2 − N N P i =1 log b q ( x i ) , (4)where p is the unknown distribution of the test dataset, x , x , x ,..., x N are subsets of test dataset. q stands for the model that we want to evaluate. Perplexity evaluates the similarity between theprediction and the ground truth. Since perplexity maintains a reciprocal relationship with theLog-likelihood measures, a language model with lower perplexity achieves better performanceduring application. Therefore, in this section, we apply perplexity to demonstrate the accuracyof the proposed models and the testing datasets are clariﬁed in each sub-section. Furthermore,the uniﬁed standard—complexity is also capable of generalizing the results since the standardis based on numerical results. A. LDA model1) Dataset division:

To demonstrate the distinct preference of topics within different regions,the dataset has been divided into 9 smaller datasets. The separation is based on the BS coordinateswhich is illustrated by Fig.5. According to Fig. 5, areas of City of London and Inner Londonare equipped with denser BSs compared with areas of Outer London.

Fig. 5. The density distribution of the involved BSs in London (Area marked with deeper purple contains more BSs). We observethat the areas of City of London and Inner London are equipped with denser BSs compared with areas of Outer London.

2) Literal results:

To demonstrate the variation of different extracted topics in different regionsand considering the page limit, we display the results in Location 5 and Location 4 for examples.The reason for selecting the two locations is that the topics in the two regions are more diversiﬁedcompared to other regions.The literal results of the LDA model in Location 5 are demonstrated below in Table III. The results from the location 5 are able to be interpreted in the followingapproach. The ﬁrst topic relates to employment advertising tweets with the tag of “ nd and rd topics associate with the equality of the “LGBT” group and the weather.According to the results in Table III, the ﬁrst and the third topic are possibly originated from TABLE IIIS

AMPLE RESULTS IN DIFFERENT REGIONS GENERATED FROM THE PROPOSED

LDA

MODEL

Region Topics the tweets composed by tourists as they are discussing the particular locations like the HeathrowAirport and Hounslow. The second topic is related to the Valentine’s Day, however as there isalso a music festival called “Kaleidoscope Festival” during the same time period, the keyword“kaleidoscope” is also included here.The complexity tendencies of the LDA model under different datasets are illustrated in Fig.6.To illustrate the different tendencies of perplexity under datasets of different sizes, we selecttwo larger datasets and two smaller datasets based on their regions. The datasets of location 6,location 5 are larger datasets while the rest two datasets are smaller ones. The graph demonstratethe results in two aspects. The decreasing trends demonstrate that the LDA model is able toconverge and the larger datasets enable the model to achieve more accrate performance (lowercomplexity).Here are we demonstrate some keywords of the events from LDA model as illustrated inFig. 7. From the graph, tweets in the 9 areas are primarily related to the landmark places ordistrict functions (sightseeing, sports), which assists to allocate different characteristics amongthe districts in London.Analysis of the keywords is capable of determining the text-related caching contents basedon the public preference. Since we have discovered that the keywords from tweets in differentregions are associated closely with the indigenous landmarks, the text-related contents are able to P e r p l e x i t y Larger datasetSmaller dataset Location_6 learning perpLocation_5 learning perpLocation_4 learning perpLocation_1 learning perp

Fig. 6. The learning perplexity of the proposed LDA model in Location 1 ,Location 4 (smaller datasets), Location 5 andLocation 6 (larger datasets).Fig. 7. Keywords distribution of the predicted events in London generated from the proposed LDA model. be determined from them. For instance, according to the Fig. 7, in the south part, the keywords“wimbledon” and “stadium” are illustrated. Different from the conventional counting methods,our proposed machine learning approach is capable of demonstrate the concurrency among thewords. Therefore, based on the two keywords mentioned above, the text-related contents, suchas videos for the matches held in the stadium, are supposed to be cached at nearby base stations.Similarly, for the base stations located near the keyword “Arsenal”, the related videos of therecent football games should be cached. Different from the situations mentioned above, regarding to the palaces in the central part of London, the pictures and navigation information (includingcoordinates and public transport information) should be cached since there are several sightseeingspots including the famous Buckingham Palace. B. LSTM model with skip-gram embedding1) Datasets:

The 9 training datasets and 9 testing datasets are separated based on regionsof BS. The training datasets range from th January to th February. The testing datasets aretweets on the th February.

2) Forecasting results:

For the similar reason with the literal results in LDA model, wedemonstrate the results from Location 4 and Location 5. The top events of the location 5 andlocation 4 are illustrated in the Table IV. Same as LDA model, as the perplexity trends convergesduring the testing, the generalization and the accuracy of the model are able to be demonstrated.The interpretations are based on the keywords in the topics. With the keyword “job”, “hiring”,“

TABLE IVS

AMPLE RESULT EVENTS FROM THE PROPOSED

LSTM

MODEL WITH SKIP - GRAM EMBEDDING

Location Topic

Location 5

While invoking the model on the location 5 dataset with networks of different scales, thevariation of the training and testing perplexity is illustrated in Fig. 8. P e r p l e x i t y * ^ Medium Scale

Training perpTesting perp0 10 20 30 40 50Epoch0.00.20.40.60.81.0 P e r p l e x i t y * ^ Large Scale

Training perpTesting perp 0.000.020.040.060.080.100.12 L e a r n i n g r a t e Learning rate 0.0000.0250.0500.0750.1000.1250.1500.1750.200 L e a r n i n g r a t e Learning rate

Fig. 8. The training and testing perplexity under medium scale and large scale of the proposed LSTM networks.

C. LSTM model with skip-gram-geo-aware embedding1) Datasets:

The training dataset contains all tweets from th January to th February. Thetesting dataset is the tweets on the th February.

2) Forecasting results:

The training and testing perplexity tendency are illustrated in the Fig.9. The text prediction examples are illustrated in Table V. The topic is largely similar to theresults of invoking the LDA model to the Location 5 dataset. As the job advertising tweetsconstantly contain a URL directing to their web site, the keywords “apply” and “click” areinvolved in the prediction. The prediction is more uniﬁed and of less noise compared to theresults of the skip-gram embedding.

D. Model Analysis

In this section, the properties of the three models are analyzed. The performances of modelsunder different circumstances are demonstrated. Based on the graphs mentioned in the abovesection, we discuss the different aspects of performance illustrated. TABLE VR

ESULT T OPIC AND L OCATION EXAMPLES

Location Forecasting Topic

Location 5 P e r p l e x i t y * ^ Medium Scale

Training perpTesting perp0 10 20 30 40 50Epoch0.00.20.40.60.81.0 P e r p l e x i t y * ^ Large Scale

Training perpTesting perp 0.000.020.040.060.080.100.12 L e a r n i n g r a t e Learning rate 0.0000.0250.0500.0750.1000.1250.1500.1750.200 L e a r n i n g r a t e Learning rate

Fig. 9. The training and testing perplexity under medium scale and large scale of the proposed LSTM networks (Geo-aware).

1) LDA model with different datasets:

The perplexity applied in the LDA model is able tobe regarded as a standard to evaluate the cohesiveness between the extracted events and theground truth—the existing tweet text. Since complexity is related tightly to the Log-likelihood,less complexity leads to more cohesiveness between the extracted topics and the twitter textand more accuracy of the results. The convergence of the complexity is also able to be deemedthe convergency of the model. With different scales of documents illustrated in Fig.6, largerdocuments (Location 5 / 6) tend to converge more rapidly. Furthermore, the overall complexity of larger documents tend to be lower. Since more tweets have involved and during each iteration,for deﬁnitive K latent topics and a larger corpus, the topic-word co-occurrence matrix is morereasonable. The ﬁnal perplexity of the LDA model is satisfactory (about 300 under large dataset).Since the literal results demonstrated in Table III contain correlative key words, we can interpretthe topics expediently. Regarding the advantages above, we demonstrate that the LDA model isquite feasible of predicting Twitter topics.

2) Different scales of LSTM with skip-gram embedding:

Different from the LDA model, lesscomplexity of the model when applied on the testing datasets leads to less difference betweenthe prediction and the ground truth—the testing datasets (future tweets). The convergence ofthe complexity is also able to deemed the convergency of the model. Under larger scale ofnetwork and learning rate, the tendency in Fig.8 tends to converge more rapidly while thetesting perplexity ﬂuctuates obviously due to overﬁtting and the characteristics of the gradientdescend optimization. As mentioned above, complexity is applied to evaluate the accuracy ofthe models in this paper. According to the result, the smaller network and learning rate result inbetter overall complexity. The ﬁnal testing complexity of the medium-scale network is 54738.23while that of the large-scale network is 78312.39. Due to the polytropic combination of oralwords, the perplexity of the Twitter documents are not restrained as that of the formal datasetlike PTB (about 70 perplexity in large-scale LSTM model [31]).

3) Different scales of LSTM with skip-gram-Geo-aware embedding:

When the new modelconverges, the topics of prediction are largely the same due to the characteristics of the LSTMmodel;therefore, the model itself is only capable of predicting one tweet in a speciﬁc area. Thevariations of the complexity under different scales of networks are demonstrated in Fig.9. Thelarger network with greater initial learning rate tends to converge more rapidly while ﬂuctuatingobviously. After the training and testing process, the larger network with greater depth of networkstructure concludes superior results on the overall testing perplexity. The outcome demonstratesthat the larger network tends to generate more monolithic outcome results with the testing dataset.The ﬁnal testing perplexity is 1253.75(Larger model) / 1752.83 (Medium model).Based on the events prediction and perplexity results of the three models, the LDA modelachieves satisfactory prediction as well as multiple topics prediction within one particular area.While the basic LSTM model with skip-gram embedding is of relatively high perplexity. Thenovel LSTM model with skip-gram-Geo-aware embedding realizes comparatively lower perplex-ity and the geographic information prediction. However, due to the characteristics of the neuralnetwork when converges, future efforts are required to enable the model to predict multipletopics. V. A

SSOCIATING E VENTS F ORECASTING WITH C ACHING

Three machine learning models are proposed in above sections to extract latent events from theexisting tweets and predict future topics based on existing tweets. The perplexity results conﬁrmthe effectiveness of the proposed models. Since we have proposed valid approaches to solve thecore problems mentioned above, we need then to evaluate the model in the context of wirelesscaching. In this section, we associate the events prediction with the wireless caching. To exploitdifferent advantages of different machine learning models, we propose the following approachto determine the text-related caching contents. While the LSTM-based models mentioned aboveare capable of predicting the future twitter text based on the chronological inputs of existingtweets, the LDA model is capable of extracting the latent topics from the predicting twitter text.Afterwards, the text-related contents—the images and videos, which are capable of generatingremarkable impact on the network backhaul loads, are cached at the BS. Since the accuracyof the framework is directly related to the accuracy of the LDA model, we mainly discuss theaccuracy of LDA topic-extracting model as the machine learning approach (ML) in this section.The hit rates is applied to evaluate the performance of the proposed framework. Since de-termining text-related contents has no previous algorithms, we take conventional algorithms(LFU, LRU) for comparison. The LFU approach is applied to extract the most frequently usedkeywords from the existing tweet text while the LRU obtains the most recent keywords from theexisting tweet text. The keywords extracted through the two conventional methods are applied tomatch actual media ﬁles. As for evaluation criteria, we proposed the ”hit portion”, which is theportion of utilized topics among all the extracted topics. Regarding the speciﬁc caching contentsevaluation in this section, we demonstrate the difﬁculties of determining the caching contentsof social media contents as well as propose our criterion to evaluate the model. Therefore, thecoherency between the topics and caching contents is better demonstrated. Here, the testingdatasets are the tweets on 27th February while the training datasets are the tweets ranged from27th January to 26th February.

A. TAC strategy

In this section, we demonstrate the strategy for caching actual media ﬁles (images and videos)regarding the twitter topics.

1) Caching objects:

Since we have demonstrated the structure of our framework at theprevious section and the approaches to obtain topics prediction have been demonstrated, weintroduce the caching strategy to associate the topics with actual caching contents (images,videos). As the tweet text has been cleaned in the dataset preparation procedure, the words in tweet text are seldom meaningless and each of the words represents a symbol of identiﬁcation.The core idea is to decide which tweet is valuable to be cached through predicted topics. Tojudge whether a topic is associated with a tweet, we deﬁne when there are 3 words the samebetween a topic and text of a tweet, the media ﬁles associated with this tweet are worthwhileto cache. After determining the tweet has the value to be cached—hit by the topics, the mediaﬁles are cached through the URLs retrieved from the JavaScript Object Notation (JSON) datastructure, which is obtained from Twitter API.

2) Prior list:

After gathering the actual media ﬁles and extracting topics from existing tweettext, the relationship between the two parts is established. The popularity of the topics and actualcaching objects is capable of being deﬁnitely settled through HTTP requests from users. Sincethere are generally large number of topics contained in tweet text and topics themselves areof popularity of different scale, the “Prior List” (PL) is applied to ﬁlter the topics of the mostpopularity to maximum the caching efﬁciency with certain number of topics and caching spacefor actual media ﬁles. Similarly, a PL of caching objects is a ranking list contains actual cachingobjects related to a speciﬁc topic obtained through Twitter API. The rankings of caching objectsare based on their frequency of usage, namely their popularity. Regarding the existing cachingstrategy, which applies popularity of caching objects [21], the efﬁciency of the caching algorithmincreases. With the aim of precisely determine what to cache, we create a PL for each topicto preserve the media ﬁles related to it. The objective of the PL is to associate the extractedtopics with the popularity of the caching objects, which improves the efﬁciency of the cachingframework.As the BS is capable of monitoring the wireless trafﬁc through HTTP requests / responses,PLs related to different topics are varying dynamically after deploying our caching framework atthe BS. When the topics PL has reached maximum amount of topics and a new topic emerges,the popularity of the new topic is compared with those of topics in the topics PL. When thepopularity of the new-emerged topic exceed any from those of the topics in PL, the PL is updated.Moreover, the caching space is updated.

B. Evaluating the extracted topics

In this section, the performance of the ML approach as well as the conventional LFU, LRUapproaches is evaluated through four different numerical results—tweet hit rate, tweet hit portion,cache portion and hit cache portion. The LFU and LRU approaches extract the keywords fromthe tweet text rather than the topics like the ML approach. The aim of this section is to compare the performance of the models as well as evaluate the properties of the ML approach underdifferent circumstances.

1) Tweet hit rate:

In this part, we demonstrate hit rates of tweets in our model. The aimfor considering tweet hit rates is to illustrate to what extent a model can associate the topicswith the tweet text. In Fig.10, the hit rates of three approches are demonstrated—LFU, LRUand the Machine Learning approach (ML). The X-axis of the graph is the amount of topics. Byﬁxing the number of topics extracted from the tweet text, we are therefore capable of discussingthe performances under different caching space since more topics lead to more media ﬁles tobe cached. Rather than merely extracting keywords from the corpus like LFU and LRU, MLapproach is actually extracting topics, i.e. the cohesive keywords from the corpus, which assistthe approach achieve high coherency between the topics and the tweet text. Therefore, the MLapproach achieves the satisfactory results towards extracting topics of tweet text from the corpus. H i t r a t e LFULRUML

Fig. 10. The hit rate of topics of different approaches.

With the aim of further developing the properties of the machine learning approach, we discussthe performance of the method under different circumstances. To demonstrate the different trendsunder datasets of different sizes, we select 4 datasets from different regions to illustrate the trends.The selection of the datasets are based on their sizes (number of tweets) with the aim to investthe model property in different dataset ranges. Among the 4 testing datasets exploited to evaluatethe model, the sizes of training datasets are location 5 > location 8 > location 9 > location 6.In Fig. 11, the trends of hit rates can be described as below. 1) The hit rates of the modelon all testing datasets increases when the numbers of topics increases. 2) The overall hit ratesvary little among the different sizesof datasets, which means that the little variance among thetesting datasets does not inﬂuence the overall accuracy. 3) The highest hit rates are achieved H i t r a t e Location_6Location_5Location_8Location_9

Fig. 11. The hit rate of topics in different regions. when there are maximum number of topics and the highest hit rates vary from 50% to 65%under different testing datasets. As a conclusion, the wireless caching framework we propose iscapable of generating relatively satisfactory results.

2) Tweet hit portion:

To display the utilization ratio of the topics based on the three approachesmentioned above—LFU, LRU and Machine Learning approach (ML). The tweet hit portion isthe utilization ratio of the topics, namely the topics hit by the tweet text of the future. The aim ofintroducing this property is also to demonstrate the effectiveness and accuracy of the abilities toextract topical information from the corpus. We deﬁne the tweet hit portion to be the portion oftopics hit by the testing tweet text dataset. The aim of this criterion is to illustrate the accuracyof the topics generated from different approaches. Besides, the performance of the ML approachunder different datasets is discussed.In Fig.12, the machine learning approach (ML) achieves the lowest hit portion rate whilethe LFU and LRU achieve higher hit portion. Therefore, the contradiction is that the machinelearning approach achieves higher tweet hit rates as well as the lower hit portion under the sametopics numbers compared to the other two methods. The explanation to this contradiction is thatwhile the LFU and LRU focus on more monotonic topics—The recent ones and the hot ones, thetopics from these two approaches are more uniﬁed compared to the ML approach. While the MLapproach generates a more heterogeneous topics prediction, the hit portion is lower comparedto LFU and LRU.To invest the ML approach properties under datasets of different sizes, we choose the 4datasets from different regions. The selection of the datasets are based on the familiar reason H i t p o r t i o n LFULRUML

Fig. 12. The hit portion of topics of different approaches. as that of the tweet hit rates section. In the Fig. 13, the trends of hit portions are demonstrated.The results are able to be explained from the following aspects. 1) The utilization ratios of thecaching topics decrease along with the increasing number of extracted topics. 2) The over alltrends of utilization ratios vary remarkably among different sizes of testing datasets. Moreover,larger datasets (location 5,8 datasets) are able to achieve higher utilization ratios. 3) The highestutilization ratios are achieved when there are least extracted topics and the maximum utilizationratio is approximately 60% with the largest testing dataset of location 5. H i t p o r t i o n Location_6Location_5Location_8Location_9

Fig. 13. The hit portion of topics in different regions. C. Problem of evaluation and proposed solution

During the collection of the datasets, we noticed an obstacle of evaluating the effectivenessand accuracy of our model. The tweets with geo-tags are largely the original ones, which arecreated by user themselves rather than retweet. This situation leads to the conclusion that theoverlap of media ﬁles between the two different days are particularly little. Different fromconventional video sites, the contents of the social platforms are largely published by the public,which makes the determination of caching contents nontrivial. However, since the trafﬁc of socialmedia platforms is capable of being formulated into two aspects—viewing (downloading) andposting (uploading), we propose a uniﬁed approach to evaluate the model. Users prefer to notonly go through the contents (text, images, videos) that they favor, but also to post their owncontents associated to the topic, such as a video related to a popular kind of pet in that region.Therefore, when media contents of the next day are similar to the media contents at the presentday, the media contents cached today are highly possible to be viewed by the users. Regardingto the reasoning above, we propose the two evaluation criteria, namely “cache portion” and “hitcache portion” to evaluate the models.In this section, the performance of the models is evaluated through solid text-related cachingcontents(images, videos). To process the evaluation, the numerical results are presented fromtwo different aspects.

1) Cache portion:

First part is the cache portion. Cache portion is the portion of mediaﬁles (text-related caching contents) which are cached after the caching-content determinationprocedure. This criterion represents how much of the formal caching contents is cached basedon the topics, which relates tightly to the size of occupied caching space at BS. With highercaching portion, the requirements for caching space are higher—more contents are cached. Here,the portion is the size of ﬁles that is cached divided by the total size of media ﬁles in the trainingdataset.In Fig.14, the cache portion of the 3 approaches—LFU, LRU and Machine Learning approach(ML) are illustrated. In the ﬁgure, LFU maintains the highest cache portion up to about 90%,which leads to the conclusion that under our caching strategy, 90% of the existing contents arecached to fulﬁll the future needs. While LRU achieves the least cache portion (70%), MachineLearning approach maintains the medium caching portion as approximately 75% of the existingmedia ﬁles. The curve of the ML approach also demonstrates that when the number of topicsis restricted to 100-150, the ML approach achieves the stable status—increasing of the topicnumber results in little increase of the contents to be cached. This part results cooperate withthe “hit cache portion” to obtain further conclusions. C a c h e p o r t i o n LFULRUML

Fig. 14. The caching portion of the training dataset (different approaches).

2) Hit cache portion:

In this section, the hit cache portion is employed as the criterion toevaluate the models. Hit cache portion is portion of how much footprint of media ﬁles in thetesting dataset is hit by the obtained topics. As we mentioned before, since conventional “hitrate” measurement is not suitable for Twitter caching scenario, we propose this criterion todemonstrate the coherency between the predicted topics and the future media contents in thetesting dataset. H i t c a c h e p o r t i o n LFULRUML

Fig. 15. The hit caching portion of the testing dataset (different approaches).

In Fig.15, the hit cache portion of the three approaches has been demonstrated. Regardingthe graph, LFU and ML achieves much better performance (approximately 75%) comparedto the LRU (less than 60%). This leads to the conclusion that the ML approach and the LFU approach are feasible of associating the obtained topics with the actual caching contents.However, considering the results from the “cache portion” section, LFU approach is actuallycosting more footprint to achieve the satisfactory result. This leads to the conclusion that ourML approach caches less redundant contents compared to the conventional LFU method. Thereason for not achieving 100% caching accuracy is that some caching contents in the testingdataset are irrelevant to the existing media ﬁles, while leads to the situation that the topics arenot capable of being associated with this part of contents.To reach a conclusion, the machine learning approach we proposed is evaluated through severaldifferent criteria against two conventional caching algorithms. With higher tweet hit rate, ourproposed ML approach is capable of achieving satisfying topic prediction results from the tweettext aspect. Regarding to the caching contents evaluation we demonstrate above, the ML approachis feasible of achieving high consistency between the topics and the future caching contents. Thecache portion results illustrate that the ML approach achieves less caching redundancy.VI. C

ONCLUSIONS AND F UTURE W ORK

In this paper, a novel Twitter aided content caching (TAC) framework, which associated thetweets with the BS information, was proposed. To associate Twitter events with the relative BS,the dataset was established to map tweets to their corresponding BS. Three machine learningapproaches of allocating Twitter topics with geographic information of BS were evaluated.Compared to LSTM model with skip-gram embedding, LDA and LDA-based approaches werecapable of generating satisfactory predicting results in different regions. Regarding the resultsand the tendency of perplexity, our novel LSTM model with skip-gram-Geo-aware embeddingwas quite compatible to process the tweets with BS information. With the aid of machinelearning based wireless caching techniques, the redundancy of the caching content was ableto be diminished. Combining different types of machine learning approaches, the extractedtopics are capable of providing guidance of what text-related contents to cache at the BS.Regarding the different situation of caching at social media platforms and that of traditional mediacontents providers, such as video sites, speciﬁc evaluation criteria are proposed to demonstratethe effectiveness and the accuracy of our proposed framework.R

EFERENCES [1] Y. Qi, Z. Yang, Z. Qin, Y. Liu, and Y. Chen, “Big data prediction in location-aware wireless caching: A machine learningapproach,” in

Proc. IEEE Global Commun. Conf. (GLOBECOM) , 2019, pp. 1–6.[2] Cisco, “Cisco visual networking index: Global mobile data trafﬁc forecast update, 2017-2022,”

White paper , 2019.[Online]. Available: https://s3.amazonaws.com/media.mediapost.com/uploads/CiscoForecast.pdf[3] A. Ioannou and S. Weber, “A survey of caching policies and forwarding mechanisms in information-centric networking,”

IEEE Commun. Surv. Tut. , vol. 18, no. 4, pp. 2847–2886, Fourthquarter 2016.[4] M. Tang, L. Gao, and J. Huang, “Communication, computation, and caching resource sharing for the internet of things,”

IEEE Commun. Mag. , vol. 58, no. 4, pp. 75–80, Apr. 2020. [5] C. Wang, C. Liang, F. R. Yu, Q. Chen, and L. Tang, “Computation ofﬂoading and resource allocation in wireless cellularnetworks with mobile edge computing,” IEEE Trans. Wireless Commun. , vol. 16, no. 8, pp. 4924–4938, Aug. 2017.[6] L. Breslau, P. Cao, L. Fan, and et al., “Web caching and zipf-like distributions: evidence and implications,” in

Proc. IEEEINFOCOM , vol. 1, Mar. 1999, pp. 126–134 vol.1.[7] Statista, “Number of social network users worldwide from 2010 to 2023,”

White paper

IEEE Trans. Netw. Scie. Engin. , vol. 7, no. 1, pp. 193–204, Jan. 2020.[9] A. Xiao, X. Huang, S. Wu, C. Jiang, L. Ma, and Z. Han, “User preference aware resource management for wirelesscommunication networks,”

IEEE Netw. , vol. 34, no. 3, pp. 78–85, May 2020.[10] K. Zhu, W. Zhi, X. Chen, and L. Zhang, “Socially motivated data caching in ultra-dense small cell networks,”

IEEE Netw. ,vol. 31, no. 4, pp. 42–48, Jul. 2017.[11] E. Zeydan, E. Bastug, M. Bennis, M. A. Kader, I. A. Karatepe, A. S. Er, and M. Debbah, “Big data caching for networking:moving from cloud to edge,”

IEEE Commun. Mag. , vol. 54, no. 9, pp. 36–42, Sep. 2016.[12] G. Lansley and P. A. Longley, “The geography of twitter topics in London,”

J. Comput. Environ. Urban. , vol. 58, pp.85–96, 2016.[13] B. O’Connor, Brendan and et al., “From tweets to polls: Linking text sentiment to public opinion time series.”

Int’l AAAIConf. on Weblogs and Social Media (ICWSM) , vol. 11, no. 122-129, pp. 1–2, May. 2010.[14] S. T. Piantadosi, “Zipf’s word frequency law in natural language: A critical review and future directions,”

Psychonomicbulletin & review , vol. 21, no. 5, pp. 1112–1130, 2014.[15] N. Golrezaei, K. Shanmugam, A. G. Dimakis, and et al., “Femtocaching: Wireless video content delivery through distributedcaching helpers,” in

Proc. IEEE Int. Conf. on Comput. Commun. (INFOCOM) , March. 2012, pp. 1107–1115.[16] K. Zhang, S. Leng, Y. He, S. Maharjan, and Y. Zhang, “Cooperative content caching in 5G networks with mobile edgecomputing,”

IEEE Wireless Commun. , vol. 25, no. 3, pp. 80–87, June. 2018.[17] N. Golrezaei, A. F. Molisch, A. G. Dimakis, and et al., “Femtocaching and device-to-device collaboration: A newarchitecture for wireless video distribution,”

IEEE Commun. Mag. , vol. 51, no. 4, pp. 142–149, April. 2013.[18] E. Bastug, M. Bennis, and M. Debbah, “Living on the edge: The role of proactive caching in 5G wireless networks,”

IEEECommun. Mag. , vol. 52, no. 8, pp. 82–89, Aug. 2014.[19] H. Pang, J. Liu, X. Fan, and L. Sun, “Toward smart and cooperative edge caching for 5G networks: A deep learning basedapproach,” in , Jun. 2018, pp. 1–6.[20] E. Bas¸tu˘g, M. Bennis, and M. Debbah, “A transfer learning approach for cache-enabled wireless networks,” in

Modelingand Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt) . IEEE, 2015, pp. 161–166.[21] Z. Yang, Y. Liu, Y. Chen, and L. Jiao, “Learning automata based Q-learning for content placement in cooperative caching,”

IEEE Trans. Commun. , vol. 68, no. 6, pp. 3667–3680, 2020.[22] B. Yang, W. Guo, B. Chen, G. Yang, and J. Zhang, “Estimating mobile trafﬁc demand using twitter,”

IEEE WirelessCommun. Lett. , vol. 5, no. 4, pp. 380–383, 2016.[23] D. Ramage, S. T. Dumais, and D. J. Liebling, “Characterizing microblogs with topic models.”

Int’l AAAI Conf. on Weblogsand Social Media (ICWSM) , vol. 10, no. 1, p. 16, May. 2010.[24] J. Wang and M. She, “Probabilistic latent semantic analysis for multichannel biomedical signal clustering,”

IEEE SignalProcessing Lett. , vol. 23, no. 12, pp. 1821–1824, 2016.[25] M. Steyvers and T. Grifﬁths, “Probabilistic topic models,”

Handbook of latent semantic analysis , vol. 427, no. 7, pp.424–440, 2007.[26] A. Fang, C. Macdonald, I. Ounis, and P. Habel,

Topics in Tweets: A User Study of Topic Coherence Metrics for TwitterData . Springer International Publishing, 2016.[27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[28] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in

Proc. of Neural InformationProcessing Systems (NIPS) , 2014.[29] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”

J. Mach. Learn. Res. , vol. 3, no. Jan, pp. 993–1022,2003.[30] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phraserepresentations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078 , 2014.[31] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329 ,2014.[32] Bishop and C. M, “Pattern recognition and machine learning,”