[PDF] An Iterative Refinement Approach for Social Media Headline Prediction - Researchain

Abstract

In this study, we propose a novel iterative refinement approach to predict the popularity score of the social media meta-data effectively. With the rapid growth of the social media on the Internet, how to adequately forecast the view count or popularity becomes more important. Conventionally, the ensemble approach such as random forest regression achieves high and stable performance on various prediction tasks. However, most of the regression methods may not precisely predict the extreme high or low values. To address this issue, we first predict the initial popularity score and retrieve their residues. In order to correctly compensate those extreme values, we adopt an ensemble regressor to compensate the residues to further improve the prediction performance. Comprehensive experiments are conducted to demonstrate the proposed iterative refinement approach outperforms the state-of-the-art regression approach.

Full PDF

AAn Iterative Refinement Approach for Social Media HeadlinePrediction

Chih-Chung Hsu

Department of ManagementInformation Systems, NationalPingtung University of Science andTechnology (NPUST)Pingtung, [email protected]

Chia-Yen Lee

Department of Electrical Engineering,National United University (NUU)Miaoli, [email protected]

Ting-Xuan Liao

Department of ManagementInformation Systems, [email protected]

Jun-Yi Lee

Department of ManagementInformation Systems, [email protected]

Tsai-Yne Hou

Department of ManagementInformation Systems, [email protected]

Ying-Chu Kuo

Department of ManagementInformation Systems, [email protected]

Jing-Wen Lin

Department of ManagementInformation Systems, [email protected]

Ching-Yi Hsueh

Department of ManagementInformation Systems, [email protected]

Zhong-Xuan Zhang

Department of Electrical Engineering,[email protected]

Hsiang-Chin Chien

Department of Electrical Engineering,[email protected]

ABSTRACT

In this study, we propose a novel iterative refinement approach topredict the popularity score of the social media meta-data effectively.With the rapid growth of the social media on the Internet, how toadequately forecast the view count or popularity becomes moreimportant. Conventionally, the ensemble approach such as randomforest regression achieves high and stable performance on variousprediction tasks. However, most of the regression methods may notprecisely predict the extreme high or low values. To address thisissue, we first predict the initial popularity score and retrieve theirresidues. In order to correctly compensate those extreme values, weadopt an ensemble regressor to compensate the residues to furtherimprove the prediction performance. Comprehensive experimentsare conducted to demonstrate the proposed iterative refinementapproach outperforms the state-of-the-art regression approach.

KEYWORDS

Random forest, ensemble learning, regression, iterative refinement.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

MM ’18, October 22–26, 2018, Seoul, Republic of Korea © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5665-7/18/10.https://doi.org/10.1145/3240508.3266443

ACM Reference Format:

Chih-Chung Hsu, Chia-Yen Lee, Ting-Xuan Liao, Jun-Yi Lee, Tsai-YneHou, Ying-Chu Kuo, Jing-Wen Lin, Ching-Yi Hsueh, Zhong-Xuan Zhang,and Hsiang-Chin Chien. 2018. An Iterative Refinement Approach for SocialMedia Headline Prediction . In

ACM, New York, NY, USA,5 pages. https://doi.org/10.1145/3240508.3266443

The popularity prediction of social media becomes more impor-tant while the rapid growth of social networks such as Facebook,Flickr, and Pinterest. Once the headline of posts or pictures can bepredicted, the advertisement related to the headline can be placedin. We also can synthesize having the high popularity of the post(headline) according to the feature of the headline. Therefore, howto adequately address the headline prediction of the social mediaremains a significant challenge.Recently, machine learning approach is widely used in varioustasks such as popularity score prediction, object recognition, andtime-series signal analysis. For example, a large-scale social mediadataset – Social Media Headline Prediction Dataset (SMHPD) – iscollected in [13]. It includes 305,614 metadata, images, and time-zone information. With SMHPD, our goal is to learn the popularityof the posts based on the metadata only without images informationdue to express prediction purpose. Besides, the content of an imagemay mismatch to that of the corresponding post, leading to helplessof the prediction task. a r X i v : . [ c s . MM ] S e p M ’18, October 22–26, 2018, Seoul, Republic of Korea C.C. Hsu et al.

To adequately address the social media headline prediction task,each metadata of SMHPD should be carefully processed. As de-scribed in the social media headline prediction task in [13], thereare 15 metadata properties in the metadata. It contains a uniquepicture ID (pid) along with user id (uid). Also, metadata of the pic-ture such as the posted date (date), category it belongs to (cat. andsubcat.), concept, path alias for image (alias), whether public to allusers (ispublic), media status (status), title, media type (type), alltags (tags), geometric information such as latitude (lat.), longitude(lon.), and geoaccuracy (acc.).In general, SVR [1][14][12] and RFR [6] show the outstandingperformance among the traditional regression models. However,the inputs to SVR needs to pre-process first to avoid the fact that fea-ture with large value will bias the prediction results [10]. However,the data types of social media are a significant difference so thatthe performance of the popularity prediction based on SVR may besuppressed. Random forest regression allowed heterogeneous datasuch as social media information and achieved high performance.DNNR also achieved excellent performance on various regressiontasks [4]. However, the model selection and training strategy re-main a big challenge. Since there are some extreme values of thepopularity score of SMHPD, leading to lower performance based onthose standard regressors, we propose a novel iterative refinementapproach is proposed to resolve this problem in this study.The main contribution of this paper is two-fold: i) We propose aniterative refinement approach to deal with extreme value regressiontask, and ii) We carefully treat the social information and analyzefeature importance to achieve the best performance.The rest of this paper is organized as follows. Section 2 presentsthe proposed iterative refinement approach for social media popu-larity prediction. In Sec. 3, experimental results are demonstrated.Finally, conclusions are drawn in Sec. 4.

Given a metadata of the social information X with 15-dimensionalvector, the predicted value can be obtained by y = h ( X , θ ) withlearned parameters of random forest regression θ . Since there aresome features are in string data types, making the mathematicaluncomputable. One of the advanced approaches to transfer stringdata type to an encoded-vector is word-embedding [5]. However,the languages of the string descriptions in the metadata are differentso that the word-embedding cannot apply in different languages.Instead, we first adopt a simple strategy to make the metadata X computable and analyze their feature importance. For the featureswhich have high importance, the digitalization of them should bemore carefully treated. In order to support this assumption, weadopt RFR to compute the feature importance.First, we adopt the following steps to make the metadata com-putable as follows: • UID and PID: Converted to integer type. • Cat. and Subcat.: Given a unique integer value for each cate-gory or subcategory. • Concept: Converted to the length of the text description. • Alias: Converted to the length of the text description.

Figure 1: The feature importance analysis of the metadata x c . • IsPublic: Treated false as 0 and true is 1. • Status: Converted the length of the text description. • Title and Tags: Converted to the count of words. • Type: Converted to the length of the text description. • Date: Converted to integer type. • Lat., Acc., and Lon.: Converted to floating-point type.As shown in 1, it is clear that some of the social information arerelatively important than the others. For example, the UID, PID, tags,and date are significantly crucial than others. The social informationassociated with the identity such as PID and UID is unnecessary topreprocess due to their data is already computable. Consequently,we should pay more attention to the social information which needsto be preprocessed.First, we focus on improving the preprocessing strategy of thefollowing five social information: category, subcategory, concept,title, and tags. Since category and subcategory are finite, it is pos-sible to convert this two social information to the unique identitynumbering. Toward this end, we first calculate the repeat items andthen remove them to find the unique items from the whole socialinformation. Then, each unique item will be given a number to becomputable. The pseudo-code is drawn in Algorithm 1. Finally, weadopt Algorithm 1 to obtain the unique ID of category, subcategory,and concept. In order to avoid the problem causing by differentlanguages usage, instead, we convert tags and title to the length oftheir text description to simplify the learning task.

Random forests for regression is based on the growing decisiontrees with a random vector f ( θ ) so that the predictor (i.e., a decisiontree) h ( X , θ ) takes on the values as opposed to labels. In general,the prediction task will be defined as a mean-squared error follows: E X , Y ( Y − h ( X )) , (1)where X and Y are the training samples and labels (popularityscores). It is easy to predict the popularity score for any test sam-ple X t based on the learned random forest h ( X t , θ ) by solving theabove equation. In general, the regression methods will adopt thesmoothness regularization term to avoid overfitting. However, it n Iterative Refinement Approach for Social Media Headline Prediction MM ’18, October 22–26, 2018, Seoul, Republic of Korea Algorithm 1

Unique ID Converter procedure Find the uniqe ID Given a feature X = [ x , x , ..., x N ] and count = X copy ← X X copy = [ x c , x c , ..., x c N ] for i=1 to N do item ← x c i for j=1 to N do if item = x c j and i != j then Remove( x c j ) from X copy count = count + for i = count do d i ← i for i = N do for j = count do if x i = x c j then x i = d j return X will also lead to the smoothing prediction results, causing the ex-treme value prediction hardly [9]. In our case, some of the postswith the extremely high popularity will be more potential to bea headline. It is worth to predict these extreme values correctly.Given the preprocessed social training set X s , we first obtain theinitial prediction value P s based on P s = h ( X s , θ ) . (2)Afterward, the residual value between the predicted values andground truths can be computed as follows: R = Y − P s . (3)By observing the distribution of the residual values, a single oneregressor (i.e., random forest) for the social information of SMHPDprediction is unable to exactly predict the higher and lower popu-larity scores of the posts. It is well-known that a typical regressionmodel tends to fit data distribution in a smoothness way to preventoverfitting issue.To effectively compensate the residues of the predicted popu-larity score P s , we propose an iterative refinement approach toimprove the prediction performance, especially in extreme valuescompensation. Let the first predicted popularity score be P s , andits residue is R, the first goal is to learn what samples that wouldbe the extreme high or low popularity scores. Toward this end, it isnecessary to learn a classifier that д ( X s ) = C ( X s , | θ s ) , (4)where д ( X s ) indicates either -1 (non-extreme value) or 1 (extremevalue). There are several ways to learn the classifier with the pa-rameters θ s such as support vector machine (SVM), random forestclassifier (RFC), and AdaBoosting [2] classifier. In this study, weadopt AdaBoost as the classifier. In general, the loss function shouldbe L ( X s ) = N (cid:213) i = l ( C ( X s , R )) , (5)where the l is the loss function defined by the learning approachand L is the total loss function. Directly solving the above equation is relatively hard because R is not a binary class. To solve this issue,we predefine a threshold t y to separate the popularity scores intotwo groups of larger and regular residues (called R t ). Therefore,the loss function becomes L ( X s ) = N (cid:213) i = l ( C ( X s , R t )) , (6)Intuitively, the larger values in R indicates a lousy prediction sit-uation, which also implies an extreme value may be presented in Y . Let д i indicate a classifier at i th iteration, the д i ( X s ) = h i indicate i th regressor, we need to learn k regressors and classifiersbased on the prediction residues R t i and the training samples X R i with larger residues. Finally, the compensation function at iteration i can be defined as follows: P s i = R i + P s i − = h i ( X R i , θ i ) + h i − ( X R i , θ i − ) , (7)where X R i will be X s at iteration 0 and X R i = [ X s | д i ( X s ) = ] . Bycontrolling the predefined threshold value t y , it is easy to decide thenumber of the samples to compensate its prediction results. Oncewe set t y =

0, the residual compensation will be performed on allprediction results. The traning process of the proposed iterativerefinement approach is illustrated in Fig. 2. In test phase, it is quitesimple to feed the test sample to the learned RFR and perform the k iterative refinement processes to obtain the final predicted value P f inal . In this experiments, social media headline prediction challengedataset (SMHPD) containing 305 ,

614 posts [13][14][12] is used toevaluate the performance of the popularity prediction of the pro-posed method and other state-of-the-art methods. To fairly verifythe performance of the proposed method, we partition SMHPDinto a 300 ,

000 training samples and 5 ,

614 test samples. In the ex-periments, we have two different partition manners of SMHPD.First, we randomly split SMHPD into training and test sets withoutconsidering time-order (Set-A). Second, we follow the instructionin [13][14][12] to partition SMHPD into training and test sets intime-order (Set-B). We also download 275 ,

066 images from Flickrfor performance comparison purpose. The unavailable images willbe replaced with a black image (i.e., all pixel values in the imageare zero). The metrics of the performance comparison are rankcorrelation (Spearman’s Rho)[7], Mean Absolute Error (MAE), andMean Squared Error (MSE), where rank correlation is a nonpara-metric measure of statistical dependence between the ranking oftwo variables.To compare the performance of the popularity prediction, wecollect six stat-of-the-art regression methods as follows: 1) Multi-model approach proposed in [3], 2) Standard random forest regres-sor, 3) SVR with Radial basis kernel, 4) AdaBoosting Regressor [2],5) Naive Bayer Regressor, and 6) Linear regression.

M ’18, October 22–26, 2018, Seoul, Republic of Korea C.C. Hsu et al.

Figure 2: The proposed iterative refinement approach for the social media headline prediction.Table 1: Performance comparison of the different regressionmethods on test Set-A (Partitioned randomly).

Methods Rank correlation MSE MAENaive Bayer Regressor 0.312 7.595 2.107SVR 0.351 5.411 1.846Linear Regression 0.423 5.068 1.785AdaBoosting Regression 0.883 1.442 0.671Random Forest 0.886 1.415 0.662Multi-model Approach [3] 0.901 1.283 0.630Proposed method .

919 1 .

185 0 . Table 2: Performance comparison of the different regressionmethods on test Set-B (Partitioned by time-order).

Methods Rank correlation MSE MAENaive Bayer Regressor 0.417 5.196 1.814SVR 0.441 4.999 1.769Linear Regression 0.424 5.186 1.803AdaBoosting Regression 0.594 3.967 1.541Random Forest 0.886 1.418 0.663Multi-model Approach [3] 0.846 1.838 0.748Proposed method .

908 1 .

193 0 . Table 1 shows the prediction performance of the proposed methodand six state-of-the-art regression methods for the test Set-A par-titioned randomly. With the MSE, MAE, and rank correlation cri-terion, the proposed iterative refinement approach achieves thebest performance, compared to other methods. The performanceof the multi-model method [3] also achieves good performance.Since the images of the posts are usually noised, the deep neuralnetwork in [3] may not found enough meaningful information toimprove the predicted results. In contrast, the proposed refinementapproach concentrates on finding the most useful clues from themetadata of SMHPD to compensate the prediction residual, makingthe outstanding performance. Compare to our method, other meth-ods cannot achieve promising results due to the highly complexproperty of the metadata.Table 2 presents the performance comparison of the proposedmethod and other methods for test Set-B. However, we note thatthe overall performance of all regression methods (including our iterative refinement approach) on test Set-B is slightly lower thanthat of Set-A. A possible reason for the lower performance is that allof the methods do not carefully model any temporal information.

In the proposed method, two critical parameters need to be deter-mined. First one is the predefined threshold value t y to partitionthe residues into two groups. A lower threshold value t y is, themore predicted results will compensate. The second parameter isthe number of the iterations of the proposed refinement approach.Intuitively, the more iterations perform, the higher the performancewe can achieve. To find the best parameters setting, we conduct twoexperiments to determine these two parameters. For the selectionof parameter k , the best performance presented in k =

4. Note thatthe experiment is conducted while t y =

0. However, it is remarkablethat a high performance gain is shown at iteration 2. In practical,we suggest that the k can be 2 if it is a time-limited application.Otherwise, k can set to 3 − t y , we set the value of t y to the 80%, 50%, 25%, 12%,6%,3%,1%,and 0% of the highest value inthe residues and k =

2. We observed that the best performancepresented in t y = t y can set to alower value in an adequate resource situation and set to 25% inresource-limited condition.In the parameters setting of k = t y = k = t y = In this study, we have proposed an effectively and efficiently it-erative refinement approach for social media headline prediction.The main contribution of the proposed method is to address theproblem of the extreme value prediction via progressively refiningthe predicted popularity scores. Since the proposed method is per-formed on the metadata only, the computational complexity is alsorelatively low. A comprehensive experiment demonstrated that theproposed method is effective and efficient.

ACKNOWLEDGMENTS

This work was supported in part by the Ministry of Science andTechnology of Taiwan under grant MOST 105-2628-E-224-001-MY3. n Iterative Refinement Approach for Social Media Headline Prediction MM ’18, October 22–26, 2018, Seoul, Republic of Korea

REFERENCES [1] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vectormachines.

ACM transactions on intelligent systems and technology (TIST)

2, 3(2011), 27.[2] Michael Collins, Robert E Schapire, and Yoram Singer. 2002. Logistic regression,AdaBoost and Bregman distances.

Machine Learning

48, 1-3 (2002), 253–285.[3] Chih-Chung Hsu, Ying-Chin Lee, Ping-En Lu, Shian-Shin Lu, Hsiao-Ting Lai,Chihg-Chu Huang, Chun Wang, Yang-Jiun Lin, and Weng-Tai Su. 2017. Socialmedia prediction based on residual learning and random forest. In

Proceedings ofthe 2017 ACM on multimedia conference . ACM, 1865–1870.[4] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning.

Nature

Advances in neural information processing systems . 2177–2185.[6] Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by ran-domForest.

R news

2, 3 (2002), 18–22.[7] John H McDonald. 2009.

Handbook of biological statistics . Vol. 2.[8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficientestimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).[9] K Murphy. 2012. Machine learning: a probabilistic approach.

MassachusettsInstitute of Technology (2012), 1–21.[10] Nasser M Nasrabadi. 2007. Pattern recognition and machine learning.

Journal ofelectronic imaging

16, 4 (2007), 049901.[11] Xia Wang and Dipak K. Dey. 2011. Generalized extreme value regression forordinal response data.

Environmental and Ecological Statistics

18, 4 (01 Dec 2011),619–634. https://doi.org/10.1007/s10651-010-0154-8[12] Bo Wu, Wen-Huang Cheng, Yongdong Zhang, and Tao Mei. 2016. Time Matters:Multi-scale Temporalization of Social Media Popularity. In

Proceedings of the 2016ACM on Multimedia Conference (ACM MM) .[13] Bo Wu, Wen-Huang Cheng, Yongdong Zhang, Huang Qiushi, Li Jintao, and TaoMei. 2017. Sequential Prediction of Social Media Popularity with Deep TemporalContext Networks. In

International Joint Conference on Artificial Intelligence(IJCAI) .[14] Bo Wu, Tao Mei, Wen-Huang Cheng, and Yongdong Zhang. 2016. UnfoldingTemporal Dynamics: Predicting Social Media Popularity Using Multi-scale Tem-poral Decomposition. In