A Comprehensive Pipeline for Hotel Recommendation System
AA Comprehensive Pipeline for Hotel RecommendationSystem
J. Chen , Z. Gao ∗ Department of Computer Science, VU University Amsterdam [email protected]
Abstract.
This paper addresses a comprehensive pipeline to build a hotel recommendationsystem with the raw data collected by Apps in users’ smartphone. The pipeline mainlyconsists of pre-process of the raw data and training prediction models. We use two methods,Support Vector Machine (SVM) and Recurrent Neural Network (RNN). The results showthat two methods achieved a reasonable accuracy with the pre-process of the raw data.Therefore, we conclude that this paper provides a comprehensive pipeline, in which a hotelrecommendation system was successful build from the raw data to specific applications.
In this paper, we describe a comprehensive pipeline with two methods to predict the mood of theuser on the next day based on the data we obtained from the users on the days before. Moreover,we achieve the hotel recommendation for the user based on their mood. The methods are comparedwith the benchmark that simply predict the mood on the next day by assuming it is the same asthe previous day. First, we use the model of Support Vector Machine (SVM) to predict the moodof user. The second method we used in this assignment is the Recurrent Neural Network (RNN).Both of them achieved the prediction in a reasonable accuracy.The Pre-process of the dataset is a very important step in data mining. Usually, the Pre-processis closely related to the prediction. To be clear description, this document describe the Pre-processof dataset in Section 2. The experiments are implemented in R code based on the libraries like e1071 (SVM),
RNN , etc. Reading and Understanding Data
In this section we use R code to process the dataset due to the plenty of support library indata mining. First, it is necessary to understanding what are the meaning of the variablesand the value of the dataset before process the data. The dataset shows the variables and thecorresponding values of the users from the smart phone. The mood of the users is related tothe variables on the last days. However, the data of some variables are not related to the moodof users, or not unusable due to damage and /or insufficiency. This is the objective in thissection, that we aim to pre-process the dataset from an original to the wrapped dataset thatcan be fed to the predictive model.2.
Pre-process the Dataset
In this section, we present the process about how we pre-process the data. First, to make thedata structure clearer, we organize the dataset as the table 1 that the value is grouped by id,time, and variables . In addition, we can analysis the mood of the user in a day like the Figure1 for the mood dynamic of the user. In such way, we know how is the mood dynamics in aday, which is good to predict the mood of the user in the next day. Therefore we processedthe dataset to the new structure as shown in Fig.3. To build the predictive model, we need tosummarise the value of the variables in days that can be right format to input the classifiers.We therefore average the value of variables in days. a r X i v : . [ c s . I R ] A ug ig. 1: The mood of an user in a day. The mood of the user keep in a stable average value in [7,8].Fig. 2: The predictive model procedure.Fig. 3: The snapshot of the data structure that the value is grouped by id, time, and variables .The data in red rectangle are unusable due to too NA .However, in figure 3, we can see some variables of dataset we obtained have few data. We thinkthey are unusable data, and remove them. Although the dataset is much more tidy, it is stillnot good enough to be the training data and test data. Therefore we remove the data in somedays that only have a little value of variables. To here, the dataset are usable for training andtesting as shown in Fig. 4.To pre-process the dataset to be usable in data mining, we face many challenges and problemsto the original dataset like missing value, outliers, etc [1]. The pre-process is an essential andimportant step for data mining due to a variety of possible defect in the original dataset.Here we show many examples that are part of the techniques we used in our experiments.Missing value in original dataset is a common problem that we have to solve in data mining. First,we illustrate how we process the problem of missing value in the original dataset. In Fig. 4, we choose some variables as the usable feature with enough samples. The data in Fig. 4have the full values in the variables in different dates and ids. They are tidy data that can be fedto the model for training and testing from the data formate and information.In other hand, we divide the dataset into two parts with 10% and 90% rate as the training sampleand testing sample separately. To build the predictive model as shown in Fig. 2, we aggregate thehistory to create attributes that can be used in the machine learning approach like the SVM [2] andRNN we use in this document. We use the average mood during the last five days as a predictor.This is clearly present in Fig. 5 to create the new feature that can be used in the classifiers. d time variablesmood ..... .... ....AS14.01 2014-02-26 6.25AS14.01 2014-02-27 6.33AS14.01 2014-03-21 6.2AS14.01 2014-03-22 6.4AS14.01 2014-03-23 6.8
Table 1: The re-structure of dataset.Fig. 4: The snapshot of the usable data structure. The value is grouped by id, time, and variables .There are no NA and duplicated id in different variables. days_mean_std$days m ean Fig. 5: The example of temporal abstraction. We take the average of mood value in five days. .3 Rationale
For the rationale of choice of the final attributes, in this assignment, we mainly consider thequality and quantity of dataset. We have to filter damaged data that probable to train an incorrectpredicted model or decrease the accuracy of the prediction. Therefore, we remove the data in theday that many variables have missing value and the variable with outliers.
In this section, we described two predictive models and the benchmark. But we are not focus onthe details of the models because we use the standard R code library for SVM and RNN.
First, we adapted the Support Vector Machine (SVM) as the predictive model. In R programming,a variety of libraries can be used to implement SVM, we used the library e1071 due to its featureof easy-to-use. The main parameters setting of SVM is shown in Table 2. parameters scale type kernel degree gamma coef0 cost class.weights epsilonsetting 1 C-classification linear 3 1 0 1 1 0.1
Table 2: The main parameters setting of SVM.Therefore, we just need to train and test the sample after the pre-process section. We used thevariables in section 2 as the input of SVM model and the value of mood is the output of SVMmodel. The 90% sample was used to train the SVM model. The rest 10% sample was used to testthe accuracy of trained SVM model.We output the accuracy of trained SVM model by predicting the training sample. And then,The accuracy was verified again by predicting the testing sample. The table 3 shows the resultsof predictive mood of the user on the next day that testing on the training sample. In table 3,the 568 samples are used to train the SVM predictive model. The value of mood from 5 to 8are predicted from 3 to 9. The 467 samples are correct predicted. The trained SVM model have accuracy = 467 /
568 = 82 . . results_train 3 4 5 6 7 8 95 0 1 4 1 0 0 06 1 2 9 71 11 4 07 0 0 0 23 313 31 28 0 0 0 0 13 79 0 Table 3: The results of training. The 568 samples are used to train the SVM predictive model. Thevalue of mood from 5 to 8 are predicted from 3 to 9. The 467 samples are correct predicted. Thestatistical results in Fig. 6.Furthermore, we verified the accuracy of SVM predictive model by predicting the test sample.We have divide 100 sample as the testing sample in section 2. We have the result as shown in Table4 and the statistical results in Fig. 7. The 81 samples were correctly predicted.Last, we test the accuracy of benchmark that assuming the mood of user is the same as theprevious day, which is 62.3%. Therefore, we have the comparison as shown in Table 5.
For this variant of the model, we incorporate a Recurrent Neural Network (RNN) to exploit thetemporal characteristics of the dataset. To do so, we first pre-process the data somewhat. For thisig. 6: The boxplot of the results for the training sample. The 568 samples are used to train theSVM predictive model. Thevalue of mood from 5 to 8 are predicted from 3 to 9. result_test 3 5 6
86 1 2 10
07 0 1 5
38 0 0 0 Table 4: The predicted mood by SVM predictive model for the test sample. And the statisticalresults in Fig. 7.Fig. 7: The boxplot of the results that is tested for the test sample. The most predictive results arecorrect.pre-processing we first replaced all the unavailable values, that is the values corresponding to ‘NA’,by their values of the previous data-point. In such a way, we can use more data-point and do nothave to discard any data points. Besides that, it seems reasonable to equate these values to theirpreviously measured values since all the variables are measured several times a day, and it seemsplausible that the values of these variables do not change substantially from one data-point to thenext.Now that we have a full dataset with no missing values, we can aggregate the data over thedays. This allows us to obtain averages of all the days for each variable, which is needed to producea prediction of the average mood the next day. At the same time, all the days that the moodvariable is measured and delete the days in our dataset for which the mood is not observed foreach individual separately. By doing this for each individual separately, we avoid throwing awaydata that we could in fact use for certain individuals. So if, for instance, the mood is only measuredfor an individual 1 at 10 dates, but for some other individual 2 on 15 dates, we avoid throwingaway 5 dates for individual 2. As a next step, we then find for which days, where at least the moodhas been measured, the most variables have been recorded and discard the rest of the days in ourdataset. This is again done for each individual separately, which results in the following number ofobservations.As a final step for preparing to fit a RNN, we convert all variables to the [0 , interval, whichis necessary for the RNN to converge (faster). In the end, we scale back our predictions to theiroriginal scale such that we obtain predictions for the mood that we actually observe.For every individual we then train and test a RNN, where we eventually ended with a learningrate of , hidden layers in the network, . iterations, the logistic sigmoid and the stochasticgradient descent method as updating rule. For testing the individual RNN’s, we used of theavailable data (rounded to the nearest integer) and the other (rounded to the nearest integer)for training the data. As an example, we present the results of this training and testing phase rediction result_train result_test benchmarkaccuracy 0.822 0.810 0.623 Table 5: The comparison of testing based on training sample and testing sample, and the benchmark
ID Observations ID Observations
AS14.01 AS14.19 AS14.02 AS14.20 AS14.03 AS14.23 AS14.05 AS14.24 AS14.06 AS14.25 AS14.07 AS14.26 AS14.08 AS14.27 AS14.09 AS14.28 AS14.12 AS14.29 AS14.13 AS14.30 AS14.14 AS14.31 AS14.15 AS14.32 AS14.16 AS14.33 AS14.17 Table 6: Number of observations after pre-processing the datafor individual AS14.08 below. Note that for each individual, the random number generator in thetraining phase was initialized by set.seed(2204) . (a) Error for training set of AS14.08(b) Predictions for test set of AS14.08 Fig. 8: Training phase for AS14.08rom results in Figure 8, the errors made in classifying the data decrease rather steep as thenumber of iterations progress. From the corresponding prediction plot, we see what the actualvalues of the mood were in the test set as opposed to the predicted values from the trained RNN.We might be worried from the error plot that we are overfitting the data in the RNN since theerrors become so small, but we can see that the RNN seems to reasonably predict the mood forthe following day from the prediction plot. This means our RNN is not overfitting in this caseand that it can reasonably predict the mood for the following day for unknown cases. If we trainour network using the entire dataset, we can also see that we adequately capture the mood of thefollowing day for the known cases.Fig. 10: Predictions for entire dataset of AS14.08The results of predictions in Figure 10 show that we actually capture the mood of the followingday with rather high accuracy. As expected, we thus obtain qualitatively the same pattern forthe errors made in classifying the data as for the training phase earlier arises when we use theentire dataset, explains why we are able to predict the mood of the following day quite precisely.Moreover, this pattern for both the predictions and the errors is consistent for all the individuals.To provide a selection of our results, we show the prediction plots for 3 individuals below, namelyfor AS14.08, AS14.16 and AS14.24. From these plots we indeed see that our predictions match theobserved values rather closely and seem pretty accurate. This is confirmed when we inspect theRMSE of the predictions for each individual. These RMSE’s are given in the following table. We
ID RMSE ID RMSE
AS14.01
AS14.19
AS14.02
AS14.20
AS14.03
AS14.23
AS14.05
AS14.24
AS14.06
AS14.25
AS14.07
AS14.26
AS14.08
AS14.27
AS14.09
AS14.28
AS14.12
AS14.29
AS14.13
AS14.30
AS14.14
AS14.31
AS14.15
AS14.32
AS14.16
AS14.33
AS14.17
Table 7: RMSE of RNN approach a) AS14.08 (b) AS14.16 (c) AS14.24
Fig. 11: Actual values vs predictions for each individualsee that these RMSE’s are all pretty close to zero for each individual. All things considered, wethus see that the predicted values match the actual values rather closely, for each individual. TheRNN method thus seems to adequately incorporate the temporal aspects of the dataset at handon an individual level.
In this model variant, we simply predict that the average mood on the next day is the sameas on this day. In the prediction plots below we can see these actual and predicted values for 3individuals, namely for AS14.08, AS14.16 and AS14.24. We can see that this naive approach ofsimply predicting that the average mood will stay constant (the same as the previous day) doesnot produce as nice results as those of the RNN’s. The following table shows the correspondingRMSE for each individual when adopting this naive approach.
ID RMSE ID RMSE
AS14.01
AS14.19
AS14.02
AS14.20
AS14.03
AS14.23
AS14.05
AS14.24
AS14.06
AS14.25
AS14.07
AS14.26
AS14.08
AS14.27
AS14.09
AS14.28
AS14.12
AS14.29
AS14.13
AS14.30
AS14.14
AS14.31
AS14.15
AS14.32
AS14.16
AS14.33
AS14.17
Table 8: RMSE of naive approachFrom this table we also see that the RNN’s actually perform much better than the naiveapproach. All things considered, the naive approach does not seem correct to adopt and can beconsidered as a ‘clueless’ method, that is if we had no idea how to approach the problem then thiswould be the standard ‘worst case scenario’ for producing predictions. The naive approach cantherefore indeed be considered as a benchmark model. a) AS14.01(b) AS14.02(c) AS14.03
Fig. 13: Actual values vs predictions for each individual (1)
Conclusion
Hotel recommendation system is a popular research field. This paper provide a comprehensivepipeline for the researchers to build such a system from the raw data to specific application. Al-though the results show that the two methods achieve a successful prediction system, they are onlythe basic approaches in machine learning. Many approaches are interesting to further exploration.For instance, evolutionary approaches have been applied in many areas [3]. Neuroevolution havebeen applied in evolving neural network for real-time computer vision [4], evolutionary robotics[5,6,7,8]. In many areas [9], convolution neutral networks generally achieves remarkable perfor-mance that we aim to apply in this pipeline. Knowledge graph is a popular method that is appliedto the many applications [10,11], such as finance, medicine, biology, Question—Answering, StoringInformation of Research, in particular recommendation system. Therefore, we will use knowledgegraph to design the hotel recommendation system in the future. In addition, the signal compress[12,13,14] is an interesting technology for the pre-process raw data. These approaches are the pointswe aim to extend for this pipeline.
References
1. Dunren Che, Mejdl Safran, and Zhiyong Peng. From big data to big data mining: challenges, issues,and opportunities. In
International conference on database systems for advanced applications , pages1–15. Springer, 2013.2. G. Lan, J. Benito-Picazo, D. M. Roijers, E. Domínguez, and A. E. Eiben. Real-time robot vision onlow-performance computing hardware. In , pages 1959–1965, Nov 2018.3. Gongjin Lan, Jakub M. Tomczak, Diederik M. Roijers, and A. E. Eiben. Time efficiency in optimizationwith a bayesian-evolutionary algorithm, 2020.4. G. Lan, L. de Vries, and S. Wang. Evolving efficient deep neural networks for real-time object recog-nition. In , pages 2571–2578, Dec2019.5. G. Lan, J. Chen, and A. E. Eiben. Simulated and real-world evolution of predator robots. In , pages 1974–1981, Dec 2019.6. Gongjin Lan, Matteo De Carlo, Fuda van Diggelen, Jakub M. Tomczak, Diederik M. Roijers, andA. E. Eiben. Learning directed locomotion in modular robots with evolvable morphologies.
ArXiv ,abs/2001.07804, 2019.7. Gongjin Lan, Jiunhan Chen, and AE Eiben. Evolutionary predator-prey robot systems: from simulationto real world. In
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO) ,pages 123–124. ACM, 2019.8. Gongjin Lan, Milan Jelisavcic, Diederik M. Roijers, Evert Haasdijk, and A. E. Eiben. Directed loco-motion for modular robots with evolvable morphologies. In
Parallel Problem Solving from Nature –PPSN XV , pages 476–487. Springer, 2018.9. Guocheng Liu, J. Liang, G. Lan, Q. Hao, and M. Chen. Convolution neutral network enhanced binarysensor network for human activity recognition. In , pages 1–3, Oct 2016.10. Ting Liu, K. Anton Feenstra, Jaap Heringa, and Zhisheng Huang. Influence of gut microbiota onmental health via neurotransmitters: A review.
Journal of Artificial Intelligence for Medical Sciences ,1:1–14, 2020.11. Ting Liu and Zhisheng Huang. Evidence-based analysis of neurotransmitter modulation by gut mi-crobiota. In
International Conference on Health Information Science , pages 238–249. Springer, 2019.12. G. Lan, X. Hu, and Q. Hao. A bayesian approach for targets localization using binary sensor networks.In , volume 2,pages 453–456, Dec 2015.13. G. Lan, J. Liang, G. Liu, and Q. Hao. Development of a smart floor for target localization with bayesianbinary sensing. In , pages 447–453, March 2017.14. Gongjin Lan, Ziyun Luo, and Qi Hao. Development of a virtual reality teleconference system usingdistributed depth sensors. In2016 2nd IEEE International Conference on Computer and Communi-cations (ICCC)