Influences of Temporal Factors on GPS-based Human Mobility Lifestyle
IInfluences of Temporal Factors on GPS-basedHuman Mobility Lifestyle
Tran Phuong Thao
The University of Tokyo
Tokyo, [email protected]
Abstract —Analysis of human mobility from GPS trajectoriesbecomes crucial in many aspects such as policy planning for ur-ban citizens, location-based service recommendation/prediction,and especially mitigating the spread of biological and mobileviruses. In this paper, we propose a method to find temporalfactors affecting the human mobility lifestyle. We collected GPSdata from 100 smartphone users in Japan. We designed a modelthat consists of 13 temporal patterns. We then applied a multiplelinear regression and found that people tend to keep theirmobility habits on Thursday and the days in the second week of amonth but tend to lose their habits on Friday. We also explainedsome reasons behind these findings.
Index Terms —Human Behaviour, Movement Lifestyle,Location-based Recommendation, GPS History, Multiple LinearRegression, Student t-test Statistics.
I. I
NTRODUCTION
Understanding individual human mobility plays an im-portant role especially when the geographic spread of theinfectious virus that causes COVID-19 has taken the worldinto uncharted territory. Not only that, it is also a critical factorin policy planning [1], [2], travel demand forecasting [3],[4], location-based recommendation/service advertising [6], orlocation-based personal authentication [5]. M. Gonzalez etal. [25] proved that human mobility follows a high degree ofregularity. Therefore, several sophisticated models have beenproposed to determine the factors influencing the probabilitythat people tend to increase and lose their mobility lifestyle.The factors can be classified into spatial, temporal, and socialin which temporal one has been proved to be the most impor-tant affecting factor. However, the temporal factors found inexisting work are still coarse-grained (i.e., weekend/weekdaywithout clarifying which specific days of the week, or whichweek of the month, etc.)In this paper, we investigated the recurrence and temporalperiodicity inherent to human mobility inferred from mobilephone data with more fine-grained factors. We collected GPSdata from 100 random smartphone users in Japan. We designeda model consisting of 13 temporal factors from 3 patterncategories (i.e., days of the weeks, quarters of the month,and holidays including weekend and national public holidays)for independent variables. We also proposed an algorithm tocompute the probability (i.e., similarity score) of the users tovisit the locations they visited before for the target outcome.We then applied a multiple linear regression and performed a t -test. We found that people tend to keep their mobility habits on Thursday and the days in the second week of a month buttend to lose their habits on Friday. We also discussed somereasons and applications behind these findings.The rest of this paper is organized as follows. Section IIintroduces related work. Section III presents our proposedmethodology. Section IV gives the experiment and our find-ings. Section V discusses applications and limitations of ourmethod. Section VI describes the conclusion.II. R ELATED W ORK
In this section, we introduce related work about factorsaffecting the location habit. The work can be classified intothree research directions.
A. Spatial Factors
S. Zhao et al. [10] observed that 80% successive checked-in POIs (Points-of-Interest) happen in less than 32 kilometers.They explained that people often act around their home oroffice, so even being independent with the last check-in, thesuccessive check-in can still happen in the same activityarea. S. Yali et al. [11] analyzed the two location-basedsocial networks Foursquare and Gowalla. They found thatthe probabilities for distances within 5 km are greater than40%, which decrease to17% and 8% within 10 km on thedatasets, respectively. Most users checked in within 20 km. T.Thao et al. [13], [14] leveraged the idea that the locations atclose time clocks have a closer correlation in physical distancethan the locations at far time clocks since a human needs aperiod of time to move from a location to another locationgradually. The experimental result showed that the extracteddistance coherence features along with the longitudes andlatitudes could improve the authentication’s accuracy. Whileall the papers [10], [11], [14] focused on the fact that closerlocations have a higher probability of being visited by users,Y. Hongzhi et al. [12] raised a more challenging problem whenpeople travel to a new city where they have no activity history.They showed that people tend to travel a limited distancewhen visiting venues and attending events. Furthermore, theactivity records in their non-home cities are only 0.47% of theactivity records when living in their home cities. To solve theproblem, the authors analyzed the two factors including userinterest (e.g., kids would pay more attention in playgroundswhile young ladies may be more interested in cosmetics stores)and local preference (e.g., people are more likely to visit local a r X i v : . [ c s . H C ] S e p ightseeing attractions and attend popular events in the citywhen they travel to an unfamiliar city). They found that thefactors also affect the decision to visit an unfamiliar location. B. Temporal Factors
G. Huiji et al. [7] extracted the correlations between thecheck-in time and the corresponding check-in preferences of auser. They found that weekly patterns (7 days of the week) andweekday/weekend patterns can capture the temporal check-in preferences of a user. However, the results do not clearlyindicate which day of the week, weekday, or weekend isthe affecting factor but only the general patterns. S. Zhao etal. [23] found that the day of week check-in pattern at differenthours: users take more check-ins in the late afternoon and theevening from 04:00 p.m. to 3:00 a.m. on weekends than theweekdays. Saturday and Sunday take a similar pattern, whilethe days from Monday to Friday take a similar pattern that isdifferent from the weekends. It may infer that weekday andweekend are two types of effects on the check-in behaviorof the user. J. Bao et al. [24] split a week into two partsincluding weekdays and weekends. For each part, they splita day into hourly time bins. A total of × time bins areused for the expression of temporal patterns. M. Gonzalez etal. [25] measured the return probability for each individual.They found that the probability that a user returns to theposition where the user was first observed after t hours fora two-dimensional random walk should follow t ln ( t ) . Thereturn probability is characterized by several peaks at 24h,48h, and 72h, which indicates a strong tendency of humansto return to locations they visited before. M. Xie et al. [8]explored the importance of spatial, temporal, and social factorsand found that they can be ranked as follows: temporal effect > content effect > spatial effect. This indicates the temporalfactors may provide the most information although of course,combining them is the best solution. C. Social and Content Factors
H. Wang et al. [15] studied that the social link is animportant factor affecting the choices of people when decidingwhich new place to visit. The authors analyzed the Gowalladataset and found that a friend or a friend-of-a-friend hasvisited more than 30% of the new places visited by a userin the past. With the same observation that social friends tendto have similar check-in behavior, several papers [16]–[19]also extracted the similarity score between the users derivedfrom the social friendships. The experiment result showed thatit could enhance the accuracy. Besides the links of friendand friend-of-a-friend, H. Bagci et al. [20] showed that localexpert is also a factor affecting the place to visit. J. Bao etal. [21] pointed out that users who visit many high-qualitylocations tend to have high knowledge about the vicinity. Ina similar manner, if a particular location is visited by manyhigh-quality users (i.e., experts), it is more probable for thatlocation to be a quality location. L. Kai et al. [22] aimed atthe service locations only, such as restaurants, fitness centers,etc. They found that the factors including demographics, preferences, and service levels (e.g., price range, discount ornot, advertisements) can increase the probability of mobileusers visiting the service locations.III. M
ETHODOLOGY
In this section, we present our proposed methodology in-cluding data collection and the model design.
A. Data Collection
A navigation application named MITHRA (Multi-factorIdentification/auTHentication ReseArch) was created to collectthe GPS information from the Android smartphone users.One hundred users were randomly recruited, thus live andwork in random areas. The data consists of timestamps andGPS information (longitude and latitude). The applicationcollects the data every 5 minutes. The users have differentdata collection periods because it depends on the time thateach user stars running the application. The entire collecteddata from all the users is from January to April 2017. Thetimestamp is up to seconds. The precision of the longitudesand latitudes is six decimal places (e.g., 36.xxxxxx), whichcorrespond to 0.1 meters. Regarding the data privacy, a privacyconsent is shown to the users during the installation process.The application can only be successfully installed if the usersaccept the terms and conditions agreement. We do not collectany personal information such as name, age, date of birth,gender, etc. except address which is used for user identity.Our project is reviewed by the Ethics Review Committee ofthe Graduate School of Information Science and Technology,the University of Tokyo.
B. Model
At first, we briefly describe how a linear regression work.Linear regression is a statistical method used for measuringwhether a set of factors affect (or can be used to predict) acertain outcome. It can model the relationship between oneor more independent variables (features) and one dependent(output) variable. The value of the target function is expectedto be a linear combination of the features. Formally, let f ∗ denote the predicted value: f ∗ ( c, x ) ∼ c + c x + · · · + c n x n (1)where X = { x , x , · · · , x n } denotes the set of features, n denotes the number of features, C = { c , c , · · · , c n } denotesthe set of coefficients, and c denotes the intercept. c is aconstant representing the expected mean value of f ∗ when x i = 0 for all i = { , · · · , n } . There are several methodsto solve the regression (e.g., Ridge Regression, Lasso, etc.)but we use the most common method Ordinary Least Squares(OLS) which minimizes the residual sum of squares betweenthe observed targets in the dataset, and the targets predictedby the linear approximation: min C || Xc − f || (2)When x , x , · · · , x n are correlated and the columns of thedesign matrix X are approximately linear dependent, X willbecome close to singular. ABLE IE
XAMPLE OF S IMILAR S CORE C ALCULATION FOR A N U SER U H 00:00-00:59 01:00-01:59 · · · D learn U learn W learn U learn W learn · · · U learn W learn ( lon , lat ) weight ( lon , lat ) weight · · · ( lon , lat ) weight ( lon , lat ) weight ( lon , lat ) weight · · · ( lon , lat ) weight ( lon , lat ) weight · · · D test date r s r s · · · r s lon , lat ) weight ( lon , lat ) · · · ( lon , lat ) weight lon , lat ) ( lon , lat ) weight · · · ( lon , lat ) weight lon , lat ) ( lon , lat ) weight · · · ( lon , lat ) We are now ready to define our model for the regression.For each user U , the model is defined as: score ∼ wdays + mquar + hdays (3)where score represents the target function; wdays , mquar ,and hdays represent the variables related to the days of theweeks, quarters of the month, and holidays, respectively.
1) Target Function (Dependent Variable):
In this part, weexplain the algorithm used to calculate the similarity score,which measures the probability of a user re-visiting a locationthat he/she visited before. The scores also represent themobility lifestyle pattern of a user. For each user U , the data issplitted into two parts based on the data collection time period.The data from the first half of the time period is denoted by D learn and the one from the later half is denoted by D test .The similarity score between D learn and D test is used for thetarget function. The procedure to calculate the similarity scoreis described as follows. a) Measuring Template from D learn : First, the longitudeand latitude in each data record d i ∈ D learn are roundedto 2 decimal places from original 6 decimal places sincethe location accuracy of people’s movement is often within1 km square. Let dat i , tim i , lon i , lat i denote the date (year,month, day), the time (hour, minute, second), the longitudeand latitude after being rounded, of d i , respectively. Let H = { , , · · · , } be the 24hourly-time periods. Each period is denoted by h α ∈ H where α ∈ [0 , . The records in D learn are grouped into24 subsets according to h α . For each α , the following sets areconstructed: • T learn α = { ( lon i , lat i ) } : the set contains the longitudeand latitude of all the records d i such that tim i ∈ h α regardless of dat i . • U learn α = { ( lon uniq j , lat uniq j ) } ⊂ T learn α : theset contains only the unique pairs of longitude andlatitude. For ∀ j, j (cid:48) ∈ [0 , | U learn α | ] , ( lon uniq j (cid:54) = lon uniq j (cid:48) ) ∨ ( lat uniq j (cid:54) = lat uniq j (cid:48) ) (remark, it isan OR, not AND operation). • W learn α = { weight j } : the set contains the corre-sponding weight of the pair ( lon uniq j , lat uniq j ) ∈ U learn α . U learn α and M learn α have the same length.The weight is calculated as the percentage that theuser U stays at the coordinate ( lon uniq j , lat uniq j ) , that is the ratio between the number of the pair values ( lon uniq j , lat uniq j ) and the length of T learn α : weight j = lon uniq j , lat uniq j ) | T learn α | (4) b) Extracting Representatives from D test : In D learn , wegrouped the data into 24 hours regardless of the date. For D test , we consider each different date before grouping thedata of the date into 24 hours. For each unique date δ fromthe data in D test and for each α ∈ [0 , , we also construct T test α = { ( lon i , lat i ) } in the same way as T learn α but with dat i = δ . We determine the representative r test δα for T test α by extracting the element ( lon i , lat i ) ∈ T test α at which theuser U stays for the longest period of time on the date δ . Wehave α representatives for entire D test . c) Matching to Calculate Similarity Scores: For eachdate δ in D test and for each α ∈ [0 , , if the representative r δα exists in U learn α , the similarity score s δα will be set tothe corresponding weight from W learn α . If not, s δα is set tozero. The example is given in Table I. After all the weightsfor 24 hours in each day δ are computed, all the weights in D test for the user U are summed up and used for the finalvalue of score . So, each user U has a corresponding similarityscore. For the example in Table I, the final score for U is weight + weight + weight + · · · + 2 weight .
2) Variables:
For each user U and each day δ mentionedabove, the following binary variables were extracted. The firstgroup is 7 binary variables which correspond to 7 days ofthe week (i.e., is δ Monday, · · · , is δ Sunday) denoted by { mon , tue , · · · , sun } . The second group is 4 variables whichcorrespond to 4 weeks of the month (i.e., is δ the first week, · · · , is δ the fourth week) denoted by { wk1 , wk2 , wk3 , wk4 } .The third group is 2 variables related to holidays (i.e., is δ aweekend and is δ a national holiday) denoted by { natl , wknd } .These 13 binary variables are summed up for all the days δ ofeach user U . wdays , mquar , and hdays represent the summedvariables for the first, second, and third group, respectively. Let D P denote the final data which will be used for the regressionwhich consists of 100 samples with 13 variables.IV. E XPERIMENT
The program is written in Python 3.7.4 on a computerMacBook Pro 2.8 GHz Intel Core i7, RAM 16 GB. Themultiple (linear) regression model is executed using scikit-learn package version 0.21. The t-test is computed using statsmodels package version 0.11.
ABLE IIV
ARIABLES D ISTRIBUTION no var. mean SD kurtosis skew min max mon tue wed thu fri sat sun wk1 wk2 wk3 wk4 natl wknd score ARQUE -B ERA T EST FOR R ESIDUALS metrics entire excluded excluded100 Samples (1, 21, 94) (1, 21, 30) p -value 0.03 0.05 0.05kurtosis -0.64 -0.57 -0.59skew 2.64 2.57 2.69 A. Distribution of Variables and Normality of Residuals
The distribution of the 13 variables and the target score isgiven in Table II. While the independent variables ( wdays , mquar , hdays ) and dependent variable ( score ) do not needto be normally distributed, the normality is required for theresiduals. The entire preprocessed data ( D P as mentioned inSection III-B2) has 100 samples corresponding to 100 userswith 13 variables. We performed an Jarque-Bera test , and theresult is showed in the second column in Table III. The p -valueis less than 0.05, which indicates that the residuals are notnormally distributed. Therefore, we had to conduct an analysisof data outliers in the next part. Fig. 1. Z-Sscore Plotting for 100 Datapoints
B. Outlier Identification
First, we measured the z -score for each of the 13 variablesfrom 100 samples. According to the empirical rule (so-called or three-sigma rule ) [9], any z -score that isgreater than 3 or less than -3 is considered to be an outlier.Almost all of the data (99.7%) should be within three standarddeviations from the mean; and 99.7% of the z-scores to bewithin the range (-3, +3). Therefore, we scanned all the z -scores and could find six samples that have any of 13 variableswith z -score greater than 3 or less than -3. The 6 outliersare the 1st, 4th, 8th, 22nd, 30th, and 82nd sample in D P denoted by outlier ( − , +3) = { s , s , s , s , s , s } . Ouraim is to remove the smallest number of outliers such thatthe p -value of the residuals can be increased up to 0.05 ormore. We, therefore, run an algorithm to perform the Jarque-Bera test after removing each k -combination of the elementsin the set outlier ( − , +3) . k is chosen in ascending order from1 to the length n = | outlier ( − , +3) | = 6 . Remark that we donot need to check all the combinations (cid:80) nk =1 (cid:0) nk (cid:1) . If we canfind a p -value that is equal or greater than 0.05 at a certain k = k p , it is unnecessary to check the other combinationswith k > k p . Unfortunately, we could not find (to remove)any outlier combination that can pass the Jarque-Bera test.Therefore, we then reduced the outlier range from (-3, +3) to (-2.9, +2.9) and could extract 8 samples, says outlier ( − . , +2 . = { s , s , s , s , s , s , s , s } . Simi-larly, we also performed the Jarque-Bera test; and fortunately,we could find two combinations at k = 3 that can boostthe p -value when removing them: C = { , , } and C = { , , } . The z -scores of all 100 samples are plottedin Fig. 1. 13 colors of the datapoints represent 13 variables.All the data belonging to the 4 outlier samples s , s , s ,and s lie along the 4 red lines. The results of the testsare summarized in the last two columns of Table III. Let D C = D P \ C and D C = D P \ C denote the data afterremoving the outliers from C and C . The Quantile-Quantile(QQ) plots of D P , D C , and D C are given in Figures 2, 3,and 4, respectively. It can observe that the datapoints from D C and D C are closer to the straight 45-degree referencelines than those from D P .It may raise the question of why not just remove all thedata outliers. First, we should note that removing all theoutliers does not mean that the p -value for the residuals can beincreased. We made a test when removing the 6 samples from outlier ( − , +3) and the 8 samples from outlier ( − . , +2 . . The p -values are then even worse ( . → . and . → . , respectively). Second, keeping the samples as many aspossible can preserve the nature of human behaviors. That iswhy we balance the trade-off by finding the combinations ofoutliers as above. C. Factor Extraction
We now apply the multiple linear regression on D C and D C . The affecting factors are determined based on the p -values with 3 significant levels: • p ≤ . : significant affecting factors S a m p l e Q u a n t il e s Fig. 2. QQ-Plot of Entire Data S a m p l e Q u a n t il e s Fig. 3. QQ-Plot After Removing (1, 21, 94) S a m p l e Q u a n t il e s Fig. 4. QQ-Plot After Removing (1, 21, 30) • . < p ≤ . : nearly-significant affecting factors • . < p ≤ . : normal affecting factorsThe result is described in Table IV. For D C , we found twonormal factors, including thu and fri with positive and negativecoefficients, respectively. It indicates that people tend to keeptheir movement lifestyle on Thursday but tend to lose theirmovement lifestyle on Friday. For D C , we found one nearly-significant factor fri with negative coefficient like D C andone normal factor wk2 with a positive coefficient. It indicatesthat people tend to lose the movement lifestyle on Friday andtend to keep the movement lifestyle on the days in the secondweek of the month. V. D ISCUSSION
The result can be heuristically explained that Thursday andthe second week of the month are the middle time of theweek and the month, respectively. The human behavior (evenhuman mood and social interaction) becomes more stable thanthe first days of the weeks (the first weeks of the month) andthe days near to the weekends (weeks near to the end of themonth). In contrast, the result of Friday may be caused by“nomikai” which is a drinking party (often on Friday and withco-workers) phenomenon particular to Japanese culture. Evenso, a deeper analysis and formal proof of these results shouldbe investigated for future work.Our findings can help understand more about human mobil-ity psychological and behavioural science which is importantfor urban planning, traffic forecasting, and the spread ofbiological and mobile viruses. They can also help enhance theeffectiveness of the location-based recommendations and thelocation-based predication, and enable the advertisers to designand present their location services to targeted customers. Forexample, if a restaurant where a customer visited before knowsthat he tends to lose the habit of going to the restaurant onFriday, it can promote more discounts on those days ratherthan the other days of the week.In this paper, weekly temporal patterns were analyzed.Future work can examine daily temporal patterns which aredifferent time frames during a day such as { } (for the interval of 6hours), { · · · , 21:01-24:00 } (for the interval of 3 hours), etc. Combining weekly and daily temporalpatterns is also a promising approach to figure out which timebins that people tend to visit (or tend to lose the habit ofvisiting) the usual location.VI. C ONCLUSION
In this paper, we aim to find which temporal factors thataffect the human mobility lifestyle. We collected GPS dataincluding longitude, latitude, and timestamp from 100 randomparticipants in Japan using a smartphone application. Wedesigned a regression model that utilizes 13 weekly temporalfactors as independent variables categorized into 3 patterntypes: days of the week, quarters of the month, and holidays.We proposed an algorithm to compute the similarity scorebetween the location history and the most recent location log.We applied a multiple linear regression with a t -test and foundthat people tend to keep their mobility habit on Thursday andthe days in the second week of the month but tend to lose thehabit on Friday. R EFERENCES[1] M. W. Horner and M. E. OKelly, “Embedding economies of scaleconcepts for hub network design”. In: Journal of Transport Geography,vol. 9, no. 4, 2001, pp. 255–265.[2] Y. Long, H. Han, Y. Tu, and X. Shu, “Evaluating the effectiveness ofurban growth boundaries using human mobility and activity records”.In: Cities, vol. 46, 2015, pp. 76–84.[3] R. Kitamura, C. Chen, R. M. Pendyala, and R. Narayaran, “Micro-simulation of daily activity-travel patterns for travel demand forecast-ing”. In: Transportation, vol. 27, pp. 25–51, 2000.[4] W. Zheng, X. Huang, and Y. Lic, “Understanding the tourist mobilityusing GPS: Where is the next place?”. In: Tourism Management, vol.59, 2017, pp. 267–280.[5] Thao T.P., Irvan M., Kobayashi R., Yamaguchi R.S., Nakata T. (2020)Self-enhancing GPS-Based Authentication Using Corresponding Ad-dress. In: Data and Applications Security and Privacy XXXIV (DB-Sec’20), Lecture Notes in Computer Science, vol. 12122. Springer,Cham, pp. 333–344. DOI: https://doi.org/10.1007/978-3-030-49669-219[6] J. Yuan, Y. Zheng, and X. Xie, “Discovering regions of differentfunctions in a city using human mobility and POIs”. In: Proceedingsof the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining (KDD’12), 2012, pp. 186–194.[7] G. Huiji, T. Jiliang, H. Xia, and L. Huan, “Exploring temporal effectsfor location recommendation on location-based social networks”. In:Proceedings of the 7th ACM conference on Recommender systems(RecSys’13), pp. 93–100, 2013.ABLE IVE
VALUATION R ESULT case no. variables coef SE ttt ppp
CI[0.025 0.975] mon -20.37 32.36 -0.63 0.53 -84.72 43.972 tue -8.38 41.38 -0.2 0.84 -90.65 73.893 wed -16.20 39.87 -0.41 0.69 -95.48 63.07 thu (*) Exclude fri -85.02 34.67 -2.45 0.02 (*) -153.95 -16.08 outlier 6 sat sun -24.26 27.93 -0.87 0.39 -79.80 31.278 wk1 wk2 wk3 wk4 natl -6.28 14.36 -0.44 0.66 -34.83 22.2713 wknd mon -30.81 32.21 -0.96 0.34 -94.85 33.232 tue -5.68 40.62 -0.14 0.89 -86.45 75.083 wed -14.43 38.46 -0.38 0.71 -90.89 62.044 thu fri -87.45 34.38 -2.54 0.01 (**) -155.81 -19.09 outlier 6 sat sun -24.76 27.68 -0.90 0.37 -79.79 30.268 wk1 wk2 (*) wk3 wk4 natl -14.40 15.55 -0.93 0.36 -45.32 16.5313 wkndwknd