[PDF] STAN: Spatio-Temporal Attention Network for Next Location Recommendation

Abstract

The next location recommendation is at the core of various location-based applications. Current state-of-the-art models have attempted to solve spatial sparsity with hierarchical gridding and model temporal relation with explicit time intervals, while some vital questions remain unsolved. Non-adjacent locations and non-consecutive visits provide non-trivial correlations for understanding a user's behavior but were rarely considered. To aggregate all relevant visits from user trajectory and recall the most plausible candidates from weighted representations, here we propose a Spatio-Temporal Attention Network (STAN) for location recommendation. STAN explicitly exploits relative spatiotemporal information of all the check-ins with self-attention layers along the trajectory. This improvement allows a point-to-point interaction between non-adjacent locations and non-consecutive check-ins with explicit spatiotemporal effect. STAN uses a bi-layer attention architecture that firstly aggregates spatiotemporal correlation within user trajectory and then recalls the target with consideration of personalized item frequency (PIF). By visualization, we show that STAN is in line with the above intuition. Experimental results unequivocally show that our model outperforms the existing state-of-the-art methods by 9-17%.

Full PDF

SSTAN: Spatio-Temporal Attention Network for Next LocationRecommendation

Yingtao Luo , Qiang Liu , , *, Zhaocheng Liu University of Washington Center for Research on Intelligent Perception and Computing, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Renmin University of [email protected],[email protected],[email protected]

ABSTRACT

The next location recommendation is at the core of various location-based applications. Current state-of-the-art models have attemptedto solve spatial sparsity with hierarchical gridding and model tem-poral relation with explicit time intervals, while some vital ques-tions remain unsolved. Non-adjacent locations and non-consecutivevisits provide non-trivial correlations for understanding a user’sbehavior but were rarely considered. To aggregate all relevant vis-its from user trajectory and recall the most plausible candidatesfrom weighted representations, here we propose a Spatio-TemporalAttention Network (STAN) for location recommendation. STAN ex-plicitly exploits relative spatiotemporal information of all the check-ins with self-attention layers along the trajectory. This improve-ment allows a point-to-point interaction between non-adjacent loca-tions and non-consecutive check-ins with explicit spatio-temporaleffect. STAN uses a bi-layer attention architecture that firstly aggre-gates spatiotemporal correlation within user trajectory and thenrecalls the target with consideration of personalized item frequency(PIF). By visualization, we show that STAN is in line with the aboveintuition. Experimental results unequivocally show that our modeloutperforms the existing state-of-the-art methods by 9-17%.

CCS CONCEPTS • Information systems → Location based services ; Data min-ing ; •

Human-centered computing → Ubiquitous and mobilecomputing design and evaluation methods . KEYWORDS

Point-of-Interest; recommendation; attention; spatiotemporal

ACM Reference Format:

Yingtao Luo, Qiang Liu, and Zhaocheng Liu. 2021. STAN: Spatio-TemporalAttention Network for Next Location Recommendation. In

Proceedings ofthe Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.

ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3442381.3449998 ∗ Corresponding author.This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449998

Next Point-of-Interest (POI) recommendation raises intensive stud-ies in recent years owing to the growth of location-based servicessuch as Yelp, Foursquare and Uber. The large volume of historicalcheck-in data gives service providers invaluable information tounderstand user preferences on next movements, as the historicaltrajectories reveal the user’s behavioral pattern in making everydecision. Meanwhile, such a system can also provide users with theconvenience to decide where to go and how to plan the day, basedon previous visits as well as current status [5, 7, 9, 23, 41].Previous approaches have extensively studied various aspectsand proposed many models to make a personalized recommenda-tion. Early models mainly focus on sequential transitions, such asMarkov chains [26]. Later on, recurrent neural networks (RNNs)with memory mechanism improved recommendation precision,inspiring following works [4, 11, 27, 45] to propose RNN variantsto better extract the long periodic and short sequential featuresof user trajectories. Besides sequential regularities, researchershave exploited temporal and spatial relation to assist sequentialrecommendation [22]. The recent state-of-the-art models fed timeintervals and/or spatial distances between two consecutive visitsto explicitly represent the effect of the spatiotemporal gap betweeneach movement. Prior works have also addressed the sparsity prob-lem of spatiotemporal information by discretely denoting time inhours and partitioning spatial areas by hierarchical grids [18, 34, 42].Besides, they modified neural architectures [28, 35, 43] or stackedextra modules [2, 8, 27] to integrate these additional information.With the continuously upcoming novel models pushing forwardour understanding of mobility prediction, several key problemsremain unsolved. 1) First, the correlations between non-adjacentlocations and non-contiguous visits have not been learned effec-tively. The mobility of users may depend more on relevant locationsvisited a few days ago rather than irrelevant locations visited justnow. Moreover, it is not rare for a user to visit distanced locationsthat are functionally relevant/similar. In a special example shownin

Figure 1 , a user always dines at a certain restaurant near theworkplace on Friday evening, go to some shopping malls on Sat-urday morning, and dine at a random restaurant near a mall onSaturday evening. In this case, the user has de facto made two non-consecutive visits to non-adjacent restaurants, where the explicitspatial distances between home and shopping malls and the explicittemporal interval between meals provide non-trivial informationfor predicting the exact location for Saturday dinner. However,most current models focused on spatial and/or temporal differencesbetween current and future steps while ignoring spatiotemporal a r X i v : . [ c s . I R ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yingtao Luo, Qiang Liu, Zhaocheng Liu

654 02 13

Friday

Saturday Sunday Monday

Figure 1: A trajectory example showing the relation betweennon-consecutive visits and non-adjacent locations. The mapshows the spatial distribution of visited locations, which arenamed by figures from 0 to 6. The timeline shows the tempo-ral distribution of visited locations from Friday to Monday.Solid marks represent restaurants. Hollow marks 0, 1, 2 rep-resent home, work place, and shopping mall, respectively.Restaurants 3, 4, 5 and 6 are functionally relevant but aretemporally non-successive and spatially distanced. correlation within the trajectory. 2) Second, the previously prac-ticed hierarchical gridding for spatial discretization is insensitiveto spatial distance. The gridding-based attention network aggre-gates neighboring locations but cannot perceive spatial distance.Grids that are close to each other reflect no difference to thosethat are not, tossing a lot of spatial information. 3) Third, previousmodels extensively overlooked personalized item frequency (PIF)[12, 25, 30]. Repeated visits to the same place reflect the frequency,which emphasizes the importance of the repeated locations andthe possibility of users revisiting. Previous RNN-based models andself-attention models can hardly reflect PIF due to the memorymechanism and normalization operation, respectively.To this end, we proposed STAN, a Spatio-Temporal Self-AttentionNetwork for the next location recommendation. In STAN , wedesign a self-attention layer for aggregating important locationswithin the historical trajectory and another self-attention layer forrecalling the most plausible candidates, both with the considerationof a point-to-point explicit spatiotemporal effect. Self-attention lay-ers can assign different weights to each visit within the trajectory,which overcomes the long-term dependency problem of the com-monly used recurrent layers. The bi-layer system allows effectiveaggregation that considers PIF. We employ linear interpolation forthe embedding of spatiotemporal transition matrix to address thesparsity problem, which is sensitive to spatial distance, unlike GPSgridding. STAN can learn correlations between non-adjacent loca-tions and non-contiguous visits owing to the spatiotemporal effectof all check-ins fed into the model.To summarize, our contributions are listed as follows: https://github.com/yingtaoluo/Spatial-Temporal-Attention-Network-for-POI-Recommendation • We propose STAN, a spatiotemporal bi-attention model, tofully consider the spatiotemporal effect for aggregating rel-evant locations. To our best recollection, STAN is the firstmodel in POI recommendation that explicitly incorporatesspatiotemporal correlation to learn the regularities betweennon-adjacent locations and non-contiguous visits. • We replace the GPS gridding with a simple linear interpola-tion technique for spatial discretization, which can recoverspatial distances and reflect user spatial preference, insteadof merely aggregating neighbors. We integrate this methodinto STAN for more accurate representation. • We specifically propose a bi-attention architecture for PIF.The first layer aggregates relevant locations within the tra-jectory for updated representation, so that the second layercan match the target to all check-ins, including repetition. • Experiments on four real-world datasets are conducted toevaluate the performances of the proposed method. The re-sult shows that the proposed STAN outperforms the accuracyof state-of-the-art models by more than 10%.

In this section, we briefly review some works on sequential rec-ommendation and the next POI recommendation. The next POIrecommendation can be viewed as a special sub-task of sequentialrecommendation with spatial information.

The sequential recommendation was mainly modeled by two schoolsof models: Markov-based models and deep learning-based models.Markov-based models predict the probability of the next behav-ior via a transition matrix. Due to the sparsity of sequential data,the Markov model can hardly capture the transition of intermit-tent visits. Matrix factorization models [15, 26] are proposed toapproach this problem, with further extensions [3, 10] find thatexplicit spatial and temporal information help a lot with recom-mendation performance. In general, Markov-based models mainlyfocus on the transition probability between two consecutive visits.Challenged by the flaws of Markov models, deep learning-basedmodels thrive to replace them. Among them, models based onRNN [40] are representative and quickly develop as strong base-lines. They have achieved satisfactory performances on variety oftasks, such as session-based recommendation [11, 16], next bas-ket recommendation [37] and next item recommendation [1, 44].Meanwhile, time intervals between adjacent behaviors are incor-porated in the RNN-based recommendation models [20, 45], forbetter preserving the dynamic characteristics of user history. Be-sides RNN, other deep learning methods are also considered. Forexample, metric embedding algorithms [5, 6], convolutional neuralnetworks [28, 31, 39], reinforcement learning algorithms [24], andgraph network [33, 38] are proposed one by one for sequential rec-ommendation. Recently, researchers extensively use self-attention[29] for sequential recommendation, where a model named SAS-Rec [14] is proposed. Based on SASRec, time intervals within usersequence are considered [17, 36]. Moreover, as discussed in [12],Personalized Item Frequency (PIF) is very important for sequentialrecommendations. RNN-based sequential recommenders have been

TAN: Spatio-Temporal Attention Network for Next Location Recommendation WWW ’21, April 19–23, 2021, Ljubljana, Slovenia proven to be unable for effectively capturing PIF. In models basedon self-attention, PIF is also hard to capture due to the normaliza-tion in attention modules. After normalization, the representationof previous histories is reduced to a single vector of embeddingdimension. Matching each candidate with this representation canhardly reflect PIF information.

Most existing next POI recommendation models are based on RNN.STRNN [22] uses temporal and spatial intervals between everytwo consecutive visits as explicit information to improve modelperformance, which has also been applied in public security evalua-tion [32]. SERM [35] jointly learns temporal and semantic contextsthat reflect user preference. DeepMove [4] combines an attentionlayer for learning long-term periodicity with a recurrent layer forlearning short-term sequential regularity and learned from highlycorrelated trajectories. Regarding the use of spatiotemporal infor-mation in the next location recommendation, many previous worksonly used explicit spatiotemporal intervals between two successivevisits in a recurrent layer. STRNN [22] directly uses spatiotemporalintervals between successive visits in a recurrent neural network.Then, Time-LSTM[45] proposes to add time gates to the LSTMstructure to better adapt the spatiotemporal effect. STGN [43] fur-ther enhances the LSTM structure by adding spatiotemporal gates.ATST-LSTM [13] uses an attention mechanism to assist LSTM inassigning different weights to each check-in, which starts to useattention but still only considered successive visits. LSTPM [27]proposes a geo-dilated RNN that aggregates locations visited re-cently, but only for short-term preference. Inspired by sequentialitem recommendation [14], GeoSAN [18] uses self-attention modelin next location recommendation that allows point-to-point inter-action within the trajectory. However, GeoSAN ignores the explicitmodeling of time intervals and spatial distances, as the griddingmethod for spatial discretization used in GeoSAN can not wellcapture the exact distances. In other words, all previous methodshave not effectively considered non-trivial correlations betweennon-adjacent locations and non-contiguous visits. Moreover, thesemodels also have problems in modeling PIF information.

In this section, we give problem formulations and term definitions.We denote the set of user, location and time as U = { 𝑢 , 𝑢 , ..., 𝑢 U } , L = { 𝑙 , 𝑙 , ..., 𝑙 L } , T = { 𝑡 , 𝑡 , ..., 𝑡 T } , respectively. Historical Trajectory . The trajectory of user 𝑢 𝑖 is temporally or-dered check-ins. Each check-in 𝑟 𝑘 within the trajectory of user 𝑢 𝑖 is a tuple ( 𝑢 𝑖 , 𝑙 𝑘 , 𝑡 𝑘 ) , in which 𝑙 𝑘 is the location and 𝑡 𝑘 is the times-tamp. Each user may have a variable-length trajectory tra ( 𝑢 𝑖 ) = { 𝑟 , 𝑟 , ..., 𝑟 𝑚 𝑖 } . We transform each trajectory into a fixed-length se-quence seq ( 𝑢 𝑖 ) = { 𝑟 , 𝑟 , ..., 𝑟 𝑛 } , with 𝑛 as the maximum length weconsider. If 𝑛 < 𝑚 𝑖 , we only consider the most recent 𝑛 check-ins.If 𝑛 > 𝑚 𝑖 , we pad zeros to the right until the sequence length is 𝑛 and mask off the padding items during calculation. Trajectory Spatio-Temporal Relation Matrix . We model timeintervals and geographical distances as the explicit spatio-temporal relation between two visited locations. We denote temporal intervalbetween 𝑖 -th and 𝑗 -th visits as Δ 𝑡𝑖 𝑗 = | 𝑡 𝑖 − 𝑡 𝑗 | , and denote spatialdistance between the GPS location of 𝑖 -th visit and the GPS loca-tion of 𝑗 -th visit as Δ 𝑠𝑖 𝑗 = 𝐻𝑎𝑣𝑒𝑟𝑠𝑖𝑛𝑒 ( 𝐺𝑃𝑆 𝑖 , 𝐺𝑃𝑆 𝑗 ) . Specifically, thetrajectory spatial relation matrix Δ 𝑠 ∈ R 𝑛 × 𝑛 and the trajectorytemporal relation matrix Δ 𝑡 ∈ R 𝑛 × 𝑛 are separately represented as: Δ 𝑡,𝑠 =  Δ 𝑡,𝑠 Δ 𝑡,𝑠 . . . Δ 𝑡,𝑠 𝑛 Δ 𝑡,𝑠 Δ 𝑡,𝑠 . . . Δ 𝑡,𝑠 𝑛 ... ... . . . ... Δ 𝑡,𝑠𝑛 Δ 𝑡,𝑠𝑛 . . . Δ 𝑡,𝑠𝑛𝑛  (1) Candidate Spatio-Temporal Relation Matrix . Besides the inter-nal explicit relation, we also consider a next spatiotemporal matrixin the paper. It calculates the distance between each location can-didate 𝑖 ∈ [ , 𝐿 ] and each location of the check-ins 𝑗 ∈ [ , 𝑛 ] as 𝑁 𝑠𝑖 𝑗 = 𝐻𝑎𝑣𝑒𝑟𝑠𝑖𝑛𝑒 ( 𝐺𝑃𝑆 𝑖 , 𝐺𝑃𝑆 𝑗 ) , and represents the time intervalsbetween 𝑡 𝑚 + and { 𝑡 , 𝑡 , ..., 𝑡 𝑚 } that are repeated L times to expandinto 2D as 𝑁 𝑡𝑖 𝑗 = | 𝑡 𝑚 + − 𝑡 𝑗 | . The candidate spatial relation matrix 𝑁 𝑠 ∈ R 𝐿 × 𝑛 and the candidate temporal relation matrix 𝑁 𝑡 ∈ R 𝐿 × 𝑛 are separately represented as: 𝑁 𝑡,𝑠 =  𝑁 𝑡,𝑠 𝑁 𝑡,𝑠 . . . 𝑁 𝑡,𝑠 𝑛 𝑁 𝑡,𝑠 𝑁 𝑡,𝑠 . . . 𝑁 𝑡,𝑠 𝑛 ... ... . . . ...𝑁 𝑡,𝑠𝐿 𝑁 𝑡,𝑠𝐿 . . . 𝑁 𝑡,𝑠𝐿𝑛  (2) Mobility Prediction . Given the user trajectory ( 𝑟 , 𝑟 , ..., 𝑟 𝑚 ) , thelocation candidates 𝐿 = { 𝑙 , 𝑙 , ..., 𝑙 L } , the spatio-temporal relationmatrix Δ 𝑡,𝑠 , and the next spatio-temporal matrix 𝑁 𝑡,𝑠 , our goal isto find the desired output 𝑙 ∈ 𝑟 𝑚 + . Our proposed

Spatio-Temporal Attention Network (STAN) consistsof: 1) a multimodal embedding module that learns the denserepresentations of user, location, time, and spatiotemporal effect;2) a self-attention aggregation layer that aggregates importantrelevant locations within the user trajectory to update the repre-sentation of each check-in; 3) an attention matching layer thatcalculates softmax probability from weighted check-in representa-tions to compute the probability of each location candidate for nextlocation; 4) a balanced sampler that use a positive sample andseveral negative samples to compute the cross-entropy loss. Theneural architecture of the proposed STAN is shown in

Figure 2 . The multi-modal embedding module consists of two parts, namely atrajectory embedding layer and a spatio-temporal embedding layer.

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yingtao Luo, Qiang Liu, Zhaocheng Liu

Historical Trajectory ( 𝑢 (cid:3036) , 𝑙 (cid:3038) , 𝑡 (cid:3038) ) GPS Location ( 𝑙 (cid:3038) , 𝑙𝑜𝑛 (cid:3038) , 𝑙𝑎𝑡 (cid:3038) ) Multimodal Embedding 𝑆 (cid:2869) 𝑒 (cid:3036)(cid:3048) 𝑒 (cid:2869)(cid:3039) 𝑒 (cid:2869)(cid:3047) 𝑆 (cid:2870) 𝑆 (cid:3041) … 𝑒 (cid:3036)(cid:3048) 𝑒 (cid:3041)(cid:3039) 𝑒 (cid:3041)(cid:3047) … 𝑒 (cid:3036)(cid:3048) 𝑒 (cid:2870)(cid:3039) 𝑒 (cid:2870)(cid:3047) 𝐸 (cid:2869) 𝐸 (cid:2870) 𝐸 (cid:3013) … Attention Matching 𝐴 (cid:2869) 𝐴 (cid:2870) 𝐴 (cid:3013) … Candidates

Attention AggregationBalanced Sampler

𝑃𝑜 𝑁𝑒 (cid:2869) 𝑁𝑒 (cid:3046) … Matching Loss

𝐸(∆ (cid:3047) )𝐸(∆ (cid:3046) ) 𝐸(𝑁 (cid:3047) )𝐸(𝑁 (cid:3046) ) Figure 2: The architecture of the proposed STAN model.

A multi-modal embeddinglayer is used to encode user, location and time into latent representa-tions. For user, location and time, we denote their embedded repre-sentations as 𝑒 𝑢 ∈ R 𝑑 , 𝑒 𝑙 ∈ R 𝑑 , 𝑒 𝑡 ∈ R 𝑑 , respectively. The embeddingmodule is incorporated into the other modules to transform thescalars into dense vectors to reduce computation and improve repre-sentation. Here, the continuous timestamp is divided by 7 × = 𝑒 𝑢 , 𝑒 𝑙 and 𝑒 𝑡 areU, L, and 168, respectively. The output of user trajectory embeddinglayer for each check-in 𝑟 is the sum 𝑒 𝑟 = 𝑒 𝑢 + 𝑒 𝑙 + 𝑒 𝑡 ∈ R 𝑑 . Forthe embedding of each user sequence seq ( 𝑢 𝑖 ) = { 𝑟 , 𝑟 , ..., 𝑟 𝑛 } , wedenote as 𝐸 ( 𝑢 𝑖 ) = { 𝑒 𝑟 , 𝑒 𝑟 , ..., 𝑒 𝑟 𝑛 } ∈ R 𝑛 × d . A unit embedding layeris used for the dense representation of spatial and temporal differ-ences with an hour and hundred meters as basic units, respectively.Recall that if we regard the maximum space or time intervals asthe number of embeddings and discretize all the intervals, it caneasily lead to a sparse relation encoding. This layer multiplies thespace and time intervals each with a unit embedding vector 𝑒 Δ 𝑠 and 𝑒 Δ 𝑡 , respectively. The unit embedding vectors reflect the continu-ous spatiotemporal context with the basic unit and avoid sparsityencoding with the dense dimensions. Especially, we can use thistechnique that is sensitive to spatial distance to replace hierarchicalgridding method, which only aggregates adjacent locations andis not capable to represent spatial distance. In mathematics, thespatiotemporal difference embedding is 𝑒 Δ 𝑖 𝑗 ∈ R 𝑑 : (cid:40) 𝑒 Δ 𝑡𝑖 𝑗 = Δ 𝑡𝑖 𝑗 × 𝑒 Δ 𝑡 𝑒 Δ 𝑠𝑖 𝑗 = Δ 𝑠𝑖 𝑗 × 𝑒 Δ 𝑠 (3)Inspired by [19, 21, 22], we may also consider an alternative inter-polation embedding layer that sets a upper-bound unit embeddingvector and a lower-bound unit embedding vector and represents theexplicit intervals as a linear interpolation, which is an approxima-tion to the unit embedding layer. In experiments, the two methodshave similar efficiency. The interpolation embedding is calculatedas:  𝑒 Δ 𝑡𝑖 𝑗 = 𝑒 𝑠𝑢𝑝 Δ 𝑡 ( 𝑈 𝑝𝑝𝑒𝑟 ( Δ 𝑡 ) − Δ 𝑡 ) + 𝑒 𝑖𝑛𝑓 Δ 𝑡 ( Δ 𝑡 − 𝐿𝑜𝑤𝑒𝑟 ( Δ 𝑡 )) 𝑈 𝑝𝑝𝑒𝑟 ( Δ 𝑡 ) − 𝐿𝑜𝑤𝑒𝑟 ( Δ 𝑡 ) 𝑒 Δ 𝑠𝑖 𝑗 = 𝑒 𝑠𝑢𝑝 Δ 𝑠 ( 𝑈 𝑝𝑝𝑒𝑟 ( Δ 𝑠 ) − Δ 𝑠 ) + 𝑒 𝑖𝑛𝑓 Δ 𝑠 ( Δ 𝑠 − 𝐿𝑜𝑤𝑒𝑟 ( Δ 𝑠 )) 𝑈 𝑝𝑝𝑒𝑟 ( Δ 𝑠 ) − 𝐿𝑜𝑤𝑒𝑟 ( Δ 𝑠 ) (4)This layer processes two matrices: the trajectory spatio-temporalrelation matrix and the candidate spatio-temporal relation ma-trix, as described in preliminaries. Their embeddings are 𝐸 ( Δ 𝑡 ) ∈ R 𝑛 × 𝑛 × 𝑑 , 𝐸 ( Δ 𝑠 ) ∈ R 𝑛 × 𝑛 × 𝑑 , 𝐸 ( 𝑁 𝑡 ) ∈ R 𝐿 × 𝑛 × 𝑑 , and 𝐸 ( 𝑁 𝑠 ) ∈ R 𝐿 × 𝑛 × 𝑑 .We can use a weighted sum of the last dimension and add spatialand temporal embeddings together to create: (cid:40) 𝐸 ( Δ ) = 𝑆𝑢𝑚 ( 𝐸 ( Δ 𝑡 )) + 𝑆𝑢𝑚 ( 𝐸 ( Δ 𝑠 )) ∈ R 𝑛 × 𝑛 𝐸 ( 𝑁 ) = 𝑆𝑢𝑚 ( 𝐸 ( 𝑁 𝑡 )) + 𝑆𝑢𝑚 ( 𝐸 ( 𝑁 𝑠 )) ∈ R 𝐿 × 𝑛 (5) Inspired by self-attention mechanisms, we propose an extensionalmodule to consider the different spatial distances and time intervalsbetween two visits in a trajectory. This module aims at aggregatingrelevant visited locations and updating the representation of eachvisit. Self-attention layer can capture long-term dependency andassign different weights to each visit within the trajectory. Thispoint-to-point interaction within the trajectory allows the layer toassign more weights to relevant visits. Moreover, we can easily in-corporate the explicit spatio-temporal intervals into the interaction.Given the user embedded trajectory matrix 𝐸 ( 𝑢 ) with non-paddinglength 𝑚 ′ and the spatio-temporal relation matrices 𝐸 ( Δ ) , this layerfirstly construct a mask matrix 𝑀 ∈ R 𝑛 × 𝑛 with upper left elements R 𝑚 ′ × 𝑚 ′ being ones and other elements being zeros. Then the layercomputes a new sequence 𝑆 after converting them through distinctparameter matrices 𝑊 𝑄 ,𝑊 𝐾 ,𝑊 𝑉 ∈ R d × d as 𝑆 ( 𝑢 ) = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 ( 𝐸 ( 𝑢 ) 𝑊 𝑄 , 𝐸 ( 𝑢 ) 𝑊 𝐾 , 𝐸 ( 𝑢 ) 𝑊 𝑉 , 𝐸 ( Δ ) , 𝑀 ) (6)with 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 ( 𝑄, 𝐾, 𝑉 , Δ , 𝑀 ) = (cid:18) 𝑀 ∗ 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( 𝑄𝐾 𝑇 + Δ √ 𝑑 ) (cid:19) 𝑉 (7) TAN: Spatio-Temporal Attention Network for Next Location Recommendation WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Here, only the mask and softmax attention are multiplied ele-ment by element, while others use matrix multiplication. It is veryimportant for us to consider causality that only the first 𝑚 ′ visits inthe trajectory are fed into the model while predicting the ( 𝑚 ′ + ) -st location. Therefore, during training, we use all the 𝑚 ′ ∈ [ , 𝑚 ] to mask the input sequence and accordingly to the selected la-bel. We can get 𝑆 ( 𝑢 ) ∈ R 𝑛 × d as the updated representation of theuser trajectory. Another alternative implementation is to feed ex-plicit spatio-temporal intervals into both 𝐸 ( 𝑢 ) 𝑊 𝐾 and 𝐸 ( 𝑢 ) 𝑊 𝑉 , asTiSASRec [17] did. However, in experiments, we found out thetwo methods have similar performances. Our implementation is ina more concise form using only matrix multiplication instead ofelement-wise calculation. This module aims at recalling the most plausible candidates from allthe L locations by matching with the updated representation of theuser trajectory. Given the updated trajectory representation 𝑆 ( 𝑢 ) ∈ R 𝑛 × d , the embedded location candidates 𝐸 ( 𝑙 ) = { 𝑒 𝑙 , 𝑒 𝑙 , ..., 𝑒 𝑙𝐿 } ∈ R 𝐿 × d , and the embedding of the candidate spatio-temporal relationmatrix 𝐸 ( 𝑁 ) ∈ R 𝐿 × 𝑛 , this layer computes the probability of eachlocation candidate to be the next location as 𝐴 ( 𝑢 ) = 𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔 ( 𝐸 ( 𝑙 ) , 𝑆 ( 𝑢 ) , 𝐸 ( 𝑁 )) (8)with 𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔 ( 𝑄, 𝐾, 𝑁 ) = 𝑆𝑢𝑚 (cid:18) 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (cid:18) 𝑄𝐾 𝑇 + 𝑁 √ 𝑑 (cid:19)(cid:19) (9)Here, the 𝑆𝑢𝑚 operation is a weighted sum of the last dimension,converting the dimension of 𝐴 ( 𝑢 ) to be R 𝐿 . In Eq.(8), we showthat the updated representations of check-ins all participate in thematching of each candidate location, unlike other self-attentionmodels that reduce the PIF information. This is due to the designof a bi-layer system that firstly aggregates relevant locations andthen recalls from representations with consideration of PIF. Due to the unbalanced scale of positive and negative samples in 𝐴 ( 𝑢 ) , optimizing the cross-entropy loss is no longer efficient as theloss weights little on the momentum to push forward the correctprediction. It would be normal to observe that as the loss goes down,the recall rate also goes down. Given the user 𝑖 ’s sequence 𝑠𝑒𝑞 ( 𝑢 𝑖 ) ,the matching probability of each candidate location 𝑎 𝑗 ∈ 𝐴 ( 𝑢 𝑖 ) for 𝑗 ∈ [ , 𝐿 ] , and the label 𝑙 𝑘 with number of order 𝑘 in the locationset 𝐿 , the ordinary cross-entropy loss is written as: − ∑︁ 𝑖 ∑︁ 𝑚 𝑖 (cid:169)(cid:173)(cid:171) 𝑙𝑜𝑔𝜎 ( 𝑎 𝑘 ) + 𝐿 ∑︁ 𝑗 = ,𝑗 ≠ 𝑘 𝑙𝑜𝑔 ( − 𝜎 ( 𝑎 𝑗 )) (cid:170)(cid:174)(cid:172) (10)In this form, for every positive sample 𝑎 𝑘 , we need to compute 𝐿 − Table 1: Basic dataset statistics.

Gowalla TKY SIN NYC − ∑︁ 𝑖 ∑︁ 𝑚 𝑖 (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) 𝑙𝑜𝑔𝜎 ( 𝑎 𝑘 ) + ∑︁ ( 𝑗 ,𝑗 ,...,𝑗 𝑠 ) ∈[ ,𝐿 ]( 𝑗 ,𝑗 ,...,𝑗 𝑠 ) ≠ 𝑘 𝑙𝑜𝑔 ( − 𝜎 ( 𝑎 𝑗 )) (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) (11) In this section, we show our empirical results to make a fair compar-ison with other models quantitatively. We show a table of datasets,a table of recommendation performance under the evaluation oftop 𝑘 recall rates, figures of model stability, and the visualization ofattention weights in STAN aggregation. We evaluate our proposed STAN model on four real-world datasets:Gowalla , SIN , TKY and NYC . The numbers of users, locations,and check-ins in each dataset are shown in Table 1 . In experiments,we use the original raw datasets that only contain the GPS of eachlocation and user check-in records, and pre-process them followingeach work’s protocol. In regard to the pre-processing technique ofdatasets, many previous works used sliced trajectory with a fixed-length window or maximum time interval. We follow each work’ssetup, although this could prevent the model from learning long-time dependency. For each user that has 𝑚 check-ins, we dividea dataset into training, validation, and test datasets. The numberof training set is 𝑚 −

3, with the first 𝑚 ′ ∈ [ , 𝑚 − ] check-ins asinput sequence and the [ , 𝑚 − ] -nd visited location as label; thevalidation set uses the first 𝑚 − ( 𝑚 − ) -st visited location as label; the test set uses the first 𝑚 − 𝑚 -th visited location aslabel. The split of datasets follows the causality that no future datais used in the prediction of future data. We compare our STAN with the following baselines: • STRNN [22]: an invariant RNN model that incorporatesspatio-temporal features between consecutive visits. • DeepMove [4]: a state-of-the-art model with recurrent andattention layers to capture periodicity. http://snap.stanford.edu/data/loc-gowalla.html WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yingtao Luo, Qiang Liu, Zhaocheng Liu

Table 2: Recommendation performance comparison with baselines.

Gowalla TKY SIN NYCRecall@5 Recall@10 Recall@5 Recall@10 Recall@5 Recall@10 Recall@5 Recall@10STRNN 0.1664 0.2567 0.1836 0.2791 0.1791 0.2016 0.2365 0.2802DeepMove 0.1959 0.2699 0.2684 0.3509 0.2389 0.3155 0.3268 0.4014STGN 0.1528 0.2422 0.1940 0.2710 0.2292 0.2727 0.2439 0.3015ARNN 0.1810 0.2745 0.1852 0.2696 0.1817 0.2538 0.1970 0.3483LSTPM 0.2015 0.2701 0.2568 0.3310 0.2579 0.3327 0.2791 0.3564TiSASRec 0.2411 0.3546 0.3031 0.3693 0.2963 0.3753 0.3664 0.5020GeoSAN 0.2764 0.3645 0.2957 0.3740 0.3397 0.3943 0.4006 0.5267STAN

Improvement 9.12% 9.68% 17.04% 14.01% 10.42% 9.08% 16.55% 13.20% • STGN [43]: a state-of-the-art model that adds time and dis-tance interval gates to LSTM. • ARNN [8]: a state-of-the-art model that uses semantic andspatial information to construct knowledge graph and im-prove the performance of sequential LSTM model. • LSTPM [27]: a state-of-the-art model that combines long-term and short-term sequential models for recommendation. • TiSASRec [17]: a state-of-the-art model that uses self-attentionlayers with explicit time intervals for sequential recommen-dation, but it uses no spatial information. • GeoSAN [18]: a state-of-the-art model that uses hierarchicalgridding of GPS locations for spatial discretization and usesself-attention layers for matching, without use of explicitspatio-temporal interval.

We adopt the top 𝑘 recall rates, Recall@5 and Recall@10, to evaluaterecommendation performance. Recall@k counts the rate of truepositive samples in all positive samples, which in our case meansthe rate of the label in the top 𝑘 probability samples. For evaluation,we drop the balanced sampler module and directly recall the targetfrom A, the output of the attention matching layer. The larger theRecall@k, the better the performance. There are two kinds of hyperparameters: (i) common hyperparam-eters that are shared by all models; (ii) unique hyperparametersthat depend on each model’s framework. We train the commonhyperparameters on a simple recurrent neural network and thenapply them to all models, which helps reduce the training burden.The embedding dimension 𝑑 to 50 for TKY, SIN and NYC datasetsand 10 for gowalla dataset. We use the Adam optimizer with defaultbetas, the learning rate of 0.003, the dropout rate of 0.2, the train-ing epoch of 50, and the maximum length for trajectory sequenceof 100. Fixing these common hyperparameters, we fine-tune theunique hyperparameters for each model. In our model, the numberof negative samples in the balanced sampler is optimal at 10. Table 2 shows the recommendation performance of our model andbaselines on the four datasets. All the differences between different methods are statistically significant ( 𝑒𝑟𝑟𝑜𝑟 < 𝑒 − ). We use a T-testwith a p-value of 0.01 to evaluate the performance improvementprovided by STAN. Here, we use the averaged performance runby 10 times and reject the H0 hypothesis. Therefore, we know theimprovement of STAN is statistically significant.We can see that our model unequivocally outperforms all com-pared models with 9%-17% improvement in recall rates. We showin Figures 3 and 4 that the model is stable under hyperparame-ter tuning. Among baseline models, self-attention models such asTiSASRec and GeoSAN clearly have better performances over RNN-based models. It is not a surprise since previous RNN-based modelsoften use sliced short trajectories instead of long trajectories, whichtossed long-term periodicity and can hardly capture the exact influ-ence of each visits towards the next movement. It should be notedthat we do not use any semantic information to construct knowl-edge graph to perform meta-path in ARNN, as semantic analysiswas not performed by other baselines in the comparison.Among RNN-based models, LSTPM and DeepMove have rela-tively better performances, due to their consideration of periodicity.Among self-attention models, TiSASRec used temporal intervalsand GeoSAN considered geographical partitions. Only STAN fullyconsiders the spatio-temporal intervals within the sequences formodeling non-consecutive visits and non-adjacent locations, andmodifies attention architecture to adapt PIF information insteadof inheriting the transformer [29] structure directly. In addition,because STRNN and TiSASRec both use temporal intervals, we cancompare their performances to evaluate the improvement providedby self-attention modules versus recurrent layers.We can also refer to

Table 3 , where the − 𝐴𝐿𝐿 model represents avariant STAN model without spatio-temporal intervals and the bal-anced sampler. − 𝐴𝐿𝐿 model is different from ordinary self-attentionmodels only on the bi-layer system, which considers PIF informa-tion. − 𝐴𝐿𝐿 has a slightly worse performance than GeoSAN on therecall rates of the four datasets, but is slightly better than TiSASRecand much better than RNN-based models. This tells us that thebi-layer system which considers PIF is approximately as importantas time intervals incorporated into the attention systems.

To analyze different modules in our model, we conduct an ablationstudy in this section. We denote the based model as STAN, with

TAN: Spatio-Temporal Attention Network for Next Location Recommendation WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Table 3: Ablation Analysis, in which we compare different modules in STAN.

Gowalla TKY SIN NYCRecall@5 Recall@10 Recall@5 Recall@10 Recall@5 Recall@10 Recall@5 Recall@10STAN -TIM-BS 0.2835 0.3718 0.3006 0.3819 0.3416 0.3873 0.4126 0.5245-EWTI-BS 0.2794 0.3717 0.3052 0.3781 0.3404 0.3890 0.4083 0.5272-TIM 0.2946 0.3925 0.3315 0.4099 0.3643 0.4176 0.4495 0.5814-SIM-BS 0.2823 0.3729 0.3123 0.3865 0.3337 0.3901 0.4126 0.5299-EWSI-BS 0.2812 0.3724 0.3132 0.3794 0.3313 0.3916 0.4124 0.5277-SIM 0.2977 0.3908 0.3405 0.4141 0.3636 0.4165 0.4502 0.5860-ALL 0.2645 0.3531 0.2867 0.3660 0.3239 0.3776 0.3896 0.5094

Figure 3: Impact of embedding dimension.

10 20 30 40 50 60

Dimension R e c a ll @ (a) Gowalla

10 20 30 40 50 60

Dimension R e c a ll @ (b) TKY

10 20 30 40 50 60

Dimension R e c a ll @ (c) SIN

10 20 30 40 50 60

Dimension R e c a ll @ (d) NYC spatio-temporal intervals and a balanced sampler. We drop differentcomponents to form variants. The components are listed as: • SIM (Spatial Intervals in Matrix): This denotes the explicitspatial intervals we use within the trajectory as a matrix. • EWSI (Element-Wise Spatial Intervals): This denotes theelement-wise spatial intervals following the structure ofTiSASRec [17]. • TIM (Temporal Intervals in Matrix): This denotes the explicittemporal intervals we use within the trajectory as a matrix. • EWTI (Element-Wise Temporal Intervals): This denotes theelement-wise temporal intervals following the structure ofTiSASRec [17]. • BS (Balanced Sampler): Balanced sampler for calculating loss.

Table 3 shows the results of the ablation study. We find thata balanced sampler is crucial for improving the recommendationperformance, which provides a nearly 5-12% increase in recall rates.Spatial and temporal intervals can explicitly express the correlationbetween non-consecutive visits and non-adjacent locations. Addingspatial distances and temporal intervals all provide nearly 4-8%increase in recall rates. We also find that our method to introducespatio-temporal correlations is equivalent to the method used inTiSASRec [17], while our method is easier to implement and canbe computationally convenient due to its matrix form. The worstcondition is that none of the spatio-temporal intervals nor balanced

Figure 4: Impact of number of negative samples.

Number R e c a ll @ (a) Gowalla Number R e c a ll @ (b) TKY Number R e c a ll @ (c) SIN Number R e c a ll @ (d) NYC sampler is used, in which the Recall@5 and Recall@10 decreasedrastically. Even so, this − 𝐴𝐿𝐿 ablated model still outperformspreviously reported RNN-based models such as DeepMove, STRNN,and STGN. − 𝐴𝐿𝐿 model with the bi-layer system can consider PIFinformation. This explains why − 𝐴𝐿𝐿 still has a better performanceover TiSASRec and RNN-based models. This tells us that the bi-layer system which considers PIF is as important as time intervalsincorporated into self-attention systems.

We vary the dimension of embedding 𝑑 in the multimodal embedding module from 10 to 60 with step 10. Figure 3 shows that 𝑑 =

50 is the best dimension for trajectoryand spatio-temporal embedding. In general, the recommendationperformance of our model is insensitive to the hyperparameter 𝑑 ,with less than 6% change rate for the Gowalla dataset and less than2% change rate for other datasets. As long as 𝑑 is large than 30,the change in recommendation performance will be less than 0.5%,which can be ignored. We experiment a series of num-ber of negative samples 𝑠 = [ , , , , , ] in the balancedsampler. Figure 4 shows that the number of negative samples lessthan 20 can all produce stable recommendations for all datasets.STAN is specifically insensitive to the number of negative samples

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yingtao Luo, Qiang Liu, Zhaocheng Liu for the Gowalla dataset, which has as many as 121944 locations.This indicates that the larger the dataset, the larger the optimalnumber of negative samples. As the number of negative samplesincreases, the balanced loss will tend to the ordinary cross-entropyloss. In

Table 3 , we found that the balanced sampler is crucial forimproving recommendation performance. If the number of negativesamples is above the threshold, the recall rate will drop drastically.

To understand the mechanism of STAN, the aggregation of non-consecutive visits and non-adjacent locations performed by theself-attention aggregation layer is at the core. We visualize thecorrelation matrix

𝐶𝑜𝑟 of the attention weights in

Figure 5 . Eachelement

𝐶𝑜𝑟 𝑖,𝑗 of the matrix represents the weighted influence of 𝑗 -th visited location on 𝑖 -th visited location. The correlation matrixis calculated as the softmax of the multiplication of query and keyin the self-attention aggregation layer. The value of each elementin this correlation matrix is either tending to 1 or 0, as a resultof softmax operation. Using the correlation matrix to times theoriginal check-in embeddings, we can update the representationsof the trajectory. Figure 5 is based on a slice of real user trajectoryexample that is discussed in

Introduction Section and

Figure 1 .Here, different locations are classified and named by numbersfrom 0 to 6. By query of the exact GPS, we find that locations 0, 1, 2are home, workplace, and shopping mall, respectively. Locations 3,4, 5 and 6 are restaurants. Figure 5(a) shows the spatial correlationof visited locations that is attained by Figure 5(b), where locationswith the yellow-colored marks and locations within the range of thesame dark circles are aggregated together. This shows that not onlyadjacent locations but also non-adjacent locations are correlated.Locations 3, 4, 5 and 6 are all restaurants and are often visited atthe exact time for meals. We can tell from the correlation matrixthat they are relevant, despite that they are spatially distanced. Thetemporal order of this trajectory example is shown in the timelineof

Figure 1 . This is a sliced sparse trajectory as we edit off theirrelevant visits to focus on the correlation of restaurants. The timeand order of these restaurants being visited are not consecutive butare still aggregated together. This gives evidence that visited tem-porally non-consecutive locations may be correlated. Both shredsof evidence in space and time demonstrate our motivation.

In this work, we propose a spatio-temporal attention network, ab-breviated as STAN. We use a real trajectory example to illustratethe functional relevance between non-adjacent locations and non-consecutive visits, and propose to learn the explicit spatio-temporalcorrelations within the trajectory using a bi-attention system. Thisarchitecture firstly aggregates spatio-temporal intervals within thetrajectory and then recalls the target. Because all the representa-tions of the trajectory are weighted, the recall of the target fullyconsiders the effect of personalized item frequency (PIF). We pro-pose a balanced sampler for matching calculating cross-entropyloss, which outperforms the commonly practiced binary and/orordinary cross-entropy loss. We perform comprehensive ablationstudy, stability study, and interpretability study in the experimentalsection. We prove an improvement of recall rates by the proposed (b)

654 02 13 (a)

Figure 5: The mechanism of STAN. (a) An example mapshowing the aggregation of visited locations. The locationswith the same colored marks and locations within the rangeof the same dark circles are aggregated. This gives solid evi-dence that non-adjacent locations may be correlated and ag-gregated in our model. (b) The correlation matrix. Here, wetake the softmax of the multiplication of query and key inthe self-attention aggregation layer as a correlation matrix,which is used to update the representation of check-ins. components and very robust stability against hyperparameters’variation. We also propose to replace the hierarchical griddingmethod for spatial discretization with a simple linear interpolationtechnique, which can reflect the continuous spatial distance whileproviding dense representation. Experimental comparison withbaseline models unequivocally demonstrates the superiority of ourmodel, as STAN improves recall rates to new records that surpassthe state-of-the-art models by 9-17%.

ACKNOWLEDGMENTS

This work is supported by National Key Research and DevelopmentProgram (2018YFB1402605, 2018YFB1402600), National Natural Sci-ence Foundation of China (U19B2038, 61772528), Beijing NationalNatural Science Foundation (4182066).

REFERENCES [1] Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, andHongyuan Zha. 2018. Sequential recommendation with user memory networks.In

Proceedings of the eleventh ACM international conference on web search anddata mining . 108–116.[2] Yile Chen, Cheng Long, Gao Cong, and Chenliang Li. 2020. Context-awaredeep model for joint mobility and time prediction. In

Proceedings of the 13thInternational Conference on Web Search and Data Mining . 106–114.

TAN: Spatio-Temporal Attention Network for Next Location Recommendation WWW ’21, April 19–23, 2021, Ljubljana, Slovenia [3] Chen Cheng, Haiqin Yang, Michael R. Lyu, and Irwin King. 2013. Where Youlike to Go next: Successive Point-of-Interest Recommendation. In

Proceedings ofthe Twenty-Third International Joint Conference on Artificial Intelligence (Beijing,China) (IJCAI ’13) . AAAI Press, 2605 ¨ C2611.[4] Jie Feng, Yong Li, Chao Zhang, Funing Sun, Fanchao Meng, Ang Guo, and DepengJin. 2018. Deepmove: Predicting human mobility with attentional recurrentnetworks. In

Proceedings of the 2018 world wide web conference . 1459–1468.[5] Shanshan Feng, Xutao Li, Yifeng Zeng, Gao Cong, and Yeow Meng Chee. 2015.Personalized ranking metric embedding for next new poi recommendation. In

IJCAI’15 Proceedings of the 24th International Conference on Artificial Intelligence .ACM, 2069–2075.[6] Shanshan Feng, Lucas Vinh Tran, Gao Cong, Lisi Chen, Jing Li, and Fan Li. 2020.HME: A Hyperbolic Metric Embedding Approach for Next-POI Recommendation.In

Proceedings of the 43rd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval . 1429–1438.[7] Huiji Gao, Jiliang Tang, Xia Hu, and Huan Liu. 2013. Exploring Temporal Effectsfor Location Recommendation on Location-Based Social Networks. In

Proceed-ings of the 7th ACM Conference on Recommender Systems (Hong Kong, China).Association for Computing Machinery, New York, NY, USA, 93 ¨ C100.[8] Qing Guo, Zhu Sun, Jie Zhang, and Yin-Leng Theng. 2020. An AttentionalRecurrent Neural Network for Personalized Next Location Recommendation. In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 34. 83–90.[9] Peng Han, Zhongxiao Li, Yong Liu, Peilin Zhao, Jing Li, Hao Wang, and ShuoShang. 2020. Contextualized Point-of-Interest Recommendation. In

Proceedings ofthe Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 ,Christian Bessiere (Ed.). International Joint Conferences on Artificial IntelligenceOrganization, 2484–2490. https://doi.org/10.24963/ijcai.2020/344[10] Ruining He and Julian McAuley. 2016. Fusing similarity models with markovchains for sparse sequential recommendation. In . IEEE, 191–200.[11] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2015. Session-based recommendations with recurrent neural networks. arXivpreprint arXiv:1511.06939 (2015).[12] Haoji Hu, Xiangnan He, Jinyang Gao, and Zhi-Li Zhang. 2020. Modeling Per-sonalized Item Frequency Information for Next-basket Recommendation. arXivpreprint arXiv:2006.00556 (2020).[13] Liwei Huang, Yutao Ma, Shibo Wang, and Yanbo Liu. 2019. An attention-basedspatiotemporal lstm network for next poi recommendation.

IEEE Transactions onServices Computing (2019).[14] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom-mendation. In . IEEE,197–206.[15] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

42, 8 (2009), 30–37.[16] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017.Neural attentive session-based recommendation. In

Proceedings of the 2017 ACMon Conference on Information and Knowledge Management . 1419–1428.[17] Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time Interval Aware Self-Attention for Sequential Recommendation. In

Proceedings of the 13th InternationalConference on Web Search and Data Mining . 322–330.[18] Defu Lian, Yongji Wu, Yong Ge, Xing Xie, and Enhong Chen. 2020. Geography-Aware Sequential Location Recommendation. In

Proceedings of the 26th ACMSIGKDD International Conference on Knowledge Discovery & Data Mining . 2009–2019.[19] Qiang Liu, Zhaocheng Liu, and Haoli Zhang. 2020. An empirical study on featurediscretization. arXiv preprint arXiv:2004.12602 (2020).[20] Qiang Liu, Shu Wu, Diyi Wang, Zhaokang Li, and Liang Wang. 2016. Context-aware sequential recommendation. In . IEEE, 1053–1058.[21] Qiang Liu, Shu Wu, and Liang Wang. 2017. Multi-behavioral sequential predictionwith recurrent log-bilinear model.

IEEE Transactions on Knowledge and DataEngineering

29, 6 (2017), 1254–1267.[22] Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. Predicting the nextLocation: A Recurrent Model with Spatial and Temporal Contexts. In

Proceedingsof the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16) . AAAI Press, 194 ¨ C200.[23] Yong Liu, Wei Wei, Aixin Sun, and Chunyan Miao. 2014. Exploiting geographicalneighborhood characteristics for location recommendation. In

Proceedings of the23rd ACM International Conference on Conference on Information and KnowledgeManagement . 739–748.[24] David Massimo and Francesco Ricci. 2018. Harnessing a generalised user be-haviour model for next-POI recommendation. In

Proceedings of the 12th ACM Conference on Recommender Systems . 402–406.[25] Pengjie Ren, Zhumin Chen, Jing Li, Zhaochun Ren, Jun Ma, and Maarten deRijke. 2019. RepeatNet: A repeat aware neural recommendation machine forsession-based recommendation. In

Proceedings of the AAAI Conference on ArtificialIntelligence , Vol. 33. 4806–4813.[26] Steffen Rendle. 2010. Factorization machines. In . IEEE, 995–1000.[27] Ke Sun, Tieyun Qian, Tong Chen, Yile Liang, Quoc Viet Hung Nguyen, andHongzhi Yin. 2020. Where to Go Next: Modeling Long-and Short-Term UserPreferences for Point-of-Interest Recommendation. In

Proceedings of the AAAIConference on Artificial Intelligence , Vol. 34. 214–221.[28] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-tion via convolutional sequence embedding. In

Proceedings of the Eleventh ACMInternational Conference on Web Search and Data Mining . 565–573.[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information processing systems . 5998–6008.[30] Chenyang Wang, Min Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2019. Mod-eling item-specific temporal dynamics of repeat consumption for recommendersystems. In

The World Wide Web Conference . 1977–1987.[31] Jingyi Wang, Qiang Liu, Zhaocheng Liu, and Shu Wu. 2019. Towards Accurateand Interpretable Sequential Prediction: A CNN & Attention-Based FeatureExtractor. In

Proceedings of the 28th ACM International Conference on Informationand Knowledge Management . 1703–1712.[32] Shu Wu, Qiang Liu, Ping Bai, Liang Wang, and Tieniu Tan. 2016. SAPE: A systemfor situation-aware public security evaluation. In

AAAI .[33] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019.Session-based recommendation with graph neural networks. In

Proceedings ofthe AAAI Conference on Artificial Intelligence .[34] Dingqi Yang, Benjamin Fankhauser, Paolo Rosso, and Philippe Cudre-Mauroux.2020. Location Prediction over Sparse User Mobility Traces Using RNNs: Flash-back in Hidden States!. In

Proceedings of the Twenty-Ninth International JointConference on Artificial Intelligence, IJCAI-20 . 2184–2190.[35] Di Yao, Chao Zhang, Jianhui Huang, and Jingping Bi. 2017. Serm: A recurrentmodel for next location prediction in semantic trajectories. In

Proceedings of the2017 ACM on Conference on Information and Knowledge Management . 2411–2414.[36] Wenwen Ye, Shuaiqiang Wang, Xu Chen, Xuepeng Wang, Zheng Qin, and DaweiYin. 2020. Time Matters: Sequential Recommendation with Complex TemporalInformation. In

Proceedings of the 43rd International ACM SIGIR Conference onResearch and Development in Information Retrieval . 1459–1468.[37] Feng Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. A dynamicrecurrent model for next basket recommendation. In

Proceedings of the 39thInternational ACM SIGIR conference on Research and Development in InformationRetrieval . 729–732.[38] Feng Yu, Yanqiao Zhu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2020.TAGNN: Target attentive graph neural networks for session-based recommenda-tion. In

Proceedings of the 43rd International ACM SIGIR Conference on Researchand Development in Information Retrieval . 1921–1924.[39] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xi-angnan He. 2019. A simple convolutional generative network for next itemrecommendation. In

Proceedings of the Twelfth ACM International Conference onWeb Search and Data Mining . 582–590.[40] Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, BinWang, and Tie-Yan Liu. 2014. Sequential click prediction for sponsored searchwith recurrent neural networks. arXiv preprint arXiv:1404.5772 (2014).[41] Zhiqian Zhang, Chenliang Li, Zhiyong Wu, Aixin Sun, Dengpan Ye, and Xi-angyang Luo. 2017. NEXT: A Neural Network Framework for Next POI Recom-mendation.

CoRR abs/1704.04576 (2017). arXiv:1704.04576 http://arxiv.org/abs/1704.04576[42] Kangzhi Zhao, Yong Zhang, Hongzhi Yin, Jin Wang, Kai Zheng, Xiaofang Zhou,and Chunxiao Xing. 2020. Discovering Subsequence Patterns for Next POIRecommendation. (2020).[43] Pengpeng Zhao, Haifeng Zhu, Yanchi Liu, Jiajie Xu, Zhixu Li, Fuzhen Zhuang, Vic-tor S Sheng, and Xiaofang Zhou. 2019. Where to go next: A spatio-temporal gatednetwork for next poi recommendation. In

Proceedings of the AAAI Conference onArtificial Intelligence , Vol. 33. 5877–5884.[44] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, XiaoqiangZhu, and Kun Gai. 2019. Deep interest evolution network for click-through rateprediction. In

Proceedings of the AAAI conference on artificial intelligence , Vol. 33.5941–5948.[45] Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and DengCai. 2017. What to Do Next: Modeling User Behaviors by Time-LSTM.. In