[PDF] Knowledge Adaption for Demand Prediction based on Multi-task Memory Neural Network

Abstract

Accurate demand forecasting of different public transport modes(e.g., buses and light rails) is essential for public service operation.However, the development level of various modes often varies sig-nificantly, which makes it hard to predict the demand of the modeswith insufficient knowledge and sparse station distribution (i.e.,station-sparse mode). Intuitively, different public transit modes mayexhibit shared demand patterns temporally and spatially in a this http URL such, we propose to enhance the demand prediction of station-sparse modes with the data from station-intensive mode and designaMemory-Augmented Multi-taskRecurrent Network (MATURE)to derive the transferable demand patterns from each mode andboost the prediction of station-sparse modes through adaptingthe relevant patterns from the station-intensive mode. Specifically,MATUREcomprises three components: 1) a memory-augmentedrecurrent network for strengthening the ability to capture the long-short term information and storing temporal knowledge of eachtransit mode; 2) a knowledge adaption module to adapt the rele-vant knowledge from a station-intensive source to station-sparsesources; 3) a multi-task learning framework to incorporate all theinformation and forecast the demand of multiple modes jointly.The experimental results on a real-world dataset covering four pub-lic transport modes demonstrate that our model can promote thedemand forecasting performance for the station-sparse modes.

Full PDF

KKnowledge Adaption for Demand Prediction based onMulti-task Memory Neural Network

Can Li the University of New South WalesSydney, [email protected]

Lei Bai the University of New South WalesSydney, [email protected]

Wei Liu the University of New South WalesSydney, [email protected]

Lina Yao the University of New South WalesSydney, [email protected]

S Travis Waller the University of New South WalesSydney, [email protected]

ABSTRACT

MATURE comprises three components: 1) a memory-augmentedrecurrent network for strengthening the ability to capture the long-short term information and storing temporal knowledge of eachtransit mode; 2) a knowledge adaption module to adapt the rele-vant knowledge from a station-intensive source to station-sparsesources; 3) a multi-task learning framework to incorporate all theinformation and forecast the demand of multiple modes jointly.The experimental results on a real-world dataset covering four pub-lic transport modes demonstrate that our model can promote thedemand forecasting performance for the station-sparse modes.

CCS CONCEPTS • Computing methodologies → Neural networks ; •

Appliedcomputing → Transportation, Forecasting ; KEYWORDS

Demand Prediction; Memory-based Recurrent Network; Multi-taskLearning

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

Conference’17, July 2017, Washington, DC, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Reference Format:

Can Li, Lei Bai, Wei Liu, Lina Yao, and S Travis Waller. 2020. KnowledgeAdaption for Demand Prediction based on Multi-task Memory Neural Net-work. In

Proceedings of ACM Conference (Conference’17).

ACM, New York,NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Each public transport mode such as buses, trains, light rails, andferries often plays an irreplaceable and integral role in the publictransport system and operation of the city. How to design efficientand reliable public transport services is a fundamental and criticalproblem for cities. Estimating travel demand of various modes ofpublic transit is a critical component to addressing the mentionedproblem. Better demand forecasting allows one to better accom-modate the public transit demand (e.g., provide sufficient servicefrequency and design stop/station spacing to reduce waiting, crowd-ing, and improve the attractiveness of public transit services) andimprove public transit service efficiency (e.g., efficient and effectiverouting and scheduling of transit services).However, the development stages/levels of different public tran-sit modes in a city or region often differ. For instance, Figure 1(a)displays the proportion of stations for the four different modescollected from the Greater Sydney area. The number of bus stationsis far more than the other three modes. Moreover, Figure 1(b) dis-plays an example of the geographical distribution of some publictransportation stations. Consistent with the proportions shownin Figure 1(a), the covered urban area by the bus stations coversthose covered by stations of other public transit modes and thestations of these modes are distributed sparsely. The sparseness ofthe stations in relation to the train, light rail, and ferry, which oftenlimits the accuracy of demand prediction. From the spatial view, thesparse distribution of stations lacks the features to characterize thelocal spatial dependencies among the stations. From the semanticview, with limited stations, it is harder to model correlations amongstations sharing similar temporal patterns. Thus, we propose toutilize the rich demand data from buses to improve demand predic-tion for the other three modes considering that demand patternsof different modes may exhibit a certain level of similarity in thesame or similar areas. On the one hand, the demand is influencedby the functions of the regions, thus the stations of different trans-portation modes could share similar demand patterns. On the otherhand, near regions may have similar demand patterns [28] which a r X i v : . [ c s . A I] S e p eads to analogous demand trends for various modes. This indeedinspires us to utilize the massive data from station-intensive modes(e.g., buses) to improve the demand prediction for station-sparsesources in our work. (a) (b) Figure 1: Stations of Four Modes of Public Transport, (a)represents the proportions of different types of public tran-sit stations/stops; (b) represents geographical distribution:color orange represents the bus station; color blue repre-sents the train station; color red represents the light rail sta-tion; color green represents the ferry station)

In the literature, there has been a long line of studies in traffic dataprediction. Traditional methods in the area employ time-series mod-els such as Auto-Regressive Integrated Moving Average (ARIMA)and its variants for traffic flow forecasting [12, 15]. These modelsare less capable to take the non-linear temporal relations into ac-count. In recent years, deep learning models have been explored forcapturing non-linear temporal dependencies to forecast the trafficdemand. These methods mainly utilized Long-Short Term Memory(LSTM) [1, 3, 17, 23, 28] and Gated Recurrent Unit (GRU) [16] ortemporal convolution module [2] to capture the complex non-lineartemporal dependencies. However, these methods only focused onone target transport mode with intensive stations/regions. It is hardfor these works to handle station-sparse sources for precise pre-diction. And they were not able to explore the correlations amongdifferent transport modes which have great potential to improvethe forecasting performance. Although Ye et al. [29] co-predictedthe demand of two transportation modes, their method requiresthe same level of region coverage for the sources, which lacks theapplicability to predict the demand of station-sparse sources. Inpractice, since the development levels/stages of different cities andmodes of transport are uneven, the sparseness of stations/regionsfor some modes may lead to less accurate and unsatisfactory de-mand forecasting. Different to all of these existing works, we focuson forecasting the demand of the station-sparse transportationmodes and aim to boost the prediction accuracy by considering thedemand patterns of different transport modes in one target city andadapting the knowledge learned from the station-intensive sourceto the station-sparse sources.To take advantage of the information from a station-intensivesource for demand prediction in relation to station-sparse sources,this study proposes M emory- A ugmen t ed M u lti-task Re currentNetwork ( MATURE ) for station-level demand forecasting in the public transit systems. Specifically, a general multi-task learningframework is constructed for multi-mode demand co-predictionbased on LSTM. To capture more accurate temporal correlationsand enable the knowledge adaption, we further augment an ex-ternal memory module to LSTM of each public transit mode forenhancing the ability to capture the long-and-short term infor-mation. The extracted temporal knowledge can be stored in theexternal module and shared by other transport modes that classicLSTM cannot achieve efficiently. After that, utilizing the stored in-formation in the memory module, we design a knowledge adaptionmechanism consisting of a boost vector and an eliminate vectorto drop the unnecessary features and decide which informationshould be adapted from the station-intensive transport mode to thestation-sparse transport modes. By optimizing the sources jointly,

MATURE is capable to adapt appropriate information from thestation-intensive mode (e.g., bus) to station-sparse modes (e.g., train,light rail, and ferry) and thus improve the prediction performance ofstation-sparse transport modes. We validate

MATURE on a large-scale real-world dataset collected from the Greater Sydney areaincluding four types of public transport, i.e., bus, train, light rail,and ferry, where the bus contains much intenser information. Thecomprehensive comparisons with several state-of-the-art methodsdemonstrate the effectiveness of the proposed model.The main contributions of this paper are summarized in thefollowing:(1) To the best of our knowledge, this is the first study to utilizethe data of station-intensive public transit mode to enhancethe demand prediction of station-sparse modes in a city,which enlightens a new direction for improving the predic-tion performance of station-sparse transport modes.(2) This study proposes a novel multi-task deep-learning model-

MATURE to improve the prediction performance of thestation-sparse transport modes.

MATURE can learn moreaccurate temporal correlations in the historical transit datawith the augmented memory modules and adapt the learnedknowledge from the station-intensive mode to the station-sparse modes with our knowledge adaption module underthe multi-task learning framework.(3) This study conducts extensive experiments on a large-scalereal-world public transport dataset collected in a large met-ropolitan area. The results show that the proposed modelsignificantly outperforms the tested existing methods.The rest of this paper is organized as follows. First, we introducethe related works in Section 2 and define the demand predictionproblems in Section 3. Then, Section 4 presents the proposed

MA-TURE model and technique details. The evaluation of our methodand comparison to other methods are presented in Section 5. Finally,we conclude the paper in Section 6.

In this section, we review relevant studies on travel demand fore-casting and knowledge adaption methods.

The earliest models for demand forecasting are based on traditionaltime-series regression models such as ARIMA, Kalman Filter, andheir variants [12, 15, 26]. Lippi et al. [12] adopted the combina-tion of the Seasonal ARIMA (SARIMA) model and Kalman filterfor good-quality demand predictions. ARIMA and time-varyingPoisson models were coupled by Moreira et al. [15] to predict thespatial distribution of taxi-passenger in a short-term time horizonusing streaming data. IMM filter algorithm was applied to individ-ual forecasting models by Xue et al. [26] for dynamically passengerdemand prediction in the next interval. However, these strategieswere often hard to capture the non-linear temporal correlations fora precise prediction.Recently, a number of deep learning methods have shown theirsuccess in time-series forecasting such as Fully Connected Layer(FCL), basic RNN, and LSTM [7, 17, 25, 30]. Yi et al. [30] proposed adeep neural network-based approach consisting of a spatial trans-formation component and a deep distributed fusion network forpredicting time-series information. Dual-stage attention includingan input attention mechanism and a temporal attention mecha-nism had been added to RNN by Qin et al. [17] for capturing thelong-term temporal dependencies and forecasting.Convolution Neural Network (CNN) [10] and Graph Convolu-tional Network (GCN) [8] were also adapted to capture spatial cor-relations combining with temporal features for better forecasting[2, 19, 21, 28, 31]. For example, Li et al. [11] introduced DiffusionConvolutional Recurrent Neural Network (DCRNN) based on adirected graph which captured the spatial dependency by bidirec-tional random walks for traffic forecasting. An arbitrary graph wasstructured to the classical RNN by Seo et al. [19] to identify spatialstructures for sequence prediction. In order to take advantage of thetemporal and spatial features, a Deep Multi-View Spatial-TemporalNetwork (DMVST-Net) was proposed by Yao et al. [28] to modelcorrelations among regions sharing similar temporal patterns fortaxi demand prediction. Multi-step citywide passenger demandswere predicted by Zhou et al. [31] through an encoder-decoderframework based on convolutional and ConvLSTM units. Morerecently, Bai et al. [2] used a hierarchical graph convolutional struc-ture to capture spatial and temporal correlations simultaneously topredict passenger demand. The grid embedding method for bothgeographical and semantic neighborhoods was illustrated to cap-ture spatial correlations and then predict origin-destination taxidemand by Wang et al. [21].Although the aforementioned works achieved good performanceon predicting traffic demand, all of them were based on station-intensive sources with only one forecasting target. They are hard tocapture adequate temporal information from station-sparse sourcesfor precise prediction. And they missed the chance to utilize thecorrelations among different sources for performance improve-ment. Ye et al. [29] focused on two transportation modes with asatisfactory level of region coverage which is hard to apply forstation-sparse data analysis. Overall, most of the previous worksdid traffic prediction based on LSTM or RNN which had less capa-bility to store the knowledge from one source which can be adaptedto other sources. They also lacked the ability to forecast demandwithout intensive-station/region data accurately. In practice, due tothe unbalanced urban development, the limited data leads to var-ious station-sparse public transportation modes. The works onlyfocused on one domain had less ability to predict the demand ofstation-sparse sources. Different from the previous works, we aim to adapt the learned knowledge from the station-intensive sourcemode of public transport to the station-sparse modes for improvingtheir forecasting performance based on the proposed multi-taskframework.

As we discussed in Section 1, it is often hard to provide high-qualitydemand forecasting of station-sparse sources only based on theirown information. Thus, many methods adopted auxiliary data toenhance the performance of prediction [3, 25]. For instance, an end-to-end multi-task model for demand prediction was proposed byBai et al. [3] where CNN was used to extract spatial correlations andexternal factors consisting of weather conditions were incorporatedto enhance the prediction accuracy. Similarly, the external relevantinformation including weather and time was applied to LSTM forpredicting future taxi requests by Xu et al. [25]. Unfortunately,the additional data/information is often hard to access (e.g., thepublic weather data often only contains the overall situation of acity while different regions in the city may have different weatherconditions in practice), which certainly limits the applicability ofthese approaches to a certain extent.Due to the insufficiency of sources and external data, some worksutilized developed previously learning methods to adapt the knowl-edge to new tasks or domains for prediction [20, 24, 27]. An inter-city region matching function was learned by Wang et al. [20]to match two similar regions from the source domain to the tar-get domain for crowd flow prediction. For air quality prediction,a Flexible multi-modal transfer Learning (FLORAL) method wasproposed by Wei et al. [24] through learning semantically relateddictionaries from the source domain and adapting it to the targetdomain. Yao et al. [27] adopted meta-learning to take advantageof the knowledge from multiple cities to increase the stability oftransfer for spatial-temporal prediction of the target city. Thesemethods aimed at solving the data-scarce problem lacking trainingsamples (e.g. only has three days data [20]) which are different tous. And they did not train the sources jointly which were hardto adapt or share useful knowledge during the training process.Those models designed for data-scarce problems cannot apply todeal with the problem as we described for enhancing the predictionperformance for station-sparse sources.Compared with the earlier studies, the external factors are notavailable so they are not incorporated in our work for predictingenhancement. Different from the strategies analyzing the demandpatterns of sources with limited training samples, we focus onimproving the demand forecasting performance of sources withsparse stations by adapting the useful knowledge learned from thestation-intensive source.

In this section, we introduce the dataset used in our work at first.Then we list some mathematical notations and formally define theproblem in our work.

The dataset is collected from Sydney covering main public transportservices: buses, trains, ferries, and light rails, from 01/Apr/2017 to0/Jun/2017 including 24 hours a day covering 6 .

37 million users.We choose all lines’ information including tap-on and tap-off lo-cation (e.g., name, longitude, and latitude of the station), and thenumber of passengers getting on and off in our experiments. Thestations’ amount proportion of each public transportation modeis shown in Figure 1(a). Considering the amount and distributionof each public transportation mode, we use the information of thebus as the station-intensive source and other three public trans-portation modes (i.e., train, light rail, and ferry) as station-sparsesources.

Demand Series . For a public transportation mode D (e.g., bus)with N D stations, we denote the demand of station i at time step t as a scalar x D , it which means the amount of passengers duringtime period t − ∼ t . And the unit of the time-step is the dura-tion for counting demand. Next, the demand of all the stationsfor transport mode D at time-step t can be represented as vector X Dt = { x D , t , x D , t , · · · , x D , it , · · · , x D , N D t } . Further, the demand se-ries of transport mode D along time can be denoted as a multivariatetime series X D = { X D , X D , · · · , X Dt , · · · , X DT } , where T is the totalnumber of time-steps. Station-level Demand Forecasting Problem . Suppose we havethe intensive demand data X R of mode R and a set of sparse re-gion demand data X S k of mode S k . Given a sequence of demand { X R , X R , · · · , X Rτ } of transport mode S , and a sequence of demand { X S k , X S k , · · · , X S k τ } of station-sparse mode S k , our aim is to fore-cast the demand of each station of station-intensive mode andstation-scarce mode in the future time-stamp T + X RT + , ˆ X S k T + = Γ { X RT − τ + , · · · , X RT − , X RT , X S k T − τ + , · · · , X S k T − , X S k T } (1)where τ is the length of time-steps and Γ (·) is the learned predictionfunction by MATURE . In this section, we first introduce the basic version of our multi-task framework for enhancing the station-sparse transport modedemand forecasting. Then, we elaborate on how we upgrade thebasic framework to the powerful

MATURE with two components:Memory-Augmented Recurrent Network (

MARN ) to enhance thetemporal features capturing and store useful information for eachtransportation mode and Knowledge Adaption Module to incorpo-rate and adapt the knowledge from the station-intensive source tostation-sparse sources. The overall structure of the proposed modelis shown in Figure 2 which will be explained in detail as follows.

To better describe the method for multi-task enhanced forecasting,we describe the basic multi-task learning framework at first.As described in Section 1, the forecasting performance of thestation-sparse sources could be enhanced by the knowledge from the station-intensive source. Thus, a multi-task framework is de-signed to jointly optimize the station-intensive source and station-sparse source which can learn representations with underlyingfactors between two sources for further demand forecasting.To capture the nonlinear temporal relationships, Recurrent Neu-ral Networks have been widely used in NLP tasks to process asequence of arbitrary length. And nowadays it has been widelyused in time-series forecasting [5, 22]. However, for classic RNN,the components of the gradient vector grow or decay exponentiallyover long sequences [13]. Therefore, LSTM now becomes a strongtool for time-series forecasting tool due to its capability for learninglong-term dependencies and avoiding exploding or vanishing prob-lems by a memory unit and a gate mechanism [6, 16]. It is able tolearn temporal correlations by maintaining an internal memory cell c t at time-step t . Thus, the data from the station-intensive modeand station-sparse mode are sent into LSTM separately at first toanalyze their own temporal correlations.In detail, LSTM has an input gate i t , a forget gate f t , an outputgate o t , an internal memory cell c t , and a hidden state h t ∈ R m . Weadopt the LSTM cell in our study as the basic temporal correlationscapturing framework which is defined as follows with the inputvector x t ∈ R n at the current time-stamp: i t = σ ( W i x t + U i h t − + b i ) f t = σ ( W f x t + U f h t − + b f ) o t = σ ( W o x t + U o h t − + b o ) θ t = tanh ( W θ x t + U θ h t − + b θ ) c t = f t ⊙ c t − + i t ⊙ θ t h t = o t ⊙ tanh ( c t ) (2)where ⊙ represents element-wise multiplication, σ denotes the lo-gistic sigmoid function σ ( u ) = \( + e − u ) , U i , U f , U o , U θ ∈ R h × m , W i , W f , W o , W θ ∈ R h × n are weight matrices, and b i , b f , b o , b θ ∈ R h are bias vectors.After extracting temporal features for each source, a simplemanner to achieve the knowledge adaption is to merge the datafrom different sources together. Therefore, the extracted featuresare concatenated and applied to the output module (e.g., two fullyconnected layers) to find the relations between the station-intensivesource and the station-sparse source for further demand forecasting. We now move to introduce the external augmented memory mod-ule to extract and store the information from the demand seriesbased on the recurrent neural network. It could enhance the capa-bility to capture the long-and-short term information and share theextracted information with others which the internal memory cellof LSTM cannot achieve efficiently.As introduced in Section 4.1, the internal memory cell of LSTMhas the ability to capture long-and-short term relations for oneforecasting task. However, it is hard to be shared by other tasks toimprove the forecasting performance for station-sparse sources. Inrecent years, recurrent neural networks augmented with externalmemory have been studied with the ability to preserve and shareuseful information [18]. They are able to learn algorithmic solutionsto different complex tasks and were used for language modeling and igure 2: Overall Structure of the Proposed Model: MATURE. X t denotes the demand data, h t represents the hidden state, and M t is the external augmented-memory. R represents the station-intensive source while S represents the station-sparse source.(Use the ferry as an example.) machine translation [14, 18]. Motivated by the success of memorymechanism in the NLP area for modeling the long-term temporalcorrelation, we adopt an external augmented memory module toextract the historical information of each transportation mode andboost the capacity to deal with long-and-short term information.The extracted information can be stored in the augmented memorymodule and further shared with other transportation modes bythe knowledge adaption module, which will be discussed in 4.3for prediction. The architecture of Memory-Augmented RecurrentNetwork ( MARN ) is shown in Figure 3.In detail, we first define the external augmented memory moduleas M t ∈ R K × S at time-step t where K denotes the number ofmemory segments while S denotes the size of each segment.In order to read the useful information from the memory module,we introduce a vector k t ∈ R S emitted by LSTM at each time-step t which is defined as: k t = tanh ( W k h t + b k ) (3)where W k , b k represents the weight matrix and bias vector respec-tively. Then, α t is utilized to decide what information needs tobe read from the memory module M t − and written to M t at thenext time-step. And r t ∈ R S is used for describing the operation ofreading effective knowledge from M t − . The formulas of α t ∈ R K and r t are shown as follows: α t , k = so f tmax ( f ( M t − , k , k t − )) (4) r t = M Tt − α t (5)where f ( x , y ) is a function to compare the similarity between vector x and y . And f ( x , y ) = cosine ( x , y ) where cosine represents thecosine similarity. Figure 3: The Architecture of MARN. ⊗ , ⊕ , and ⊙ representselement-wise multiplication, addition, and matrix multipli-cation respectively. σ denotes the logistic sigmoid function. c t represents internal memory cell, k t is a vector emitted bythe recurrent network. b t is the boost vector while l t is theeliminate vector, i t , f t , and o t represent input gate, forgetgate, and output gate respectively. The next step is to update effective information to M t basedon the previous time-step information. Adopting an erase vector e t ∈ R S and an add vector a t ∈ R S , we then formulate M t as: M t = M t − ⊙ ( − α t e T t ) + α t a T t (6)where e t = σ ( W e h t + b e ) a t = tanh ( W a h t + b a ) (7)nd T represents transpose operation, W e , W a denote the weightmatrices while b e , b a represent bias vectors.To combine the external augmented memory module M t com-puted in Formula (6) and internal memory cell c t obtained in For-mula (2), a deep fusion strategy is introduced instead of simplyadding or concatenating to avoid conflicts between the internalmemory and external memory. Thus, the hidden state h t should beformulated as: h t = o t ⊙ tanh ( c t + σ ( W r r t + W c c t ) ⊙ ( W h r t )) (8)where W r , W c , W h are the weight matrices.In conclusion, the Memory-Augmented Recurrent Network en-hances the ability to handle long-and-short term information whichcould improve the demand prediction accuracy for all sources. Theexternal memory module also has the strength to store useful knowl-edge that can be shared by other transportation modes. Based on the augmented memory module introduced in Section 4.2,this section further introduces the proposed knowledge adaptionmechanism to adapt useful knowledge from the station-intensivesource to station-sparse sources in detail. And the framework ofthis module is shown in Figure 4.

Figure 4: Knowledge Adaption Module

Due to the small number of stations as described in Section 1, itis hard to obtain accurate forecasting demand only based on thestation-sparse source itself. Thus, we propose to adapt the usefulinformation from the station-intensive source to station-sparsesources for enhancing prediction performance since the extractedknowledge from the station-intensive source could describe thedemand patterns in particular areas better which could be useful forstation-sparse sources to characterize their own demand patterns inthe same area. We adopt the external augmented memory obtainedby Formula (6) of each source. At time-step t , we use M Rt and M St to denote the external memory of the station-intensive sourceand the station-sparse source, separately. Directly adapting M Rt to M St through concatenation or addition may suffer from featureredundancy for two reasons. On the one hand, it could bring someunnecessary task-specific features to the station-sparse source. Onthe other hand, some adaptive features may be mixed in the privatespace. Therefore, we propose a knowledge adaption mechanism toincorporate external memories from different sources.The first step is to utilize an align function for M Rt and M St attime-step t to compare the extracted information from two sources which can be further used to decide what knowledge needed to beadapted. The formula can be shown as: д ( M Rt , k , M St , k ) = v T tanh ( W д [ M Rt , k ; M St , k ]) (9)where W д is a parameter matrix and v denotes a parameter vector.Thus, the corresponding control parameter to decide the amountof knowledge from the station-intensive source should be adaptedto the station-sparse source can be computed as: β t , k = so f tmax ( д ( M Rt − , k , M St − , k ) (10)After knowing the connection between two sources, we designan adaptive function to fuse the memories M R and M S which adoptsa boost vector b t ∈ R S and an eliminate vector l t ∈ R S . b t holdsthe purpose for limiting the task-specific information read from M R while l t is used for adapting effective features to the new memorymatrix. The function can be written as: M newt = M Rt − ( − β t l Tt ) + β t b Tt (11)where b t = tanh ( W b b t − + b b ) l t = σ ( W l l t − + b l ) (12)At last, the adaptive matrix of augmented memory for station-sparse source M St is computed as: M St = γ × M St − + ( − γ ) × M newt (13)where γ is a hyperparameter to decide the proportion of M newt inthe augmented memory. The adaptive information M St of station-sparse source then will be sent into the Memory-Augmented Re-current Network for hidden state capture and further demand pre-diction. Predicting the demand of several public transportation modes in-dependently is hard to adapt useful information from the station-intensive source to the station-sparse source in the training process.Thus, we optimize the demand forecasting of the station-intensivesource and the station-sparse source jointly. As shown in Figure 2,the last step of the multi-task learning framework is to concate-nate the hidden states of the two sources and then sent them intodifferent fully connected layers for further demand forecasting.In the training process of

MATURE , the objective is to minimizethe error between the true demand and the predicted values of thestation-intensive source and the station-sparse source simultane-ously. The loss function is defined as the mean squared error fortime-step length τ , which is formulated as L ( θ ) = ϵ × T + τ (cid:213) i = T + || ˆ X Ri − X Ri || + ( − ϵ ) × T + τ (cid:213) i = T + || ˆ X Si − X Si || (14)where θ denotes all the learnable parameters in the proposed MA-TURE model, ϵ is a hyperparameter to balance the loss betweentwo sources. Our model can be trained in an end-to-end mannervia back-propagation and the Adam optimizer. EXPERIMENTS

In this section, we first introduce the experiment settings, eval-uation metrics, and comparing baselines. In the next, we list theexperimental results from two perspectives: overall comparisonand ablation study. Then, we discuss the comparison among sev-eral public transportation stations with totally different ranges anddistribution of demand. The last part mainly focuses on parametersensitivity (e.g., the hyperparameter γ in Formula (13)). Dataset Setting.

The demand data is normalized by Min-Max nor-malization for training and re-scaled to the actual values for evalu-ating the prediction performance. To test the performance of ourmodel, the last 27 days’ data are used for testing while the rest fortraining and validation. In each experiment, we use one station-sparse source for testing. The unit of time-step we choose is onehour. Since the data volume of the bus is too large which containssome meaningless data (e.g., more than 80% of time-steps in oneday have zero demand), we drop the bus stations with an averageof fewer than five demands one hour. The number of the bus, train,light rail, and ferry stations are 1573, 310, 23, and 38 respectively.And we choose the previous 12 time-steps (12 hours) to predict thepublic transport demand in the next time-step.

Evaluation Metrics.

Two evaluation metrics are used to evaluatethe proposed model: Root Mean Square Error (RMSE) and MeanAbsolute Error (MAE).

Network Implementation.

The batch size is set to 64. The pro-posed model is tuned with the learning rate (from 0 . . ϵ in the loss function (from 0 to 1 with a stepsize of 0 . γ in Formula (13) (from 0 to 1 with a step size of 0 . .

002 for the train,0 . . ϵ is set to 0 . γ are set to 0 . . . . We compare the proposed model with the methods in the following: • Historical Average (HA) : The predicted demand is com-puted as the average values of historical demand at the sametime interval of every day. • Linear Regression (LR) : It models the relations betweenvariables and minimizes the sum of the squares of the errorsfor prediction. • eXtreme Gradient Boosting (XGBoost) : XGBoost wasproposed by Chen and Guestrin [4] based on gradient boost-ing tree which incorporates the advantages of Bagging inte-grated learning methods in the evolution process. • Multilayer Perceptron (MLP) : The neural network con-tains four fulling connected layers and the numbers of hiddenunits are 256 , , , • Long-Short Term Memory (LSTM) : LSTMs are used tomodel long- and short-term dependencies and directly ap-plied to predict demand for each transportation mode. • Graph Convolutional Recurrent Network (GCRN) [19]:It combines CNN on graphs to identify spatial structures andRNN to find temporal patterns for demand forecasting whichcould improve the forecasting accuracy by simultaneouslycapturing graph spatial and dynamic information about data. • Long- and Short-term Time-series network (LSTnet) [9]: It employs a recurrent-skip network with a convolu-tional layer to capture the long-term dependence patternsand discover the local dependency patterns for forecasting. • Dual-stage Attention-based Recurrent Neural Network(DA-RNN) [17]: It has two components: an input attentionmechanism to extract relevant driving series at each time-step and a temporal attention mechanism to select relevantencoder hidden states across all time-steps for predictioneffectively. The original structure is used for one variantforecasting, we change it to multiple variants forecasting. • MT-LSTM : This is the basic multi-task model introduced inSection 4.1. We use LSTM layers to extract temporal correla-tions and adopt fully connected layers to analyze the implicitrelations for demand forecasting.

For the tested approaches, we tune the model parameters on thevalidation dataset to locate the best parameters and list the forecast-ing results based on the testing dataset. A summary of the results ofdifferent models is reported in Table 1. Due to our aim is to enhancethe forecasting accuracy of station-sparse sources with the help ofuseful knowledge learned from the station-intensive source, we listthe MAE and RMSE of the three station-sparse sources (train, lightrail, and ferry) and calculate the average values of the forecastingresults of the bus in three experiments for testing three station-sparse sources for the basic multi-task model

MT-LSTM and ourmodel

MATURE .From Table 1, we make several observations. First, the classicalmachine learning models including HA and LR perform worse thanother techniques in terms of both MAE and RMSE. Also, it seemsthat XGBoost has an advantage in some sources (e.g., train) for hav-ing a better performance than some deep learning methods. Then,in deep learning methods, MLP has a relatively poor performancesince it is hard to extract the characteristics of temporal relation-ships. Other tested strategies based on recurrent neural networks(e.g., LSTM) or convolutional networks have the ability to handlethis problem and thus improve forecasting accuracy. The threelisted state-of-the-art strategies (GCRN, LSTnet, and DA-RNN) ob-tain better results than classic LSTM except for DA-RNN for thetrain and bus since it is proposed for uni-variant forecasting. Itcould predict accurately for those with fewer variants but has apoor ability for abundant variants prediction. The performanceof GCRN is better than other state-of-the-art strategies and suchresults imply that capturing spatial and temporal features simul-taneously does improve the forecasting results. This motivates usto consider the memory-based spatial-temporal correlations fordemand forecasting in our future work.Moreover,

MT-LSTM yields better results than LSTM whichmeans different sources indeed have connections. It is effectiveto concatenate several sources to find their possible relations and able 1: Overall Comparison between the Proposed Method and Existing MethodsMethod MAE RMSEBus Train Light Rail Ferry Average Bus Train Light Rail Ferry AverageHA LR XGBoost [4] 12.6638 21.7030 12.2439 18.3773 16.2470 17.2568 35.0718 19.6542 28.7830 25.1915

MLP

LSTM

GCRN [19] 9.7909 20.3683 10.3356 15.8446 14.0849 13.1624 30.2009 15.4056 21.6866 20.1139

LSTnet [9]

DA-RNN [17] 10.7181 23.8095 11.5292 16.3595 15.6041 14.2535 33.8811 17.9019 24.0820 22.5296

MT-LSTM

MATURE share the extracted knowledge for further prediction. However,LSTM is not as powerful as other state-of-the-art methods, so theMAE and RMSE values obtained by

MT-LSTM are larger thanthem. Compared to

MT-LSTM , our method gains 7 .

75% relativeimprovement in MAE and 14 .

98% relative improvement in RMSEon the average of the four modes which proves the effectivenessof the external memory module and the knowledge adaption strat-egy. Though, our method gains slightly larger MAE and RMSEon the bus than GCRN and LSTnet, jointly training the sourcesby

MATURE obtains the highest accuracy regarding all the met-rics in the three station-sparse sources. Also, compared to the beststate-of-the-art method, our method obtains improvement in MAEand RMSE on the average of the four modes. Such results indicatethat in the same city, based on our multi-task learning framework,extracting information from the station-intensive source to char-acterize the demand patterns of the target areas by the externalmemory-augmented recurrent network introduced in Section 4.2and adapting the useful knowledge to station-sparse sources by theknowledge adaption module discussed in Section 4.3 could enhancethe forecasting performance of station-sparse sources. Also, the busstill achieves good performance which means our method will notdestroy the temporal correlations of the station-intensive source.

To study the effects of different components of the proposed model,we further evaluate the models with various combinations of com-ponents. Table 2 lists the comparison results of our method and fivedifferent combinations. We only list the MAE and RMSE of the threestation-sparse sources (train, ferry, and light rail) since our aim isto verify the demand patterns extracted from the station-intensivesource has the ability to enhance the performance of station-sparsesources. The tested architectures are described as follows: • C-LSTM:

The concatenation of the two sources is sent intothe LSTM layer. And then two fully connected layers analyzethe obtained matrices for demand prediction. • MT-LSTM:

As described in Section 5.2, two LSTM layers areutilized to extract temporal correlations for two sources sep-arately. The extracted knowledge is concatenated and thensent into two fully connected layers for further prediction. • MARN:

The strategy we described in Section 4.2. We trainthe model for predicting the demand of each public trans-portation mode independently by

MARN . • MARN-S:

This architecture adopts

MARN to extract tem-poral correlations for each source at first. Then, the hiddenstates are combined and followed by the fully connectedlayers for further prediction. • MARN-C:

This architecture also adopts

MARN to extractknowledge from the sources independently. The procedureof knowledge adaption for the station-sparse source is toconcatenate two external memories. The concatenation ma-trix is then sent to a fully connected layer to construct anew memory module M St for station-sparse source. Also, thehidden states of two sources are combined together and sentinto fully connected layers to predict the demand. Table 2: Comparison with Different VariantsMethod MAE RMSETrain LightRail Ferry Train LightRail FerryC-LSTM

MT-LSTM

MARN

MARN-S

MARN-C

Our Model 19.9445 9.6914 15.3570 30.1061 14.4222 21.5486

From the forecasting results in Table 2, we can find that the firstcombination of LSTM,

C-LSTM has worse accuracy than LSTMwhich means the concatenation of two sources may destroy thetemporal correlations of the original data which will make a badeffect on demand forecasting. In contrast, the second structure

MT-LSTM improves the prediction performance based on LSTM.These imply to us that it is better to analyze temporal relations atfirst for each mode and then find their underlying relations whichcould improve the results. Such experiments motivate us to takeadvantage of the station-intensive source to enhance the perfor-mance of station-sparse sources in our work.

MARN achieves bet-ter performance than LSTM on the three sources, which illustrates

MARN could enhance the capability to model long-and-short term a) Train (b) Light Rail (c) Ferry

Figure 5: Forecasting Demand vs. True Demand (a) Train (b) Light Rail (c) Ferry

Figure 6: Parameter Sensitivity of γ information for prediction. The strategy MARN-S based on theexternal augmented module yields better performance on demandprediction than

MARN , which indicates that analyzing the implicitcorrelations among the temporal information extracted from twosources can improve the forecasting accuracy. Though

MARN-S gains larger MAE than

MT-LSTM on the light rail, it has lowerRMSE which means

MARN-S works better on larger demand val-ues. Then,

MARN-C with external memory adaption obtains lowerMAE and RMSE on the train and ferry which implies the signif-icance of adapting information stored in memory M t . However,it achieves worse performance on the light rail and such fact im-plies that pure concatenation operation could bring noises to thestation-sparse sources to influence the performance. Furthermore,our model achieves a better performance than the listed strate-gies which indicates that the knowledge from the station-intensivesource can play a strengthening effect on forecasting the demandof station-sparse sources. And it demonstrates our knowledge adap-tion module is more powerful to adapt knowledge from M Rt to M St than pure concatenation which is hard to avoid meaningless spe-cific features. The proposed method is able to effectively decreasethe impact of noise and increase the impact of useful knowledgefrom the station-intensive source on station-sparse sources. We now move to illustrate the effectiveness of the proposed methodon the selected stations with different ranges of demand.Specifically, we choose three stations for the three station-sparsesources to predict the demand. The selected stations have totallydifferent average values and standard deviation distribution of pas-senger demand. We plot the true demand and prediction resultsobtained by our model in Figure 5 where the solid lines denote true data while the dashed lines denote forecasting demand. Ingeneral, the proposed model

MATURE has the ability to predictthe demand of each hour for different ranges of demand precisely.The predicted curves and the real curves are highly coincident. Asfor the train, our model can accurately predict peaks except forthree occasions where demand values are smaller than those of theprevious two days and are expected to be irregular. The predicteddemand on the last day is less accurate than other days since thereexists a drop that is hard to be detected. For the light rail, there alsoexist several peaks which are not fully captured because of the drop.The ferry demand is often affected by weather disturbances. Thus,the accuracy of the ferry is slightly lower than the other two publictransit modes. For instance, the demand of the fourth day is muchlower than the other six days which is an irregular case that is hardto be predicted (without additional inputs or information). Suchphenomenons indeed motivate us to focus on the demand peakand drop capturing in our future work. In summary, the proposedmethod can predict the demand of the ferry accurately except forcases that are largely affected by other external factors.

To study the influence of hyperparameters of

MATURE , we thentrain the proposed model on our dataset with different values of γ in Formula (13).Specifically, γ is changed from 0 to 1 when fixing the value of thelearning rate and ϵ in the loss function. Figure 6 summarizes theresults where the left x-axis represents MAE while the right x-axisrepresents RMSE. When we set γ to 0, the adaptive function onlycontains the matrix M newt that we calculate through the knowledgeadaption module. When we set γ to 1, the adaptive procedure isinvalid and the method is similar to the model MARN-S . The resultslso match with the MAE and RMSE of

MARN-S . As the valueof γ changes, the values of MAE do not fluctuate significantly,indicating that our model is not sensitive to this hyperparameter.Overall, γ equaling to 0 .

3, 0 .

4, and 0 . In this paper, we propose a novel external memory-based multi-taskmodel, namely Memory-Augmented Multi-task Recurrent Network(

MATURE ) for demand forecasting regarding station-sparse publictransit modes with the help of station-intensive mode(s). Specifi-cally, the method learns an external memory based on the recurrentnetwork for each data source including station-intensive mode andstation-sparse modes which can strengthen the ability to capturethe temporal information and store the useful knowledge for fur-ther prediction. The knowledge extracted from the station-intensivesource often captures the demand patterns/features of the selectedareas well. Thus, we then introduce a knowledge adaption moduleto adapt the knowledge from the station-intensive source to station-sparse sources which could enhance the predicting performance ofstation-sparse public transit modes. When evaluated on one real-world dataset including four public transit modes (bus, train, lightrail, and ferry), our approach achieves better performance thanthe state-of-the-art baselines. This research provides a new tooland insights to the study of public transport demand prediction byextracting knowledge from various modes and sharing/adaptinguseful features to the modes that need them. For further work, wewill explore the spatial correlations and spatio-temporal relationsfor demand prediction based on our model.

REFERENCES [1] Lei Bai, Lina Yao, Salil S Kanhere, Xianzhi Wang, Wei Liu, and Zheng Yang. 2019.Spatio-Temporal Graph Convolutional and Recurrent Networks for CitywidePassenger Demand Prediction. In

Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management . 2293–2296.[2] Lei Bai, Lina Yao, Salil S Kanhere, Xianzhi Wang, and Quan Z Sheng. 2019.STG2seq: spatial-temporal graph to sequence model for multi-step passengerdemand forecasting. In

Proceedings of the 28th International Joint Conference onArtificial Intelligence . AAAI Press, 1981–1987.[3] Lei Bai, Lina Yao, Salil S Kanhere, Zheng Yang, Jing Chu, and Xianzhi Wang. 2019.Passenger demand forecasting with multi-task convolutional recurrent neuralnetworks. In

Pacific-Asia Conference on Knowledge Discovery and Data Mining .Springer, 29–42.[4] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.In

Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining . ACM, 785–794.[5] Razvan-Gabriel Cirstea, Darius-Valer Micu, Gabriel-Marcel Muresan, ChenjuanGuo, and Bin Yang. 2018. Correlated time series forecasting using multi-taskdeep neural networks. In

Proceedings of the 27th acm international conference oninformation and knowledge management . 1527–1530.[6] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[7] Siteng Huang, Donglin Wang, Xuehan Wu, and Ao Tang. 2019. DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting. In

Proceedings of the28th ACM International Conference on Information and Knowledge Management .2129–2132.[8] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graphconvolutional networks. arXiv preprint arXiv:1609.02907 (2016).[9] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. 2018. Modelinglong-and short-term temporal patterns with deep neural networks. In

The 41stInternational ACM SIGIR Conference on Research & Development in InformationRetrieval . ACM, 95–104. [10] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition.

Proc. IEEE

86, 11 (1998), 2278–2324.[11] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion ConvolutionalRecurrent Neural Network: Data-Driven Traffic Forecasting. In

InternationalConference on Learning Representations .[12] Marco Lippi, Matteo Bertini, and Paolo Frasconi. 2013. Short-term traffic flowforecasting: An experimental comparison of time-series analysis and supervisedlearning.

IEEE Transactions on Intelligent Transportation Systems

14, 2 (2013),871–882.[13] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural networkfor text classification with multi-task learning. In

Proceedings of the Twenty-FifthInternational Joint Conference on Artificial Intelligence . 2873–2879.[14] Pengfei Liu, Xipeng Qiu, and Xuan-Jing Huang. 2016. Deep Multi-Task Learningwith Shared Memory for Text Classification. In

Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Processing . 118–127.[15] Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, and LuisDamas. 2013. Predicting taxi–passenger demand using streaming data.

IEEETransactions on Intelligent Transportation Systems

14, 3 (2013), 1393–1402.[16] Yan Qi, Chenliang Li, Han Deng, Min Cai, Yunwei Qi, and Yuming Deng. 2019. ADeep Neural Framework for Sales Forecasting in E-Commerce. In

Proceedings ofthe 28th ACM International Conference on Information and Knowledge Management .299–308.[17] Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garrison WCottrell. 2017. A dual-stage attention-based recurrent neural network for timeseries prediction. In

Proceedings of the 26th International Joint Conference onArtificial Intelligence . 2627–2633.[18] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior,Gregory Wayne, Alex Graves, and Timothy Lillicrap. 2016. Scaling memory-augmented neural networks with sparse reads and writes. In

Advances in NeuralInformation Processing Systems . 3621–3629.[19] Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson.2018. Structured sequence modeling with graph convolutional recurrent net-works. In

International Conference on Neural Information Processing . Springer,362–373.[20] Leye Wang, Xu Geng, Xiaojuan Ma, Feng Liu, and Qiang Yang. 2019. Cross-cityTransfer Learning for Deep Spatio-temporal Prediction. In

IJCAI InternationalJoint Conference on Artificial Intelligence . 1893.[21] Yuandong Wang, Hongzhi Yin, Hongxu Chen, Tianyu Wo, Jie Xu, and Kai Zheng.2019. Origin-destination matrix prediction via graph convolution: a new per-spective of passenger demand modeling. In

Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . ACM, 1227–1235.[22] Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, andPhilip S Yu. 2019. Memory In Memory: A Predictive Neural Network for LearningHigher-Order Non-Stationarity from Spatiotemporal Dynamics. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition . 9154–9162.[23] Zheng Wang, Kun Fu, and Jieping Ye. 2018. Learning to estimate the travel time.In

Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining . 858–866.[24] Ying Wei, Yu Zheng, and Qiang Yang. 2016. Transfer knowledge between cities.In

Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining . 1905–1914.[25] Jun Xu, Rouhollah Rahmatizadeh, Ladislau Bölöni, and Damla Turgut. 2017.Real-time prediction of taxi demand using recurrent neural networks.

IEEETransactions on Intelligent Transportation Systems

19, 8 (2017), 2572–2581.[26] Rui Xue, Daniel Jian Sun, and Shukai Chen. 2015. Short-term bus passengerdemand prediction based on time series model and interactive multiple modelapproach.

Discrete Dynamics in Nature and Society

The World Wide Web Conference . 2181–2191.[28] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, PinghuaGong, Jieping Ye, and Zhenhui Li. 2018. Deep multi-view spatial-temporal net-work for taxi demand prediction. In

Thirty-Second AAAI Conference on ArtificialIntelligence .[29] Junchen Ye, Leilei Sun, Bowen Du, Yanjie Fu, Xinran Tong, and Hui Xiong.2019. Co-Prediction of Multiple Transportation Demands Based on Deep Spatio-Temporal Neural Network. In

Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining . 305–313.[30] Xiuwen Yi, Junbo Zhang, Zhaoyuan Wang, Tianrui Li, and Yu Zheng. 2018. Deepdistributed fusion network for air quality prediction. In

Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery & Data Mining .965–973.[31] Xian Zhou, Yanyan Shen, Yanmin Zhu, and Linpeng Huang. 2018. Predictingmulti-step citywide passenger demands using attention-based neural networks.In