Multi-stream RNN for Merchant Transaction Prediction
Zhongfang Zhuang, Chin-Chia Michael Yeh, Liang Wang, Wei Zhang, Junpeng Wang
MMulti-stream RNN for Merchant Transaction Prediction
Zhongfang Zhuang [email protected] ResearchPalo Alto, California
Chin-Chia Michael Yeh [email protected] ResearchPalo Alto, California
Liang Wang [email protected] ResearchPalo Alto, California
Wei Zhang [email protected] ResearchPalo Alto, California
Junpeng Wang [email protected] ResearchPalo Alto, California
ABSTRACT
Recently, digital payment systems have significantly changed peo-pleâĂŹs lifestyles. New challenges have surfaced in monitoringand guaranteeing the integrity of payment processing systems. Oneimportant task is to predict the future transaction statistics of eachmerchant. These predictions can thus be used to steer other tasks,ranging from fraud detection to recommendation. This problem ischallenging as we need to predict not only multivariate time seriesbut also multi-steps into the future. In this work, we propose a multi-stream RNN model for multi-step merchant transaction predictionstailored to these requirements. The proposed multi-stream RNNsummarizes transaction data in different granularity and makes pre-dictions for multiple steps in the future. Our extensive experimentalresults have demonstrated that the proposed model is capable ofoutperforming existing state-of-the-art methods.
Topic Area:
Application - Monitoring, Forecasting
KEYWORDS
Transaction, Multivariate, Time Series, Forecasting, Prediction
ACM Reference Format:
Zhongfang Zhuang, Chin-Chia Michael Yeh, Liang Wang, Wei Zhang, and Jun-peng Wang. 2018. Multi-stream RNN for Merchant Transaction Prediction.In
KDD Workshop on Machine Learning in Finance, Aug 24, 2020, San Diego,CA.
ACM, New York, NY, USA, 7 pages.
Advanced digital payment technology has enabled billions of trans-actions to be processed every second. One challenge these systemsface is detecting the irregular transaction behaviors of merchantsthat deviate from the historical data. While many of these deviationsmay be harmless, some may signal serious underlying issues, rang-ing from connection issues between point-of-sale (POS) and thepayment processing centers to money laundering. Thus, detectingdeviated behaviors is not only an important task for both merchantsand payment processing companies but also a responsibility forevery participant of the payment ecosystem.To detect deviated behaviors, one crucial step is to study thetransaction data of every merchant from the past and estimate the
KDD Workshop, Aug 24, 2020, San Diego, CA © 2020 Copyright held by the owner/author(s).This is the author’s version of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published in
KDD Workshopon Machine Learning in Finance, Aug 24, 2020, San Diego, CA . Past Futurefeature 1feature 2feature 3feature 4 Sun Mon Tue Wed Thu SatFri Sun
Figure 1: An example of time series prediction in the contextof the merchant transaction history. Four features exhibit daily recurring behaviors. future. Transaction data in real-world applications are often storedas multivariate time series where each dimension is a time-varyingfeature. These features, such as the hourly transaction amount, thehourly number of transactions and hourly averaged transactionamount, are extracted and aggregated for analysis. We illustrateone merchant’s transaction history in a week in Figure 1.Time series prediction is a well-studied problem [1]. However,predicting multiple time series features in real-world scenariosremains challenging as the real-world data is dynamic and maybe impacted by various unpredictable factors, such as traffic andweather. Recent works on time series prediction [7, 8] take a newlook at this problem from the neural network perspective. Specifi-cally, Taieb and Atiya [7] proposes to train a neural network witha target consisting of multi-steps into the future; Wen et al. [8]approaches this problem from sequence-to-sequence perspective.However, in these methods [7, 8], each feature is treated indepen-dently without assuming any dependencies among them whichcontradicts our domain knowledge in transaction data as differentfeatures in transaction data are inter-dependent.In this work, we propose to use a multi-stream RNN model totackle the merchant transaction prediction task, where each streamis responsible for the data in each day of the week based on observedmerchant behaviors. Our approach first aggregates transaction data a r X i v : . [ q -f i n . S T ] J u l DD Workshop, Aug 24, 2020, San Diego, CA Z. Zhuang, et al. v v v v T T T T T T T T T +++ T T T abc d e f v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v Figure 2: An overview of our proposed model. Components are explained in Section 2. at a fine granularity (e.g., day) and then augment it with anotherRNN to capture the pattern at a higher level. Different from theexisting work in computer vision research [9], where multi-streamRNNs are used to predict each time step for action labels, the goal ofusing multi-stream RNN in this work is to summarize and preserveas much transaction information as possible for future long-termpredictions.We summarize the contributions of this work as follows: • We propose and analyze the problem of multivariate multi-step merchant transaction prediction. • We design a multi-stream RNN model to tackle the multi-stepprediction task. • We demonstrate the effectiveness of our proposed model bycomparing it with various baseline methods on real-worldaggregated transaction data.
In this section, we first use Figure 2 to explain the architecture of theproposed model in Section 2.1. Next, we discuss design variationsfor some of the model components in Section 2.2.
The input to our model is the hourly aggregated transaction data of168 hours ( T ∼ T ) in a form of multivariate time series, and theoutput of our model is the predicted multivariate time series for the next 24 hours ( T ∼ T ). The core components of multi-streamRNN are:(1) Daily RNNs. Each daily RNN is tasked to process each day’stransaction data (e.g., a - c ).(2) Merge layers (i.e., d ). Each merge layer is responsible foraggregating the information from the hourly output of eachdaily RNNs. The hourly outputs associated with the samehour of the day are aggregated together.(3) Weekly RNN (i.e., e ) connects to the previous 24 mergelayers to capture the high-level patterns in the input timeseries.(4) The fully-connected layers (i.e., f ) that make predictionsfor each of the 24 temporal steps individually.One main component in our model is the daily RNN, with thegoal of summarizing the daily merchant time series patterns. Inparticular, we use either GRU or LSTM as the daily RNN. We denoteeach daily RNN here as: h ( t ) i = RNN i ( x t ) (1)where x t denotes the values of merchant features at time t , RNN i is the i -th RNN in the multi-stream schema and h ( t ) i is the latentrepresentation of the time series of the i -th day of the week (at time t ). For example, RNN is only responsible of processing T ∼ T transaction data.Each RNN used here has the same number of hidden dimensions.A dropout layer is augmented after each RNN with a fixed 0.2dropout rate. ulti-stream RNN for Merchant Transaction Prediction KDD Workshop, Aug 24, 2020, San Diego, CA v v v v v v v v v v v v v v v v g Figure 3: Stacked RNN for complex time series input.
Next, the hidden states h ( t ) i are aggregated by merge layers d where each merge layer captures the patterns across every day ofthe week for different hours: s t = h ( t ) + h ( t ) + ... + h ( t ) (2)where t ∈ [ , ] and h ( t ) ∼ h ( t ) contains the information associ-ated with a given hour t for each day in the week.On a higher level, we utilize e to learn the patterns across outputof each merge layer. e can be written as: h ( t ) s = RNN s ( s t ) (3)Lastly, we utilize h ( t ) s to predict the future values of each featureby appending fully connected layers after RNN s (i.e., f ). Insteadof predicting values of all features together, we utilize separatedlinear fully connected layers to predict each of the k -th featureindividually. These fully connected layers are: v ( t ) k = W k h ( t ) s + b k (4)where W k is the weight matrix and b k is the bias term of fullyconnected layer. Stacked RNN.
In the original design, we only use one RNN layeras depicted in Figure 2. However, time series that contains morecomplicated patterns requires more complicated RNN structure,such as stacked RNN, to capture the patterns. Thus, one variationof our model is to replace the daily RNN layers (e.g., a ∼ c ) withthe new stacked RNN layers ( g ) in Figure 3. Shrink Stacked RNN for Output.
Another variation of our modelis in the higher level weekly RNN layer (i.e., e ). Different fromthe vanilla design of the weekly RNN layer, we now have multipleRNN layers to gradually reduce the dimensionality of the RNNlayers (denoted as h in Figure 4). We shrink the dimensions ofeach output layer using the following rule: d i + = (cid:40) ⌊ d i / r ⌋ , otherwise m , if ⌊ d i / r ⌋ = d i + and d i are the dimensions of the ( i + ) -th and i -th layers,respectively. r represents the rate of shrinkage. m is the minimaldimension of this stack when ⌊ d i / r ⌋ =
0. This is to ensure the lastoutput RNN layer has a valid number of dimensions.
Shrink Stacked RNN
444 v v v v v v v v v v v v t t t … ……… … h Figure 4: Shrink stacked RNN for output.
We have organized four different datasets where each dataset con-sists of merchants from one of the following categories: departmentstore, restaurants, sports facility, and medical services, denoted asC1 ∼ C4, respectively.For each category, we randomly select 2,000 merchants locatedwithin California, United States. The time series datasets consist offour features, each is produced by computing the hourly aggrega-tion of the following statistics: number of approved transactions,number of unique cards, a sum of the transaction amount, andrate of the approved transaction. The training data consists of timeseries data during November 1–23, 2018 and the test data is timeseries during November 24–30, 2018.As mentioned in Section 1, the goal of the system is to predictthe next 24 hours given the last 168 hours (i.e., seven days). Wepredict every 24 hours in the test data by supplying the latest 168hours to the system. For example, the transaction data of 168 hoursin
Week-10 is used to predict the values of 24 hours on the Mondayof
Week-11 . Model Parameters.
In the recurrent layers, we use 256 as the num-ber of dimensions. Based on our experiments, we choose sigmoid as the activation function of the input RNN layers and relu as theactivation function of the output RNN layers. We use rmsprop withlearning rate of 0.001 as the optimizer. Batch size is set to 256 as abalance between computing resource utilization and model quality.We chose r = m = Training Goal.
The training goal is to minimize the mean squareerror between the predicted and true values of the features v ( t ) k ofthe next 24 hours within a mini-batch. The training goal can be DD Workshop, Aug 24, 2020, San Diego, CA Z. Zhuang, et al.
Table 1: Experiment result summary.
RMSE Normalized RMSEDepartmentStore Restaurants Sports Medical Average DepartmentStore Restaurants Sports Medical AverageLinear Model 3.7006 3.7096 4.3325 3.7536 3.8741 10.5704 5.5397 10.2111 9.9309 9.0630Linear Model (L2-regularized) 3.6999 3.7094 4.3324 3.7532 3.8737 10.5669 5.5393 10.2107 9.9300 9.0617Nearest Neighbor 5.9784 5.7083 8.6402 5.4327 6.4399 11.0063 6.8305 11.2740 9.5840 9.6737Random Forest 3.6918 4.2933 4.8370 3.9903 4.2031
C1 C2 C3 C42.02.53.03.54.04.5
C1 C2 C3 C42.02.53.03.54.04.5
C1 C2 C3 C44567891011
C1 C2 C3 C424681012
LSTM GRU
Non-normalized
RMSE
Normalized
RMSE
Figure 5: Non-normalized and Normalized RMSE resultswith various batch sizes. written as: 1 n n (cid:213) i = (cid:213) k = (cid:213) t = (cid:18) v ( t ) k − (cid:100) v ( t ) k (cid:19) (6)where k = , · · · , t = , · · · , • Linear Model is one of the simpler models for solving re-gression problems. It predicts a value by linearly combiningthe input vector. Since we are predicting the four featuresfor the next 24 hours (i.e., 96 values total), we train 96 linearmodels and each model predicts one value. The input to eachof the models is a 672-sized vector consisting of the fourtime series from the last 168 hours. We also test linear mod-els under both l2-regularized and non-regularized settings.When l2 regularization is used, the parameter associatedwith the strength of regularization is found using three-foldcross-validation. • Nearest Neighbor is another simple method for solvingregression problems. It predicts the future by finding thenearest neighbor of the current time series from the past.Once the nearest neighbor is located, the next 24 hours ofthe nearest neighbor can be used as the prediction of thecurrent time series. This method is one of the more prevalentmethods in time series data mining [6]. • Random Forest is an ensemble method utilizing both boot-strap aggregating and random subspace method [2, 5] fortraining a set of decision trees. The input to the model is onceagain a 672-sized vector consisting of the four time seriesfrom the last 168 hours, and the output of the model is a96-sized vector consisting of the four time series for the next24 hours. The hyper-parameters associated with the randomforest is found based on the estimated error of three-foldcross-validation. • Recurrent Neural Network is a popular artificial neuralnetwork model for modeling sequence data. We use one ortwo layers of RNN to encode the input time series, then weuse a Multi-layer Perceptron (MLP) to predict the time seriesfor the next 24 hours following the state-of-the-art [7, 8].Similar to the proposed method, we test the model with bothLSTM [4] and GRU [3] recurrent architecture.
We have evaluated the proposed method and the baseline methodson the four datasets described in Section 3.1. In addition to the moreconventional root mean square error (RMSE), we also measure theperformance in normalized RMSE. We compute normalized RMSEby z -normalizing [6] each dimension of the predicted time seriesbefore computing the RMSE. Such performance measurement focuson how much the shape of the predicted time series is differentfrom the ground truth time series. The experiment result is sum-marised in Table 1. For the proposed method, we have tested theperformance with either 1-layer LSTM or 1-layer GRU.When we consider only the baseline methods, a simple L2-regularized linear model outperforms all the other non -deep learn-ing methods and is comparable to deep learning-based methodsin averaged RMSE. For averaged normalized RMSE, random forestoutperforms all other baseline methods, even the state-of-the-art ulti-stream RNN for Merchant Transaction Prediction KDD Workshop, Aug 24, 2020, San Diego, CA C1 C2 C3 C43.03.23.43.63.84.04.24.4
C1 C2 C3 C42.02.53.03.54.04.55.0
C1 C2 C3 C445678910
C1 C2 C3 C44567891011
LSTM GRU
Non-normalized
RMSE
Normalized
RMSE
Figure 6: Non-normalized and Normalized RMSE resultswith various hidden dimensions.
RNN-based solution which shows how normalized RMSE could pro-vide us an alternative view on the predictions provided by differentmethods. Nevertheless, both the averaged RMSE and averaged nor-malized RMSE suggest that MS-RNN with either LSTM or GRUoutperforms all baseline methods.MS-RNN outperforms the RNN on two out of four merchanttypes when we look closely at the result of each merchant type forRMSE, while MS-RNN outperforms the RNN on three out of fourmerchant types for normalized RMSE. Coupled with the fact that theimprovement in RMSE is 3.4% while the improvement in normalizedRMSE is 5.4%, MS-RNN is more capable of modeling the shape ofthe time series compared to the state-of-the-art RNN method. Suchimprovement can be attributed to the model architecture as themulti-stream design helps the model attending more to the detailsof the input time series.
We test one parameter associated with the model architecture andtraining process in each set of experiments, including batch size,number of hidden dimensions, number of RNN layers and shrinkstacked RNN.
Batch Size.
Three different settings of batch size (i.e., 128, 256 and512) are tested for the proposed MS-RNN with either LSTM or GRU.Similar to Section 3.4, we report both the RMSE and normalizedRMSE.The result is shown in Figure 5. There is no apparent trend whenconsidering the performance of the model with respect to the batchsize. In other words, the proposed architecture is not sensitive tothe parameterization of the optimization process.
Number of Hidden Dimensions.
We have considered three dif-ferent settings for the number of hidden dimensions (i.e., 64, 128and 256). The experiment result is summarised in Figure 6. There isno clear trend when considering the performance with respect tothe number of hidden dimensions which suggested that setting thenumber of hidden dimensions to 64 is sufficient for these datasets.
C1 C2 C3 C4
C1 C2 C3 C4
LSTM GRU
Non-normalized
RMSE
Normalized
RMSE
C1 C2 C3 C4
C1 C2 C3 C4
LSTM GRU4681012
Figure 7: Effect of number of layers (shared y -axis). C1 C2 C3 C4
MS-RNN-NS (LSTM)MS-RNN (LSTM)
C1 C2 C3 C4
MS-RNN-NS (GRU)MS-RNN (GRU)
C1 C2 C3 C4
MS-RNN-NS (LSTM)MS-RNN (LSTM)
C1 C2 C3 C4
MS-RNN-NS (GRU)MS-RNN (GRU) - d i m - d i m - d i m C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
C1 C2 C3 C4
Non-normalized
RMSE
Normalized
RMSE
Figure 8: The effectiveness of shrink stacked RNN with var-ious parameter settings.Number of RNN layers.
We compare the performance differ-ence between the 1-layer variant and 2-layer variant for MS-RNNwith both LSTM and GRU; the result is presented in Figure 7. Theincrement in the number of layers helps our model to model theshape better (higher normalized RMSE); however, it also slightlydegraded the RMSE on all datasets for LSTM and greatly degradedthe RMSE on medical merchants for GRU. The choice for this pa-rameter would depend on whether RMSE is more important thannormalized RMSE or not.
Shrink Stacked RNN.
We tested the effectiveness of shrink stackedRNN on 6 different architecture settings (LSTM or GRU with 3 differ-ent numbers of hidden dimensions) on 4 different datasets. In otherwords, we compare the MS-RNN with or without shrink stackedRNN under 24 different settings. The comparison under the 24 ex-periment settings is organized in Figure 8. The effect is positive foradding shrink stacked RNN to the MS-RNN. When we look closelyat the non-normalized RMSE results, MS-RNN with shrink stackedRNN outperforms their counterparts without shrink stacked RNNin 19 out of 24 setups. Additionally, when the normalized RMSEresults are considered, MS-RNN with shrink stacked RNN achievesbetter performance compared to their counterparts in 17 out of24 setups. The shrink stacked RNN design does help the MS-RNNconverge to a better solution.
In addition to the quantitative analysis of the MS-RNN model, weperform a qualitative analysis of the time series prediction outputby the model. Particularly, we plot the predicted time series of MS-RNN along with both of the ground truth and the state-of-the-art
DD Workshop, Aug 24, 2020, San Diego, CA Z. Zhuang, et al.
Hour 0 Hour 12 Hour 23
Figure 9: The true and predicted time series of a randomly chosen merchant for the immediate future (i.e., hour 0), near future(i.e., hour 12) and far future (i.e., hour 23). All values are normalized to [ , ] range. Feature 1, True Value Feature 1, Predicted Value from MS-RNN (GRU) Feature 2, True Value Feature 2, Predicted Value from MS-RNN (GRU)
Figure 10: A density plot of the last predicting step ( t + )of all merchants in the restaurant category. All values arenormalized to [ , ] range. RNN methods; the specific variant that we plot for both of thesemethods is the 1-layer GRU and 1-layer LSTM.In Figure 9, we plot the predicted time series of a randomlychosen merchant for the immediate future (i.e., hour 0), near future(i.e., hour 12) and far future (i.e., hour 23).We demonstrate the associated time series for the GRU variantsof each model in the top two rows and LSTM variations of eachmodel in the bottom two rows of Figure 9. Each row within therespective group corresponds to a different feature. We can see thatthe MS-RNN with either GRU or LSTM is better than the state-of-the-art RNN in all three time points and all two features especiallyfor the hour 23 in the first feature. When we focus on the lattertwo rows, which shows the prediction results for both MS-RNNand RNN’s LSTM variants, the conclusion is similar that MS-RNNproduces a prediction of higher quality comparing to RNN. Such avisual examination confirms that the prediction from the MS-RNNmodel is better than the state-of-the-art RNN model. To provide a dataset-wise holistic visualization of the predictionusing MS-RNN, we present the density plot for our model againstthe ground truth in Figure 10.We discretize the plot region into grid cells and record the num-ber of points falling into each cell when overlaying the hour 23prediction of all restaurant merchants. The color from white tored/blue reflects the increasing count from 0 to the maximum sam-ples for either the ground truth or MS-RNN. It can be seen that thedensity plot for MS-RNN matches with that of the ground truth,which further assures the prediction capability of MS-RNN.
In this work, we identified a real-world problem of predicting multi-ple features in multiple future consecutive time steps. We proposeda novel and flexible multi-stream RNN based solution to capturemerchant transaction patterns. Our extensive experiments havedemonstrated that the proposed framework outperforms existingbaseline methods under various configurations.Predicting multiple steps in the future on multivariate time seriesis an interesting topic in the FinTech industry. One limitation of thiswork is that we use only one week’s data to predict the followingday’s transaction. A possible future elaboration of this work couldbe to modify the network and utilize transaction data from a longerperiod (e.g., a month).Another interesting research direction to tackle this problemwould be to explore non-recurrent neural networks, such as convo-lutional neural networks (CNN). CNN is potentially useful for thisproblem for two reasons: (1) CNN has a simpler structure comparedto RNN, which simplifies the training process. (2) Intuitively, CNNmay also fit well with the sliding window problem setting in timeseries research. ulti-stream RNN for Merchant Transaction Prediction KDD Workshop, Aug 24, 2020, San Diego, CA
REFERENCES [1] George Box, Gwilym M Jenkins, and G Reinsel. 1970. Time Series AnalysisForecasting and Control Holden-Day: San Francisco.
BoxTime Series Analysis:Forecasting and Control Holden Day1970 (1970).[2] Leo Breiman. 2001. Random forests.
Machine learning
45, 1 (2001), 5–32.[3] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).[4] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget:Continual prediction with LSTM. (1999).[5] Tin Kam Ho. 1995. Random decision forests. In
Proceedings of 3rd internationalconference on document analysis and recognition , Vol. 1. IEEE, 278–282.[6] Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista,Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh. 2012. Searchingand mining trillions of time series subsequences under dynamic time warping.In
Proceedings of the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining . ACM, 262–270.[7] Souhaib Ben Taieb and Amir F Atiya. 2015. A bias and variance analysis formultistep-ahead time series forecasting.
IEEE transactions on neural networks andlearning systems
27, 1 (2015), 62–76.[8] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka.2017. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053 (2017).[9] Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu,Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V Chawla. 2019.A deep neural network for unsupervised anomaly detection and diagnosis inmultivariate time series data. In