Predicting next shopping stage using Google Analytics data for E-commerce applications
PP REDICTING NEXT SHOPPING STAGE USING G OOGLE A NALYTICS DATA FOR E- COMMERCE APPLICATIONS
Mihai Cristian Pîrvu
MorphL, UPB
Alexandra Anghel
MorphLMay 30, 2019 A BSTRACT
E-commerce web applications are almost ubiquitous in our day to day life, however as useful asthey are, most of them have little to no adaptation to user needs, which in turn can cause both lowerconversion rates as well as unsatisfied customers. We propose a machine learning system which learnsthe user behaviour from multiple previous sessions and predicts useful metrics for the current session.In turn, these metrics can be used by the applications to customize and better target the customer,which can mean anything from offering better offers of specific products, targeted notifications orplacing smart ads. The data used for the learning algorithm is extracted from Google AnalyticsEnhanced E-commerce , which is enabled by most e-commerce websites and thus the system can beused by any such merchant. In order to learn the user patterns, only its behaviour features were used,which don’t include names, gender or any other personal information that could identify the user. Thelearning model that was used is a double recurrent neural network which learns both intra-sessionand inter-session features. The model predicts for each session a probability score for each of thedefined target classes. The internet is crowded with e-commerce businesses, with numbers ranging from 12 to 24 million websites in 2019[1].E-commerce websites are commonly structured around items lists, items details, a shopping cart and a checkout process,with an optional step for paying online. Most of the time, the shopping experience doesn’t take into account the user’sneeds or history. Personalization is employed at a basic level, by profiling users in general categories based on gender,age, location, mobile / desktop device and so on.On the other hand, customers receive the same type of notifications and offers, with little or no attention to their specificrequirements. General discount rates or rule-based calls to action are being sent, such as email notifications for finalizinga purchase when an item was already added to the cart.Using this general approach means missing out on a huge opportunity to increase conversion rates by engaging userswith a personalized experience. Clearly, if an e-commerce application wants to maximize its profits, it should adapt tothe user, rather than just offering a one-size-fits-all approach.In this article, we present a learning algorithm which predicts the outcome of a user’s browsing session, using theinformation gathered from his previous sessions as well as the data that the user provides during the current one. Thedata that the user generates is gathered from the Google Analytics Reporting API v4 . The same model can be appliedto Google Analytics 360 / BigQuery.Google Analytics is very useful for generating activity reports or segmenting users. In addition, the Enhanced E-commerce section offers an overview of the sales funnel. However, this funnel doesn’t offer any insights into why someusers convert and some do not. https://developers.google.com/analytics/devguides/collection/analyticsjs/enhanced-ecommerce https://developers.google.com/analytics/devguides/reporting/core/v4 a r X i v : . [ c s . I R ] M a y PREPRINT - M AY
30, 2019Our proposed method enables merchants to enhance the user experience by getting to know beforehand the behaviourof the user during a particular session. The behaviour is defined as a probability vector for the most common actionsthat a user can take: visit a regular page, visit an item details page, add an item to the online cart, visiting the checkoutpage and actually making a successful transaction. These actions correspond to the purchase funnel , available as areport in the Google Analytics dashboard. For each of these classes, based on the user’s history and user’s features inthe current session, a probability vector is outputted. This vector doesn’t include the value of the predicted transaction,just the probability that the user will make a purchase. The same applies for all classes. It becomes straightforward thatthe retailers can use this information to target the user with better ads or notifications, or even offer incentives, such asgifts or discounts, while allowing the shop to also make a profit. This becomes a win-win situation, using only the dataat hand.In Section 3 we will present how the data is exported, what features are kept and how they are processed in order toenable learning. Then, in Section 4, we will present the learning algorithm, the architecture of the model and otherinformation related to training. Lastly, in Section 5, we will provide quantitative and qualitative results as well as presentan experiment on real world data in order to compare the results of using the trained model against pure statisticalsolutions. In the domain of website personalization via learning, two classical approaches are usually taken: learning from thedata alone using various regression or classification models, such as linear regression, SVMs or neural networks. Thesemethods can also be extended to include larger time lines as context, resulting in recurrent models. The second directionis using reinforcement learning, where the goal is optimizing some non-differentiable criteria, such as maximizingnumber of clicks on an ad, maximizing profit or minimizing churning.For Recommender Systems, classical approaches use either collaborative filtering or content-based filtering usingsupervised learning. Matrix factorization [2] is a prime example of collaborative filtering, however the main problem isthat it can suffer from cold-start issue with new users. This is partially solved using recurrent neural networks on thedata alone, such as in [3].Another classical problem is the task of Intent Prediction, which aims at finding the intention of a user during thecurrent session. Recent methods include predicting if the user will buy during the current session using recurrent neuralnetworks [4]. Other approaches involve splitting the intent of a user in multiple disjoint classes, such as informational,transactional, considerational or navigational, corresponding to the marketing funnel.In [5], we have predicted theseclasses using only the search query that the user employed to land on the page. We used a semi supervised approach, byfirst creating an auto-labeling process and annotating a large amount of queries from a big corpus and train a partialmodel. Then, we fine tuned the model using a small subset of queries that were manually labeled, which proved betterthan using any of the steps individually.One of the most pioneering works in the field of reinforcement learning for website personalization is using contextualbandits [6, 7]. This algorithm tries to optimize a given criteria, say number of clicks on an ad, in a one-action systemusing the historical data for each action. The classical non-contextual bandits only use the statistics about previousclicks and then, after computing scores for each action and using a choice strategy (which is a classical example ofexploration vs. exploitation), an ad will be shown. The contextual bandits extend this problem to adding user features inthe mix, such as previous page, user history, how the user arrived to the current page and so on. This kind of features iswhat our learning algorithm tries to leverage as well, but we are limited to the features offered by Google Analytics.Using Google Analytics data for Machine Learning purposes is a relatively new and rarely used method, perhapsbecause the analytics platform’s original focus is on aggregated data. The User Explorer report, which providesinsights at the user level, is relatively new. In [8], they predict the amount of people visiting a city based on websitetraffic of various touristic websites. In [9], they try to identify the demography of the users on a website, studying theevolution of location as well as device used for some time frames. In [10], they analyzed the website traffic to sort themost visited pages for some online library. Then, based on the traffic analysis, they put more emphasis on the mostvisited ones, promoting them to the main page more often which resulted in a decrease of the bounce rate by a largemargin. In [11], they analyze the performance of various e-commerce sites using computed statistics of the featuresoffered by Google Analytics. https://support.google.com/analytics/answer/6014872?hl=en https://support.google.com/analytics/answer/6339208?hl=en PREPRINT - M AY
30, 2019We can see that most of these articles focus on human made assessments and simple statistics, while our method usesthe data to train a recurrent neural network which in turn is used to provide a competent insight that can be used by thewebsite programmatically in order to improve its performance.
The learning algorithm leverages on data that is readily available to any e-commerce website that has enabled GoogleAnalytics Enhanced E-commerce. However, in order to implement a proof of concept for the task of predicting thenext shopping stage, we took a small sample of a predefined time frame from an online e-commerce retailer. All theexperiments and statistics that are computed are only valid for that particular website. Therefore, in order to applythe same learning algorithm on a new website, the statistics must be recomputed and the model must be retrained andfine-tuned for it.The Google Analytics Reporting API v4 enables administrators to export its logged data in JSON format, which can beprocessed, turned into a tabular form with numerical features and used for training machine learning methods. The datais divided into three main components: user data, session data and hits data. User data represents the information abouteach user, meaning an unique Client ID, device (mobile or desktop), browser and user type (New vs. Returning Visitor).It should be noted that the Client ID refers to a browser, not to a user account, thus it doesn’t contain any personal data.It is possible to associate the Client ID with a user account (across devices) by providing an authentication feature inthe E-commerce application, however, in this particular use case, all client ids refer to browsers and have no correlationwith real life names or any information that could be used to identify the user. Session data contains the informationabout each session, such as duration, number of transactions, number of searches, etc. Finally, hits data representthe intra-session information, such as the time on each page during the current session, the time when the page wasaccessed or the fact that the user looked at details of a product.The basic user and session data is generally available for all websites that include the Google Analytics script, withoutadditional setup. However, E-commerce shopping stages, such as visualizing items lists, items details, adding a productto the cart and the checkout process require additional code that must be added by the webmaster. These actions arethen sent to Google Analytics via the DataLayer. The setup process is described at length in the "Enhanced E-commerce(UA) Developer Guide" . In addition, in order to allow exporting the user-level, session-level and hits-level data via theGoogle Analytics API, custom dimensions must be added, uniquely identifying each user / browser, each session andeach hit / timestamp. Next, we’ll talk about each feature in particular, giving a small description about them and try to understand how theyinfluence the decision of the user at the end of each session. For a complete list of features included in the GoogleAnalytics API, one can visit the official page . User features.
The user features are invariant to each user. If the same person uses two browsers or devices, then heor she will be counted as two different users by this system.Feature DescriptionClientID Unique Client ID in Google Analytics, used to track the user across multiple sessionsusing browser cookie. It must be set as a custom dimension in order to export user data from the API.User Type Whether this user is a new visitor or a returning visitorDevice Category The category of the device: mobile, tablet or desktopBrowser Name The raw name of the browserBrowser Revenue Browser feature computed from the amount of revenue of eachper Transaction browser divided by the total amount of transactions done using that browserDevice Name The raw name of the mobile deviceDevice Revenue Device feature computed from the amount of revenue of eachper Transaction device divided by the total amount of transactions done using that deviceThe ClientID is used for joining together the tables. Browser Name and Device Name cannot be used as features,because this is a categorical feature that can grow indefinitely. However, by including it, we can compute a statisticalnumber regarding how much revenue a transaction brings using each particular browser and device. This is computed https://developers.google.com/tag-manager/enhanced-ecommerce https://developers.google.com/analytics/devguides/reporting/core/dimsmets PREPRINT - M AY
30, 2019by summing all the transaction revenues using each mobile or browser and then dividing by the count of the transactionsfor that item. A statistical analysis for these features can be seen in Figures 1 & 2.Figure 1: Users histograms based on browsersFigure 2: Users histograms based on devicesWe can see that while the number of users using a particular device or browser is dominated by a small amount of items,the actual revenue per transaction differs vastly. The final feature we include from this insights is the last histogram,where the raw number if used, thus prioritizing high selling devices over low selling ones.
Session features.
The session features are those which remain invariant during a session, as opposed to hits features,which change for each page view. 4
PREPRINT - M AY
30, 2019Feature DescriptionClientID Unique Client ID in Google Analytics, used to track the user across multiple sessionsusing browser cookieSessionID Unique Session ID in Google Analytics. Set as a custom dimension,composed of a random string and a timestamp.Session Duration The duration of the current sessionUnique Pageviews Amount of unique pages that were visited during the current sessionTransactions Number of transactions that took place during this sessionRevenue The amount of money spentUnique Purchases The amount of unique items that were purchasedDays Since Last Session Integer representing the number of days between the previous session and current oneSite Search Status Boolean representing whether the internal website search function was usedResults Pageviews Number of times the resulting page of an internal search was accessedTotal Unique Searches Total number of unique internal searches. If the same keyword is searched multipletimes, it will be counted only onceSearch Depth The total number of subsequent page views made after a use of the site’sinternal search feature.Search Refinements The total number of times a refinement (transition) occurs betweeninternal keywords search within a session.Shopping Stage The shopping stage at the end of each session. This is the target feature forthis use case.The first two features are only used to join together the hits and user features. However, all the other columns aretransformed into numerical values and used as-is in the learning process. It may not be obvious that all columnsare useful for predicting the next shopping stage, however some of them, such as revenue or number of transactionsdefinitely influence. It should be noted that for predicting the current shopping stage, the features of the previous sessionare used directly, because otherwise the transactions or revenue columns would directly tell us if a user made a paymentor just visited the website. This also enables the system to be used as a real-time application, where the features of theprevious sessions are used in addition to the hits of the current session in order to predict the shopping stage of thecurrent session. More about this will be detailed in Section 4.
Hits features.
The hits features are those which are updated for each page view of each session.Feature DescriptionClientID Unique Client ID in Google Analytics, used to track the user across multiple sessionsusing browser cookieSessionID Unique Session ID in Google AnalyticsDate Hour and Minute The hour when the hit was madeTime on Page The amount of time the hit tookProduct Detail Views Whether the user looked at the details page of a productWe can see that there are only a few usable features, which is a downside of using Google Analytics in its current form.However, if more hits were to be enabled, this would only increase the quality of the data and, in turn, the quality of theresults. In Figure 3, we can see a correlation matrix for all the features using Pearson’s correlation coefficient.5
PREPRINT - M AY
30, 2019Figure 3: Features correlation matrixThe last column is the target column and we can see that most if not all features have some correlation to it. Thestrongest ones are number of transactions, revenue and unique purchases in the session. This is the reason why we’reconcatenating the features of the previous session to the current session instead of using the current ones. We’d have noidea how many transactions a user has made at the start of a session in a real life situation. Also, those features hide theshopping stage. It is also important to observe that all features provide some sort of correlation, meaning that, while thisproblem is starved of features, all of them are somewhat relevant. One final thought is that adding more custom featureswould increase the quality of the result.
Our initial intention was to simply predict the shopping stage as a simple classification problem, predicting one of the 6possible classes. One issue that we had to solve was that Google Analytics doesn’t offer us the real time shopping stage,which means that for each hit (page view), we get the status of the user. Instead, it offers us all the shopping stages theuser went through at the end of the session, without knowing which hit caused which stage. Therefore, after removingoutliers and aggregating unique paths, we are left with the following classes: • All Visits • All Visits -> Product View • All Visits -> Product View -> Add to Cart • All Visits -> Product View -> Add to Cart -> Checkout • All Visits -> Product View -> Checkout • Transaction 6
PREPRINT - M AY
30, 2019These are the 6 disjoint and standard paths a user usually takes when visiting an E-commerce website. The first onecorresponds to simply visiting a misc page, such as contact page information about the retailer. The second one is theopposite, where the user visits a page that contains a product or the list of all products. The next ones correspond to theuser adding an item to the shopping cart and checking out a shopping cart. Finally, the last one represents the mostimportant one for this use case, where a user has successfully bought a product.However, we quickly realized that, because of data imbalance, the classifier would always predict the most dominantclasses (first two), so we changed the problem formulation from a classification problem to a regression problem. Now,the model has to predict a probability for each of the 6 classes. However, the main problem now was deciding how tomodel this probability, as we’d like it to be as close to the actual probability of a user making a Transaction, or makingan All Visits hit and so on. Our solution was to compute it using two attribution modelings , namely linear and timedecaying. This attribution model simply gives equal contribution for a transaction to all sessions, regardless of the time that passedbetween the first and last sessions.Let’s consider a user has 5 sessions and two of these are transactions. We can view this as a binary vector: v =[0 , , , , . Using a linear attribution model, this user has a transaction probability of / . . However, since weare dealing with a recurrent neural network, we’d like to learn from this user his whole behaviour, starting from his firstsession up until his fifth. Thus, his transaction probability up until his n-th session is the partial sum of his transactionsstarting from the first session all the way to the nth one. Formally, t ( n ) = (cid:80) ni =1 v (1: n ) n , for all n ≤ N , where N is thetotal number of sessions for this user and : is the partial vector operator. The resulting vector t for this particular user is t = [ , , , , ] = [0 , . , . , . , . .This is the ground truth vector that the recurrent model must learn at each time step. The same process is repeated forall 6 classes, not just transactions. We can see that all the sessions indeed contribute equally, giving a large value to thetransaction that was made in the 2nd session all the way up until the 5th session. The next attribution model changesthis, by giving more weight to the sessions that are closer to the last session. The second attribution modeling tries to give more value to the latest sessions, by applying a half-life weight basedon how far away the session is. Basically, given the same example as before, if we have the transactions vector v = [0 , , , , , then the weight vector would be w = [ , , , , ] = [0 . , . , . , . , . Making aparallel to the linear case, the weight vector there is simply 1 for each position.Then, the partial transaction probability value can be formulated as: t ( n ) = w (1: n ) · v (1: n ) (cid:80) ni w (1: n ) , for all n ≤ N where · isthe dot product between two vectors. Technically, the same formula can be applied to the linear case, however thedenominator sums up to n, because the weight vector contains just ones and the numerator simplifies to the sum above.For this particular case, the transaction probability vector is: t = [ 0 ∗ , ∗ + 1 ∗ + 1 , ∗ + 1 ∗ + 0 ∗ + + 1 , ∗ + 1 ∗ + 0 ∗ + 0 ∗ + + + 1 , ∗ + 1 ∗ + 0 ∗ + 0 ∗ + 1 ∗ + + + + 1 ] == [0 , . , . , . , . .We can see that this method puts much more value on the more current ones. Had the first transaction moved 1 sessioncloser to the last one, then the transaction probability value of the last session would’ve become 0.65 according to thesame formula. The model proposed to tackle this problem is a double recurrent neural network, where both recurrent layers areimplemented using LSTM cells. The high level architecture can be seen in Figure 4. https://support.google.com/analytics/answer/1662518?hl=en PREPRINT - M AY
30, 2019Figure 4: Model ArchitectureThe data flow is that, for each user, we divide the entries of the user based on all his sessions. Now, for each sessionof this user we have a variable number of hits, which is denoted with the blue box in the figure. After passing all thehits features through the first LSTM, we get a hidden state vector, noted as H i , for each session. This feature vector isconcatenated with the features of the previous session S i − , with the special case that the first session is concatenatedwith a zeroed vector, as there is no history beforehand, resulting the feature vector of sessions and hits. Now, all thesesession + hits vectors are passed through the second LSTM, resulting another feature vector S i − H i that combinesboth the features of the current session with the history of the previous sessions, for all sessions, denoted as green inthe Figure. For all these feature vectors, we also concatenate the user features and then pass them through two fullyconnected layers, after which we get 6 probabilities, representing the probability of each possible output, at everysession. The activation function for the first FC layer is ReLU, while the second one is the element-wise sigmoid. Thisis denoted by pink in the Figure. These outputs are compared with the ground truth probability vectors, computed asdescribed earlier depending on the chosen attribution model, using the MSE as the loss function of the model.The only hyperparameter used for this model is the number of hidden units of the LSTM hidden states, which wasempirically chosen as 30, due to the low feature space of the hits. The total number of trainable parameters for thismodel is just 15.246, which can be considered very lightweight compared to other deep learning models. The experiments were run on a private dataset from an E-commerce website, which were exported exactly as describedin Section 3. Both attribution models were used (linear and time decaying), as well as two normalization methods (minmax normalization and data standardization). The results were pretty similar, as we can see in Table1. The model wasimplemented using PyTorch, using only standard layers for the neural network.Attribution model Normalization Loss Accuracy (0.5/2.0) Accuracy (0.8/1.25)Linear Min-Max
Linear Standardization / . . The renormalization scheme is done, such as we candefined interval thresholds based on the real number of transactions and the predicted one. Here, at the 5th session,the user has made 3 transations. Thus, for a . ≤ x ≤ interval, we say that the result is correct if the predictedtransactions are more than half (1.5) and less than twice (3). This process is only evaluated for the last session, based onthe output probability, and the number of sessions of the user. A similar logic is applied to time decaying as well.8 PREPRINT - M AY
30, 2019We can observe that the linear model learns much better than the time decaying one. Perhaps, because applying anexponential law to the distance between sessions as a label target is too complicated given the small amount of featuresthat we could gather using Google Analytics. In what follows, we take these models and try to apply it to a real lifesituation, where we’d like to target users, with an ad, or a promotion or anything else targeting might mean.
Given a subset of the validation set, with a segment of 5347 users, we want to target as many of these users in thelast session (which can be thought of the active session). We know that only 32 users performed a transaction in thelast session, however we have the history of transactions of all of them in all of their sessions. These 32 transactionsaccounted for a revenue of 29579 (in some currency). We will analyze four targeting methods, two based on thetwo machine learning models, one using a random scheme and one using a statistical method similarly to the linearattribution model.We want to see, for each method, what are the percentiles of true positives (correctly targeted users) and false positives(incorrectly targeted users). In real life situations, each targeted user costs an amount of revenue (the cost of an ad, forexample) and the breaking point would represent the maximum amount of revenue that can be spent while also makingprofit.Our experiment can be described as follows: Given each user and its history (sessions, hits, transactions, etc.), wecompute a transaction probability ≤ p u ≤ . Then, for this user we apply a binomial sampling and if the result ispositive we target him. Targeting has a cost, but that cost can be alleviated if the user makes a transaction after hewas targeted. If not, then the user was targeted for nothing, and we lose money from the total revenue. Thus, given 4methods: random, statistical, time decaying and linear, we’d like to know how many users (in average) we correctlytarget (true positives), how many users we wrongly target (false positives).The random method gives a random probability to all users without any bias. The statistical method is a replica of thelinear attribution model, where we use all the sessions besides the last one for each user and apply a random uniformnoise of 0.1. The last two methods are the values reported by the trained models. In Table 2, we can see the results ofthis experiment, for this particular validation set.Method True positives True positives% False positives False positives% Breaking cost pointRandom ( ± ( ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ( ± ( ± ( ± prof it = (cid:80) i ∈ T P revenue ( i ) − ( T P + F P ) ∗ cost . The revenue vector represents theamount of money that particular user has spent on the last session. The breaking cost point can be obtained by settingthe profit to 0, resulting in BP = (cid:80) i ∈ TP revenue ( i ) T P + F P . In Figure 5, we can observe the linear and log-scale plots for thisfunction, for all 4 methods. 9
PREPRINT - M AY
30, 2019Figure 5: Revenue per targeting cost (linear and log scale)We observe that, if the cost of a targeting is very low (close to 0), then the random method would give us the highestprofit. However, this cost is seldom so low, because it usually represents the cost of an ad campaign, which has usuallymuch higher values. Also, if we target users that don’t want to buy something now, we risk the chance of losing themforever. This is why it is important to only target relevant users, that might convert with a high probability based on itsbehaviour. As the cost increases, the profit drops drastically for the random scheme. The other 3 methods prove moreresilient, with the linear model having the best result with a breaking cost point of over 100.
In this paper we explain how to use Google Analytics to export relevant data from an E-commerce website and usethat data to train a recurrent neural network model in order to predict the probability of an user doing an action. Theseactions were defined purely using statistics, by taking the most 6 visited paths. However, the most important one is thetransaction class. Using the trained models, we provide an targeting experiment, where we’d like to see if the modelperforms better than a random or a statistical based targeting rule in order to compute the breaking cost point. Weobserve that the linear attribution model performs the best, much better to all the other 3 methods.10
PREPRINT - M AY
30, 2019
References [1] Syed Muneeb Ul Hasan. The Top E-commerce Companies Of 2019. https://magenticians.com/top-ecommerce-companies-in-the-world/" .[2] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.
Computer , (8):30–37, 2009.[3] Yong Kiam Tan, Xinxing Xu, and Yong Liu. Improved recurrent neural networks for session-based recommen-dations. In
Proceedings of the 1st Workshop on Deep Learning for Recommender Systems , pages 17–22. ACM,2016.[4] Humphrey Sheil.
Discovering User Intent In E-commerce Clickstreams . PhD thesis, Cardiff University, 2019.[5] Mihai Cristian Pîrvu, Alexandra Anghel, Ciprian Borodescu, and Alexandru Constantin. Predicting user intentfrom search queries using both cnns and rnns. arXiv preprint arXiv:1812.07324 , 2018.[6] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized newsarticle recommendation. In
Proceedings of the 19th international conference on World wide web , pages 661–670.ACM, 2010.[7] Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake-off. arXiv preprint arXiv:1802.04064 ,2018.[8] Ulrich Gunter and Irem Önder. Forecasting city arrivals with google analytics.
Annals of Tourism Research ,61:199–212, 2016.[9] David Durden. Identifying user demographics in digital collections with google analytics. 2016.[10] Amy Vecchione, Deana Brown, Elizabeth Allen, and Amanda Baschnagel. Tracking user behavior with googleanalytics events on an academic library web site.
Journal of web librarianship , 10(3):161–175, 2016.[11] Beatriz Plaza. Google analytics for measuring website performance.