Hierarchical Dynamic Modeling for Individualized Bayesian Forecasting
Anna K. Yanchenko, Di Daniel Deng, Jinglan Li, Andrew J. Cron, Mike West
HHierarchical Dynamic Modelingfor Individualized Bayesian Forecasting
Anna K. Yanchenko , Di Daniel Deng , Jinglan Li , Andrew J. Cron , and Mike West Department of Statistical Science, Duke University, Durham NC 27708-0251. U.S.A. . ◦ , 100 West 5th Street, Cincinnati, OH 45202. U.S.A. January 12, 2021
Abstract
We present a case study and methodological developments in large-scale hierarchical dynamicmodeling for personalized prediction in commerce. The context is supermarket sales, where im-proved forecasting of customer/household-specific purchasing behavior informs decisions aboutpersonalized pricing and promotions on a continuing basis. This is a big data, big modeling andforecasting setting involving many thousands of customers and items on sale, requiring sequen-tial analysis, addressing information flows at multiple levels over time, and with heterogeneityof customer profiles and item categories. Models developed are fully Bayesian, interpretableand multi-scale, with hierarchical forms overlaid on the inherent structure of the retail setting.Customer behavior is modeled at several levels of aggregation, and information flows from ag-gregate to individual levels. Forecasting at an individual household level infers price sensitivityto inform personalized pricing and promotion decisions. Methodological innovations includeextensions of Bayesian dynamic mixture models, their integration into multi-scale systems, andforecast evaluation with context-specific metrics. The use of simultaneous predictors from mul-tiple hierarchical levels improves forecasts at the customer-item level of main interest. This isevidenced across many different households and items, indicating the utility of the modelingframework for this and other individualized forecasting applications.
Keywords:
Bayesian state space models, big dynamic data, commercial forecasting systems, con-sumer sales forecasting, decouple/recouple, forecast assessment, multi-scale hierarchical models,personalized marketing, probabilistic forecasting, supermarket sales forecasting
Practical forecasting to inform decision making in large retail and allied commercial demand/salessettings presents enormous challenges to statistics. Problems involve big, complex and heteroge-neous data, sparsity of information, and inherently hierarchical structures in multiple dimensions1 a r X i v : . [ s t a t . M E ] J a n time, geography, customers/households, and items on sale). Our work focuses on such challengesin a supermarket sales setting; a primary goal is to forecast customer behavior at the individual(customer/household) and item (specific product on sale) level, in each store in a large nationalsupermarket system, and continually over time. Forecasts inform downstream decision makingabout prices and promotions. The fine-scale decisions lie at the level of individual items potentiallypurchased each week by individual customers/households, involving assessments of price sensitivityso as to best customize/individualize offers of item-specific discounts and other marketing decisions.Analysis involves large data sets on past purchasing behavior of many thousands of householdson a weekly basis. The data typically includes item purchase information by household, with detailsof item prices and promotions, for the many items in the retail chain and all customers. Purchasingbehavior is hugely heterogeneous; many households will not purchase a specific item for multipleweeks if at all, making the data sparse and very variable across items. Households and items arenested within intersecting hierarchies, and understanding purchasing trends at each level is central.Sharing information across households is a key interest, but is very challenging in terms of bothcomputational demands and data heterogeneity. At a strategic level, the following desiderata arisein the applied context; models for personalized forecasting should be • able to incorporate predictor information such as price and promotions, • adaptable to time-varying trends, regression effects and unforeseen temporal changes, • interpretable and open to intervention by users and downstream decision makers, • fully probabilistic to properly characterize forecast uncertainties and allow formal model andforecast assessment under multiple metrics, • adapted to hierarchical settings, and • amenable to automated, computationally efficient sequential learning and forecasting.Statistical challenges raised by these needs are shared by increasingly fine-scale prediction problemsin various fields due to the growing availability of individual level data. Related and Relevant Work
Our developments link to interests in individualized forecasting in a variety of retail settings (e.g.Chen et al., 2018), while the broader statistical challenges relate to personalization as a main goalof recommendation systems in a variety of application areas. Applications range from image recom-mendation systems (Niu et al., 2018) to music (Wang et al., 2013). Personalized recommendationsystems typically rely on methods such as collaborative filtering and matrix factorization (Su andKhoshgoftaar, 2009; Du et al., 2018) or Bayesian personalized ranking (Rendle et al., 2009). Re-cent approaches have leveraged deep learning to scale-up to larger and more complex data (He2t al., 2018; Niu et al., 2018; Naumov et al., 2019). The majority of such approaches are inherentlynon-dynamic, though matrix factorization has been extended to a dynamic setting (Jerfel et al.,2017) and temporal features can be leveraged (Chu and Park, 2009; Hu et al., 2015). Recommen-dation system approaches typically aim to match users to items, rather than aiming to forecastwhich items an individual will purchase. In medical applications, a core focus on personalizedprediction arose with the advent of genomics (e.g. Nevins et al., 2003; Pittman et al., 2004; Westet al., 2006); the field has grown and seen advances based on statistics and machine learning inareas such as Alzheimer’s (Peterson et al., 2017; Fisher et al., 2019) and glaucoma progressionprediction (Kazemian et al., 2018). Such approaches have exploited various techniques, includingtraditional Kalman filtering (Kazemian et al., 2018), Gaussian processes (Peterson et al., 2017) andvariants of Boltzmann machines (Fisher et al., 2019). The goal is usually forecasting a single quan-tity of interest (i.e. glaucoma progression) rather than forecasting across related data “dimensions”such as households and items.In the retail domain, individualized modeling has generally involved either (i) traditional ran-dom effects models, which are often not explicitly dynamic and can be challenging to fit to manytime series, or (ii) black-box machine learning approaches that generally lack interpretability andprobabilistic structure. Random effects models (Lichman and Smyth, 2018; Kazemian et al., 2018;Lichman and Smyth, 2018; Thai-Nghe et al., 2011) are inherently hierarchical, but standard formu-lations do not meet the desiderata for dynamics and computational scalability. Machine learningapproaches such as Sen et al. (2019) have utilized global matrix factorization for information sharingacross time series with temporal convolution networks. Salinas et al. (2019) used Gaussian copulaprocesses with a focus on retail forecasting, Flunkert et al. (2017) and Wang et al. (2019) proposeddeep models with probabilistic forecasts, while Chen et al. (2018) used recurrent neural networksfor predicting next-arrival times of customers. These deep learning approaches are sequential andcan be probabilistic (Salinas et al., 2019; Chen et al., 2018) as well as computationally efficient,but lack interpretability and openness to communication and intervention by downstream decisionmakers. For example, it is unclear how to interpret the impact of price discounts on purchasingbehavior, making it very challenging to determine which household to send discount coupons to forwhich items– a primary driver of the interest in personalized models.
Modeling Perspective and Framework
We address the desiderata laid out above in an holistic Bayesian forecasting framework using tried-and-tested dynamic modeling components and concepts. The hierarchical structure of the problemnaturally invites extensions of multi-scale modeling ideas for efficient and effective sharing of infor-mation. We build on extensions of dynamic generalized linear models (DGLMs: West and Harrison,1997, chap. 14) to forecast individual household purchasing behavior. DGLMs and related dynamic3ayesian time series models have been widely applied to much success, and recent extensions havefocused on tailoring these approaches to count-valued time series (Berry and West, 2020; Berry et al.,2020) and on increasing computational efficiency in hierarchical multivariate settings (Lavine et al.,2020). Such developments have been partly motivated by retail forecasting applications at a moreaggregate level. Berry and West (2020) introduced the dynamic count mixture model (DCMM),a mixture of Bernoulli and Poisson DGLMs to model count time series with a high proportion ofzeros for retail demand forecasting. These models can also account for over-dispersion in the Pois-son DGLM and, importantly, leverage aggregate information via a multi-scale modeling approach.Berry and West (2020), and the extensions to joint transaction and sales forecasting in Berry et al.(2020), have demonstrated the value of multi-scale modeling in a variety of settings as well as thecompetitive forecasting performance of linked systems of Bayesian DGLMs.We extend the multi-scale approach to hierarchically compose sets of models on different aspectsof household consumer behavior, while maintaining computational efficiency. This involves a broadperspective on hierarchical, multi-scale modeling to share information across both items and house-holds at the individual level; fusing tiered sets of conditional models at different levels of the hierar-chy, this can yield improved forecasting for individual households. Methodological innovation alsoincludes a new class of dynamic linear mixture models (DLMMs) for mixed binary/continuous timeseries, relevant for components of the multi-scale system. Importantly, computational tractabilityand efficiency are at the forefront in methodological developments. All model components are inter-pretable, fully Bayesian/probabilistic, and defined for sequential learning and forecasting, makingthe approach ideally suited to general personalized forecasting applications.Section 2 discusses data and application goals, followed by models and computational detailsin Section 3. Section 4 discusses decision analytic perspectives on metrics and evaluation methods,and Section 5 presents some empirical forecast accuracy results at both aggregate and individuallevels. Concluding comments appear in Section 6.
Data informs on individual household purchasing histories in a large supermarket system. Datarecords selected for this case study provide information on the weekly purchasing behavior ofover 500,000 households across 112 weeks for over 200 unique items on sale. A “household” isconsidered an “individual”, and primary interest lies in forecasting weekly household purchasingtrends. Available covariates include the total $ US spend for each household each week, the numberand types of items purchased and the discount percentage (discount amount divided by the regularprice), if applicable. Different households can be offered different discounts for the same item in4he same week, due to coupons and e-marketing, for example.
General buying trends differ considerably by household (Figure 1, Table 1) and there is additionalindustry interest in categorizing households by various demographic information and purchasingtrends. We define three household groups based on total items purchased over the course of the112 weeks: • Household Group 1: high spending and purchasing households • Household Group 2: moderate spending and purchasing households • Household Group 3: lower spending and purchasing householdsEach household group consists of 2,000 households; the methodology is scalable to many morehouseholds. Additional demographic information can be incorporated into models or groupings forevaluation, but is not available here. The household groupings are used to explore forecast accuracyacross a variety of households and to demonstrate that purchasing trends do vary significantly byhousehold.Table 1: Household group summaries. Proportion of weeks that each household returns withsummary spend information. There is variation in patterns of return and total spend habits ofhouseholds across groups.
Household Group Proportion Return Mean Spend Median Spend SD Spend1 0.96 $ $ $ $ $ $ $ $ $ Purchasing trends also differ significantly by item for individual households (Figure 2). In additionto forecasting at the household level, a main goal is to understand price sensitivity of householdson each item. We select six specific items to exemplify this– items that have dynamic discounts byhousehold over different weeks, and thus have “potential” for price sensitivity. That is, aggregatedover all 112 weeks, if an increase in discount amount leads to an increase in amount purchasedfor a majority of households, then many households tend to be price-sensitive for that item (atleast in aggregate). We categorize households based on their promotion scenarios and purchasingbehaviors, with particular interest in households that are sensitive to promotions. We then selectitems that have a high number of households that are price-sensitive in aggregate, choosing the sixitems with the largest proportion of price-sensitive households. These six items are purchased by5 a) Household Group 1: High spending households.(b) Household Group 3: Low spending households.
Figure 1: Purchasing trends for three different households in the (a) high-spending household group(Group 1) and (b) the low spending household group (Group 3) for one specific item.a large number of households in each household group, making the items very relevant exemplars.Item A, shown in Figure 1, is the main item of focus for subsequent results; it is the highest sellingitem of the six considered, while the additional items, B–F, are discussed in more detail in theAppendix.For many weeks, the households in Figure 2 do not purchase any of the items considered. It ispossible that these households did not return to the store, or that they did not specifically purchaseany of the three items considered during a visit. We thus need to separate “no return to store”from “non-purchase” as a model component.Aggregate price sensitivity for selected items varies by household. Figure 3 displays item quan-tity purchased vs. discount amount offered for Item A across all weeks for three different households.Some households (both higher spending and lower spending) purchase more of Item A when thediscount amount is increased, while other households purchase the same item quantity, regardlessof the discount amount offered. Many households do not purchase the item, whatever the discount.This exemplifies heterogeneity by household in terms of price sensitivity, and analyzing this in adynamic fashion is a main point of modeling focus.6 .1.3 Nested Item Hierarchy
Trends at the individual level tend to be very heterogeneous by household and item, and evenweek-to-week for a specific household-item pair. However, there is a nested hierarchy of iteminformation that provides opportunity to identify more stable trends at more aggregate levels. Theitem hierarchy consists of category, sub-category and finally item information. The category is thehighest and broadest level of aggregation, for example a specific category might include all soda.The sub-category level is more specific, and may include diet sodas, for example. Finally, the itemlevel is the most specific, and would represent Diet Coke, for example. There are multiple itemsin each sub-category and multiple sub-categories in each category. Purchasing trends tend to bemuch more stable and persistent when aggregated at the category level, especially for high spendinghouseholds in household Group 1 (Figure 4). Exploiting this aggregate information in modeling atthe individual level will be important. (a) Household Group 1: High spending household.(b) Household Group 3: Low spending household.
Figure 2: Purchasing trends for three different items for one individual household in the (a) high-spending household group (Group 1) and (b) the low spending household group (Group 3). Purchas-ing trends vary greatly by item for a specific household, and again, many items are not purchasedfor many weeks. 7 a) Household Group 1. (b) Household Group 3.
Figure 3: Item quantity purchased by discount amount for Item A for three different households inhousehold (a) Group 1 and (b) Group 3. Each point represents an individual week. Price sensitivityis heterogeneous by household, for a specific item.
The overarching goal is to collectively forecast purchasing trends at the household-item level. Sev-eral related sub-goals inform our modeling perspective. Understanding price sensitivity at anindividual household level is especially important, as we aim to identify households that are likelyto be sensitive to targeted discounts for specific items. Related to this is the importance of proba-bilistic forecasting at all levels of the hierarchy. Individual household forecasts are intended for usewithin a larger forecasting and promotion system, requiring probabilistic forecasts for downstreamdecision making. Then, due to the volume of data and the number of households, computationalefficiency is essential. Methodology must scale to many hundreds of thousands of households anditems.
Our hierarchical modeling decomposition is multi-scale, allowing for the sharing of informationwithout a need for complex dependency structures; forecast information from higher levels inthe model informs lower-level forecasts. At each level, we use dynamic generalized linear mod-els (DGLMs) (West and Harrison, 1997) and variants (Berry and West, 2020) to ensure that theoverall modeling approach is sequential, interpretable, probabilistic and computationally efficient.We use a new DGLM variant, the dynamic linear mixture model, to address key aspects of sparsityin household data. 8 a) Household Group 1: Multiple Households.(b) Household Group 1: Multiple Categories.
Figure 4: Purchasing trends for (a) three different households in the high-spending householdgroup (Group 1) and (b) for one household in household Group 1 across three different categories.Purchasing trends are more stable and consistent at the category level.
The approach to forecasting specific items that an individual household will purchase in a givenweek is motivated by concepts of rare-event modeling using nested sets of conditional probabilitymodels. Trends at the individual household-item level tend to include many weeks of no purchasefor the majority of items (Figure 1), making this a rare-event forecasting problem. However,trends at the category-level, for example, in the nested item hierarchy tend to be more stableand persistent (Figure 4). This concept applies also to sub-categories within categories, and thenitems within sub-categories. Hence the rare event of an individual household-item purchase can beframed through conditional models of less rare events at the sub-category and then category levels.This multi-scale idea conditions out “no return to store” in lower levels of the model and allowsfor the use of simultaneous information across levels. Figure 5 is a schematic of the main modelstructure for one household and one item. Here “Return” is the indicator of whether the householdshopped at all that week, (log) “Total Spend” represents the (log) total $ amount spent by the9igure 5: Multi-scale model decomposition for one household-item pair.household in that week (whether zero or positive) at global, category and sub-category levels, and“discount perc.” denotes the % discount offered to the household on the nominal price of the item.At the target item level, we predict how many of that item the household purchases conditionalon spending a non-zero $ amount in the sub-category the item belongs to. At this final stage, theoutcome is a– typically small– non-negative integer.Each component of the model in Figure 5 is dynamic across weeks and is fit to each householdand item conditionally independently based on conditioning information. Dependencies are cap-tured through information sharing by conditioning on aggregate information from higher levels inthe hierarchy. This extends the multi-scale view that has been demonstrated to improve forecastingaccuracy (Berry and West, 2020). The decomposition also “conditions out” sparsity in the itemlevel data, yielding improved forecast accuracy; if we predict that a household does not returnin a given week, we do not need to proceed further to the item level. Then– importantly– thedecomposition allows for simultaneous predictors. For example, after modeling the total spend in agiven week, the forecast value of total spend can be used as a simultaneous predictor in the modelfor total spend at the category level for that week. Such simultaneous predictors from higher levelsin the decomposition can greatly improve forecast accuracy in the lower-level models. Further,generating ensembles of higher-level quantities enables full probabilistic uncertainty propagation tofiner levels. The building blocks at every level in Figure 5 are based on DGLMs, so are fully probabilistic,naturally analyzed sequentially and with computational efficiency. Coupled with the hierarchical10ecomposition into components at each level, this meets the desiderata of Section 1.In a general notation, let y t be a univariate time series observed at discrete times t = 1 , . . . , T .Denote the available information at time t by D t = { y t , D t − , I t − } , where I t − represents anyadditional relevant information beyond the observed data. Each y t follows a distribution in theexponential family with linear predictor λ t that evolves over time as λ t = F (cid:48) t θ t where θ t = G t θ t − + ω t and ω t ∼ ( , W t ) , (1)where • F t is a matrix of known covariates at time t , • θ t is the state vector, which evolves via a first-order Markov process, • G t is a known state matrix, • ω t is the stochastic innovation vector, or evolution “noise”, with E ( ω t |D t − , I t − ) = and V ( ω t |D t − , I t − ) = W t , independently over time.Model components in the hierarchy involve Bernoulli logistic DGLMs for p(Return), normal dy-namic linear models (DLMs)– with potentially time-varying conditional variances– for p(log TotalSpend | Return), and Poisson loglinear DGLMs for item level counts.
Many items will not be purchased in individual weeks even when a household spends on otheritems. Further, in some weeks a household will purchase more quantities than typical for a spe-cific item. Thus item quantity by household can exhibit both zero-inflation and over-dispersion.Dynamic count mixture models (DCMMs) were introduced explicitly to address these non-Poisson(dynamic) features, and have very competitive forecasting accuracy compared to a variety of othermodels (Berry and West, 2020). Hence, DCMMs are integrated into the hierarchical model struc-ture at the final stage to predict the item quantity purchased by the specific household conditionalon information from the earlier levels of the hierarchy together with item pricing and discountinformation.If y t is a non-negative count time series, let z t = ( y t > is the indicator function.A DCMM has the form z t ∼ Bernoulli ( π t ) and ( y t | z t ) = , if z t = 0 , s t , s t ∼ Poisson ( µ t ) , if z t = 1 , with logit( π t ) = F (cid:48) t ξ t and log( µ t ) = F + (cid:48) t θ t , (2)where ξ t and θ t are state vectors following usual linear evolutions, and F t and F + t are knownregression vectors. The DCMM models µ t and π t independently, and F t and F + t can be distinct,11epending on the application. Sequential learning and forecasting proceeds similarly as in theDGLM, with the Poisson model only being updated if z t = 1; see the Appendix and full detailsin Berry and West (2020). At the category and sub-category levels there are also many zero values, indicating no purchase forthat week. This requires extensions of the traditional normal DLM for (log) $ spend to accountfor weeks with zero spend. To address this, we map the concept underlying the DCMM to a newclass of dynamic linear mixture models (DLMMs) . This treats zero spend in a given week as abinary time series and, conditional on a non-zero spend, applies a DLM to log $ amount spent.The proportion of weeks a purchase is made by a household can vary greatly by category and sub-category, so the binary component here is key. Some sub-categories, in particular, are purchasedvery infrequently and exhibit a high proportion of zeros, necessitating this new mixture modelingapproach to capture conditionally (log) normally modeled outcomes combined with possibly manyzeros. DLMMs apply to the log total spend at both the category and sub-category levels, defininghousehold-item specific forms for p(Category log Total Spend | Return, Global log Total Spend)and p(Sub-Cat. log Total Spend | Return Category, Category log Total Spend).Specifically, denote by x t a continuous valued outcome (log total spend at one level) and let z t = ( x t (cid:54) = 0). A DLMM has the form z t ∼ Bernoulli ( π t ) and ( x t | z t ) = , if z t = 0 ,x t ∼ N ( F (cid:48) t θ t , V t ) , if z t = 1 , (3)where again π t is defined by a binary DGLM with its own state vector, and θ t follows a separate,independent DLM evolution and defines the model for non-zero $ spend involving the known re-gression vectors F t . As with the DCMM, these new DLMMs provide modeling flexibility in a fullyBayesian, computationally efficient extension of traditional DLMs; more details are noted in theAppendix.
As highlighted in Berry and West (2020), a forecast is a full predictive distribution, as understand-ing uncertainty is critical for downstream decision making. Any point forecasts used for modelevaluation and comparison should be justified by an appropriate loss function in a Bayesian deci-sion theoretic perspective. Results can depend critically on loss functions selected so it is importantto carefully consider which loss functions are used to evaluate models.12 .1 Loss Functions and Point Forecasts
Common choices of loss functions are squared error, absolute deviation and absolute percentageerror, yielding mean or average values denoted by MSE, MAD and MAPE. The latter is partic-ularly common in commercial settings as it can putatively be compared across contexts, thoughis restricted to positive outcomes. While forecast medians are optimal under MAD, the optimalpoint forecast f of y ∼ p ( y ) under MAPE is the ( − − median, i.e., the median of g ( y ) ∝ p ( y ) /y ,always less than or equal to the median and sometimes much lower. MAPE has been extended tothe class of (zero-adjusted APE) ZAPE loss functions (Berry, 2019; West, 2020). While there areseveral variants, the form of ZAPE used here for a non-negative outcome y and point forecast f is L ZAP E ( y, f ) = ( y t = 0) f / (1 + f ) + ( y t > | y − f | /y. The ZAPE optimal forecast is alwaysless than or equal to the ( − − median; with forecast distributions heavily favoring zero or lowvalues, it can often be zero. Numerical optimization for ZAPE forecasts is easy and detailed in theAppendix. We stress and routinely use calibration and coverage assessments for probabilistic forecasting,building on the increasing impact of full probabilistic assessment in commercial forecasting set-tings (Berry and West, 2020; Berry et al., 2020). For binary outcomes that are central in thisapplication, frequency calibration is a core assessment concept. To calculate realized calibrationcharacteristics, we bin the probability scale and evaluate calibration within each probability bin.In particular, we consider calibration plots related to predicting when a household will return toshop, and how well we are modeling the sparsity in the data at several levels. Probability coveragecomparing predicted versus realized coverage over a forecasting time period uses predictive cred-ible intervals (HPD, highest probability density), allowing assessments of under- or over-coverageacross the probability scale via simple coverage plots. This is used, for example, in assessment ofcomponents of the model forecasting household $ spend at overall weekly basis and then withincategories and sub-categories of items. Highlights summarized here focus on forecasting at the category, sub-category and item levels, withadditional modeling results at all levels of the modeling decomposition given in the Appendix.13 .1 Modeling Details
We apply the hierarchical modeling decomposition of Figure 5 to households in each of the threegroups (Group 1 = high spenders, Group 2 = moderate spenders, Group 3 = low spenders) toforecast item quantity purchased by household. Each household is modeled conditionally inde-pendently, but with information shared via conditioning variables– across items and households atthe item level– as detailed in Section 5.4. At the global level, p(Return) and p(log Total Spend | Return) are modeled via a Bernoulli DGLM and a (log) normal DLM, respectively. Here, we focuson results at (i) the category and sub-category levels using DLMMs, and (ii) the item level usingDCMMs. Unless otherwise noted, results are presented for Item A involving a high proportion ofpotentially price-sensitive households, and Item A’s category and sub-category. Additional items,sub-categories and categories are noted in the Appendix; results presented here generalize to arange of different items. All models are implemented in the PyBats package (Lavine and Cron,2020) with relevant extensions for the DLMM.In Sections 5.2–5.5, we treat simultaneous predictors from higher modeling levels as known.This allows for an evaluation of specific aspects of each model at each modeling level. However,the modeling decomposition is designed for fully simultaneous modeling, and in Section 5.6 we useforecast values from higher levels in the modeling hierarchy as the simultaneous predictors, withonly lagged predictors treated as known. That is, the simultaneous predictors in the category, sub-category and item levels, as well as the prediction of return or not at all modeling levels, are theresults of forecasts from the relevant higher modeling levels. Additionally, while we focus on one-step ahead forecasts via simulation here, multi-step ahead forecasts also naturally proceed eithervia simulation or analytically (Berry and West, 2020; Lavine et al., 2020).Evaluations have included comparison with the DCMMs that have been found to be competitivein terms of forecasting accuracy when compared to a wide range of competing models (Berry andWest, 2020). We find that our full modeling decomposition does improve forecasting results, essen-tially uniformly across examples studied, and regard this as substantial recommendation given thealready proven utility of DCMMs. There are, as far as we are aware, no other relevant approachesfor comparison that satisfy all of the practically motivated desiderata: interpretable models thatare hierarchical and fully probabilistic, analyzed and used sequentially, and parallelizable as wellas inherently computationally scalable.
We have found that simultaneous predictors can contribute very substantially to forecast accuracy,especially at lower levels in the modeling hierarchy. Predictors “one level up” in the modelinghierarchy are typically most useful and supercede information from higher levels. For example,predicted $ spend at the sub-category level leads to improved forecasts at the item level relative14o predictors based on spend at the category or global level. This empirically reflects a naturalconditional independence structure in the hierarchical decomposition. Specific examples at thecategory, sub-category and item levels are now discussed. A DLMM defines p(log Total Spend by Category | Return, Global log Total Spend). We comparemodels with 3 different predictors to evaluate the utility of the simultaneous predictors as comparedto lagged predictors, with all covariates initially treated as known. Denote the models by M1, M2,M3. All models have a random-walk local level (trend term) and one model-specific dynamicregression term with predictor variable as follows: M1 has a lagged local predictor given as the logtotal spend in the category at the last return; M2 has a lagged global predictor given as the logtotal spend across all categories at the last return; M3 has a simultaneous global predictor given asthe log total spend across all categories for the current week. Aggregate results within each of thethree household groups are given in Table 2. For each metric, the point forecast is optimal underthat loss function. MAPE is only evaluated at non-zero outcomes.Across household groups and metrics, M3 with the simultaneous predictor outperforms modelswith lagged predictors. The ZAPE metric, in particular, explicitly evaluates how well the DLMMpredicts zeros, or the occurrence of no-spend in the category. M3 also improves forecast calibrationacross all households and weeks, relative to the models with the lagged predictors (Figure 6). Themodeling decomposition generally improves calibration at this modeling level, as we have alreadyconditioned out zeros that are a result of no return to purchase at all. While calibration of the newDLMM is quite good across all models and all weeks (Figure 6), predicting the occurrence of zerosfor a specific week can still be challenging, and thus conditioning out zeros representing no returnat higher levels in the modeling hierarchy can improve forecasts at lower levels. (a) Household Group 1. (b) Household Group 3.
Figure 6: Frequency calibration summaries for 3 DLMMs predicting $ spend at the category level,averaged across the 2,000 households within each household group.15able 2: Forecast summaries for 3 DLMMs predicting $ spend at the category level, giving median(and 25, 75th percentiles) values across the 2,000 households within each household group. HH Group Metric M1: Lagged Cat. M2: Lagged Global M3: Simultaneous Global1 MAD 4.31, (2.83, 6.03) 4.21, (2.76, 5.83) . , (2.42, 4.37)MAPE 0.51, (0.45, 0.58) 0.51, (0.44, 0.57) . , (0.37, 0.47)ZAPE 0.61, (0.54, 0.69) 0.60, (0.54, 0.68) . , (0.43, 0.54)2 MAD 4.14, (3.27, 5.35) 3.47, (2.62, 4.46) . , (2.27, 3.39)MAPE 0.53, (0.47, 0.60) 0.48, (0.44, 0.53) . , (0.37, 0.45)ZAPE 0.68, (0.60, 0.91) 0.62, (0.56, 0.70) . , (0.45, 0.53)3 MAD 2.88, (2.14, 3.78) 2.77, (2.09, 3.60) . , (1.83, 2.77)MAPE 0.46, (0.41, 0.52) 0.45, (0.40, 0.50) . , (0.35, 0.43)ZAPE 0.65, (0.58, 0.76) 0.63, (0.57, 0.74) . , (0.45, 0.53) At the next level in the hierarchy, a DLMM defines p(log Total Spend by Sub-Category | Return inCategory, Category log Total Spend). Simultaneous predictors again improve forecasting accuracy.Candidate sub-category models M1, M2, M3 each have a random-walk local level and one model-specific dynamic regression term. Here M1 has log total spend at the last return at the sub-categorylevel, M2 has log total spend at the last return in the category level, while M3 has log total spendfor the current week at the category level. As shown in Table 3, use of the simultaneous predictorin M3 very substantially improves forecast accuracy at the sub-category level. Calibration (notshown) is also improved by M3.Table 3: Forecast summaries for 3 DLMMs predicting $ spend at the sub-category level, with detailsas in Table 2. HH Group Metric M1: Lagged Sub-Cat. M2: Lagged Cat. M3: Simultaneous Cat.1 MAD 3.69, (2.52, 5.36) 3.67, (2.51, 5.36) . , (1.63, 2.72)MAPE 0.53, (0.43, 0.63) 0.52, (0.43, 0.63) . , (0.27, 0.37)ZAPE 0.67, (0.56, 1.00) 0.67, (0.56, 0.99) . , (0.33, 0.46)2 MAD 2.64, (1.97, 3.48) 2.61, (1.98, 3.43) . , (1.53, 2.31)MAPE 0.43, (0.37, 0.50) 0.43, (0.37, 0.49) . , (0.26, 0.35)ZAPE 0.63, (0.54, 0.80) 0.63, (0.54, 0.80) . , (0.33, 0.45)3 MAD 2.81, (2.18, 3.65) 2.80, (2.18, 3.64) . , (1.37, 2.17)MAPE 0.49, (0.42, 0.58) 0.49, (0.41, 0.58) . , (0.24, 0.34)ZAPE 0.62, (0.51, 1.94) 0.62, (0.51, 1.96) . , (0.32, 0.46) .2.3 Item Level At the finest level of individual household-item pairs, a DCMM defines p(Item Quantity | ReturnSub-Category, Sub-Category log Total Spend, Discount Percent). Again we find that simultaneouspredictors improve forecast accuracy. The 3 models evaluated extend the comparisons made athigher levels, each having a trend term as well as the predictors log total spend and discount percent.M1 includes lagged predictors at the sub-category level, M2 includes lagged predictors at the itemlevel, while M3 considers simultaneous predictors at the sub-category level. Forecast accuracymetrics show the comparisons in Table 4, where the improvements in M3 are very substantial.Again, similar findings emerge for calibration at the item level.Table 4: Forecast summaries for 3 DCMMs predicting item quantity at the final level, with detailsas in Table 2.
HH Group Metric M1: Lagged Sub-Cat. M2: Lagged Item M3: Simultaneous Sub-Cat.1 MAD 1.46, (1.00, 2.01) 1.46, (0.99, 2.01) . , (0.64, 1.40)MAPE 0.54, (0.38, 0.71) 0.53, (0.39, 0.71) . , (0.25, 0.44)ZAPE 0.55, (0.44, 0.68) 0.54, (0.43, 0.67) . , (0.31, 0.48)2 MAD 1.14, (0.83, 1.52) 1.12, (0.82, 1.51) . , (0.60, 1.17)MAPE 0.45, (0.30, 0.61) 0.45, (0.30, 0.60) . , (0.23, 0.39)ZAPE 0.48, (0.38, 0.59) 0.48, (0.38, 0.58) . , (0.30, 0.45)3 MAD 1.11, (0.80, 1.49) 1.11, (0.80, 1.49) . , (0.53, 1.09)MAPE 0.48, (0.31, 0.66) 0.48, (0.31, 0.65) . , (0.19, 0.36)ZAPE 0.50, (0.40, 0.61) 0.50, (0.40, 0.61) . , (0.27, 0.43) In addition to enabling sharing of simultaneous predictors across levels, the modeling decompositioninduces a partial tree-like structure related to zero sales events, and this yields computationaland statistical efficiency. If a specific household does not return in a given week, the problem ofpredicting sales of any item is obviated. Further, there can be zeros for multiple weeks for specificitems as noted in Section 2; then the interest is in forecasting when there will be a non-zero sale,how many items will be purchased then, and how price-sensitive the household is. The modelingdecomposition facilitates these goals by utilizing the structure of the data to condition out zeros atlower levels of the model, and this ultimately improves forecast accuracy.Examples are given for the low spending Household Group 3 in which the proportion of zeropurchases is highest. We compare the full analysis with that based on non-hierarchical, individualhousehold-item level models, referred to as “direct” models. For fair comparisons, all models use17nly lagged predictors. Two direct models M1 and M2 are DCMMs for p(Item Quantity); they eachinclude a local intercept but use different lagged predictors of log total spend and discount percent:M1 uses these at the sub-category level, and M2 at the item level. Corresponding hierarchicalmodels, M3 and M4, define p(Item Quantity | Return Sub-Category, Sub-Category log Total Spend,Discount Percent) with intercept terms and lagged predictors as in M1 and M2, respectively. Table 5shows typical results for 3 of the items, indicative of the benefits of the hierarchical structure inoutperforming direct models.Table 5: Forecast summaries for the households in Group 3 using DCMMs at the item level.Hierarchical models M3 and M4 exploit the modeling decomposition, direct models M1 and M2 donot. Format as in earlier tables.
Item M1: Direct Sub-Cat. M2: Direct Item M3: Decomp. Sub-Cat. M4: Decomp. ItemA 0.60, (0.47, 0.75) 0.61, (0.47, 0.75) . , (0.40, 0.61) . , (0.40, 0.61)C 0.62, (0.41, 0.81) 0.62, (0.40, 0.81) . , (0.24, 0.50) . , (0.25, 0.52)E 0.87, (0.59, 1.00) 0.85, (0.58, 1.00) . , (0.39, 0.70) . , (0.42, 0.67) We now turn to models that integrate item-specific price discount information at household andhousehold group levels, extending the hierarchical modeling framework to share information acrosshouseholds in predicting spends on specific items. This defines a multi-scale approach that sharesinformation across items and households via price discount information at the item level. Specificpredictors based on price discounts are constructed as follows. The potential discount amount is(i) the observed discount amount for each week the item is purchased, or (ii) an imputed valueof discount on offer if an item was not purchased by the household that week. This predictor isavailable in external simultaneous form, as discounts due to specific promotions are set weeks inadvance. We compare two DCMMs, M1 and M2, for p(Item Quantity | Return Sub-Category, Sub-Category log Total Spend); each has a local intercept, a predictor given by the simultaneous logtotal spend at the sub-category level, and a predictor based on potential discount percentage. Thediscount information in M1 is specific to the item and household, while the discount informationin M2 is the average across all households in the group. Hence M2 shares information across itemsvia the simultaneous log total spend at the sub-category level predictor and across households viathe aggregate discount percentage predictor. The summary comparisons in Table 6 for item A aretypical of the results. For the majority of households, including multi-scale discount information(M2) improves forecast accuracy across metrics; this more general multi-scale view of sharinginformation across data “dimensions” can lead to improvements in forecasting at the individuallevel. These results are typical and bear out the utility of the approach, and suggest that additional18xtensions could consider sharing multi-scale information across other dimensions, including acrosstime, in future extensions.Table 6: Forecast summaries using hierarchical, item-level DCMMs to predict sales of item A,comparing price discount predictors in two models: M1 has household-specific discount percentwhile M2 has a multi-scale aggregate discount percent predictor. Format as in earlier tables.
HH Group Metric M1: HH Discount M2: Multi-Scale Discount1 MAD 1.06, (0.71, 1.49) . , (0.66, 1.30)MAPE 0.39, (0.26, 0.50) . , (0.25, 0.42)ZAPE 0.42, (0.33, 0.51) . , (0.31, 0.44)2 MAD 0.91, (0.65, 1.19) . , (0.61, 1.10)MAPE 0.34, (0.23, 0.43) . , (0.22, 0.38)ZAPE 0.39, (0.31, 0.46) . , (0.30, 0.42)3 MAD 0.82, (0.57, 1.11) . , (0.55, 1.05)MAPE 0.30, (0.19, 0.40) . , (0.19, 0.37)ZAPE 0.36, (0.28, 0.44) . , (0.28, 0.41) The discussion so far has demonstrated aspects of forecast improvements based on the modelingdecomposition, with results aggregated across households. We now turn to the key question of item-specific price sensitivity for each individual household. Here we compare models with and withoutdiscount information as a dynamic predictor. Households showing improved forecast accuracy withinclusion of discount information will be identified as price-sensitive. Resulting inferences will feedinto decision processes for individualized discount offers. Further, as price sensitivity is modeleddynamically, we are able to monitor if and how the sensitivities change over time.Summaries come from comparison of two item-level DCMMs for p(Item Quantity | Return Sub-Category, Sub-Category log Total Spend). Both models include a local level and a simultaneouspredictor for log total spend at the sub-category level. The models differ only in that one includesthe aggregate discount percentage across households, found to be a useful predictor in Section 5.4,while the other does not. Household-specific MAD and ZAPE measures are shown in Figure 7. Eachpoint represents an individual household. For households below the diagonal, including discountinformation improves forecast accuracy and are households that have the potential to be price-sensitive so are perhaps candidates for more customized promotions.Figures 8 and 9 show summary results for Item A purchases of two price-sensitive, high spendinghouseholds from Group 1. The visual presentation shows improvements with the inclusion ofdiscount information. For each household, the state vector coefficient on discount percentage– thediscount sensitivity in these models– is inferred to be positive with high probability across the time19 a) MAD. (b) ZAPE.
Figure 7: MAD and ZAPE metrics for individual households in Household Group 1 from DCMMswith and without a discount predictor for Item A. Points represent households; those below thediagonal have forecasts that are improved by including discount information.period, indicating that higher discount percentages are associated with increased items purchasedfor these households. These are therefore households that may respond positively to additionaldiscount offers. (a) 1-Step Ahead Forecasts.(b) Discount Sensitivity.Figure 8: (a) MAD optimal point forecasts and 90% prediction intervals for Item A purchases ofone price-sensitive household, with and without discount information; (b) on-line posterior meanand 90% intervals for the state vector element corresponding to the discount predictor.20a) 1-Step Ahead Forecasts.(b) Discount Sensitivity.Figure 9: ZAPE optimal point forecasts and other summaries for second price-sensitive household,with format as in Figure 8.A third Group 1 household that is not price-sensitive provides contrast; see Figure 10. ItemA forecasts for the model without discount information are more accurate than the correspondingforecasts for the model with discount information; correspondingly, the discount sensitivity stateelement is inferred as insignificant over time. This household is unlikely to be responsive to discountoffers in the normal ranges.The results across all households and details for each item-household pair show the use ofmodels with simultaneous predictors; they generally improve predictions while identifying individualhouseholds that are price-sensitive for specific items, estimating how this price sensitivity changesover time. This opens the door to direct discount interventions for price-sensitive households,enabled by the use of interpretable models at each modeling level. The framework also facilitatesdecision-theoretic approaches to selection of optimal discount percent to offer to each price-sensitivehousehold, and to update the strategy over time.
The analysis at lower levels of the hierarchy involves simultaneous predictors from higher levels.Forecast information on these predictors cascades down from the global to finer levels, defining thecoupling of sets of models in the hierarchy. This can be done using full ensembles of synthetic valuesgenerated from higher level predictive distributions, and/or using point forecasts. We detail this21a) 1-Step Ahead Forecasts.(b) Discount Sensitivity.Figure 10: (a) ZAPE optimal point forecasts and other summaries for a third household that is notprice-sensitive, with format as in Figure 8.further assuming the simultaneous forecasts are in terms of medians or means. For one household-item pair in one week, predicting one week ahead begins with the Bernoulli DGLM for p(Return)at the global level. This defines a forecast that is propagated to serve as a predictor in the DLMMfor p(Global log Total Spend | Return) along with the known predictor of lagged global log totalspend. This DLMM provides a forecast of global spend that, together with the forecast Return, isprojected down to the category level DLMM of p(Category log Total Spend | Return, Global logTotal Spend). The process continues with forecast values projected to the sub-category level DLMMof p(Sub-Category log Total Spend | Return Category, Category log Total Spend). Then, finally,the forecast from the sub-category model is projected to define the DCMM of p(Item Quantity | Sub-Category Return, Sub-Category log Total Spend).Some summary evaluations are presented for Item A by household group. We focus on aspectsof prediction accuracy for one of the most challenging components of the analysis, that of predict-ing household return at each of the the global, category, sub-category and item levels. Calibrationplots (Figure 11) and confusion matrices (Table 7) speak to this. Across all households and weeks,predicting global return remains the key challenge, while calibration at subsequent levels is gener-ally very good. The confusion matrices also show that overall, the modeling decomposition tendsto do a good job at predicting return for each household. However, at lower levels in the modelingdecomposition, where more predictors are propagated via forecasts, projecting forecast means tends22a) Household Group 1. (b) Household Group 3.Figure 11: Frequency calibration within each of the model levels for the high-spending (Group1) and low-spending (Group 3) household groups for Item A, using forecast means for projectedsimultaneous predictors.Table 7: Confusion matrices in predicting return at each modeling level for household Group 1 andItem A, using forecast means or medians for projected simultaneous predictors. Entries are theproportions of households and weeks which fall into each cell in the confusion matrix; f t denotes thepoint forecast and y t the observed value; f t = 0 or y t = 0 indicates no return (forecast or actual). Return Category Sub-Category f t > f t = 0 f t > f t = 0 f t > f t = 0Mean y t > y t = 0 0.03 0.01 0.21 0.07 0.29 0.13Median y t > y t = 0 0.03 0.01 0.10 0.17 0.10 0.31 to over-predict return, while projecting forecast medians tends to under-predict return (Table 7).Nevertheless, the overall forecast accuracy using projected forecasts for simultaneous predictorscompares very well with that based on the unobtainable “gold-standard”– assuming the simultane-ous predictors are known (at their future realized values); see Table 8. This is a strong testamentto the utility of the hierarchical model decomposition and the overall approach.23able 8: Median (25%,75%) forecast accuracy assessments. M1 and M2 use forecasts of simultane-ous predictors– the forecast mean and median, respectively; M3 is the hypothetical gold-standardusing realized values. HH Group Metric M1: Simul. Mean M2: Simul. Median M3: Known1 MAD 0.90, (0.40, 1.48) 1.19, (0.79, 1.72) 0.99, (0.63, 1.40)MAPE 0.43, (0.28, 0.55) 0.46, (0.34, 0.57) 0.35, (0.25, 0.44)ZAPE 0.44, (0.27, 0.56) 0.50, (0.40, 0.59) 0.40, (0.31, 0.48)2 MAD 0.71, (0.39, 1.06) 0.94, (0.65, 1.24) 0.87, (0.59, 1.16)MAPE 0.37, (0.25, 0.49) 0.40, (0.29, 0.50) 0.31, (0.22, 0.39)ZAPE 0.39, (0.25, 0.50) 0.45, (0.36, 0.53) 0.37, (0.29, 0.45)3 MAD 0.45, (0.23, 0.73) 0.74, (0.45, 1.00) 0.77, (0.53, 1.09)MAPE 0.32, (0.20, 0.45) 0.35, (0.23, 0.47) 0.28, (0.19, 0.36)ZAPE 0.29, (0.16, 0.42) 0.40, (0.30, 0.48) 0.35, (0.26, 0.43)
We have developed and summarized a case study in Bayesian modeling and forecasting of large,heterogeneous, individual-level time series data using a decouple/recouple strategy. The appliedsetting and extended case study is defined by central and topical issues of consumer behaviormodeling and sales forecasting in a retail context, with intimate links to the day-to-day issues ofsupply chain management and interests in personalized (household-customer level) prediction fordecisions.The applied advances are based on customized, multi-scale models that overlay the inherenthierarchical nature of the context: consumers decide to visit a store, when they are there they spend,and spending outcomes cascade through broad categories of goods, to refined sub-categories, andultimately to specific items on sale. How much an individual household spends overall is a usefulpredictor of potential outcomes– spend and numbers of items at the finest level– so that cascadinginformation down the hierarchy is key to decoupling levels. The recoupling is then inherentlydefined through simultaneous outcomes; the amount spent at one level defines a predictor for finerlevels. We have presented selected summaries of results that particularly highlight the role andrelevance of the model decomposition reflecting these simultaneous predictors and the ability toincrease forecast accuracy. We have also exemplified the use of the model in identifying individualhouseholds that are price-sensitive, opening the path to integrating these interpretable probabilisticmodels with decision analysis focused on (dynamic) selection of personalized pricing/discounts atthe individual item level. Importantly, we find that our positive results generalize across veryheterogenous groups of households and items. 24n terms of empirical forecast accuracy, a main challenge is that of predicting return or noreturn at each modeling level. Extended or alternative models could be considered for these binaryprediction components. There is also opportunity for additional multi-scale extensions. We might,for example, consider aggregation of household purchases over time, or over households at multiplemodeling levels that involve linkages based on household demographics or past behavior.The selective analysis of forecast accuracy in simultaneous modeling in Section 5.6 is based onprojecting point forecasts of simultaneous predictors from higher to lower levels of the hierarchy. Wehave noted that this is trivially extended to project full forecast ensembles– samples from predictivedistributions at each level– that will, of course, more formally represent the uncertainties as theypropagate (and typically increase) down the hierarchy. While not shown here, this underlies theholistic Bayesian approach with the resulting ability to average over the uncertain simultaneouspredictors (extending and building on Berry and West, 2020 and Berry et al., 2020, for example).This may in some cases have practical impact on point forecast accuracy and calibration at the finestlevel, though the impact is context-specific and to be explored case-by-case in empirical studies.One main concern in generating large ensembles is, of course, the implied computational burden;when the full multi-scale model is to be run at each time point with large Monte Carlo ensembles ofsimultaneous predictors at each level, this quickly becomes a challenge. On this theme, some recentdevelopments in alternative, partially analytic, approaches to cascading forecast uncertainties inmulti-scale frameworks (Lavine et al., 2020) will be worth exploring for adaptation to the currentsetting.
Acknowledgements
The research reported here was partly supported by 84 . ◦ . Our research has benefited from discus-sions with 84 . ◦ Research Scientist Christoph Hellmayr. Any opinions, findings and conclusionsor recommendations expressed in this paper do not necessarily reflect the views of 84 . ◦ . References
L. R. Berry and M. West. Bayesian forecasting of many count-valued time series.
Journal ofBusiness and Economic Statistics , 38:872–887, 2020. doi: 10.1080/07350015.2019.1604372.L. R. Berry, P. Helman, and M. West. Probabilistic forecasting of heterogeneous consumertransaction-sales time series.
International Journal of Forecasting , 36:552–569, 2020. doi:10.1016/j.ijforecast.2019.07.007.Lindsay R. Berry.
Bayesian Dynamic Modeling and Forecasting of Count Time Series . PhD thesis,Department of Statistical Science, Duke University, 2019.25. Chen, B. Keng, and J. Moreno. Multivariate arrival times with recurrent neural networksfor personalized demand forecasting. In , pages 810–819, 2018.Wei Chu and Seung-Taek Park. Personalized recommendation on dynamic content using predictivebilinear models. In
Proceedings of the 18th International Conference on World Wide Web , WWW’09, pages 691–700, New York, NY, USA, 2009. Association for Computing Machinery. doi:10.1145/1526709.1526802.Chao Du, Chongxuan Li, Yin Zheng, Jun Zhu, and Bo Zhang. Collaborative filtering with user-itemco-autoregressive models. In
Proceedings of the Thirty-Second AAAI Conference on Artificial In-telligence , New Orleans, Louisiana, February 2018. Association for the Advancement of ArtificialIntelligence.Charles K. Fisher, Aaron M. Smith, Jonathan R. Walsh, and et. al. Machine learning for compre-hensive forecasting of Alzheimer’s disease progression.
Scientific Reports , 9(13622), 2019. doi:10.1038/s41598-019-49656-2.Valentin Flunkert, David Salinas, and Jan Gasthaus. DeepAR: Probabilistic forecasting with au-toregressive recurrent networks. arxiv.org/abs/1704.04110, 2017.Xiangnan He, Zhankui He, Xiaoyu Du, and Tat-Seng Chua. Adversarial personalized ranking forrecommendation. In
SIGIR ’18: 41st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , Ann Arbor, MI, July 2018.Y. Hu, Q. Peng, X. Hu, and R. Yang. Web service recommendation based on time series forecastingand collaborative filtering. In , pages 233–240, 2015.Ghassen Jerfel, Mehmet Basbug, and Barbara Engelhardt. Dynamic collaborative filtering withcompound poisson factorization. volume 54 of
Proceedings of Machine Learning Research , pages738–747, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.Pooyan Kazemian, Mariel S. Lavieri, Mark P. Van Oyen, Chris Andrews, and Joshua D. Stein.Personalized prediction of Glaucoma progression under different target intraocular pressure levelsusing filtered forecasting methods.
Ophthalmology , 125(4):569–577, April 2018.Isaac Lavine and Andrew Cron. Pybats: A python package for bayesian analysis of time series andbayesian forecasting, 2020. pypi.org/project/pybats/.26saac Lavine, Andrew J. Cron, and Mike West. Bayesian computation in dynamic la-tent factor models. Technical Report, Department of Statistical Science, Duke University.arxiv.org/abs/2007.04956, 2020.Moshe Lichman and Padhraic Smyth. Prediction of sparse user-item consumption rates with zero-inflated poisson regression. In
WWW ’18: Proceedings of the 2018 World Wide Web Conference ,pages 719–728, April 2018.Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundara-man, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, DmytroDzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, An-sha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, BillJia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personal-ization and recommendation systems. arxiv.org/abs/1906.00091, 2019.J. R. Nevins, E. S. Huang, H. Dressman, J. L. Pittman, A. T. Huang, and M. West. Towardsintegrated clinico-genomic models for personalized medicine: Combining gene expression signa-tures and clinical factors in breast cancer outcomes prediction.
Human Molecular Genetics , 12:153–157, 2003. URL .Wei Niu, James Caverlee, and Haokai Lu. Neural personalized ranking for image recommendation.In
Proceedings of 1th ACM International Conference on Web Search and Data Mining (WSDM2018) . ACM, 2018.Kelly Peterson, Ognjen Rudovic, Ricardo Guerrero, and Rosalind W. Picard. Personalized GaussianProcesses for future prediction of Alzheimer’s disease progression. arxiv.org/abs/1712.00181,2017.J. L. Pittman, E. S. Huang, H. K. Dressman, C. F. Horng, S. H. Cheng, M. H. Tsou, C. M. Chen,A. Bild, E. S. Iversen, A. T. Huang, J. R. Nevins, and M. West. Integrated modeling of clinicaland gene expression information for personalized prediction of disease outcomes.
Proceedingsof the National Academy of Sciences , 101:8431–8436, 2004. URL .Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. BPR: Bayesianpersonalized ranking from implicit feedback. In
Proceedings of the Twenty-Fifth Conference onUncertainty in Artificial Intelligence (UAI 2009) , pages 452–461, Montreal, Quebec, Canada,2009. AUAI Press.David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus.27igh-dimensional multivariate forecasting with low-rank gaussian copula processes. In
Advancesin Neural Information Processing Systems 32 , pages 6827–6837. Curran Associates, Inc., 2019.Rajat Sen, Hsiang-Fu Yu, and Inderjit Dhillon. Think globally, act locally: A deep neural networkapproach to high-dimensional time series forecasting. arxiv.org/abs/1905.03806, 2019.Xiaoyuan Su and Taghi M. Khoshgoftaar. A survey of collaborative filtering techniques.
Advancesin Artificial Intelligence , Jan 2009. doi: 10.1155/2009/421425.N. Thai-Nghe, T. Horv’th, and L. Schmidt-Thieme. Personalized forecasting student performance.In , pages 412–414,2011.Xinxi Wang, Yi Wang, David Hsu, and Ye Wang. Exploration in interactive personalized music rec-ommendation: A reinforcement learning approach.
ACM Trans. Multimedia Comput. Commun.Appl. , 2(3), 2013.Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski.Deep factors for forecasting. volume 97 of
Proceedings of Machine Learning Research , pages6607–6617, Long Beach, California, USA, 09–15 Jun 2019. PMLR.M. West. Bayesian decision analysis and constrained forecasting. Technical Report, Department ofStatistical Science, Duke University. arxiv.org/abs/2007.11037, 2020.M. West, A. T. Huang, G. S. Ginsberg, and J. R. Nevins. Embracing the complexity of genomicdata for personalized medicine.
Genome Research , 16:559–566, 2006. URL .Mike West and P. Jeff Harrison.
Bayesian Forecasting and Dynamic Models . Springer-Verlag, NewYork, Inc, 2nd edition, 1997. 28
Sequential Learning and Forecasting
A.1 DGLMs: Dynamic Generalized Linear Models
Sequential learning for the DGLM (West and Harrison, 1997) proceeds as follows for the time t − t −
1: ( θ t − |D t − , I t − ) ∼ ( m t − , C t − ) .
2. Prior at t : ( θ t |D t − , I t − ) ∼ ( a t , R t ) with a t = G t m t − and R t = G t C t − G (cid:48) t + W t .3. Variational Bayes: ( η t |D t − , I t − ) ∼ CP ( α t , β t ), p ( η t |D t − , I t − ) = c ( α t , β t )exp { α t η t − β t a ( η t ) } . ( c ( · , · ) known function of hyperparameters, depends on exponential family form).4. Evaluate hyper-parameters α t and β t such that: E ( λ t |D t − , I t − ) = f t = F (cid:48) t a t and V ( λ t |D t − , I t − ) = q t = F (cid:48) t R t F t .
5. Forecast y t p ( y t |D t − , I t − ) = b ( y t , φ ) c ( α t , β t ) /c ( α t + φy t , β t + φ ).6. Posterior for η t : ( η t |D t ) ∼ CP ( α t + φy t , β t + φ ).7. Map back to the linear predictor λ t = g ( η t ): posterior mean g t = E ( λ t |D t ) and variance p t = V ( λ t |D t ).8. Posterior at time t : ( θ t |D t ) ∼ ( m t , C t ) given by m t = a t + R t F t ( g t − f t ) /q t and C t = R t − R t F t F (cid:48) t R (cid:48) t (1 − p t /q t ) /q t . This completes the time t − t evolve-predict-update cycle. For all results presented, we specifythe following state space priors: m = and C = I , where I is the identity matrix. A.2 Forecasting in Dynamic Count Mixture Models
The forecast distribution at time t + k for the DCMM is a mixture of the forecast distributions forthe independent Bernoulli and shifted Poisson DGLMs. That is, the marginal forecast distributionsare p ( y t + k |D t , I t , π t + k ) = (1 − π t + k ) δ ( y t + k ) + π t + k h t,t + k ( y t + k ) , where ( π t + k |D t , I t ) ∼ Beta( α t ( k ) , β t ( k )), δ ( y ) is the Dirac delta function, h t,t + k ( y t + k ) is thedensity of y t + k = 1 + x t + k , where ( x t + k |D t , I t ) ∼ NegBinom( α + t ( k ) , β +0 ( k )1+ β + t ( k ) ), and α t ( k ) and β t ( k )are computed from the binary DGLM and α + t ( k ), β + t ( k ) are computed from the shifted PoissonDGLM (Berry and West, 2020). We primarily focus on one-step ahead forecasts in the main results,in which case k = 1. 29 .3 Forecasting in Dynamic Linear Mixture Models In the DLMM, the forecast distribution at time t + k is nearly identical to the forecast distributionin the DCMM, with the exception that h t,t + k ( y t + k ) is now a student-t distribution, when a Beta-Gamma stochastic volatility model is specified for the observation precision, following West andHarrison (1997). B Metric Derivations
B.1 Log-T Distributions
In our modeling context, as often occurs in demand forecasting, we are interested in modeling alog quantity, specifically the log total spend at the global, category and sub-category levels. Wemodel the log total spend with a DLM, meaning that the predictive distribution is a student-tdistribution. However, we often want to evaluate our models directly on the original dollar scale,rather than the log-scale. That is, if y = log( x ) and x ∼ T k ( m, v ), then y ∼ LT k ( m, v ), which is aheavy-tailed log-T distribution with p.d.f (West, 2020): p ( y ) ∝ y − (cid:16) k + (log( y ) − m ) /v (cid:17) − ( k +1) / , y > . This p.d.f decays as an inverse power of log( y ) as y → ∞ and has pole at zero. As a result,none of the moments of the log-T distribution exist and expected losses for commonly used lossesoccurring under the log-T distribution also do not exist (West, 2020). For example, MAPE andZAPE loses do not have finite expectations. In practice, to calculate optimal point forecasts withlog-T predictive densities, we can truncate the log-T distribution to bounded values (away from 0and up to a finite value) to calculate finite expected losses (West, 2020). B.2 Optimal ZAPE Forecasts
The optimal ZAPE forecasts can be derived following the similar MAPE derivation in Berry (2019).Let y be a continuous quantity, where π = P ( y = 0) >
0. Additionally, let p ( y ), and P ( y ) be thePDF and CDF of y , respectively. Let g ( y ) = cy − p ( y ) ( y >
0) with CDF G ( y ) and c ≥
1. Then30he optimal forecast under the ZAPE loss function can be derived as follows: L ZAP E ( y, f ) = f f × ( y = 0) + ( y > (cid:12)(cid:12)(cid:12) − fy (cid:12)(cid:12)(cid:12) ,R ( f ) = (cid:90) ∞ L ZAP E ( y, f ) p ( y ) dy = f f π + (cid:90) ∞ | y − f | y − p ( y ) dy | y − f | y = − f /y, if y ≥ f,f /y − , if y < f.R ( f ) = f f π + (cid:90) f (cid:18) fy − (cid:19) p ( y ) dy + (cid:90) ∞ f (cid:18) − fy (cid:19) p ( y ) dy = f f π + (cid:90) f f cyc p ( y ) dy − [ P ( f ) − P (1)] + [1 − P ( f )] − (cid:90) ∞ f f cyc p ( y ) dy = f f π + P (1) + 1 − P ( f ) + (cid:90) f fc g ( y ) dy − (cid:90) ∞ f fc g ( y ) dy = f f π + P (1) + 1 − P ( f ) + 2 fc G ( f ) − fc∂R ( f ) ∂f = ∂∂f (cid:18) f f (cid:19) π − p ( f ) + 2 c ( G ( f ) + cp ( f )) − c∂R ( f ) ∂f = π (1 + f ) + 2 c G ( f ) − c∂R ( f ) ∂f = 0 = ⇒ G ( f ) = 12 − π c (cid:18) f ) (cid:19) ∈ [0 , , = ⇒ (cid:18) − π c (1 + f ) (cid:19) ∈ [0 , f ∗ , is:1. Solve for ˜ f via gradient descent: ˜ f = f − α ∂R ( f ) ∂f ,∂R ( f ) ∂f = π (1 + f ) + 2 c G ( f ) − c .
2. If π c (1 + ˜ f ) ≥ , then f ∗ = 0. 31. Otherwise, f ∗ = ˜ f .If y is a non-negative count, we can follow a similar procedure, but where we use a grid searchto minimize the risk on values between 0 and the (-1)-median (the median of the distribution g ( y )defined above), inclusive. Thus, for positive counts, we have that the ZAPE optimal forecast isalways less than or equal to the MAPE optimal forecast, which is less than or equal to the MADoptimal forecast. C Item Selection Details
Due to the unavailability of demographic information about households, the following categorizationof households is developed, on the basis of the promotion circumstances and buying behaviors, whichare defined for every household-item combination. This categorization is then used to select whichitems to focus on for modeling.For every household-item pair: (i,h), i = 1:I, h = 1:H, with I, H being the total number of itemsbeing sold and households recorded, define: • Discount Offered Percentage (DOP): over the span of the 112 weeks recorded, the proportionof weeks when there were promotions offered to household h for item i • Discounted Purchase Percentage (DPP): among the weeks when item i was discounted forhousehold h, the proportion of weeks that household h made a purchase • Regular Purchase Percentage (RPP): among the weeks when item i was at regular price forhousehold h, the proportion of weeks that household h made a purchaseThese three quantities together define a household space for each item, whose domain is aunit cube centering around the origin, with various sections of the cube corresponding to differentpurchasing behaviors. We can then form four household categories based on this overall purchasingbehavior for each item, i , and possible discount actions to take for each category:1. Habit and loyalty for item i are established.Actions: Maintain the relationship and occasionally compensate for their loyalty to item i .2. Promotion sensitivity and interests in item i are detectable—which is the ideal group ofcustomers to model the price sensitivity.Actions: It is interesting to find the amount of promotions of item i that generates the mostprofits, which depends on the distribution of sales and the quantity being optimized.3. Promotions are not available.Actions: Explore and experiment with these customers by delivering promotions of item i .32. Lack of interest in item i or disregard for the promotions is noticeable.Actions: Check the validity of the promotions sent out. If they are disregarded, stop thepromotions of item i .After this categorization, we select items based on which items have a large proportion ofhouseholds in category 2 above, indicating that there are many households for these specific itemsthat are price sensitive. Additional summary statistics about each item selected are given in Table 9.Table 9: Sub-Category and Category for each item modeled. Also, the number of households (outof 2000 for each household group) in each household group which return to purchase each itemmore than 10 weeks out of the 112 weeks; these are the households that are modeled for each item. Item Sub-Category Category
D Additional Results
D.1 Global Return
To model whether a household returns to purchase any item in a given week, we model p(Return)with a Bernoulli DGLM, with a trend term and the covariate log total spend across all items forthe previous week. We also compare multi-step ahead forecasts at this modeling level, with forecasthorizons of k = 1 (one-week ahead), k = 4, (one month ahead) and k = 8 (two months ahead).As we are modeling a binary outcome, we primarily focus on calibration plots to evaluate theforecasts, and find the calibration to be good for all household groups and across all three forecasthorizons (Figure 12). Additionally, we consider point forecasts under mean squared error, and thearea under the curve and F1 score (the harmonic mean of the precision and recall), evaluating thebinary forecasts as a binary classification problem. Across all three of these metrics, the forecastaccuracy is persistent across forecast horizons, with little drop in accuracy even for 2 month aheadforecasts ( k = 8) (Table 10). 33 a) Household Group 1. (b) Household Group 2. (c) Household Group 3. Figure 12: Multi-step ahead forecast calibration for a Bernoulli DGLM modeling p(Return) forhousehold (a) Group 1, (b) Group 2 and (c) Group 3. The forecasts are well calibrated across allforecast horizons considered. k represents the forecast horizon in weeks.Table 10: AUC, F1 scores and MSE values for multi-step ahead forecasts with a Bernoulli DGLMfit to return to store. All metrics are evaluated using the mean forecasts for each household foreach week. k represents the forecast horizon in weeks. HH Group Metric k = 1 k = 4 k = 81 AUC 0.68 0.68 0.67F1 Score 0.96 0.96 0.95MSE 0.07 0.07 0.082 AUC 0.64 0.64 0.63F1 Score 0.95 0.95 0.95MSE 0.08 0.08 0.083 AUC 0.63 0.62 0.62F1 Score 0.90 0.90 0.89MSE 0.14 0.14 0.15 D.2 Global Log Total Spend
At the next level in the modeling decomposition, p(Global log Total Spend | Return) is modeledusing a DLM with a trend term and the covariate global log total spend at the last return. TheseDLMs exhibit good coverage across all three household groups (Figure 13), indicating that theuncertainty associated with these one-step ahead forecasts is well-calibrated.
D.3 Category Level
At the category level, we model p(log Total Spend by Category | Return, Global log Total Spend)with a DLMM. We compare models with three different predictors to evaluate the utility of thesimultaneous predictors as compared to lagged predictors, where all covariates are treated as known.34 a) Household Group 1. (b) Household Group 2. (c) Household Group 3.
Figure 13: Coverage plots for a DLM modeling p(Global log Total Spend | Return) for household (a)Group 1, (b) Group 2 and (c) Group 3 for one-step ahead forecasts. The coverage plots evaluatethe forecast uncertainty and indicate that the observed coverage aligns well with the expectedtheoretical coverage.All models have a trend term and an additional dynamic predictor, which is the log total spendin the category at the last return for M1 (a lagged, local predictor), the global log total spendacross all categories at the last return (a lagged predictor) for M2 and the global log total spend(across all categories) for the current week (simultaneous predictor) for M3. Aggregate resultsfor all three household groups are given for two additional categories in Table 11 and Table 12.The results presented in the main paper were fit to the category containing Items A - D. For eachdifferent metric, the point forecast is the optimal forecast under that loss function and MAPE isonly evaluated at observations that are non-zero.Table 11: Median forecast accuracy across all households in each household group for DLMMs fitat the category level for the category containing Item E. The metrics in parentheses represent the25th and 75th percentile values across households.
HH Group Metric M1: Lagged Cat. M2: Lagged Global M3: Simultaneous Global1 MAD 3.08, (1.96, 4.70) 3.04, (1.97, 4.50) . , (1.83, 3.64)MAPE 0.47, (0.39, 0.58) 0.47, (0.39, 0.55) . , (0.32, 0.48)ZAPE 0.57, (0.48, 0.68) 0.57, (0.47, 0.65) . , (0.35, 0.55)2 MAD 2.98, (1.84, 4.40) 3.16, (2.10, 4.51) . , (2.08, 4.51)MAPE 0.53, (0.44, 0.66) 0.55, (0.45, 0.68) 0.57, (0.44, 0.71)ZAPE 0.68, (0.58, 0.93) 0.71, (0.61, 1.01) . , (0.54, 0.95)3 MAD 1.62, (0.82, 2.66) 1.54, (0.76, 2.50) 1.55, (0.81, 2.46)MAPE 0.58, (0.45, 0.74) 0.55, (0.44, 0.70) . , (0.42, 0.71)ZAPE 0.72, (0.59, 1.11) 0.70, (0.58, 1.02) . , (0.52, 0.89) HH Group Metric M1: Lagged Cat. M2: Lagged Global M3: Simultaneous Global1 MAD 2.56, (1.45, 4.03) 2.54, (1.45, 3.97) 2.62, (1.51, 4.04)MAPE 0.51, (0.40, 0.66) 0.51, (0.40, 0.66) 0.53, (0.42, 0.68)ZAPE 0.68, (0.60, 0.87) 0.68, (0.59, 0.89) . , (0.59, 0.85)2 MAD 1.43, (0.70, 2.42) 1.39, (0.67, 2.24) . , (0.67, 2.11)MAPE 0.42, (0.34, 0.55) 0.40, (0.33, 0.48) . , (0.31, 0.44)ZAPE 0.60, (0.53, 0.72) 0.58, (0.52, 0.67) . , (0.46, 0.57)3 MAD 1.38, (0.59, 2.31) 0.95, (0.39, 1.68) 0.95, (0.39, 1.63)MAPE 0.51, (0.38, 0.67) 0.42, (0.35, 0.56) . , (0.32, 0.51)ZAPE 0.67, (0.59, 1.07) 0.60, (0.53, 0.73) . , (0.46, 0.59) D.4 Sub-Category Level
At the sub-category level, we again model p(log Total Spend by Sub-Category | Return in Category,Category log Total Spend) with a DLMM. At this level of modeling, simultaneous predictors againserve to improve the forecasting accuracy. We compare three models, each with a trend term andthe additional predictors of M1: log total spend at the last return at the sub-category level (lagged),M2: log total spend at the last return in the category level (lagged) and M3: log total spend for the current week at the category level (simultaneous). The aggregate point forecast results are givenin Table 13 for the sub-category containing Item C and Table 14 for the sub-category containingItem F. Again, at the sub-category level, the use of simultaneous predictors greatly improves theforecast accuracy.Table 13: Median forecast accuracy across all households in each household group for DLMMs fitat the sub-category level for the sub-category containing Item C.
HH Group Metric M1: Lagged Sub-Cat. M2: Lagged Cat. M3: Simultaneous Cat.1 MAD 0.69, (0.23, 1.43) 0.70, (0.23, 1.43) . , (0.18, 1.17)MAPE 0.47, (0.35, 0.68) 0.46, (0.34, 0.67) . , (0.27, 0.52)ZAPE 0.60, (0.49, 0.86) 0.59, (0.50, 0.85) . , (0.40, 0.52)2 MAD 0.41, (0.14, 0.93) 0.41, (0.14, 0.92) . , (0.13, 0.89)MAPE 0.48, (0.32, 0.71) 0.44, (0.30, 0.64) . , (0.27, 0.57)ZAPE 0.55, (0.41, 0.67) 0.53, (0.41, 0.65) . , (0.36, 0.52)3 MAD 0.43, (0.17, 0.90) 0.42, (0.17, 0.90) . , (0.14, 0.78)MAPE 0.54, (0.37, 0.74) 0.55, (0.37, 0.75) . , (0.28, 0.63)ZAPE 0.59, (0.48, 1.15) 0.60, (0.47, 1.15) . , (0.33, 0.51) HH Group Metric M1: Lagged Sub-Cat. M2: Lagged Cat. M3: Simultaneous Cat.1 MAD 1.86, (1.23, 2.55) 1.87, (1.27, 2.55) . , (0.72, 1.71)MAPE 0.45, (0.33, 0.59) 0.45, (0.33, 0.60) . , (0.21, 0.36)ZAPE 0.62, (0.51, 1.18) 0.62, (0.51, 1.19) . , (0.32, 0.51)2 MAD 1.82, (1.22, 2.49) 1.82, (1.21, 2.48) . , (0.63, 1.55)MAPE 0.48, (0.34, 0.61) 0.48, (0.34, 0.62) . , (0.37, 0.43)ZAPE 0.63, (0.52, 1.46) 0.63, (0.52, 1.46) . , (0.31, 0.51)3 MAD 1.62, (1.01, 2.31) 1.62, (1.00, 2.30) . , (0.52, 1.44)MAPE 0.49, (0.35, 0.62) 0.49, (0.35, 0.63) . , (0.21, 0.41)ZAPE 0.62, (0.51, 1.48) 0.62, (0.51, 1.48) . , (0.31, 0.52) (a) Item C. (b) Item D. (c) Item E. Figure 14: Calibration plots for households in Group 1 on (a) Item C, (b) Item D, and (c) Item Efor DLMMs with and without discount information. Both models perform well across householdsin terms of calibration for all three items.
D.5 Item Level
Calibration plots for DLMMs fit with and without discount information for Items C, D and E arepresented for household Group 1 in Figure 14. Additional individual results for price sensitivehouseholds in household Group 2 (Figure 15) and household Group 3 (Figure 16) for Item A arepresented below. For each individual household, the inclusion of discount information improvesforecasting accuracy and the state vector corresponding to the aggregate discount information ispositive over time, representing the price sensitivity of each household.37 a) Forecasts(b) State Vector
Figure 15: (a) MAD optimal point forecasts and 90% prediction intervals for Item A purchases ofone price-sensitive household in Group 2, with and without discount information; (b) on-line pos-terior mean and 90% intervals for the state vector element corresponding to the discount predictor. (a) Forecasts(b) State Vector(a) Forecasts(b) State Vector