FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data
Shuyuan Yan, Bolin Ding, Wei Guo, Jingren Zhou, Zhewei Wei, Xiaowei Jiang, Sheng Xu
aa r X i v : . [ c s . D B ] J a n FlashP: An Analytical Pipeline for Real-time Forecasting ofTime-Series Relational Data
Shuyuan Yan
Alibaba [email protected]
Bolin Ding * Alibaba [email protected]
Wei Guo
Alibaba [email protected]
Jingren Zhou
Alibaba [email protected]
Zhewei Wei
Renmin University of [email protected]
Xiaowei JiangSheng Xu
Alibaba [email protected]@alibaba-inc.com
ABSTRACT
Interactive response time is important in analytical pipelines forusers to explore a sufficient number of possibilities and make in-formed business decisions. We consider a forecasting pipeline withlarge volumes of high-dimensional time series data. Real-time fore-casting can be conducted in two steps. First, we specify the partof data to be focused on and the measure to be predicted by slic-ing, dicing, and aggregating the data. Second, a forecasting modelis trained on the aggregated results to predict the trend of the speci-fied measure. While there are a number of forecasting models avail-able, the first step is the performance bottleneck. A natural idea isto utilize sampling to obtain approximate aggregations in real timeas the input to train the forecasting model. Our scalable real-timeforecasting system
FlashP (Flash Prediction) is built based on thisidea, with two major challenges to be resolved in this paper: first,we need to figure out how approximate aggregations affect the fit-ting of forecasting models, and forecasting results; and second, ac-cordingly, what sampling algorithms we should use to obtain theseapproximate aggregations and how large the samples are. We intro-duce a new sampling scheme, called GSW sampling, and analyzeerror bounds for estimating aggregations using GSW samples. Weintroduce how to construct compact GSW samples with the exis-tence of multiple measures to be analyzed. We conduct experimentsto evaluate our solution and compare it with alternatives on realdata.
PVLDB Reference Format:
Shuyuan Yan, Bolin Ding, Wei Guo, Jingren Zhou, Zhewei Wei, XiaoweiJiang, and Sheng Xu. FlashP: An Analytical Pipeline for Real-timeForecasting of Time-Series Relational Data. PVLDB, 14(5): XXX-XXX,2021.doi:10.14778/3446095.3446096 * Bolin Ding is the corresponding author.This work is licensed under the Creative Commons BY-NC-ND 4.0 InternationalLicense. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy ofthis license. For any use beyond those covered by this license, obtain permission byemailing [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 14, No. 5 ISSN 2150-8097.doi:10.14778/3446095.3446096
Large volumes of high-dimensional data are generated on eCom-merce platforms every day, from data sources about, e.g. , sales andbrowsing activities of online customers. Forecasting is among themost important BI analytical tasks to utilize such data, as soundprediction of future trends helps make informed business decisions.
Motivating scenario and challenges.
An online advertising plat-form enables advertisers to target certain user visits and submittheir bids for displaying advertisements on these visits during a timewindow ( i.e. , targeting ads campaigns ). To make the right decisionabout which customers to target and the bid prices, it is imperativefor an advertiser to access reliable forecasts of important measuresabout the targeted customers and their activities, e.g. , Impression ,the number of times an advertisement can be showed to such cus-tomers, and
ViewTime , the time a customer spends each visit.Time-series data about customers can be very high-dimensional,with tens or hundreds attributes about customers’ profiles and activi-ties, including their demographics, devices ( e.g. , mobile/PC), machine-learned tags ( e.g. , their preferences and intents of visits), and so on.An advertiser may decide to target a group of customers for anycombination of the attributes and values, based on the merchandisein the advertisement and fine-grained forecasts after different waysof slicing and dicing in the attribute space. For example, she maydecide to target 20-30 year old females interested in sports and lo-cated in some cities for skirt sets in the advertisement.Consider a time series of relation data in Figure 1. A forecastingtask, e.g. , the one in Figure 2, asks to predict the total number ofimpressions by female customers under age 30.Unlike in traditional forecasting applications such as airline plan-ning, real-time response is critical in our scenario. First, it is anexploratory process to find the right attribute combination for tar-geting. An advertiser may try a number of combinations and readthe forecasting results within a short period of time in order to sup-port the decision. High latency can easily make advertisers impa-tient during the exploration. Second, real-time bidding for targetingads campaigns is very competitive and dynamic market. Slow re-sponse may result in loss of displaying opportunities and higherprices. Thus, it is important to make our platform interactive.It is challenging to provide real-time response to such forecastingtasks. The volume of input time-series data to the analytical pipelineis huge, with hundreds of millions of rows per day, and commonly ge a ( ) Gender a ( ) Location a ( ) Impression m ( ) ViewTime m ( ) TimeStamp 𝑡
30 F WA 5 1.6min 2020030160 M WA 1 1.8min 2020030120 F NY 10 3.2min 2020030140 M NY 20 6.3min 20200302
Figure 1: A time series of relational data (yellow cells are relevant to the task) ... ...
Forecasting Model
Fittingthe model(training) Forecasting
Processing Aggregation Queries
FORECAST SUM ( Impression ) FROM T WHERE
Age < = 30 AND
Gender = F
USING (20200101 , SELECT SUM ( Impression ) FROM T WHERE
Age < = 30 AND
Gender = F
AND t = 20200101 ⇒ M SELECT SUM ( Impression ) FROM T WHERE
Age < = 30 AND
Gender = F
AND t = 20200102 ⇒ M SELECT SUM ( Impression ) FROM T WHERE
Age < = 30 AND
Gender = F
AND t = 20200331 ⇒ M Figure 2: Processing a real-time forecasting task I m p re ss i o n ForecastEstimated AggregationTrue Value
Figure 3: Forecasting example months of historical data is used to forecast a future point. More-over, as a result of the high-dimensionality of our data, there couldbe trillions of possible attribute/value combinations; a forecastingtask is given online with any combination, and it would be too ex-pensive to precompute and store all possible combinations and thecorresponding time series during an offline phase.
The task and solution phases.
Our forecasting system
FlashP isdesigned to process real-time forecasting tasks , in which we spec-ify: i) a constraint specifying the portion of data to be focused on( e.g. , targeting customers “
Age < = AND
Gender = F ” in the ex-ample in Figure 2); ii) the aggregated measure to be predicted ( e.g. , SUM ( Impression ) ); iii) historical data points to be used to train aforecasting model ( e.g. , from to ).We give an overview of the solution. Consider the example inFigure 2. A forecasting task is specified in a SQL-like language,with the goal to predict the total number of impressions by femalecustomers under age 30. To prepare training data for a forecastingmodel, we need to know the total number of impressions on eachday ( 𝑀 𝑡 ), which can be written as an aggregation query. As thetime-series data is partitioned on time, we can process these 91 ag-gregation queries with one scan of the data. After that, we have91 data points to fit (train) the forecasting model. Processing the91 aggregation queries (or, equivalently, a query with GROUP BY ) isthe bottleneck. A natural idea is to utilize sampling ( e.g. , [21]) orsample-based approximate query processing ( e.g. , [19, 34]) to es-timate their answers. Figure 3 shows an example of prediction forthe next 7 days: the red line shows estimated aggregations; the esti-mations are used to train the model, which then produces forecasts(green line) with confidence intervals (green dashed lines).
Contributions and organization.
FlashP is implemented in Al-ibaba’s advertising system. The first technical question is how theerrors in sample-based estimations affect the fitting of forecastingmodels (Section 3). Error bounds of estimating aggregations usingsamples are well studied e.g. , for priority sampling [37] which hasbeen shown to be optimal for aggregating
SUM . However, it is un-clear what the implications of such error bounds are in prediction re-sults. We provide both analytical and experimental evidence show-ing that aggregation errors add up to the forecasting model’s noise(from historical data points), and they together decide how confi-dent we could be with the prediction. We give formal analysis fora concrete forecasting model ( i.e. , ARMA ( , ) , defined later), andexperimental results for more complicated models ( e.g. , LSTM).The second question is about the space budget needed for sam-ples (Section 4). Uniform sampling provides unbiased estimationsfor aggregations, but the estimation error is proportional to the range of a measure ( max − min ) [28]. Weighted sampling schemes offeroptimal estimations, with much better error bounds that are inde-pendent on the range [37]; however, as the sampling distributionis decided by the measure values [10, 21], we have to draw oneweighted sample per measure independently. When there are a num-ber of measures (as in our scenario), the total space consumptioncould be prohibitive. We propose a new sampling scheme calledGSW ( Generalized Smoothed Weighted ) sampling, with samplingdistributions that can be arbitrarily specified. We analyze estimationerror bounds by quantify the correlation between the sampling dis-tribution and measure values. We introduce how to use this schemeto generate a compact sample which takes care of multiple mea-sures, using a sampling distribution that averages distributions ofdifferent measures, and analyze its estimation error bounds.We describe how
FlashP is implemented in Section 5. We re-port experimental results on real datasets in Section 6, and discussrelated work in Section 7. All missing proofs are deferred to Appen-dix A.
Time-series data model.
The input to our forecasting pipeline is a time series of relation 𝑇 , i.e. , a sequence of observed rows at suc-cessive time. We assume that time is a discrete variable here. Arow in the table 𝑇 is specified by a pair ( 𝑖, 𝑡 ) : an item 𝑖 and a timestamp 𝑡 . An item is associated with multiple dimensions , each de-noted by a , which are used to filter data ( e.g. , Age and
Location ),and multiple measures , each denoted by m , which we want to an-alyze and forecast ( e.g. , Impression and
ViewTime ). We use 𝑎 𝑖,𝑡 and 𝑚 𝑖,𝑡 to denote the values of a dimension and a measure, re-spectively, on each row; or, simply, 𝑎 𝑖 and 𝑚 𝑖 , if the time stamp 𝑡 is clear from the context or not important. The schema of 𝑇 is ( a ( ) , a ( ) , . . . , a ( 𝑑 𝑎 ) ; m ( ) , m ( ) , . . . , m ( 𝑑 𝑚 ) ; 𝑡 ) . Real-time forecasting task. A forecasting tasks is specified as: FORECAST SUM ( m ) FROM 𝑇 WHERE C USING ( 𝑡 𝑠 , 𝑡 𝑒 ) (1) OPTION ( MODEL = ′ model _ x ′ , FORE _ PERIOD = t _ future ) Here, m is the measure we want to forecast from data source 𝑇 ona given set of rows satisfying the constraint C . FlashP allows C tobe any logical expression on the dimension values of a ( ) , . . . , a ( 𝑑 𝑎 ) .We want to use historical data from time stamp 𝑡 𝑠 to 𝑡 𝑒 to fit theforecasting model. We can also specify the forecasting model wewant to use in the OPTION clause with the parameter
MODEL , and thenumber of future time stamps we want to predict with the parameter
FORE _ PERIOD ( e.g. , 7 days). In most of our application scenarios, e care about SUM aggregation, e.g. , total number of impressionsfrom a certain group users;
FlashP can also support
COUNT and
AVG . FlashP facilitates the following class of forecasting models. Let 𝑀 𝑡 be the value of metric 𝑀 ( = SUM ( m ) in our case) at time 𝑡 . Amodel is specified by a time-dependent function 𝑓 𝑡 with order 𝐾 : 𝑀 𝑡 = 𝑓 𝑡 ( 𝑀 𝑡 − , 𝑀 𝑡 − , . . . , 𝑀 𝑡 − 𝐾 ) . (2)The model 𝑓 𝑡 is fitted on historical data 𝑀 , 𝑀 , . . . , 𝑀 𝑡 with train-ing data tuples , each in the form of ( 𝑀 𝑡 ; 𝑀 𝑡 − , . . . , 𝑀 𝑡 − 𝐾 ) with 𝑀 𝑡 − , . . . , 𝑀 𝑡 − 𝐾 as the input to 𝑓 𝑡 and 𝑀 𝑡 as the output, for 𝑡 = 𝑡 , 𝑡 − , . . . , 𝐾 + . The fitted model can then be used to forecast future values 𝑀 𝑡 + , 𝑀 𝑡 + , . . . as ˆ 𝑀 𝑡 + ℎ | 𝑡 for ℎ = , , . . . , iteratively( i.e. , ˆ 𝑀 𝑡 + | 𝑡 can be used forecast 𝑀 𝑡 + ).For the forecasting models supported by FlashP , we now givebrief introduction to both classic ones, e.g. , ARMA [26], and thosebased on recurrent neural networks and LSTM [29].
Forecasting using ARMA.
An ARMA ( autoregressive moving av-erage ) [26] model uses stochastic processes to model how 𝑀 𝑡 isgenerated and evolves over time. It assumes that 𝑀 𝑡 is a noisy lin-ear combination of the previous 𝑝 values; each 𝑢 𝑡 is an independentidentically distributed zero-mean random noise at time 𝑡 , and histor-ical noise at previous 𝑞 time stamps impacts 𝑀 𝑡 too: ARMA ( 𝑝, 𝑞 ) : 𝑀 𝑡 = 𝑝 Õ 𝑖 = 𝛼 𝑖 𝑀 𝑡 − 𝑖 + 𝑢 𝑡 + 𝑞 Õ 𝑖 = 𝛽 𝑖 𝑢 𝑡 − 𝑖 . (3)For example, an ARMA model is: 𝑀 𝑡 = . 𝑀 𝑡 − + . 𝑀 𝑡 − + 𝑢 𝑡 + . 𝑢 𝑡 − . It falls into the form of (2), and is parameterized by 𝛼 . . . 𝛼 𝑝 and 𝛽 . . . 𝛽 𝑞 . The model can be fitted on 𝑀 , . . . , 𝑀 𝑡 , us-ing, e.g. , least squares regression, to find the values of 𝛼 𝑖 and 𝛽 𝑗 which minimize the error term, and used to forecast future values.We can estimate forecast intervals , i.e. , confidence intervals for fore-casts ˆ 𝑀 𝑡 + ℎ | 𝑡 : with a confidence level (probability) 𝛾 , the true futurevalue 𝑀 𝑡 + ℎ is within [ ˆ 𝑀 𝑡 + ℎ | 𝑡 − 𝑙 𝛾 , ˆ 𝑀 𝑡 + ℎ | 𝑡 + 𝑟 𝛾 ] .When there are deterministic trends over time, the differentialmethod can be used: the first order difference is defined ▽ 𝑀 𝑡 = 𝑀 𝑡 − 𝑀 𝑡 − , and the second order ▽ 𝑀 𝑡 = ▽ 𝑀 𝑡 − ▽ 𝑀 𝑡 − , and ingeneral, the 𝑑 -th order ▽ 𝑑 𝑀 𝑡 = ▽ 𝑑 − 𝑀 𝑡 − ▽ 𝑑 − 𝑀 𝑡 − . If { ▽ 𝑑 𝑀 𝑡 } 𝑡 isan ARMA ( 𝑝, 𝑞 ) model, { 𝑀 𝑡 } 𝑡 is an ARIMA ( 𝑝,𝑑, 𝑞 ) model. Forecasting using LSTM.
LSTM ( long short-term memory ) [29]is a network architecture that extends the memory of recurrent neu-ral networks using a cell . It is natural to use LSTM to learn andmemorize trends and patterns of time series for the purpose of fore-casting. In a typical LSTM-based forecasting model, e.g. , [35], anLSTM unit with dimensionality 𝑑 in the output space takes ( 𝑀 𝑡 − ,. . . , 𝑀 𝑡 − 𝐾 ) as the input; the LSTM unit is then connected to a 𝑑 × fully-connected layer which outputs the forecast of 𝑀 𝑡 . This modelalso falls into the general form (2). Here, the cell state is evolvingover time and, at each time stamp 𝑡 , is encoded in 𝑓 𝑡 , which can belearned from training data tuples ( 𝑀 𝑡 ; 𝑀 𝑡 − , . . . , 𝑀 𝑡 − 𝐾 ) in order. Our system
FlashP works in two online phases to process a fore-casting task: preparing training data points by issuing aggregationqueries , and fitting the forecasting model using the training data . • Aggregation query.
In a forecasting task, specified in (1), to pre-dict
SUM ( m ) , we have 𝑡 𝑒 − 𝑡 𝑠 + historical data points: 𝑀 𝑡 𝑠 = SELECT SUM ( m ) FROM 𝑇 WHERE C AND 𝑡 = 𝑡 𝑠 . . . . . .𝑀 𝑡 𝑒 = SELECT SUM ( m ) FROM 𝑇 WHERE C AND 𝑡 = 𝑡 𝑒 (4)Each data point is given by an aggregation query with constraint C , which can be any logical expression on the dimension valuesof a ( ) , . . . , a ( 𝑑 𝑎 ) . We would compute them in the online phase. • Forecasting.
In the next online phase, we use 𝑀 𝑡 𝑠 , . . . , 𝑀 𝑡 𝑒 astraining data to fit the forecasting model. And we use the modelto predict future aggregations, 𝑀 𝑡 𝑒 + , . . . , 𝑀 𝑡 𝑒 + t _ future . Performance bottleneck.
Suppose we have 𝑁 rows in 𝑇 for eachtime stamp, and use a history of 𝑡 = 𝑡 𝑒 − 𝑡 𝑠 + time stamps totrain a forecasting model with size (number of weights) 𝑠 . The totalcost of processing a forecasting task is O ( 𝑡 · 𝑁 )+ Train ( 𝑡 , 𝑠 ) , where O ( 𝑡 · 𝑁 ) is the cost of processing 𝑡 aggregation queries by scanningthe table 𝑇 , and Train ( 𝑡 , 𝑠 ) is the time needed to train a forecastingmodel with size 𝑠 using 𝑡 training data tuples. Train ( 𝑡 , 𝑠 ) is in theform of, e.g. , ( 𝑡 · 𝑠 · iter ) where iter is the number of iterations forthe model training to converge. We typically have 𝑡 in hundreds( 𝑡 = if we use one year’s history for training with one datapoint per day), 𝑁 (number rows in 𝑇 per day) in tens or hundredsof millions in our application, and 𝑠 in tens. Therefore, as 𝑡 , 𝑠 ≪ 𝑁 , processing of aggregation queries is the performance bottleneck(even in comprison to training a complex model). Real-time forecasting on approximate aggregations.
In order toprocess a forecasting task in an interactive way in
FlashP , we pro-pose to estimate 𝑀 𝑡 𝑠 , . . . , 𝑀 𝑡 𝑒 as ˆ 𝑀 𝑡 𝑠 , . . . , ˆ 𝑀 𝑡 𝑒 from offline sam-ples drawn from 𝑇 , and use these estimates to form training datatuples and to fit the model (2). There are several questions to beanswered. i) If estimates ˆ 𝑀 𝑖 , instead of 𝑀 𝑖 , are used to fit the fore-casting model, how much the prediction would deviate. ii) How todraw these samples efficiently (preferably in a distributed manner).iii) How much space we need to store these samples for multiplemeasures. We first analyze how sampling and approximate aggregations im-pact model fitting and, resultingly, forecasts. We give an analyticalresult for a special case of ARMA model, and will conduct exper-imental study for more complex models, e.g. , LSTM-based onesin Section 6. It is not surprising that there is a tradeoff betweensampling rate (or, quality of estimated aggregations) and forecastaccuracy.
Required properties of sampling and estimates. In FlashP , wehave no access to the accurate value of 𝑀 𝑡 ; but instead, we have ˆ 𝑀 𝑡 estimated from offline samples. We require the estimates to havetwo essential properties for the fitting of forecast models: • ( Unbiasedness ) ˆ 𝑀 𝑡 = 𝑀 𝑡 + 𝜖 𝑡 with E [ 𝜖 𝑡 ] = ; • ( Independence ) 𝜖 𝑡 ’s for different time stamps 𝑡 are independent.GSW sampling introduced in Section 4 offers unbiased estimates,with bounded variance of 𝜖 𝑡 . GSW samplers are run independentlyon the data for each 𝑡 —that is how we get independence. mpact on ARMA ( 𝑝, 𝑞 ) . When an
ARMA ( 𝑝, 𝑞 ) model has to betrained only on noisy metric values { ˆ 𝑀 𝑡 } , we rewrite (3) as ˆ 𝑀 𝑡 = 𝑝 Õ 𝑖 = 𝛼 𝑖 ˆ 𝑀 𝑡 − 𝑖 + ( 𝑢 𝑡 + 𝜖 𝑡 )+ (5) + (Í 𝑝𝑖 = ( 𝛽 𝑖 𝑢 𝑡 − 𝑖 − 𝛼 𝑖 𝜖 𝑡 − 𝑖 ) + Í 𝑞𝑖 = 𝑝 𝛽 𝑖 𝑢 𝑡 − 𝑖 ( < 𝑝 ≤ 𝑞 ) Í 𝑞𝑖 = ( 𝛽 𝑖 𝑢 𝑡 − 𝑖 − 𝛼 𝑖 𝜖 𝑡 − 𝑖 ) − Í 𝑝𝑖 = 𝑞 𝛼 𝑖 𝜖 𝑡 − 𝑖 ( < 𝑞 ≤ 𝑝 ) In both cases, it is a combination of an autoregressive model (5)of order 𝑝 , and a moving average model of order max { 𝑝, 𝑞 } . Themodel differs from the ARMA model in (3) only on the additionalzero-mean error terms 𝜖 𝑡 , which are independent on other terms inthe model and have known variance (for fixed sampling and the esti-mation methods). Thus, the model can be fitted on { ˆ 𝑀 𝑡 } using, e.g. ,maximum likelihood estimator, as normal ARMA models. 𝜖 𝑡 in-creases the model’s uncertainty, and, together with the model noiseterms 𝑢 𝑡 , decides the confidence of the model prediction.For a fixed confidence level, the narrower the forecast intervalsare, the more confident we are about the prediction. Their widths areproportional to the standard deviations of forecasts of ˆ 𝑀 𝑡 , which inturn depends on the variance of noise terms 𝑢 𝑡 ’s and 𝜖 𝑡 ’s.In comparison to the original ARMA model in (3), the additionalnoise term 𝜖 𝑡 will indeed incur wider forecast intervals. However,with a proper sampling-estimation scheme, if 𝜖 𝑡 ’s variance is negli-gible in comparison to 𝑢 𝑡 ’s, 𝜖 𝑡 will have little impact on the forecasterror/interval, which will be demonstrated in our experiments later.Here, we give a formal analysis for the ARMA ( , ) model:P ROPOSITION Suppose we have a time series { 𝑀 𝑡 } satisfies ARMA ( , ) . Let ˆ 𝑀 𝑡 = 𝑀 𝑡 + 𝜖 𝑡 be an estimation of 𝑀 𝑡 satisfyingunbiasedness and independence. Then, Var [ ˆ 𝑀 𝑡 ] = 𝑎 · 𝜎 𝑢 + 𝜎 𝜖 , where 𝜎 𝑢 = Var [ 𝑢 𝑡 ] , 𝜎 𝜖 = Var [ 𝜖 𝑡 ] , and 𝑎 = ( + 𝛼 𝛽 + 𝛽 )/( − 𝛼 ) isa constant decided by parameters in ARMA ( , ) . We can use normal random variables to approximate 𝑢 𝑡 and 𝜖 𝑡 ,and estimate forecast intervals for a given confidence level. Impact on LSTM-based model.
LSTM can be naturally appliedfor forecasting tasks thanks to its ability to memorize trends andpatterns of time series. Figure 4 depicts such a forecasting modeland illustrates where noise in the estimates impacts the model fit-ting.At time stamp 𝑡 , we want to forecast 𝑀 𝑡 with the previous 𝐾 metric values 𝑀 𝑡 − , . . . , 𝑀 𝑡 − 𝐾 and the “memory” ( c 𝑡 − , h 𝑡 − ) . How-ever, we have only estimates ˆ 𝑀 𝑡 − 𝑖 = 𝑀 𝑡 − 𝑖 + 𝜖 𝑡 − 𝑖 available; these es-timates are fed into the LSTM unit as inputs. LSTM then generatesan output vector h 𝑡 and update its memory cell from c 𝑡 − to c 𝑡 . h 𝑡 can be interpreted as the current status of LSTM and it is used asthe input to a fully-connected layer for forecasting 𝑀 𝑡 . Again, wehave only ˆ 𝑀 𝑡 = 𝑀 𝑡 + 𝜖 𝑡 available as an approximation, which welearn to fit with LSTM and the fully-connected layer. c 𝑡 and h 𝑡 areused to deliver memory to the next time stamp.Noise terms 𝜖 𝑡 ’s may affect how the weight matrices in LSTMand the fully-connected layer are learned and the values of c 𝑡 and h 𝑡 derived (via linear transformations and activation functions). Weconjecture that the LSTM-based forecasting model performs wellas long as the estimates ˆ 𝑀 𝑡 ’s are accurate enough (or 𝜖 𝑡 ’s are smallenough). It is difficult to derive any formal analytical result here,but we will evaluate it experimentally in Section 6. (cid:0)(cid:0)(cid:1)(cid:1) LSTM
Input ouput vectorcell vector of LSTM(memory) Forecast ...... M t + ǫ t h t h t c t h t, h t, h t,d h t, c t − h t − M t − + ǫ t − M t − K + ǫ t − K Figure 4: An LSTM-based forecasting model with noisy inputs
We now focus on how to draw samples for estimating results ofaggregation queries. An aggregation query in (4) is essentially tocompute the sum of a subset of measure values in a relation 𝑇 attime 𝑡 ; the subset is decided online by the constraint C specified inthe forecasting task. W.l.o.g., suppose the subset of rows satisfying C is [ 𝑛 ] = { , . . . , 𝑛 } , we want to estimate the metric 𝑀 = Í 𝑛𝑖 = 𝑚 𝑖 ,for a measure m = [ 𝑚 𝑖 ] . The (offline) sampling algorithm shouldbe independent on C , but could depend on m . Existing samplers.
There are two categories of sampling schemesfor estimating subset sums: uniform sampling and weighted sam-pling . In uniform sampling, each row in the relation is drawn intothe sample with equal probability; an unbiased estimation for 𝑀 issimply the rescaled sum of values in the sample. It has been used inonline aggregation extensively, but the main deficiency is that theerror bound is proportional to the difference between the maximumand minimum (or, the range of) measure values [28].In weighted sampling, each row 𝑖 is drawn with probability pro-portional to 𝑚 𝑖 , so that we can remove the dependency of the es-timation’s error bound on the range of the measure values. Moreconcretely, [17] and [19] give efficient implementations of such asampling distribution: for some fixed constant 𝜏 (deciding the sam-pling size), if 𝑚 𝑖 < 𝜏 , the probability of drawing a row 𝑖 is 𝑚 𝑖 / 𝜏 ,and if 𝑚 𝑖 ≥ 𝜏 , multiple (roughly 𝜏 / 𝑚 𝑖 ) copies of 𝑖 are included intothe sample. Threshold sampling [20] and priority sampling [10, 21]differ from the above one in that, if 𝑚 𝑖 ≥ 𝜏 , exactly one copy of 𝑖 is included. Priority sampling has been shown to be optimal [37]in terms of the sampling efficiency with relative standard deviation p Var [ estimation ]/ 𝑀 = p /( sample _ size − ) . Requirements for the sampler in
FlashP . Weighted sampling of-fers much better sampling efficiency than uniform sampling, espe-cially for heavy tailed distributions, which is common in practice.However, in all the above weighted sampling schemes, the sam-pling distribution is decided by the measure’s values. In
FlashP ,we have 𝑑 𝑚 different measures to forecast; and thus, we have todraw 𝑑 𝑚 such weighted samples independently; when 𝑑 𝑚 is large( e.g. , ), the storage cost of all the samples ( e.g. , even with sam-pling rate ) is prohibitive in memory. Therefore, the requirementhere is: whether we can generate a compact sample which takescare of multiple measures and still have accuracy guarantees .To this end, we propose GSW ( Generalized Smoothed Weighted )sampling. We introduce its sampling distribution and analyze itssampling efficiency in Section 4.1. We utilizes the “generality” ofits sampling distribution to generate a compact sample which takescare of multiple measures, and analyzes under which conditions itgives provable accuracy guarantees in Section 4.2. .1 Generalized Smoothed Sampling The GSW sampling scheme is parameterized by a positive constant Δ and positive sampling weights w = [ 𝑤 𝑖 ] : each row 𝑖 in the givenrelation 𝑇 is drawn into the sample 𝑆 Δ with probability 𝑤 𝑖 Δ + 𝑤 𝑖 , inde-pendently . For fixed sampling weights, Δ decides the sample size;and the choice of w decides the estimation accuracy.Inspired by the Horvitz–Thompson estimator [30], we define the calibrated measure to be ˆ 𝑚 𝑖 = 𝑚 𝑖 ( Δ + 𝑤 𝑖 )/ 𝑤 𝑖 if row 𝑖 is drawn intothe sample 𝑆 Δ , and ˆ 𝑚 𝑖 = otherwise. We would associate ˆ 𝑚 𝑖 witheach sample row 𝑖 and store it in 𝑆 Δ . Formally, we have ( Pr [ 𝑖 ∈ 𝑆 Δ ∧ ˆ 𝑚 𝑖 = 𝑚 𝑖 ( Δ + 𝑤 𝑖 ) 𝑤 𝑖 ] = 𝑤 𝑖 Δ + 𝑤 𝑖 Pr [ 𝑖 ∉ 𝑆 Δ ∧ ˆ 𝑚 𝑖 = ] = − 𝑤 𝑖 Δ + 𝑤 𝑖 = ΔΔ + 𝑤 𝑖 . (6)We can estimate 𝑀 as ˆ 𝑀 = Í 𝑖 ∈ 𝑆 Δ ˆ 𝑚 𝑖 , which is unbiased since E (cid:2) ˆ 𝑀 (cid:3) = 𝑛 Õ 𝑖 = E [ ˆ 𝑚 𝑖 ] = 𝑛 Õ 𝑖 = 𝑚 𝑖 ( Δ + 𝑤 𝑖 ) 𝑤 𝑖 · 𝑤 𝑖 Δ + 𝑤 𝑖 = 𝑀. Simple and efficient implementations.
A GSW sampler can beeasily implemented in a distributed manner: each row 𝑖 generatesuniformly random number 𝑝 𝑖 from [ , ] , independently; accordingto (6), if 𝑝 𝑖 ≤ 𝑤 𝑖 Δ + 𝑤 𝑖 , the row 𝑖 is put into the sample 𝑆 Δ .A GSW sample 𝑆 Δ can be maintained in an incremental way.Suppose we have drawn 𝑆 Δ from rows [ 𝑛 ] for some fixed Δ . If morerows 𝑛 + , . . . , 𝑛 + 𝑘 are coming, we want to increase Δ to Δ ′ andobtain a GSW sample 𝑆 Δ ′ from [ 𝑛 + 𝑘 ] with size comparable to | 𝑆 Δ | .Suppose rows in 𝑆 Δ are sorted by ( 𝑝 𝑖 − ) 𝑤 𝑖 in an ascending order;we only need to delete those with Δ ≤ ( 𝑝 𝑖 − ) 𝑤 𝑖 < Δ ′ from 𝑆 Δ ,and insert those with Δ ′ ≤ ( 𝑝 𝑖 − ) 𝑤 𝑖 , for 𝑖 = 𝑛 + , . . . , 𝑛 + 𝑘 . Inthis way, we update 𝑆 Δ to 𝑆 Δ ′ without touching any row in [ 𝑛 ] − 𝑆 Δ . We now analyzeestimation errors of the class of GSW-based estimators ˆ 𝑀 , instan-tiated by ( Δ , w ) . We consider relative standard deviation ( RSTD )and relative error ( RE ). As ˆ 𝑀 is unbiased, RSTD ( ˆ 𝑀 ) , vut E "(cid:18) ˆ 𝑀 − 𝑀𝑀 (cid:19) ≥ E (cid:20) | ˆ 𝑀 − 𝑀 | 𝑀 (cid:21) , RE ( ˆ 𝑀 ) . The goal of our analysis is to establish a relationship betweenthe sample size and the (expected)
RSTD and RE . Intuitively, whenthe sampling weight 𝑤 𝑖 is “consistent” with the measure 𝑚 𝑖 , theestimation error is minimum (for fixed sample size). We introducethe following notation to quantify the “consistency”.D EFINITION ( ( 𝜃, 𝜃 ) -consistency) Sampling weights w = [ 𝑤 𝑖 ] are ( 𝜃, 𝜃 ) -consistent with measure m = [ 𝑚 𝑖 ] , iff 𝜃 = min 𝑖 ( 𝑚 𝑖 / 𝑤 𝑖 ) and 𝜃 = max 𝑖 ( 𝑚 𝑖 / 𝑤 𝑖 ) . 𝜃 , 𝜃 / 𝜃 is called the consistency scale. The above notation about “consistency” says, for any row 𝑖 , 𝜃 ≤ 𝑚 𝑖 / 𝑤 𝑖 ≤ 𝜃 . It allows m and w to differ in scale ( e.g. , both 𝜃 and 𝜃 could be large) but measures their similarity in patterns and trends.For example, suppose m = [ , , , ] and w = [ , , , ] . We have 𝜃 = / = , 𝜃 = , and thus 𝜃 = / = . . Ingeneral, we have 𝜃 ≥ , and the following theorem shows that, therelative error has an upper bound that is proportional to √ 𝜃 . T HEOREM (Sampling efficiency of GSW ) Suppose the sam-pling weights w used in GSW sampling are ( 𝜃, 𝜃 ) -consistent withmeasure values m , the estimate ˆ 𝑀 is unbiased and has expectedrelative standard deviation and error bounded by: ( 𝜃 , 𝜃 / 𝜃 ) RE ( ˆ 𝑀 ) ≤ RSTD ( ˆ 𝑀 ) ≤ s 𝜃 / 𝜃 E [|S Δ |] = s 𝜃 E [|S Δ |] . An open question raised by Alon et al. when introducing pri-ority sampling [10] is: whether we can provide any error boundif subset sum on a measure ( m ) is estimated using a priority sam-ple drawn based on a different measure ( w ). Theorem 3 answersthis question in GSW sampling by specifying under what condition( ( 𝜃, 𝜃 ) -consistency)) there is an error bound. We can choose w to minimize the variance of estimation, while the expected sam-ple size is bounded. Please refer to Appendix A.3 for a preciseformulation and the optimal solution. From Theorem 3, a nearly-optimal solution is w = m , as in this case, w is ( , ) -consistentwith m ( 𝜃 = ). We call it optimal GSW sampler . Directly fromTheorem 3, we haveC
OROLLARY (Optimal GSW sampler)
If we use w = m ( 𝑤 𝑖 = 𝑚 𝑖 for each row 𝑖 ) as the sampling weights, we have RE ( ˆ 𝑀 ) ≤ RSTD ( ˆ 𝑀 ) ≤ r E [|S Δ |] . The optimal GSW sampler has sampling efficiency that is com-parable to the best known sampler for estimating subset sums, e.g. ,priority sampling [10] with
RSTD = p /( sample _ size − ) . We willcompare their empirical performance in Section 6. The size of GSW sample can be controlled by the parameter Δ .When there is only one measure m , we draw an optimal GSW sam-ple (setting w = m ). A more common scenario is that we have mul-tiple measures ( e.g. , in Figure 1) in one relation. We can draw oneoptimal GSW sample per each measure, which, however, increasesthe space consumption significantly. The question is whether wecan use one sample to take care multiple measures.Suppose there are 𝑘 measures in a relation to be aggregated andpredicted, m ( ) , . . . , m ( 𝑘 ) . For each measure 𝑗 , we want to estimate 𝑀 ( 𝑗 ) = Í 𝑛𝑖 = 𝑚 ( 𝑗 ) 𝑖 for rows in a set [ 𝑛 ] (satisfying the constraint C in a forecasting task). A GSW sample 𝑆 Δ is drawn using weights w = [ 𝑤 𝑖 ] . The calibrated measure on each sample row 𝑖 for eachmeasure 𝑗 is ˆ 𝑚 ( 𝑗 ) 𝑖 = 𝑚 ( 𝑗 ) 𝑖 ( Δ + 𝑤 𝑖 )/ 𝑤 𝑖 . From Theorem 3, ˆ 𝑀 ( 𝑗 ) = Í 𝑖 ∈ 𝑆 Δ ˆ 𝑚 ( 𝑗 ) 𝑖 is an unbiased estimation of 𝑀 ( 𝑗 ) .Intuitively, if the chosen sampling weight vector w centers around m ( ) , . . . , m ( 𝑘 ) and is not too far away from any of the 𝑘 , fromTheorem 3, the error can be better bounded. We can find such cen-ters by taking the average of measures. For example, for m ( ) = [ , , , ] and m ( ) = [ , , , ] , the geometric mean is w × = [√ · = , , , ] , and the arithmetic mean is w + = [( + )/ = . , . , , . ] . We now analyze how the errorcan be bounded for these two choices. eometric compressed GSW sampling.
We can use the geometricmean of the 𝑘 measures as the sampling weights: 𝑤 × 𝑖 = © « 𝑘 Ö 𝑗 = 𝑚 ( 𝑗 ) 𝑖 ª®¬ / 𝑘 . (7)Among the 𝑘 measures to be grouped, define the trend deviation between any two measures m ( 𝑝 ) , m ( 𝑞 ) (for 𝑝, 𝑞 ∈ [ 𝑘 ] ): 𝜌 𝑝,𝑞 , max 𝑖 ∈[ 𝑛 ] 𝑚 ( 𝑝 ) 𝑖 𝑚 ( 𝑞 ) 𝑖 , 𝜌 𝑝,𝑞 , min 𝑖 ∈[ 𝑛 ] 𝑚 ( 𝑝 ) 𝑖 𝑚 ( 𝑞 ) 𝑖 , and 𝜌 𝑝,𝑞 , 𝜌 𝑝,𝑞 𝜌 𝑝,𝑞 , (8)which measures how similar the trends (instead of their absolutevalues) of the two measures are . The smaller 𝜌 𝑝,𝑞 is, the moresimilar m ( 𝑝 ) and m ( 𝑞 ) are. For example, if m ( 𝑝 ) = 𝑐 · m ( 𝑞 ) for aconstant 𝑐 , 𝜌 𝑝,𝑞 = 𝑐 / 𝑐 = . Define 𝜌 , max 𝑝,𝑞 ∈[ 𝑘 ] 𝜌 𝑝,𝑞 to be the maximum deviation among the 𝑘 measures. From Theorem 3,C OROLLARY If we use w × = [ 𝑤 × 𝑖 ] as the sampling weightsfor a relation with 𝑘 measures, with a GSW sample 𝑆 Δ , we canestimate 𝑀 ( 𝑝 ) as ˆ 𝑀 ( 𝑝 ) for each measure 𝑝 with error RE ( ˆ 𝑀 ( 𝑝 ) ) ≤ RSTD ( ˆ 𝑀 ( 𝑝 ) ) ≤ vt Î 𝑗 : 𝑗 ≠ 𝑝 𝜌 / 𝑘𝑝,𝑗 E [|S Δ |] ≤ s 𝜌 𝑘 − 𝑘 E [|S Δ |] . Arithmetic compressed
GSW sampling.
Another choice is to usethe arithmetic mean as the sampling weights: 𝑤 + 𝑖 = 𝑘 𝑘 Õ 𝑗 = 𝑚 ( 𝑗 ) 𝑖 . (9)Define the range deviation 𝛿 among 𝑘 measures: for each row 𝑖 ,consider the ratio between the maximum measure and the minimumone; 𝛿 is the maximum ratio among all rows: 𝛿 , max 𝑖 ∈[ 𝑛 ] max 𝑗 ∈[ 𝑘 ] 𝑚 ( 𝑗 ) 𝑖 min 𝑗 ∈[ 𝑘 ] 𝑚 ( 𝑗 ) 𝑖 ! . (10)From the definitions, for any measure 𝑝 , we have / 𝛿 ≤ 𝑚 ( 𝑝 ) 𝑖 / 𝑤 + 𝑖 ≤ 𝛿 . Thus, directly from Theorem 3, we have the following bound.C OROLLARY If we use w + = [ 𝑤 + 𝑖 ] as the sampling weightsfor a relation with 𝑘 measures, with a GSW sample 𝑆 Δ , we canestimate 𝑀 ( 𝑝 ) as ˆ 𝑀 ( 𝑝 ) for each measure 𝑝 with error RE ( ˆ 𝑀 ( 𝑝 ) ) ≤ RSTD ( ˆ 𝑀 ( 𝑝 ) ) ≤ s 𝛿 E [|S Δ |] . How to group measures?
In the above, we have shown that, forchosen sampling weights, how the error can be bounded with someparameters ( 𝜌 and 𝛿 ) decided by the data about a group of mea-sures. When there are many ( e.g. , 𝑘 = ) measures, 𝜌 and 𝛿 couldbe huge and thus the above error bounds are not informative. Wewant to partition measures into small groups of size 4-5 based ontheir correlation, so that within each group one GSW sample givesgood estimations for the measures. To this end, we establish a rela-tionship between ( 𝜃, 𝜃 ) -consistency and 𝐿 distance as follows. Or, equivalently, m ( 𝑞 ) is ( 𝜌 𝑝,𝑞 , 𝜌 𝑝,𝑞 ) -consistent with m ( 𝑝 ) . g1:imp-clkg2:fav-cart g1:imp-favg2:clk-cart g1:imp-cartg2:clk-fav051015 A gg re ga t i o n err o r ( % ) ImpressionClickFavoriteCart (a) Three ways to group the four mea-sures:
Imp = Impression , clk = Click , fav = Favorite , cart = Cart g1:imp-clkg2:fav-cart g1:imp-favg2:clk-cart g1:imp-cartg2:clk-fav0.00.51.01.5 L d i s t a n ce t o ce n t er (b) 𝐿 distance between measure vec-tors and the sampling weight vector (thearithmetic mean of measure vectors in agroup) Figure 5: Aggregation error and 𝐿 distance P ROPOSITION (consistency and 𝐿 distance) Suppose sam-pling weights w = [ 𝑤 𝑖 ] are ( 𝜃, 𝜃 ) -consistent with measure m = [ 𝑚 𝑖 ] . We normalize w as w ′ = [ 𝑤 ′ 𝑖 = 𝑤 𝑖 / Í 𝑗 𝑤 𝑗 ] 𝑖 , and similarly, m as m ′ . Let 𝜃 = 𝜃 / 𝜃 . We have k m ′ − w ′ k ≤ ( 𝜃 − ) . ( 𝜃, 𝜃 ) -consistency is a “worst-case” notion about all the rows. Itis not a good metric for grouping measures, because first, it is expen-sive to compute 𝜃 , and second, an aggregation query may not touchall the rows. Instead, we use 𝐿 distance (after normalization) asin Proposition 7 to quantify the correlation between two measures.Intuitively, it quantifies how two measures are similar in trends andpatterns regardless of their absolute values.We consider a formulation based on the KC ENTER problem [27].The goal is to partition the measures into 𝑔 groups so that the max 𝐿 distance (after normalization) between any measure to the center ofthe group it belongs to is minimized. Here, the 𝐿 distance betweentwo measures can be estimated using a sample of rows. We can ap-ply the standard greedy algorithm [27] to find a 2-approximation.Within each group, we use the geometric/arithmetic mean as thesampling weight vector. It is a heuristic strategy without any for-mal guarantee, but from Proposition 7 and the triangle inequalityin 𝐿 , at least we know that we’d better not group two measuresthat are far way ( e.g. , > ( 𝜃 − ) ) together, as there is no samplingweight vector that is consistent with both of them at the same time.We conduct a preliminary evaluation on the above correlation met-ric, 𝐿 distance (after normalization). Using the same datasets andworkloads (with sensitivity ranging from . to ) as Exp-I inSection 6, the experimental results are plotted in Figure 5. Here,the four measures, Impression , Click , Favorite , and
Cart , are par-titioned into two equal-size groups (there are a total of three waysof partitioning). For example, the first way in Figure 5 is group1= {
Impression , Click } and group2 = {
Favorite , Cart }. For eachgroup, we pick the sampling weight vector as the arithmetic meanof measure vectors in this group. Figure 5(b) plots the 𝐿 distancebetween each measure vector and the sampling weight vector, andFigure 5(a) plots the aggregation error using this sampling weightvector in GSW. It can be seen that aggregation errors and 𝐿 dis-tances have similar trends. That is, if a sampling weight vector isclose to a measure vector, it gives better estimation; thus, if twomeasure vectors are close in 𝐿 , it is better to put them in one group(their mean can be used as the sampling weight vector). In Section 6,we focus on experimental evaluation after the grouping is fixed. Wewill leave the development and analysis of better grouping strate-gies as future work. ul$-layerSamplesofRela$ons Figure 6: Deployment of the system
FlashP
We have implemented and deployed
FlashP in Alibaba’s ads ana-lytical platform. Figure 6 shows important components of
FlashP . Offline Sample Preprocessor of FlashP is built on Alibaba’s dis-tributed data storage and analytics service, MaxCompute [1]. Timeseries of relations, partitioned by time, are stored in MaxCompute’sdata warehouse. Relations can be joined, e.g. , tables
UserProfile and
AdTraffic are joined on
UserID . GSW sampler is implementedas UDFs (user-defined functions), and draws samples from one re-lation or the view of joined relation. A set of samples of differentsizes (with increasing values of Δ ) are drawn and stored as Multi-layer Samples of Relations for response time-prediction accuracytradeoff. Online Forecasting Service first pulls Multi-layer Samples of Re-lations into Alibaba’s in-memory OLAP engine, Hologres [1], toenable real-time response. There are 30 servers in the cluster tosupport this service, each with 96 CPU cores and 512G memory.Sample data is partitioned by time in the OLAP engine.Users submit forecasting tasks to an Application Server via aWeb UI. For a task, aggregation queries in (4) are generated fromQuery Rewriter and processed on samples in the OLAP engine toobtain estimated answers. These estimations are used as trainingdata to fit the Forecasting Model, which is built on a Python server.Two models are implemented. One is ARIMA. An open-sourcelibrary [2] (built on X-13ARIMA-SEATS [3]) that trains ARIMAand automatically tunes for the best values of parameters 𝑝, 𝑑, 𝑞 isused. We also support an LSTM-based model (in Figure 4), which isimplemented using Keras [4]. We use the LSTM and fully-connectedlayers in Keras with 𝐾 = and 𝑑 = as the default parameter set-ting. Other forecasting models can be plugged in here, too. Afterfitting the model, we send forecasts back to users. We evaluate our system
FlashP under the implementation and hard-ware specified in Section 5, on a real-life dataset with 11 dimen-sions about users’ profiles and 4 numeric measures to be predicted,including
Favorite , Impression , Click , and
Cart . There are around15 million rows per day, and 200 days of data.
Table 1: A summary of results (0.1% sample, Opt-GSW = Opti-mal GSW, C-GSW = Arithmetic compressed GSW)
Full PIM Uniform Opt-GSW C-GSW
Favorite
Impression
Click
Cart
We compare different samplers:
Uniform sampling, which isalso used in [7], is the baseline;
Priority , the optimal weightedsampler [21]; our
Optimal
GSW and
Arithmetic/Geometric com-pressed
GSW introduced in Section 4. We also compare our sam-pling based methods with
PIM (Partwise Independence Model) [7]based on a Bayesian model assuming partially independence.In the following, forecasting tasks are randomly picked with dif-ferent measures to be predicted and different combinations of di-mensions in their constraints, with some (approximately) fixed se-lectivity (the fraction of rows satisfying the constraint). By default,we use 150 days’ data to fit the model and predict the next 7 days;we report relative aggregation errors (average of the 150 days), rel-ative forecast error , and forecast intervals (average of the next 7days), taking the average of 400 independent runs of different tasks,together with one standard deviation, for each measure and for eachvalue of selectivity (on independent samples).
Exp-I: A summary of results.
We first give a brief summary ofexperimental results. Table 1 reports the average forecast errors on20 random tasks with selectivity from . to using ARIMA.“Full” stands for the result when we use the full data to processaggregation queries for training. With a sampling rate . , our op-timal GSW and compressed GSW in FlashP perform consistentlybetter than Uniform and PIM in terms of forecast errors, and some-times are very close to Full (the best we can do). These two alsooffer interactive response time (less than ms). In the rest part,we will report more detailed results about response time and perfor-mance of different sampling-based methods.
Sampling rate T i m e ( m s ) ForecastingAggregation
Figure 7: E2E responsetime (with ARIMA)Exp-II: Real-time response.
Theend-to-end response time is re-ported in Figure 7, partitioned intothe portion for processing (estimat-ing) aggregation queries, and theportion for forecasting (model fit-ting + prediction using ARIMA).It can be seen that the portion foraggregation queries is the bottle-neck, but with sampling, the re-sponse time can be reduced fromaround 20sec on the full data to 30ms on a . sample, whichstill gives reasonable prediction as will be shown later. If LSTM isused, the model fitting is much more expensive, but we still have aninteractive response time around 1sec if a sample is used. Exp-III: Varying number of time stamps used in training.
Weconsider different numbers of training data points used to fit themodel. We focus on tasks with selectivity to forecast Impres-sion . The average forecast errors, with one standard deviation, forvarying sampling rate using our optimal GSW are plotted in Fig-ure 8. It shows that the number of time stamps used to fit model hasan obvious impact on the forecast error, with 150 (days) giving the ost accurate and stable prediction for both ARIMA and LSTM.It motivates us to speedup the processing of aggregation queries,as more time stamps mean more aggregation queries. We also testdifferent selectivities on other measures. The trends are similar.
30 60 90 150
Length of training time stamps F o rec a s t err o r ( % ) Sampling rate: 100.0%Sampling rate: 1%Sampling rate: 0.1% Sampling rate: 0.05%Sampling rate: 0.02% (a) ARIMA
30 60 90 150
Length of training time stamps F o rec a s t err o r ( % ) Sampling rate: 100.0%Sampling rate: 1%Sampling rate: 0.1% Sampling rate: 0.05%Sampling rate: 0.02% (b) LSTM
Figure 8: Number of time stamps used in training v.s. Forecasterror (selectivity , Impression ) Sampling rate A gg re ga t i o n err o r ( % ) (a) Selectivity 0.5% Sampling rate A gg re ga t i o n err o r ( % ) Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (b) Selectivity 5%
Figure 9: Aggregation error of different sampling methods forvarying selectivity and sampling rate (
Favorite )Exp-IV: Varying sampling rate and selectivity.
We compare dif-ferent samplers in
FlashP , for varying sampling rate and selectivity.Note that for Arithmetic/Geometric compressed GSW, one samplesuffices for the relation; Priority are Optimal GSW are measure-dependent, and thus using either of them we need four samples (oneper measure), with the total space consumption four times of Uni-form and compressed GSW for a fixed sampling rate.For tasks on
Favorite with selectivity . - , Figure 9 reportsaggregation errors, and Figures 10-11 report forecast errors whenusing ARIMA and LSTM as forecasting models; the results on Impression are plotted in Figures 13-14. In terms of both aggre-gation errors and forecasting errors, Priority and Optimal GSW arevery close and better than the others (indeed, at the cost of storingfour samples). In some cases, Optimal GSW is even slightly betterthan Priority (theoretically optimal). This is because, in Priority, ifthe measure is above some threshold, a row is included in the sam-ple deterministically, which favors the long tail; however, the longtail may or may not satisfy the constraint specified online in theforecasting task. Uniform is the worst one which is consistent withits analytical error bound [28]. Arithmetic/Geometric compressedGSW needs only one sample, too; they are better than Uniform,and get very close to Priority and Optimal GSW when the samplingrate is close to . For larger selectivity, with more rows satisfyingthe constraint included in the samples, every sampler gets better.Figure 12(a) reports widths of forecast intervals of ARIMA witha confidence level of on measure Favorite for varying sampling
Sampling rate F o rec a s t err o r ( % ) (a) ARIMA Sampling rate F o rec a s t err o r ( % ) Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (b) LSTM
Figure 10: Forecast error of different sampling methods forvarying sampling rate (selectivity 0.5%,
Favorite ) Sampling rate F o rec a s t err o r ( % ) Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (a) ARIMA
Sampling rate F o rec a s t err o r ( % ) Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (b) LSTM
Figure 11: Forecast error of different sampling methods forvarying sampling rate (selectivity 5%,
Favorite ) Sampling rate F o rec a s t i n t er va l Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (a) Varying sampling rate F avo r i t e No samplingOptimal GSWUniformArithmetic compressed GSWGeometric compressed GSWTrue Value (b) Forecast intervals of a real query on0.02% sampling
Figure 12: Forecast intervals (ARIMA) with different samplingmethods for varying sampling rate (selectivity 0.5%,
Favorite ) rate. With larger sampling rate, every sampler gives narrower fore-cast intervals, meaning predictions with more confidence. Uniformgives the widest one and Priority and Optimal GSW give the narrow-est ones. Figure 12(b) shows the forecast intervals (dashed lines) forone particular task using different samplers.In terms of forecasting, LSTM performs consistently better thanARIMA, at the cost of longer response time (as discussed in Exp-II). It is observed that, with increasing sampling rates, when theaggregation error is small enough ( e.g. , when sampling rate = inFigures 9-13), it will have little impact on the forecast error/intervalin comparison to the case when we use the full data (sampling rate = ), because it is negligible in comparison to the model and data’snoise ( e.g. , 𝑢 𝑡 in (3) for ARIMA). Forecast errors and forecast in-tervals have similar trends as aggregation errors. Both observationsare consistent with our analytical results in Section 3. Exp-V: Space cost under the same accuracy requirement.
Weevaluate the space cost needed to achieve the same accuracy for dif-ferent samplers. Since Priority and Optimal GSW perform similarly, Sampling rate F o rec a s t err o r ( % ) (a) ARIMA Sampling rate F o rec a s t err o r ( % ) Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (b) LSTM
Figure 13: Forecast error of different sampling methods forvarying sampling rate (selectivity 0.5%,
Impression ) Sampling rate F o rec a s t err o r ( % ) Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (a) ARIMA
Sampling rate F o rec a s t err o r ( % ) Optimal GSWPriorityArithmetic compressed GSWGeometric compressed GSWUniform (b) LSTM
Figure 14: Forecast error of different sampling methods forvarying sampling rate (selectivity 5%,
Impression ) T o t a l s a m p l e s i ze ( % ) Optimal GSW on ImpressionOptimal GSW on ClickOptimal GSW on FavoriteOptimal GSW on CartArithmetic compressed GSW (a) Optimal v.s. Compressed
Impression Click Favorite Cart01020304050 F o rc a s t err o r ( % ) Optimal GSW 1.8%Arithmetic compressed GSW 1%Optimal GSW 0.18%Arithmetic compressed GSW 0.1%Optimal GSW 0.09%Arithmetic compressed GSW 0.05%Optimal GSW 0.035%Arithmetic compressed GSW 0.02% (b) Forecast error (selectivity 5%)
Figure 15: Space cost for given accuracy requirement we focus on Optimal GSW and compare it with Arithmetic com-pressed GSW. We fix the sample size of Arithmetic compressedGSW (from . to ). And for each measure, we choose thesize of an Optimal GSW sample so that it gives approximately thesame aggregation error as Arithmetic compressed GSW does. InFigure 15(a), we report the total size of the four Optimal GSW sam-ples (the portions for different measures are stacked and labeledwith different colors), and the size of the Arithmetic compressedGSW sample; 𝑥 -axis is the max aggregation error in Arithmeticcompressed GSW with the parameter Δ in brackets. It shows that,under the same accuracy requirement, the total size of OptimalGSW samples is around 1.8 times of the size of Arithmetic com-pressed GSW. Figure 15(b) reports forecast errors (of ARIMA) us-ing the samples chosen above on different measures. Due to the wayhow we choose the sizes of Optimal GSW samples, Optimal GSWand Arithmetic compressed GSW give very close forecast errors. Approximate query processing and other samplers.
An orthogo-nal line of work is approximate query processing (AQP), which hasbeen studied extensively during last few decades. One can refer to[16] for a comprehensive review. There are two major lines of AQPtechniques. i) Online aggregation [18, 28, 33] and online sampling-based AQP [31] either assume that the data is randomly ordered, orneed to draw random rows from the data table via random I/O ac-cesses which is inefficient in our setting. ii) Offline sampling-basedAQP draws offline samples before queries come: some are basedon historical workloads [5, 8, 14, 15, 22, 32, 36], and some areworkload-independent [5, 6, 11, 13, 19, 34].The most relevant part in the line of AQP techniques are the sam-plers proposed to estimate aggregations. We have reviewed suchsamplers at the beginning of Section 4. There are other samplerssuch as universe (hashed) sampling and stratified sampling intro-duced in AQP systems [8, 31, 34]. These samplers were proposedto handle orthogonal aspects such as missing groups in
GroupBy and joins. They can also be used in our system if we want to extendthe task class and data schema we want to support.
Precomputing aggregations.
Another choice is to precompute ag-gregations or summaries using techniques such as view materializa-tion [9], datacube [25], histograms [23], and wavelets [12, 23, 39].These techniques either have too large space overhead (super-linear)to be applicable for enterprise-scale high-dim datasets, or cannotsupport complex constraints in our forecasting tasks.
Aggregation-forecasting analytics. [7] studies a very similar prob-lem of forecasting multi-dimensional time series. It considers cap-turing correlations across the high-dimensional space using eitherBayesian models or uniform sampling, and shows that the one basedon uniform sampling (which is also evaluated in our experimentes)offers much better forecast accuracy. Another relevant line of workis about aggregation-disaggregation techniques [38, 40] in the fore-casting literature. These techniques share some similarity with theBayesian model in [7], but focus on one-time offline analysis withsmaller scale and lower dimensional data.For multi-dimensional time series, there are works about how toconduct fast similarity search [24], but they are less relevant here.
We introduce a real-time forecasting system
FlashP for analyzingtime-series relations. There are two core technical contributions,which enable
FlashP to handle forecasting tasks on enterprise-scalehigh-dim time series interactively. First, we analyze how samplingand estimated aggregations on time-series relations enables interac-tive and accurate predictions. Second, we propose GSW sampling,a new sampling scheme; we establish a connection from the differ-ence between GSW’s sampling weights and measure values to theaccuracy of aggregation estimations. Using this scheme, we can ef-fectively reduce the space (memory) consumption of weighted sam-ples by letting multiple measures share one GSW sample with themeans of these measures’ values as sampling weights.
REFERENCES SIGMOD , pages 487–498, 2000.[6] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approxi-mate query answering system. In
SIGMOD , pages 574–576, 1999.[7] D. Agarwal, D. Chen, L. Lin, J. Shanmugasundaram, and E. Vee. Forecastinghigh-dimensional data. In
SIGMOD , pages 1003–1012, 2010.[8] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb:queries with bounded errors and bounded response times on very large data. In
Eurosys , pages 29–42, 2013.[9] S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materi-alized views and indexes in sql databases. In
VLDB , pages 496–505, 2000.[10] N. Alon, N. G. Duffield, C. Lund, and M. Thorup. Estimating arbitrary subsetsums with few probes. In
PODS , pages 317–325, 2005.[11] B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approxi-mate query processing. In
SIGMOD , pages 539–550, 2003.[12] K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate queryprocessing using wavelets.
VLDBJ , 10(2-3):199–223, 2001.[13] S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcominglimitations of sampling for aggregation queries. In
ICDE , pages 534–542, 2001.[14] S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approachfor approximate answering of aggregate queries. In
SIGMOD , pages 295–306,2001.[15] S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for ap-proximate query processing.
TODS , 32(2):9, 2007.[16] S. Chaudhuri, B. Ding, and S. Kandula. Approximate query processing: No silverbullet. In
SIGMOD , pages 511–519, 2017.[17] S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins.In
SIGMOD , pages 263–274, 1999.[18] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears.Mapreduce online. In
NSDI , pages 313–328, 2010.[19] B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang. Sample + seek:Approximating aggregates with distribution precision guarantee. In
SIGMOD ,pages 679–694, 2016.[20] N. G. Duffield, C. Lund, and M. Thorup. Learn more, sample less: control ofvolume and variance in network measurement.
IEEE Trans. Information Theory ,51(5):1756–1775, 2005.[21] N. G. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation ofarbitrary subset sums.
J. ACM , 54(6):32, 2007.[22] V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for ap-proximate query answering. In
VLDB , page 187, 2000.[23] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Optimal andapproximate computation of summary statistics for range aggregates. In
PODS ,pages 227–236, 2001.[24] X. Gong, Y. Xiong, W. Huang, L. Chen, Q. Lu, and Y. Hu. Fast similarity searchof multi-dimensional time series via segment rotation. In
DASFAA , pages 108–124, 2015.[25] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao,F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator general-izing group-by, cross-tab, and sub-totals.
Data Min. Knowl. Discov. , 1(1):29–53,1997.[26] J. D. Hamilton.
Time Series Analysis . Princeton University Press, 1994.[27] S. Har-peled.
Geometric Approximation Algorithms . American MathematicalSociety, 2011.[28] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation.
SIGMOD ,pages 171–182, 1997.[29] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural Comput. ,9(8):1735–1780, 1997.[30] D. G. Horvitz and D. J. Thompson. A generalization of sampling without re-placement from a finite universe.
Journal of the American Statistical Association ,47(260):663–685, 1952.[31] S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, andB. Ding. Quickr: Lazily approximating complex adhoc queries in bigdata clusters.In
SIGMOD , pages 631–646, 2016.[32] C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactiveanalysis of web-scale data. In
CIDR , 2009.[33] N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation forlarge mapreduce jobs.
PVLDB , 4(11):1135–1145, 2011.[34] Y. Park, B. Mozafari, J. Sorenson, and J. Wang. Verdictdb: Universalizing ap-proximate query processing. In
SIGMOD , pages 1461–1476, 2018.[35] J. Schmidhuber, D. Wierstra, and F. J. Gomez. Evolino: Hybrid neuroevolu-tion/optimal linear search for sequence learning. In
IJCAI , pages 853–858, 2005.[36] L. Sidirourgos, M. L. Kersten, and P. A. Boncz. Sciborq: Scientific data manage-ment with bounds on runtime and quality. In
CIDR , pages 296–301, 2011.[37] M. Szegedy. The DLT priority sampling is essentially optimal. In
STOC , pages150–158, 2006. [38] J. Tobias and A. Zellner. A note on aggregation,disaggregation and forecastingperformance.
Journal of Forecasting , 19(5):457–469, 2000.[39] J. S. Vitter and M. Wang. Approximate computation of multidimensional aggre-gates of sparse data using wavelets. In
SIGMOD , pages 193–204, 1999.[40] M. West and J. Harrison.
Bayesian Forecasting and Dynamic Models . Springer-Verlag, 1997.
A MISSING DETAILSA.1 Proof of Proposition 1
We can write the noisy version of
ARMA ( , ) as ˆ 𝑀 𝑡 = 𝛼 ˆ 𝑀 𝑡 − + ( 𝑢 𝑡 + 𝛽 𝑢 𝑡 − ) + ( 𝜖 𝑡 − 𝛼 𝜖 𝑡 − ) . Let
Var [ 𝑢 𝑡 ] = 𝜎 𝑢 and Var [ 𝜖 𝑡 ] = 𝜎 𝜖 . We have E [ 𝑢 𝑡 ˆ 𝑀 𝑡 ] = E [ 𝑢 𝑡 ] + E [ 𝑢 𝑡 ( 𝛼 ˆ 𝑀 𝑡 − + 𝛽 𝑢 𝑡 − + 𝜖 𝑡 − 𝛼 𝜖 𝑡 − )] = 𝜎 𝑢 + = 𝜎 𝑢 , from the fact that 𝑢 𝑡 is dependent only on ˆ 𝑀 𝑡 . Similarly, E [ 𝜖 𝑡 ˆ 𝑀 𝑡 ] = 𝜎 𝜖 . Taking the variance of both sides of the model, we have
Var [ ˆ 𝑀 𝑡 ] = Var [ 𝑢 𝑡 + 𝜖 𝑡 ] + Var [ 𝛼 ˆ 𝑀 𝑡 − + 𝛽 𝑢 𝑡 − − 𝛼 𝜖 𝑡 − ] = 𝜎 𝑢 + 𝜎 𝜖 + 𝛼 Var [ ˆ 𝑀 𝑡 − ] + 𝛽 𝜎 𝑢 + 𝛼 𝜎 𝜖 + 𝛼 𝛽 Cov [ ˆ 𝑀 𝑡 − , 𝑢 𝑡 − ] − 𝛼 Cov [ ˆ 𝑀 𝑡 − , 𝜖 𝑡 − ]− 𝛽 𝛼 Cov [ 𝑢 𝑡 − , 𝜖 𝑡 − ] . Since 𝑢 𝑡 and 𝜖 𝑡 are independent, Cov [ ˆ 𝑀 𝑡 , 𝑢 𝑡 ] = E [ 𝑢 𝑡 ˆ 𝑀 𝑡 ] , and Cov [ ˆ 𝑀 𝑡 , 𝜖 𝑡 ] = E [ 𝜖 𝑡 ˆ 𝑀 𝑡 ] , we have Var [ ˆ 𝑀 𝑡 ] = 𝛼 Var [ ˆ 𝑀 𝑡 − ] + ( + 𝛼 𝛽 + 𝛽 ) 𝜎 𝑢 + ( − 𝛼 ) 𝜎 𝜖 . Assuming ˆ 𝑀 𝑡 is weakly stationary ( Var [ ˆ 𝑀 𝑡 ] = Var [ ˆ 𝑀 𝑡 − ] ), Var [ ˆ 𝑀 𝑡 ] = ( + 𝛼 𝛽 + 𝛽 ) 𝜎 𝑢 + ( − 𝛼 ) 𝜎 𝜖 − 𝛼 (11) = + 𝛼 𝛽 + 𝛽 − 𝛼 · 𝜎 𝑢 + 𝜎 𝜖 Therefore, the variances of forecasts are linear combinations of 𝜎 𝑢 and 𝜎 𝜖 , with 𝜎 𝑢 decided by the underlying data distribution, and 𝜎 𝜖 decided by sampling scheme and sampling rate. (cid:3) A.2 Proof of Theorem 3
We can first calculate the variance of each ˆ 𝑚 𝑖 in (6): Var [ ˆ 𝑚 𝑖 ] = 𝑚 𝑖 ( Δ + 𝑤 𝑖 ) 𝑤 𝑖 · 𝑤 𝑖 Δ + 𝑤 𝑖 · ΔΔ + 𝑤 𝑖 = Δ 𝑚 𝑖 𝑤 𝑖 . Since random variables ˆ 𝑚 𝑖 ’s are independent, we have Var (cid:2) ˆ 𝑀 (cid:3) = 𝑛 Õ 𝑖 = Var [ ˆ 𝑚 𝑖 ] = 𝑛 Õ 𝑖 = Δ 𝑚 𝑖 𝑤 𝑖 . (12)Define 𝑊 = Í 𝑛𝑖 = 𝑤 𝑖 . If w is ( 𝜃, 𝜃 ) -consistent with m , Var (cid:2) ˆ 𝑀 (cid:3) = 𝑛 Õ 𝑖 = Δ 𝑚 𝑖 𝑤 𝑖 ≤ 𝑛 Õ 𝑖 = Δ 𝑚 𝑖 𝑚 𝑖 / 𝜃 = 𝜃 Δ 𝑀. rom (6) and ( 𝜃, 𝜃 ) -consistency, the expected sample size is E [|S Δ |] = 𝑛 Õ 𝑖 = 𝑤 𝑖 Δ + 𝑤 𝑖 ≤ Í 𝑛𝑖 = 𝑤 𝑖 Δ ≤ Í 𝑛𝑖 = 𝑚 𝑖 𝜃 Δ = 𝑀𝜃 Δ . (13)From the fact that ˆ 𝑀 is unbiased and the above two inequalities, E "(cid:18) ˆ 𝑀 − 𝑀𝑀 (cid:19) ≤ 𝜃 Δ 𝑀𝑀 = ( 𝜃 / 𝜃 ) · 𝜃 Δ 𝑀 ≤ 𝜃 / 𝜃 E [|S Δ |] . (14)We can conclude the proof from the above inequality. (cid:3) A.3 Finding the Optimal Weight
We can choose w to minimize the variance of estimation in (12),while the expected sample size in (13) is bounded. That is, solving min w Var (cid:2) ˆ 𝑀 (cid:3) = 𝑛 Õ 𝑖 = Δ 𝑚 𝑖 𝑤 𝑖 s.t. E [| 𝑆 Δ |] = 𝑛 Õ 𝑖 = 𝑤 𝑖 Δ + 𝑤 𝑖 ≤ 𝐵. Using the method of Lagrange multipliers, we can solve the aboveproblem and obtain the optimal solution 𝑤 ∗ 𝑖 satisfying: − Δ 𝑚 𝑖 𝑤 ∗ 𝑖 + 𝜆 Δ ( Δ + 𝑤 ∗ 𝑖 ) = , where 𝜆 is the Lagrange multiplier. If we consider a more restriveconstraint instead 𝑛 Õ 𝑖 = 𝑤 𝑖 Δ ≤ 𝐵, we have − Δ 𝑚 𝑖 𝑤 ∗ 𝑖 + 𝜆 Δ = , and thus 𝑤 ∗ 𝑖 = Δ 𝑚 𝑖 /√ 𝜆 . A.4 Proof of Corollary 5
If we use w × as the sampling weights for a particular measure m ( 𝑝 ) ,for any row 𝑖 ∈ [ 𝑛 ] , we have 𝑚 ( 𝑝 ) 𝑖 𝑤 × 𝑖 = (cid:16)Î 𝑘𝑗 = 𝑚 ( 𝑝 ) 𝑖 (cid:17) / 𝑘 (cid:16)Î 𝑘𝑗 = 𝑚 ( 𝑗 ) 𝑖 (cid:17) / 𝑘 = Ö 𝑗 : 𝑗 ≠ 𝑝 © « 𝑚 ( 𝑝 ) 𝑖 𝑚 ( 𝑗 ) 𝑖 ª®¬ / 𝑘 , and thus, max 𝑖 𝑚 ( 𝑝 ) 𝑖 𝑤 × 𝑖 = Ö 𝑗 : 𝑗 ≠ 𝑝 © « max 𝑖 𝑚 ( 𝑝 ) 𝑖 𝑚 ( 𝑗 ) 𝑖 ª®¬ / 𝑘 = Ö 𝑗 : 𝑗 ≠ 𝑝 𝜌 / 𝑘𝑝,𝑗 , 𝜃 ; similarly, min 𝑖 𝑚 ( 𝑝 ) 𝑖 𝑤 × 𝑖 = Ö 𝑗 : 𝑗 ≠ 𝑝 𝜌 / 𝑘𝑝,𝑗 , 𝜃 . Therefore, w × is ( 𝜃, 𝜃 ) -consistent with m ( 𝑝 ) , where 𝜃 and 𝜃 are de-cided by 𝜌 𝑝,𝑞 ’s and 𝜌 𝑝,𝑞 ’s as above, respectively. From Theorem 3,we can derive the following error bound: RE ( ˆ 𝑀 ( 𝑝 ) ) ≤ RSTD ( ˆ 𝑀 ( 𝑝 ) ) ≤ vt Î 𝑗 : 𝑗 ≠ 𝑝 𝜌 / 𝑘𝑝,𝑗 E [|S Δ |] ≤ s ( 𝜌 / 𝑘 ) 𝑘 − E [|S Δ |] which concludes the proof. (cid:3) A.5 Proof of Proposition 7
Since w = [ 𝑤 𝑖 ] are ( 𝜃, 𝜃 ) -consistent with measure m = [ 𝑚 𝑖 ] , 𝑚 ′ 𝑖 = 𝑚 𝑖 Í 𝑗 𝑚 𝑗 ≤ 𝜃𝑤 𝑖 Í 𝑗 𝜃𝑤 𝑗 = 𝜃𝜃 · 𝑤 ′ 𝑖 = 𝜃 · 𝑤 ′ 𝑖 , where 𝜃 = 𝜃 / 𝜃 . Similarly, 𝑚 ′ 𝑖 ≥ 𝜃 · 𝑤 ′ 𝑖 . Putting them together, we have | 𝑚 ′ 𝑖 − 𝑤 ′ 𝑖 | ≤ max {( 𝜃 − ) , ( − / 𝜃 )} · 𝑤 ′ 𝑖 . Since 𝜃 ≥ , we have 𝜃 − ≥ − / 𝜃 , and thus | 𝑚 ′ 𝑖 − 𝑤 ′ 𝑖 | ≤ ( 𝜃 − ) · 𝑤 ′ 𝑖 . Summing them up, k m ′ − w ′ k = Õ 𝑖 | 𝑚 ′ 𝑖 − 𝑤 ′ 𝑖 | ≤ ( 𝜃 − ) · Õ 𝑖 𝑤 ′ 𝑖 = ( 𝜃 − ) . (cid:3)(cid:3)