[PDF] Estimating Latent Demand of Shared Mobility through Censored Gaussian Processes - Researchain

Abstract

Transport demand is highly dependent on supply, especially for shared transport services where availability is often limited. As observed demand cannot be higher than available supply, historical transport data typically represents a biased, or censored, version of the true underlying demand pattern. Without explicitly accounting for this inherent distinction, predictive models of demand would necessarily represent a biased version of true demand, thus less effectively predicting the needs of service users. To counter this problem, we propose a general method for censorship-aware demand modeling, for which we devise a censored likelihood function. We apply this method to the task of shared mobility demand prediction by incorporating the censored likelihood within a Gaussian Process model, which can flexibly approximate arbitrary functional forms. Experiments on artificial and real-world datasets show how taking into account the limiting effect of supply on demand is essential in the process of obtaining an unbiased predictive model of user demand behavior.

Full PDF

EE STIMATING L ATENT D EMAND OF S HARED M OBILITYTHROUGH C ENSORED G AUSSIAN P ROCESSES

A P

REPRINT

Daniele Gammelli

Danmarks Tekniske Universitet (DTU)Kgs. Lyngby, Denmark, 2800 [email protected]

Inon Peled

Danmarks Tekniske Universitet (DTU)Kgs. Lyngby, Denmark, 2800 [email protected]

Filipe Rodrigues

Danmarks Tekniske Universitet (DTU)Kgs. Lyngby, Denmark, 2800 [email protected]

Dario Pacino

Danmarks Tekniske Universitet (DTU)Kgs. Lyngby, Denmark, 2800 [email protected]

Haci A. Kurtaran

Head of BI and Finances, Donkey RepublicCopenhagen, Denmark [email protected]

Francisco C. Pereira

Danmarks Tekniske Universitet (DTU)Kgs. Lyngby, Denmark, 2800 [email protected]

February 18, 2020 A BSTRACT

Transport demand is highly dependent on supply, especially for shared transport services whereavailability is often limited. As observed demand cannot be higher than available supply, historicaltransport data typically represents a biased, or censored , version of the true underlying demandpattern. Without explicitly accounting for this inherent distinction, predictive models of demandwould necessarily represent a biased version of true demand, thus less effectively predicting the needsof service users. To counter this problem, we propose a general method for censorship-aware demandmodeling, for which we devise a censored likelihood function. We apply this method to the taskof shared mobility demand prediction by incorporating the censored likelihood within a GaussianProcess model, which can ﬂexibly approximate arbitrary functional forms. Experiments on artiﬁcialand real-world datasets show how taking into account the limiting effect of supply on demand isessential in the process of obtaining an unbiased predictive model of user demand behavior. K eywords Demand modeling · Censoring · Gaussian Processes · Bayesian Inference · Shared Mobility

Being able to understand and predict demand is essential in the planning and decision-making processes of any giventransport service, allowing service providers to make decisions coherently with user behavior and needs. Having reliablemodels of demand is especially relevant in shared transport modes, such as car-sharing and bike-sharing, where the highvolatility of demand and the ﬂexibility of supply modalities (e.g., inﬁnitely many possible collocations of a car-sharingﬂeet) require that decisions be made in strong accordance with user needs. If, for instance, we consider the bike sharingscenario, service providers face a great variety of complex decisions on how to satisfy user demand. To name a few,concrete choices must be made for what concerns capacity positioning (i.e., where to deploy the service), capacityplanning (i.e., dimensioning the ﬂeet size), rebalancing (i.e., where and when to reallocate idle supply), and expansionplanning (i.e., if and how to expand the reach of the service). a r X i v : . [ s t a t . M L ] F e b PREPRINT - F

EBRUARY

18, 2020Demand modeling uses statistical methods to capture user demand behavior based on recorded historical data. However,historical transport service data is usually highly dependent on historical supply offered by the provider itself. Inparticular, supply represents an upper limit on our ability to observe realizations of the true demand. For example, if wehave data about a bike-sharing service with bikes available, we might observe a usage (i.e., demand) of bikeseven though the actual demand might have potentially been higher. This leads to a situation in which historical data is infact representing a biased, or censored , version of the underlying demand pattern in which we are truly interested. Moreimportantly, using censored data to build demand models will, as a natural consequence, produce a biased estimateof demand and an inaccurate understanding of user needs, which will ultimately result in non-optimal operationaldecisions for the service provider.To address these problems, we propose a general approach for building models that are aware of the supply-censoringissue, and which are ultimately more reliable in reﬂecting user behavior. To this end, we formulate a censored likelihoodfunction and apply it within a Gaussian Process model of user demand. Using both synthetic and real-world datasets ofshared mobility services, we pit this model against non-censored models and analyze the conditions under which it isbetter capable of recovering true demand.

There exists a long line of literature about demand forecasting, with research mainly focusing on obtaining predictions ofdemand based on historical data in combination with some exogenous input, such as weather and temporal information.The bike-sharing context alone already provides many such examples, as follows. [1, 2] use Random Forest models forrental predictions, whereas [3] approach the demand forecasting problem through the use of Generalized Linear Models(GLMs) of counts, namely, Poisson and Negative-Binomial regression models. [4] apply multivariate regression usingdata gathered from multiple bike-sharing systems.[5] deﬁne a hierarchical prediction model, based on Gradient Boosting Regression Trees, for both rentals and returns ata city-aggregation-level. They model rent proportions for clusters of bike-stations through a multi-similarity basedinference and predict returns consistently with rentals by learning transition probabilities (i.e., where bikes end afterbeing rented). A different approach is found in [6], where the proposed prediction model is based on log-log regressionand refers to the Origin-Destination (O-D) trip demand, rather than rentals and returns in speciﬁc locations or areas.[7, 8, 9] instead propose deep learning-based approaches to model bike in-ﬂow and out-ﬂow for the bike-sharing systemareas and station-level demand respectively.

Data censorship involves two corresponding sets of values: true and observed, so that each observed value may be clipped above or below the respective true value. Observations that have not been clipped are called non-censored ,and correspond exactly with true values. All other observations are called censored as they have been clipped at theirobserved values, therefore giving only a limited account of the corresponding true values (e.g., observed demand notcorresponding with true demand). We next review methods for modeling censored data, namely, censored modeling .An early form of censored modeling is the Probit model [10] for binary ( / ) observations, which assumes that theprobability of observing depends linearly on the given explanatory features. James Tobin extended this to a model forcensored regression [11], now called Tobit, which assumes that the latent variable depends on the explanatory featureslinearly with Gaussian noise, and that observations are censored according to a ﬁxed and known threshold. Because thecensored observations divert Least Squares estimators away from the true, latent values, Tobit linear coefﬁcients areinstead commonly estimated via Maximum Likelihood techniques, such as Expectation Maximization or Heckman’sTwo-Step estimator [12].The Tobit model has since become a cornerstone of censored modeling in multiple research ﬁelds, including econo-metrics and survival analysis [13]. It has been extended through multiple variations [14], such as: multiple latentvariables, e.g., one variable determines which observations are non-censored while another determines their values [15];Tobit-like models of censored count data [16]; Tobit Quantile Regression [17]; dynamic, autoregressive Tobit [18]; andcombination with Kalman Filter [19]. Other methods of censored regression have also been suggested, predominantlyfor survival analysis, such as: Proportional Hazard models [20], Accelerated Failure Time models [21]; RegularizedLinear Regression [22]; and Deep Neural Networks [23, 24].As reﬂected in all the above works, research into censored modeling commonly aims to reconstruct the latent processthat yields the true values. In this work too, we aim to predict true values, not the occurrence of clipping nor the actually2 PREPRINT - F

EBRUARY

18, 2020observed values. Similarly to Tobit, we too assume that all observations are independent and that each observation isknown to be either censored or non-censored. However, contrary to all the above works, our method employs GaussianProcesses, which are non-parametric and do not assume any speciﬁc functional relationship between the explanatoryfeatures and the latent variable. A theoretical treatment of non-parametric censored regression appears in [25], wherethe authors derive estimators for the latent variable with ﬁxed censoring, analyze them mathematically, and apply themto a single, simulated dataset. In contrast, this work proposes a likelihood function and applies it to multiple, real-worlddatasets with dynamic censoring.

Within the above stream of research, the censoring problem discussed in Section 1 is widely accepted. However,to the best of our knowledge, there has been no extensive study on how observed demand can differ from the true,underlying demand and how these differences could impact predictive models. To assess this issue, common approachesregard various data cleaning techniques . For example, [3, 26, 27, 28] attempt to avoid the bias induced by censoredobservations by ﬁltering out the time periods where censoring might have occurred, before modeling. As a furtherexample, [27] substitute the censored observations with the mean of the historical (non-censored) observations regardingthe same period. A different but related approach is proposed by [29, 30] who focus on obtaining an unbiased estimateof arrival rate by omitting historical instances where bikes were unavailable.These common approaches manage, to some degree, to correct the bias characterizing observed demand. However, theyalso give relevance to the fact that this problem represents an important gap which requires further study in order toobtain reliable predictions for shared transport and other ﬁelds. We believe there are two main reasons to why thereis the need for a more structured view on the censoring problem: (i) the approaches described above might not beapplicable in many real-world scenarios in which the number of censored observations is very high, leading either to anexcessive elimination of useful data or to an inadequate imputation of the censored data; (ii) rather than resorting tocleaning and imputation procedures as the ones above, it would be desirable to equip the forecasting model with somesort of awareness towards the censoring problem, so that the model can utilize the entire information captured in theobservations to coherently adjust its predictions.

In this Section, we incrementally describe the building blocks of our proposed censored models. First, we introduceseveral general concepts: Tobit likelihood, Gaussian Processes, and the role of kernels . Then, we combine theseconcepts by deﬁning Censored Gaussian Processes along with a corresponding inference procedure.

As a reference point for developing our censored likelihood function, let us now elaborate on the likelihood functionof the popular Tobit censored regression model, described in Section 2.2. For each observation y i , let y ∗ i be thecorresponding true value. For instance, in a shared transport demand setting, y ∗ i is the true, latent demand for sharedmobility, while y i is the observed demand; if y i is non-censored then y i = y ∗ i , otherwise y i is censored so that y i < y ∗ i .We are also given binary censorship labels l i , so that l i = 1 if y i is censored and l i = 0 otherwise (e.g., labels could berecovered by comparing observed demand to available supply).Tobit parameterizes the dependency of y ∗ i on explanatory features x i through a linear relationship with parameters β and noise term ε i , where all ε i are independently and normally distributed with mean zero and variance σ , namely: y ∗ i = β T x i + ε i , ε i ∼ N (0 , σ ) . (1)There are multiple variations of the Tobit model depending on where and when censoring arises. In this work, withoutloss of generality, we deal with upper censorship, also known as Type I , where y i is upper-bounded by a given threshold y u , so that: y i = (cid:26) y ∗ i , if y ∗ i < y u y u , if y ∗ i ≥ y u . (2)The likelihood function in this case can be derived from Eqs. 1 and 2, as follows.1. If l i = 0 , then y i is non-censored and so its likelihood is: σ φ (cid:18) y i − β T x i σ (cid:19) , (3)where φ is the standard Gaussian probability density function.3 PREPRINT - F

EBRUARY

18, 20202. Otherwise, i.e., if l i = 1 , then y i is censored and so its likelihood is: − Φ (cid:18) y i − β T x i σ (cid:19) , (4)where Φ is the standard Gaussian cumulative density function.Because all observations are assumed to be independent, their joint likelihood is: (cid:89) i (cid:40) σ φ (cid:18) y i − β T x i σ (cid:19) (cid:41) − l i (cid:40) − Φ (cid:18) y i − β T x i σ (cid:19) (cid:41) l i , (5)which is a function of β and σ . Gaussian Processes (GPs) [31] are an extremely powerful and ﬂexible tool belonging to the ﬁeld of probabilistic machinelearning [32]. GPs have been applied successfully to both classiﬁcation and regression tasks regarding various transportrelated scenarios such as travel times [33, 34], congestion models [35], crowdsourced data [36, 37], trafﬁc volumes [38],etc. For example, given a ﬁnite set of points for regression, there are typically inﬁnitely many functions which ﬁt thedata, and GPs offer an elegant approach to this problem by assigning a probability to every possible function. Moreover,GPs implicitly adopt a full probabilistic approach, thus enabling the structured quantiﬁcation of the conﬁdence – orequivalently, the uncertainty – in the predictions of a GP model. This ease in uncertainty quantiﬁcation is one of theprincipal reasons why we chose to use GPs for demand prediction in the shared mobility domain. Indeed, transportservice providers are not only interested in more accurate demand models, but also, and maybe most importantly, wishto make operational decisions based on the measure with which the model is conﬁdent of its predictions.Given a dataset D = { ( x i , y i ) } ni =1 with n input vectors x i and scalar outputs y i , the goal of GPs is to learn theunderlying data distribution by deﬁning a distribution over functions. GPs model this distribution by placing amultivariate Gaussian distribution deﬁned by a mean function m ( x ) and a covariance function k ( x , x (cid:48) ) (between allpossible pairwise combinations of input points { x , x (cid:48) } in the dataset), or kernel , on a latent variable f . More concretely,a GP can be seen as a collection of random variables which have a joint Gaussian distribution. GPs are therefore aBayesian approach which assumes a priori that function values behave according to: p ( f | x , . . . , x n ) = N ( m , K ) , (6)where f = [ f , . . . , f n ] T is the vector of latent random variables, m is a mean vector, and K is a covariance matrix withentries deﬁned by a covariance function k , so that K i,j = k ( x i , x j ) . As is customary in GP modeling, we shall assumewithout loss of generality that the joint Gaussian distribution is centered on m ≡ . Also, because the response variableis continuous, we model each y i as generated from a Gaussian distribution centered on f i with noise variance σ . A fundamental step in Gaussian Process modeling is the choice of the covariance matrix K . This not only describes theshape of the underlying multivariate Gaussian distribution, but also determines the characteristics of the function wewant to model. Intuitively, because entry K i,j deﬁnes the correlation between the i -th and the j -th random variables,the kernel describes a measure of similarity between points, ultimately controlling the shape that the GP functiondistribution adopts.Multiple kernels can be combined, e.g., by addition and multipication, to generate more complex covariance funcitons,thus allowing for a great ﬂexibility and better encoding of domain knowledge in the regression model. For the datasetsin our experiments (Section 4), we use different combinations of the following three kernels, where (cid:107) x − x (cid:48) (cid:107) denotesthe Euclidean distance between vectors x , x (cid:48) ∈ R n :1. Squared Exponential Kernel (SE) k SE ( x , x (cid:48) ) = λ exp (cid:18) − ( (cid:107) x − x (cid:48) (cid:107) ) τ (cid:19) (7)The SE kernel (also called RBF: Radial Basic Function) intuitively encodes the idea that nearby points shouldbe correlated, therefore generating relatively smooth functions. The kernel is parameterized by variance λ and length scale τ : for larger τ , farther points are considered more similar; λ acts as an additional scalingfactor, so that larger λ corresponds to higher spread of functions around the mean.4 PREPRINT - F

EBRUARY

18, 20202.

Periodic Kernel k P er ( x , x (cid:48) ) = λ exp (cid:18) − ( π (cid:107) x − x (cid:48) (cid:107) /ρ ) τ (cid:19) (8)The Periodic kernel allows modeling functions with repeated patterns, and can thus extrapolate seasonality intime-series data. Parameters τ and λ have the same effect as for the SE kernel, while parameter ρ correspondsto period length of patterns.3. Matérn Kernel k Mat ( x , x (cid:48) ) = λ − ν Γ( ν ) (cid:18) √ ν (cid:107) x − x (cid:48) (cid:107) τ (cid:19) ν K ν (cid:18) √ ν (cid:107) x − x (cid:48) (cid:107) τ (cid:19) , (9)where Γ is the Gamma function, K ν is the Bessel function [39], and as before, τ is lengthscale and λ isvariance. The Matérn kernel can be considered as a generalization of the SE kernel, parameterized by positiveparameters ν and τ , where lower values of ν yield less smooth functions.As is common, we select kernel parameters in our experiments through Type-II Maximum Likelihood, also known asEmpirical Bayes, whereby latent variables are integrated out. We have so far presented two separate modeling approaches: Tobit models for censored data, and non-parametricGaussian Processes for ﬂexible regression of arbitrarily complex patterns. For non-parametric censored regression, wenow combine the two approaches within one model by deﬁning a censorship-aware likelihood function for GPs: (cid:89) i (cid:40) σ φ (cid:18) y i − f i σ (cid:19) (cid:41) − l i (cid:40) − Φ (cid:18) y i − f i σ (cid:19) (cid:41) l i . (10)Eq. 10 is obtained from the Eq. 5 by replacing β T x i , the Tobit prediction, with f i , the GP prediction. We next proposean inference procedure for obtaining the posterior distribution of Censored GP with this likelihood function. In a Bayesian setting, we are interested in computing the posterior over the latent f , given the response y and features X . Bayes’ rule deﬁnes this posterior exactly: p ( f | y , X ) = p ( f ) p ( y | f , X ) p ( y | X ) . (11)However, exact calculation of the denominator in Eq. 11 is intractable because of the Censored GP likelihood (Eq. 10),as also explained in [31]. The posterior distribution thus needs to be estimated approximately, which we do in this studyvia Expectation Propagation (EP) [40]. EP alleviates the denominator intractability by approximating the likelihood ofeach observation through an un-normalized Gaussian in the latent f i , as: p ( y i | f i ) (cid:39) t i ( f i | ˜ Z i , ˜ µ i , ˜ σ i ) (cid:44) ˜ Z i N ( f i | ˜ µ i , ˜ σ i ) , (12)where t i is the i -th factor, or site, with site parameters ˜ Z i , ˜ µ i , ˜ σ i . The likelihood is then approximated as a product ofthe n independent local likelihood approximations: n (cid:89) i =1 t i ( f i | ˜ Z i , ˜ µ i , ˜ σ i ) = N (˜ µ, ˜ Σ ) n (cid:89) i =1 ˜ Z i , (13)where ˜ µ = [˜ µ , . . . , ˜ µ n ] , and ˜ Σ is a diagonal covariance matrix with ˜ Σ ii = ˜ σ i . Consequently, the posterior p ( f | y , X ) is approximated by q ( f | y , X ) (cid:44) Z EP p ( f | X ) n (cid:89) i =1 t i ( f i | ˜ Z i , ˜ µ i , ˜ σ i ) = N ( µ, Σ ) , (14)where Σ = ( K − + ˜ Σ − ) − , µ = Σ ˜ Σ − ˜ µ , and Z EP is the EP approximation to the marginal likelihood.The key idea in EP is to update the single site approximation t i sequentially by iterating the following four steps. Firstly,given a current approximation of the posterior, the current t i is left out from this approximation, deﬁning a cavity PREPRINT - F

EBRUARY

18, 2020 distribution : q − i ( f i ) ∝ (cid:90) p ( f | X ) (cid:89) j (cid:54) = i t i ( f i | ˜ Z i , ˜ µ i , ˜ σ i ) df j ,q − i ( f i ) (cid:44) N ( f i | µ − i , σ − i ) , (15)where µ − i = σ − i ( σ − i µ i − ˜ σ − i ˜ µ i ) and σ − i = ( σ − i − ˜ σ − i ) − .The second step is the computation of the target non-Gaussian marginal combining the cavity distribution with the exactlikelihood p ( y i | f i ) : q − i ( f i ) p ( y i | f i ) . (16)Thirdly, a Gaussian approximation to the non-Gaussian marginal ˆ q ( f i ) is chosen such that it best approximates theproduct of the cavity distribution and the exact likelihood: ˆ q ( f i ) (cid:44) ˆ Z i N (ˆ µ i , ˆ σ i ) (cid:39) q − i ( f i ) p ( y i | f i ) . (17)To achieve this, EP uses the property that if ˆ q ( f i ) is Gaussian, then the distribution which minimizes the Kullback-Leibler divergence KL ( q − i ( f i ) p ( y i | f i ) (cid:107) ˆ q ( f i )) , is the distribution whose ﬁrst and second moments match the ones of q − i ( f i ) p ( y i | f i ) . In addition, given that ˆ q ( f i ) is un-normalized, EP also imposes the condition that the zero-th momentshould match. Formally then, EP minimises the following objective:KL ( q − i ( f i ) p ( y i | f i ) (cid:107) ˆ q ( f i )) = E q − i ( f i ) p ( y i | f i ) (cid:20) log (cid:18) q − i ( f i ) p ( y i | f i )ˆ q ( f i ) (cid:19)(cid:21) . (18)Lastly, EP chooses the t i for which the posterior has the marginal from the third step. The implementation of EP in our experiments is based on GPy , an open source Gaussian Processes package in Python.We have implemented the censored likelihood and deﬁned the following moments for the EP inference procedure: ˆ Z i = (cid:90) q − i ( f i ) t i df i , ˆ µ i = E q [ f i ] , ˆ σ i = E q [( f i − E q [ f i ]) ] . (19)The moments of our censored likelihood (Eq. 10) can be deﬁned analytically, depending on the censoring upper limit y u , as follows (for a detailed derivation, see [41]).• If y i < y u , then: ˆ Z i = 1 (cid:113) π ( σ + σ − i ) exp (cid:18) − ( y i − µ − i ) σ + σ − i ) (cid:19) , ˆ µ i = µ − i + σ − i (cid:18) y i − µ − i σ + σ − i (cid:19) , ˆ σ i = σ − i − σ − i (cid:18) σ + σ − i ) (cid:19) . (20)• Otherwise, y i = y u , and: ˆ Z i = Φ(¯ z i ) , ˆ µ i = µ − i + σ − i N (¯ z i )Φ(¯ z i ) (cid:113) σ + σ − i , ˆ σ i = σ − i − σ − i N (¯ z i )Φ(¯ z i )( σ + σ − i ) (cid:18) ¯ z i + N (¯ z i )Φ(¯ z i ) (cid:19) , (21)where Φ is the Gaussian Cumulative Density Function and ¯ z i = ( µ − i − y u ) (cid:46)(cid:113) σ + σ − i . https://sheffieldml.github.io/GPy/ PREPRINT - F

EBRUARY

18, 2020

In this Section, we execute experiments for several datasets as follows. First, to convey a high-level intuition on howcensored models can have an advantage compared to non-censored models, we start with two relatively simple datasets:synthetically generated data in Section 4.2, and real-world motorcycle accident data in Section 4.3. Equipped withthis intuition, we then proceed to more elaborate real-world datasets. In Section 4.4, we then build models to predictbike pickups for a major bike-sharing service provider in Copenhagen, Denmark. Then in Section 4.5, we use dataabout yellow taxis in New York City. For both bike-sharing and taxis, pickups thus represent mobility demand, which isaffected by the censoring phenomenon introduced in previous sections because of ﬁnite supply.

As introduced in previous sections, the focus of our experiments is the comparison between Censored and

Non-Censored models in the estimation of true demand patterns. We thus compare three GP models:(i)

Non-Censored Gaussian Process (NCGP) : represents the Gaussian Process model most commonly used inliterature, i.e., with Gaussian observation likelihood. NCGP is trained on the entire dataset, consisting of bothcensored and non-censored observations, without discerning between them.(ii)

Non-Censored Gaussian Process, Aware of censorship (NCGP-A) : functionally equivalent to NCGP, but usesinformation on censoring as a pre-processing step. That is, NCGP-A is trained only on non-censored points,thus avoiding exposure to a biased version of the true demand (because of censoring). This, however, comes atthe cost of ignoring relevant information embedded in the censored data(iii)

Censored Gaussian Process (CGP) : this model considers all observations – censored and non-censored –through the likelihood function deﬁned in Eq. 10 (Section 3.4).

To demonstrate the capabilities of our approach, we begin with a simple, synthetically generated dataset, where thecensoring pattern has a relevant and meaningful structure. We deﬁne a latent function for x ∈ R : f ∗ ( x ) = 2 + sin (2 x )2 + x , (22)which we wish to recover from censored, noisy observations. We generate these observations for x = [0 , byrepeatedly evaluating f ∗ and adding a noise term, which is independently sampled from N (0 , σ ) with σ = 0 . . Asseen in Fig. 1a, this censoring process generates a cloud of points far from the true dynamics of the latent f ∗ in thepeaks of the sinusoidal oscillation.We ﬁt NCGP, NCGP-A and CGP to the observations and measure how well each of them recovers the true f ∗ . Theresulting predictions by the three models in Fig. 1b, 1c, 1d highlight interesting aspects worth underlying:• Fig. 1b: NCGP predicts an approximately linear trend for observed data, which, if we did not know the truefunction behind our observations, would seem like a reasonable conclusion. Nevertheless, the predictions aredeﬁnitely far from the true data generating process.• Fig. 1c: By discarding the censored observations, NCGP-A is able to correctly assess the behavior of theunderlying function in regions close to observable data. However, NCGP-A is not able to generalize to regionsaffected by censoring.• Fig. 1d: CGP, on the other hand is able to exploit its knowledge of censoring to make better sense of observabledata. Through the use of a censored likelihood (Eq. 10), CGP assigns higher probability to plausible functionalpatterns, which are coherent not only with the observable data but also with the censored function. Having demonstrated our approach on an artiﬁcial dataset, we now make use of the openly accessible MotorcycleAccident dataset [42] used in literature for a wide variety of statistical learning tasks. The data consists of accelerationmeasurements over consecutive ms . For this experiment, we aim at exploring how variations in the severity of thecensoring process can impact the models’ performance. In particular, we will be focusing on two axes of variation: the Source code available at: https://github.com/DanieleGammelli/CensoredGP PREPRINT - F

EBRUARY

18, 2020 (a) Synthetic Data (b) NCGP(c) NCGP-A (d) CGP

Figure 1: Synthetically generated data (a) and model ﬁts for NCGP (b), NCGP-A (c) and CGP (d). The dashed line isthe latent true function f ∗ ; crosses and circles are, respectively, censored and non-censored observations; the continuousline and the shaded area are, respectively, the posterior mean over the GP latent variable f and the correspondingposterior conﬁdence interval. percentage of censored points, which controls how often the true function is observable, and the intensity of censoring,which controls how far censored observations are from true values. Concretely, we apply the following censoringprocess in the experiments with Motorcycle data y ∗ , . . . , y ∗ n , given any ≤ p ≤ and ≤ a < b ≤ .1. Initialize y i = y ∗ i for all i = 1 ..n .2. Deﬁne the set of censored observations N c by randomly selecting (cid:100) pn (cid:101) elements from ..n .3. For each i ∈ N c , independently sample a censorship intensity as p ( i ) c ∼ U [ a, b ] , and censor the i ’th observationas y i = (cid:16) − p ( i ) c (cid:17) y ∗ i .We explore moderate to extreme censoring through increasing percentage of censored points, as p = 0 . , . , . ,and increasing censorship intensity, as [ a, b ] = [0 , . , [0 . , . , [0 . , . . Figs. 2, 3 and 4 show how each ofNCGP, NCGP-A and CGP, respectively, reacts to these variations in censorship intensity: the percentage of censoredobservations increases from top to bottom, and censorship intensity increases from left to right. In each plot, thehorizontal and vertical axes correspond to time and acceleration, respectively; we omit axis labels to reduce visualclutter, as also the speciﬁcs of this dataset are less important in the context of this experiment. We see that:• Fig. 2: Despite being able to successfully recover the latent function for small censorship intensities (ﬁrstrow), NCGP is then exposed to an excessive amount of bias because of the increased censoring. This results incompletely distorted predictions, as the remaining non-censored points are essentially considered as outliersby the model.• Fig. 3: On the other hand, NCGP-A manages to avoid the bias coming from censored points by ignoring them,as long as the non-censored observations characterize well the latent function. When this condition no longerholds, NCGP-A is unable to generalize to unobserved domains, so that it yields biased predictions with highuncertainty.• Fig. 4: Lastly, CGP not only manages to avoid the bias introduced by censored observations in moderate tohigh censoring scenarios, but is also able to couple information coming from both censored and non censoredobservations to achieve better predictions in regions of extreme censoring.8 PREPRINT - F

EBRUARY

18, 2020Figure 2: NCGP ﬁt for intensifying censorship of the motorcycle accident data.Figure 3: NCGP-A ﬁt for intensifying censorship of the motorcycle accident data.9

PREPRINT - F

EBRUARY

18, 2020Figure 4: CGP ﬁt for intensifying censorship of the motorcycle accident data.

In this Section, we deal with the problem of building a demand prediction model for a bike-sharing system. We usereal-world data provided by Donkey Republic, a major bike-sharing provider in the city of Copenhagen, Denmark.Donkey Republic can be considered a hub-based service, meaning that the user of the service is not free to pick up ordrop off a bike in any location, but is restricted to a certain number of virtual hubs around Copenhagen. Our objective isto model daily rental demand in the hub network.

The given data consists of individual records of users renting and returning bikes in hubs during days: from 1 March2018 until 14 March 2019. Hence before modeling daily rentals, we aggregate the data both spatially and temporally.Spatially, hubs were aggregated in three super-hubs by selecting three main service areas (such as the central stationand main tourist attractions) and considering a

200 m radius around these points of interest (Fig. 5). The choice ofradius is justiﬁed by Donkey Republic business insights, whereby users typically agree to walk at most

200 m to rent abike and typically give up if the bike is farther. Temporally, the data at our disposal allowed us to retrieve the time-seriesof daily rental pickups regarding the three super-hubs, which will represent the target of our prediction model.The spatial aggregation of individual hubs into super-hubs is an important modeling step in building the predictionmodel, for two main reasons:1. Time-series for individual hubs reveal excessive ﬂuctuations in rental behavior. Hence separate treatmentof individual hubs could likely hide most of the valuable regularities in the data and ultimately expose thepredictive models to an excessive amount of noise.2. As seen in Fig. 5, the individual hubs are very close one to the other, especially in central areas of the city.Demand patterns between neighboring hubs can thus be well correlated, and a bike-sharing provider wouldactually beneﬁt more from understanding the demand in an entire area rather than in a single hub. Moreover,bike-sharing demand conceptually covers an entire area, as bike-sharing users would likely walk a few dozenmeters more to rent a bike. Hence from here on, we assume that if no bikes are available at some hub, thenusers who wish to rent a bike are willing to walk to any other hub in the same super-hub.10

PREPRINT - F

EBRUARY

18, 2020Figure 5: Map with: 1) marked locations of Donkey Republic hubs, 2) the three super-hubs in our experiments, as bigcircles around constituent hubs.

Ideally and before modeling, we would like to have access to the true bike-sharing demand, free of any real-worldcensorship. However, this ideal setting is impossible, as historical data records are necessarily censored intrinsically tosome extent. Consequently and for the sake of experimentation, we assume that the given historical data representstrue demand (which is what ideally we would like to predict). This further allows us to censor the data manually andexamine the effects of such censorship.We apply manual censorship to the time series of each super-hub in two stages. In the ﬁrst stage, for each day i in N = { . . . } , we let δ i ∈ { , } indicate whether at any moment during i there were no bikes available in the entiresuper-hub, and deﬁne accordingly the set of censored and non-censored observations: N c = { i ∈ ..

379 : δ i = 1 } , (23) N nc = { i ∈ ..

379 : δ i = 0 } = N − N c . (24)We then ﬁx binary censorship labels as follows: l i = 1 for i ∈ N c and l i = 0 for i ∈ N nc . The reason for doing so isthat for every day in N c , there was a moment with zero bike availability, and so there may have been additional demand,which the service could not satisfy and which was thus not recorded.Having ﬁxed the censorship labels, the second stage of censorship can be executed multiple times for different censorshipintensities. That is, given a censorship intensity ≤ c ≤ , we censor each observation for which l i = 1 to (1 − c ) of its original value. For example, Fig. 6 shows true demand (red) as well as the result of censoring it to of itsoriginal value (grey) for each day in N c (blue markers). As for previous experiments, we train models on the manually censored demand and then evaluate them on theobservations without censorship – this is the measure of actual interest, as it represents true demand. Evaluation isdone by comparing the posterior mean to the predicted mean through two commonly used measures: coefﬁcient ofdetermination ( R ) and Rooted Mean Squared Error (RMSE), as detailed in Appendix A.1.As deﬁned in Section 4.1, the three models are NCGP, NCGP-A and CGP. We equip each model with a kernel thatconsists of a combination of SE and Periodic kernels on the temporal index feature and a Matérn kernel on weather data.Namely, the covariance function between the i ’th and the j ’th sample is: k ( x i , x j ) = k SE ( x i,t , x j,t ) + k P er ( x i,t , x j,t ) + k Mat ( x i,w , x j,w ) , (25)where x i,t = [ t ] is the temporal index, and x i,w consists of weather features as detailed in Table 1. Concretely, weestimate kernel parameters through Type-II Maximum Likelihood using Adam (CGP) and BFGS (NCGP, NCGP-A) to11 PREPRINT - F

EBRUARY

18, 2020Figure 6: Super-hub 3 data with

Censorship Intensity.Solar Irradiance Wind Air RainGlobal short wave [W / m ] Direction min. [ ° ] Temperature [ ◦ C] Fall [mm]

Short wave horizontal [W / m ] Direction max. [ ° ] Relative humidity [%]

Duration [s]

Short wave direct normal [W / m ] Direction avg. [ ° ] Pressure [hPa]

Intensity [mm / h] Downwelling horizontal long wave [W / m ] Speed avg. [m / s] Speed max. [m / s] Table 1: Weather features for bike-sharing data, collected in Copenhagen by [43].perform gradient ascent. This choice of optimization routines was the most performing in practice for each respectivemodel.The main goal of our experiments is to analyze how well these models can recover the true demand pattern after beingtrained on a censored version of the same demand. In particular, we are interested in investigating to what degreecensored models (CGP) are able to recover the underlying demand pattern compared to standard regression models(NCGP, NCGP-A), and how this comparison evolves under different censorship intensities. Our experiments thusemploy incremental censorship intensity c = 0 , . , . , . . . , . , so that c ranges between two extremes: from absenceof censoring ( full observability of the true demand) to complete censoring (no demand observed in historical data). This section presents results for the predictive models implemented on each of the three time-series with cross-validation.Tables 2, 3, 4 in Appendix A.2 detail the evaluation for each of the three super-hubs. We now concentrate on the resultsfor super-hub 1, as presented in Fig. 7 on the following page, because they are representative also of the results for theother two super-hubs. The plots in Fig. 7 are a visual representation of Table 2 and compare the performances of NCGP,NCGP-A and CGP for different censorship intensities. We discern between evaluating model performance on the entiredataset (consisting of both censored and non-censored observations) vs. only on non-censored observations.First, we compare the models that do not discard of any observations, namely, CGP and NCGP. Considering that apredictive model is better the more its RMSE is to close and the more its R is close to , the plots show that the twomodels are comparable under low degree of censoring. However, as the censorship intensiﬁes, NCGP becomes stronglybiased towards the censored observations, whereas CGP recovers the underlying demand much more consistently.Next, we compare between NCGP-A and the CGP and see that NCGP-A achieves reasonable predictive accuracy, whichis still mostly worse than the predictive accuracy of CGP. As outlined previously in Sections 4.2 and 4.3, NCGP-Aaccuracy depends highly on the extent to which observable data characterizes the full behavior of the latent function (inthis case, true demand). Here, the percent of points affected by censoring falls between and for all the threesuper-hubs, so that NCGP-A has acceptable observability over the true demand. Even so, CGP outperforms NCGP-Aalso on just non-censored data; this suggests that using a censored likelihood not only allows models to avoid predictivebias on censored data, but also allows consistent understanding of the data generating process, ultimately leading toincreased performance also on observable data. 12 PREPRINT - F

EBRUARY

18, 2020Figure 7: Censoring performance analysis on the entire dataset and only on non-censored observations, regarding R and RMSE for super-hub 1.In conclusion, and as outlined in previous Sections, the non-parametric nature of Censored GP allows it to effectivelyexploit the concept of censoring, thus preventing censored observations from biasing the entire demand model. In otherwords, Censored GP is capable of activating censoring-awareness depending on data only. It is worth emphasizing how Non-Censored GP is a special case of Censored model GP in those cases where allobservations are assumed to be non-censored. This becomes evident when juxtaposing the two likelihood functions: L C = (cid:89) i ∈ N nc (cid:18) σ φ (cid:18) y i − f i σ (cid:19)(cid:19) (cid:89) i ∈ N c (cid:18) − Φ (cid:18) y i − f i σ (cid:19)(cid:19) , (26) L NC = (cid:89) i ∈ N (cid:18) σ φ (cid:18) y i − f i σ (cid:19)(cid:19) , (27)which shows that in absence of censoring (i.e., when N c = ∅ ), L C ≡ L NC .For censorship, we still see in Fig. 7 some difference in performance. To realize why this happens, recall that byour experimental design (Section 4.4.2), censorship labels l i are pre-determined regardless of censorship intensity,hence some observations are labeled as censored even when censorship intensity is . Furthermore, the true demand inour experiments originates from historical data, which is itself intrinsically censored. It follows that the models arenever exposed to utter absence of censored observations. Conversely, in a real-world scenario, perfect censoringcorresponds to complete absence of censored observations, so that censored and non-censored GP are equivalent. The previous Section dealt with long-term transport demand under deterministic censorship per available supply. Inwhat follows, we complement this with a case study of short-term transport demand under stochastic censorship peravailable supply. To this end, we apply GP modeling to stochastically censored pickups of yellow, hail-based taxis [44]within an approximately × squared area near LaGuardia airport in New York City. The true pickup counts, y ∗ = [ y ∗ , . . . , y ∗ ] , are aggregated per

15 min consecutive intervals in the ﬁrst week of June 2010. At that time, NYCTaxis were almost entirely of the yellow type and thus accounted for virtually all ride-hailing demand.

Taxi demand is inﬂuenced by availability of vacant taxis. To approximate the number of vacant taxis, we use thenumber of dropoffs observed in the previous lag. That is, we use d i − , the observed total dropoffs in time step i − ,to stochastically censor y ∗ i , the true pickups in time step i , as follows. First, we randomly select the censorship label l i ∈ { , } with probability depending on y ∗ i , d i − , and a parameter γ ∈ (0 , . As explained in Fig. 8, approximately γ of all y ∗ i are thus labeled as censored ( l i = 1 ). Next, we censor from above each observation for which l i = 1 , bysetting y i to (1 − c ) y ∗ i , where c ∈ [0 , is a given parameter. c is thus the percent of pickups that are unobserved andso represents unsatisﬁed ride-hailing demand. This two-step procedure, named RandDropoff, is rigorously deﬁned inAlgorithm 1. Fig. 9 illustrates an example of RandDropoff for one combination of γ, c .13 PREPRINT - F

EBRUARY

18, 2020

Algorithm 1:

RandDropoff Censorship input : ≤ γ ≤ , ≤ c ≤ , y ∗ = [ y ∗ , . . . , y ∗ n ] T ∈ R n , d = [ d , . . . , d n − ] T ∈ R n . foreach i = 1 ..n do Sample l i ∈ { , } as l i ∼ Bernoulli ( p i,γ ) , where p i,γ = (cid:26) (cid:18) ln (cid:18) − γγ (cid:19) − y i − d i − y i (cid:19) (cid:27) − . (28) y i ← y ∗ i (1 − c ) l i . end return y = [ y , . . . , y n ] , k = [ l , . . . , l n ] . Figure 8: In RandDropoff, censorship probability p i,γ depends on a zero-intercept parameter γ and the differencebetween pickups y ∗ i and dropoffs d i − . The higher γ is and the higher y ∗ i is above d i − , the more likely will y ∗ i becensored.We experiment with RandDropoff censorship for γ = 0 . , . , . , . and c = 0 , . , . . . , . For these values of γ , theaverage percentages of censored observations are, respectively, , , , – close to the expected values.We will later see that these values of γ sufﬁce for reasoning about the behavior of each model as more observationsbecome censored. The censorship magnitude increases with c , so that c = 0 and c = 1 represent limiting cases: when c tends to , censored observations become very close to true values, whereas when c tends to , censored observationsare zeroed.The models are again NCGP, NCGP-A and CGP, and ﬁtting is done via BFGS with at most 1000 iterations, startingfrom the same initial hyper-parameters. Because RandDropoff is stochastic, we experiment with each γ, c combinationindependently a total of times. In each independent experiment, we ﬁt and evaluate through cross-validation with 21time-consecutive folds, each consisting of observations over hours. For each γ and c , we evaluate by comparing y ,the latent count of pickups, to the predicted means over all experiments, as detailed in Appendix A.1.For all models, the explanatory features are X = [ x , . . . , x ] , where for all time steps t = 1 . . . , x t consists of: t , the corresponding hour-of-day in . . . , and the corresponding day-of-week in . . . (this increases similaritybetween data points in the same day and improves ﬁt accuracy). For NCGP-A and CGP, x t also contains the binarycensorship label l t . Because weather conditions in the ﬁrst week of June 2010 were quite stable, they do not serve hereas features. The GP kernel for all three models is thus: k ( x , x (cid:48) ) = k SE ( x , x (cid:48) ) + k P er ( x , x (cid:48) ) . (29) We evaluate the models both on the entire dataset and exclusively on non-censored observations. The results areillustrated in Fig. 10 on page 16 and detailed in Table 5 in Appendix A.3. Next, we analyze the behavior of each modelper Fig. 10.As γ and c increase, so do the percentage of censored observations and the intensity of censorship. Consequently, wesee in Fig. 10 that NCGP deteriorates rapidly as the censored observations draw it downwards, away from the latentpickups. Conversely, NCGP-A maintains a rather stable performance even as γ increases, which suggests that it has14 PREPRINT - F

EBRUARY

18, 2020Figure 9: Example of RandDropoff censorship with γ = 0 . and c = 0 . .enough non-censored points for constructing a stable ﬁt. Note that as NCGP-A ignores censored observations, itsperformance does not depend on c and thus traces a horizontal line for each γ .For the limiting case of γ = 0 . and c = 0 , NCGP is the best performing mode, slightly better than CGP. This, however,may be expected, because labeling many observations as censored without actually censoring them is misleading forany censorship-aware model, including CGP. In all other cases, CGP outperforms NCGP and NCGP-A, whether for theentire dataset or for only non-censored observations. This suggests that CGP is more reliable not only in coping withcensorship, but also in situations where observations are known to accurately reﬂect the latent ground truth. Moreover,CGP performance follows a pattern, which becomes more pronounced as γ increases: CGP starts close to NCGP for c = 0 , improves as c increases until about c = 0 . , then decreases and converges.Finally, we note that we have also experimented with a similarly sized spatial area in the middle of Manhattan, NYC.Contrary to the time series used in this Section, the Manhattan cell exhibited a repetitive and regular demand pattern,for which NCGP-A and CGP performed quite closely. The noticeably better performance of CGP in this Section thussuggests that the advantages of censored modeling emerge in more challenging settings — where the ability to extractmeaningful information from censored observations is indeed essential for capturing the underlying demand pattern. Building a model for demand prediction naturally relies on extrapolating knowledge from historical data. This is usuallydone by implementing different types of regression models, to both explain past demand behavior and compute reliablepredictions for the future – a fundamental building block for a great number decision making processes. However, wehave shown how a reliable predictive model must take into consideration censoring , especially in those cases in whichdemand is implicitly limited by supply. More importantly, we stressed the fact that, in the context of shared transportdemand modeling, there is a need for models which can deal with censoring in a meaningful way, rather than resortingto different data cleaning techniques.To deal with the censoring problem, we have constructed models that incorporate a censored likelihood functionwithin a ﬂexible, non-parametric Gaussian Process (GP). We compare this model to commonly used GP models,which incorporate a standard Gaussian likelihood, through a series of experiments on synthetic and real-world datasets.These experiments highlight how standard regression models are prone to return a biased model of demand under datacensorship, whereas the proposed Censored GP model yields consistent predictions even under severe censorship.The experimental results thus conﬁrm the importance of censoring in demand modeling, especially in the transportscenario where demand and supply are naturally interdependent. More generally, our results support the idea of buildingmore knowledgeable models instead of using case-dependent data cleaning techniques. This can be done by feeding thedemand models insights on how the demand patterns actually behave, so that the models can adjust automatically to theavailable data.For future work, we plan to study settings where censorship labels are partly or even completely unknown. We also notethat shared mobility demand prediction can beneﬁt from utilizing spatio-temporal correlations, as commutes betweenareas take place regularly and as nearby areas are likely to exhibit similar demand patterns. Hence also for future work,we plan to implement models which predict censored demand jointly for multiple areas.15

PREPRINT - F

EBRUARY

18, 2020 (a) Evaluation over all observations.(b) Evaluation over only non-censored observations.

Figure 10: Performance of models in RandDropoff experiments. Extreme values of NCGP are omitted for easiercomparison between NCGP-A and CGP.

Acknowledgement

The research leading to these results has received funding from the People Programme (Marie Curie Actions) of theEuropean Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie IndividualFellowship H2020-MSCA-IF-2016, ID number 745673.

References [1] Zidong Yang, Ji Hu, Yuanchao Shu, Peng Cheng, Jiming Chen, and Thomas Moscibroda. Mobility modelingand prediction in bike-sharing systems. In

Proceedings of the 14th Annual International Conference on MobileSystems, Applications, and Services , MobiSys ’16, pages 165–178. ACM, 2016.[2] Yuchuan Du, Fuwen Deng, and Feixiong Liao. A model framework for discovering the spatio-temporal usagepatterns of public free-ﬂoating bike-sharing system.

Transportation Research Part C: Emerging Technologies ,103:39 – 55, 2019. 16

PREPRINT - F

EBRUARY

18, 2020[3] Christian Rudloff and Bettina Lackner. Modeling demand for bikesharing systems: Neighboring stations as sourcefor demand and reason for structural breaks.

Transportation Research Record , 2430(1):1–11, 2014.[4] R. Alexander Rixey. Station-level forecasting of bikesharing ridership: Station network effects in three u.s.systems.

Transportation Research Record , 2387(1):46–55, 2013.[5] Yexin Li, Yu Zheng, Huichu Zhang, and Lei Chen. Trafﬁc prediction in a bike-sharing system. In

Proceedings ofthe 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems , SIGSPATIAL’15, pages 33:1–33:10. ACM, 2015.[6] Divya Singhvi, Somya Singhvi, Peter I. Frazier, Shane G. Henderson, Eoin O’Mahony, David B. Shmoys,and Dawn B. Woodard. Predicting bike usage for new york city’s bike sharing system. In

AAAI Workshop:Computational Sustainability , 2015.[7] Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks for citywide crowd ﬂowsprediction. In

Proceeding of the Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI-17) , 11 2016.[8] Chengcheng Xu, Junyi Ji, and Pan Liu. The station-free sharing bike demand forecasting with a deep learningapproach and large-scale datasets.

Transportation Research Part C: Emerging Technologies , 95:47 – 60, 2018.[9] L. Lin, Z. He, and S. Peeta. Predicting station-level hourly demand in a large-scale bike-sharing network: A graphconvolutional neural network approach.

Transportation Research Part C: Emerging Technologies , 97:258–276,2018. cited By 26.[10] John H Aldrich, Forrest D Nelson, and E Scott Adler. The linear probability model. In

Linear probability, logit,and probit models , volume 45, chapter 1, pages 9–27. Sage, 1984.[11] James Tobin. Estimation of Relationships for Limited Dependent Variables.

Econometrica , 26(1):24–36, 1958.[12] Takeshi Amemiya. Tobit models: A survey.

Journal of econometrics , 24(1-2):3–61, 1984.[13] Viv Bewick, Liz Cheek, and Jonathan Ball. Statistics review 12: survival analysis.

Critical care (London,England) , 8(5):389–394, 2004.[14] William H. Greene. Censored data and truncated distributions.

SSRN Electronic Journal , pages 695–734, 2011.[15] Takeshi Amemiya. Tobit models. In

Advanced Econometrics , page 387. Harvard university press, 1985.[16] Joseph V Terza. A tobit-type estimator for the censored poisson regression model.

Economics Letters , 18(4):361–365, 1985.[17] James L Powell. Censored regression quantiles.

Journal of econometrics , 32(1):143–155, 1986.[18] Steven X Wei. A bayesian approach to dynamic tobit models.

Econometric Reviews , 18(4):417–439, 1999.[19] Bethany Allik, Cory Miller, Michael J Piovoso, and Ryan Zurakowski. The tobit kalman ﬁlter: an estimator forcensored measurements.

IEEE Transactions on Control Systems Technology , 24(1):365–371, 2015.[20] Richard Kay. Proportional hazard regression models and the analysis of censored survival data.

Journal of theRoyal Statistical Society: Series C (Applied Statistics) , 26(3):227–237, 1977.[21] Lee-Jen Wei. The accelerated failure time model: a useful alternative to the cox regression model in survivalanalysis.

Statistics in medicine , 11(14-15):1871–1879, 1992.[22] Yan Li, Bhanukiran Vinzamuri, and Chandan K Reddy. Regularized weighted linear regression for high-dimensional censored data. In

Proceedings of the 2016 SIAM International Conference on Data Mining , pages45–53. SIAM, 2016.[23] Elia Biganzoli, Patrizia Boracchi, and Ettore Marubini. A general framework for neural network models oncensored survival data.

Neural Networks , 15(2):209–218, 2002.[24] Wush Wu, Mi-Yen Yeh, and Ming-Syan Chen. Deep censored learning of the winning price in the real timebidding. In

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & DataMining , pages 2526–2535. ACM, 2018.[25] Arthur Lewbel and Oliver Linton. Nonparametric censored and truncated regression.

Econometrica , 70(2):765–779, 2002.[26] Eoin O’Mahony and David B. Shmoys. Data analysis and optimization for (citi)bike sharing. In

Proceedings ofthe Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence , AAAI’15, pages 687–694. AAAI Press, 2015.[27] Szymon Albi´nski, Pirmin Fontaine, and Stefan Minner. Performance analysis of a hybrid bike sharing system: Aservice-level-based approach under censored demand observations.

Transportation Research Part E: Logisticsand Transportation Review , 116:59 – 69, 2018. 17

PREPRINT - F

EBRUARY

18, 2020[28] Chong Goh, Chiwei Yan, and Patrick Jaillet. Estimating primary demand in bike-sharing systems.

SSRN ElectronicJournal , 01 2019.[29] N. Jian, D. Freund, H. M. Wiberg, and S. G. Henderson. Simulation optimization for a large-scale bike-sharingsystem. In , pages 602–613, 12 2016.[30] Daniel Freund, Shane G. Henderson, and David B. Shmoys. Bike sharing. In Ming Hu, editor,

Sharing Economy:Making Supply Meet Demand , pages 435–459. Springer International Publishing, Cham, 2019.[31] Carl Edward Rasmussen and Christopher K I Williams.

Gaussian Processes for Machine Learning (AdaptiveComputation and Machine Learning) . The MIT Press, 2005.[32] Zoubin Ghahramani. Probabilistic machine learning and artiﬁcial intelligence. Technical report, University ofCambridge, 2015.[33] Filipe Rodrigues, Stanislav S Borysov, Bernardete Ribeiro, and Francisco Camara Pereira. A Bayesian AdditiveModel for Understanding Public Transport Usage in Special Events.

I E E E Transactions on Pattern Analysis andMachine Intelligence , 39(11):2113–2126, 2017.[34] Tsuyoshi Ide and Sei Kato. Travel-time prediction using gaussian process regression: A trajectory-based approach.In

Proceedings of the 2009 SIAM International Conference on Data Mining , pages 1183–1194, 04 2009.[35] Siyuan Liu, Yisong Yue, and Ramayya Krishnan. Adaptive Collective Routing Using Gaussian Process DynamicCongestion Models. In

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’13, pages 704–712, New York, NY, USA, 2013. ACM.[36] Filipe Rodrigues and Francisco Camara Pereira. Heteroscedastic Gaussian processes for uncertainty modeling inlarge-scale crowdsourced trafﬁc data.

Transportation Research. Part C: Emerging Technologies , 95:636–651,2018.[37] Filipe Rodrigues, Kristian Henrickson, and Francisco C Pereira. Multi-Output Gaussian Processes for Crowd-sourced Trafﬁc Data Imputation.

IEEE Transactions on Intelligent Transportation Systems , 20(2):594–603,2019.[38] Yuanchang Xie, Kaiguang Zhao, Ying Sun, and Dawei Chen. Gaussian Processes for Short-Term Trafﬁc VolumeForecasting.

Transportation Research Record , 2165(1):69–78, 2010.[39] Milton Abramowitz and Irene A. Stegun, editors.

Handbook of Mathematical Functions with Formulas, Graphsand Mathematical Tables, sec 9.6 . Dover Publications, Inc., New York, 1965.[40] Thomas P Minka. Expectation propagation for approximate bayesian inference. In

Proceedings of the Seventeenthconference on Uncertainty in artiﬁcial intelligence , pages 362–369. Morgan Kaufmann Publishers Inc., 2001.[41] Perry Groot and Peter J. F. Lucas. Gaussian process regression with censored data using expectation propagation.In

Proceedings of PGM 2012: the Sixth European Workshop on Probabilistic Graphical Models , 2012.[42] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve ﬁtting.

Journal of the Royal Statistical Society. Series B (Methodological) , 47(1):1–52, 1985.[43] DTU. Weather Data (02-03-2018/13-03-2019), 2019. DTU Climate Station, Department of Civil Engineering,Technical University of Denmark.[44] Brian Donovan and Dan Work. New york city taxi & limousine data. https://uofi.app.box.com/v/NYCtaxidata/folder/2332218797 , 12 2014. Accessed: 2019-Apr-29.18

PREPRINT - F

EBRUARY

18, 2020

A Appendix

A.1 Evaluation Measures

Models in our experiments are evaluated with the following measures: R = 1 − (cid:80) mi =1 (ˆ µ i − y ∗ i ) (cid:80) mi =1 (¯ y ∗ − y ∗ i ) , (30)RMSE = (cid:118)(cid:117)(cid:117)(cid:116) n m (cid:88) i =1 (ˆ µ i − y ∗ i ) , (31)where y ∗ , . . . , y ∗ m are the true demand, ˆ µ , . . . , ˆ µ m are the corresponding evaluations of the posterior mean, and ¯ y ∗ = 1 n n (cid:88) i =1 y ∗ i (32)is the mean true demand. A.2 Experimental Results for Section 4.4

Tables 2, 3, 4 hereby provide the results for the experiments of Section 4.4, with best values highlighted in bold.Table 2: Model Performance for Super-hub 1

Censorship Intensity: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Entire DatasetRMSE NCGP

CGP 8.73 8.47 R NCGP R NCGP PREPRINT - F

EBRUARY

18, 2020Table 3: Model Performance for Super-hub 2

Censorship Intensity: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Entire DatasetRMSE NCGP

CGP 6.74 6.57 6.44 R NCGP

CGP 0.59 0.61 0.62 R NCGP

CGP 0.68 0.69

Table 4: Model Performance for Super-hub 3

Censorship Intensity: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Entire DatasetRMSE NCGP

CGP 7.42 R NCGP

CGP 0.52

CGP 7.12 R NCGP

CGP 0.54

A.3 Experimental Results for Section 4.5

Table 5 hereby details the experimental results for Section 4.5, with best values highlighted in bold.20

PREPRINT - F

EBRUARY

18, 2020Table 5: Model Performance in RandDropoff Experiments

Entire Dataset Only Non-CensoredRMSE R RMSE R γ c NCGP NCGP-A CGP NCGP NCGP-A CGP NCGP NCGP-A CGP NCGP NCGP-A CGP.1 0 .

295 32 . . . .516 .

319 32 . . . .541 .1 .

717 32 . . . .517 .

721 32 . . . .543 .2 .

964 32 . . . .521 .

002 32 . . . .546 .3 .

174 32 . . . .521 .

228 32 . . . .547 .4 .

975 32 . . . .519 .

963 32 . . . .544 .5 .

982 32 . . . .519 .

945 32 . . . .545 .6 .

180 32 . . . .517 .

108 32 . . . .544 .7 .

308 32 . . . .519 .

204 32 . . . .545 .8 .

525 32 . . . .522 .

385 32 . . . .549 .9 .

898 32 . . . .525 .

714 32 . . . .551 .

487 32 . . . .529 .

256 32 . . . .556 .2 0 .

294 32 . . . .505 .

446 30 . . . .559 .1 .

862 32 . . . .523 .

889 30 . . . .576 .2 .

347 32 . . . .527 .

460 30 . . . .579 .3 .

668 32 . . . .529 .

654 30 . . . .581 .4 .

845 32 . . . .524 .

651 30 . . . .578 .5 .

060 32 . . . .519 .

753 30 . . . .575 .6 .

509 32 . . . .515 .

101 30 . . . .572 .7 .

394 32 . . . .514 .

882 30 . . . .572 .8 .

526 32 . . . .513 .

887 30 . . . .572 .9 .

736 32 . . . .519 .

974 30 . . . .576 .

077 32 . . . .524 .

175 30 . . . .582 .3 0 .

294 32 . . . .491 .

489 29 . . . .570 .1 .

864 32 . . . .514 .

820 29 . . . .591 .2 .

360 32 . . . .528 .

311 29 . . . .604 .3 .

793 32 . . . .529 .

571 29 . . . .606 .4 .

019 32 . . . .527 .

476 29 . . . .606 .5 .

956 32 . . . .515 .

220 29 . . . .598 .6 .

344 32 . . . .505 .

390 29 . . . .590 .7 .

898 32 . . . .499 .

719 29 . . . .584 .8 .

704 32 . . . .501 .

206 29 . . . .587 .9 .

624 32 . . . .506 .

814 29 . . . .593 .

594 32 . . . .509 .

487 29 . . . .594 .4 0 .

914 31 . .482 . . .

079 27 . .582 . . .1 .

070 32 . . . .502 .

824 29 . . . .595 .2 .

770 32 . . . .523 .

541 29 . . . .615 .3 .

279 32 . . . .529 .

420 29 . . . .626 .4 .

145 32 . . . .526 .

868 29 . . . .627 .5 .

369 32 . . . .516 .

726 29 . . . .621 .6 .

184 32 . . . .507 .

149 29 . . . .617 .7 .

396 32 . . . .493 .

926 29 . . . .608 .8 .

949 32 . . . .493 .

950 29 . . . .606 .9 .

679 32 . − . . .494 .

051 29 . . . .605 .

582 32 . − . . .494 .

378 29 . . . .604.604