A Comparison of Statistical and Machine Learning Algorithms for Predicting Rents in the San Francisco Bay Area
AA Comparison of Statistical and Machine Learning Algorithms forPredicting Rents in the San Francisco Bay Area
Paul Waddell [email protected]
Arezoo Besharati-Zadeh [email protected] 1, 2020
Abstract
Urban transportation and land use models have used theory and statistical modeling methods todevelop model systems that are useful in planning applications. Machine learning methods have beenconsidered too ’black box’, lacking interpretability, and their use has been limited within the land useand transportation modeling literature. We present a use case in which predictive accuracy is of primaryimportance, and compare the use of random forest regression to multiple regression using ordinary leastsquares, to predict rents per square foot in the San Francisco Bay Area using a large volume of rentallistings scraped from the Craigslist website. We find that we are able to obtain useful predictions fromboth models using almost exclusively local accessibility variables, though the predictive accuracy of therandom forest model is substantially higher.
Keywords : Modeling, Hedonic, Machine Learning, Random Forest1 a r X i v : . [ ec on . E M ] N ov Introduction
Development of urban transportation and land use models has traditionally relied extensively on domainknowledge, theory, and statistical methods such as multiple regression and discrete choice models. Althoughmachine learning methods have been available for many years and demonstrated to produce more accuratepredictions than statistical models such as multiple regression, they have not been widely adopted withinthe urban modeling literature. One of the main reasons for this is that in research using statistical models(whether frequentist or Bayesian), the applications are often motivated by a need to be able to interpret thecoefficients of the model within the context of domain theory about their sign and significance. By contrast,researchers used to statistical modeling paradigms have been wary about the perceived lack of interpretabilityof models developed using machine learning methods such as neural networks. Further, models developedfor planning or policy applications are often motivated by a need to undertake counter-factual analysis ofthe potential impacts of different policy inputs, in order to undertake ex-ante evaluation of the policies.This requires some degree of causal inference, or at at least a model with a theoretical structure that theresearcher can argue is suitable for counter-factual analysis. By contrast, again, machine learning methodstend to focus on maximizing the predictive accuracy rather than on counter-factual analysis for policy orplanning.In this paper we examine a use case that lends itself to the use of machine learning methods, since thepredictions are used mainly to bootstrap a structural model. The application is hedonic modeling of rents, tobe used as starting values for a model that is a structural microsimulation of demand and supply of housing,and which incorporates a short-term market clearing component that adjusts prices until the demand forhousing would clear all submarkets – meaning that predicted demand is less than or equal to available supplyin all submarkets. For this purpose, it is valuable to obtain the most accurate possible initial predictionof rents or prices, since that predicted value will influence the demand predictions, and a poor predictionof prices or rents will generate a lower quality prediction of demand. If the estimated parameters of thedemand model were sufficiently robust with respect to price and amenities of housing, then one might hopethat the market clearing algorithm would adjust prices to more accurately reflect true demand. But in thepresence of poor predictions of prices, one might have less confidence that the estimated parameters of thedemand model are sufficiently robust. More accurate price and rent predictions should help achieve robustestimation results from the demand model, and more efficient convergence of the market clearing algorithm.We develop a hedonic regression model of rent per square foot, first using Ordinary Least Squaresregression [8], and subsequently using random forest regression, a decision tree method within machinelearning [4, 3]. The literature of hedonic modeling of housing prices is voluminous, dating to at leastthe work of Griliches on the automobile market [9], and early application to modeling housing rents [7].The theoretical formulation of hedonic modeling is generally attributed to Rosen [13], and is grounded inLancaster’s theory of consumer demand [11]. Housing prices and rents have also been examined previouslyusing random forest regression, and compared to multiple regression, for example in the context of Ljubljana,Slovenia [5], and broader comparisons of multiple regression and random forest regression for evaluatingvariable importance are also available [10]. Our paper contributes to the small emerging literature on theuse of machine learning methods such as random forest for analyzing housing prices and rents in the contextof land use and transportation modeling. It is also novel in using volunteered geographic information fromCraigslist rental listings, leveraging prior work to scrape rental listings [2].
The context of this study is the San Francisco Bay Area, with a population of over seven million andencompassing over one hundred municipalities across nine counties. It is home to Silicon Valley and owingin part to its robust technology sector, it is the most expensive metropolitan housing market in the UnitedStates. 2able 1: Statistical Profile of Variables After Clipping Outlier Values variable count mean std min 25% 50% 75% maxrent sqft 363010.0 3.0 1.0 0.0 2.0 3.0 4.0 11.0res sqft per unit 363010.0 994.0 430.0 212.0 710.0 904.0 1150.0 3600.0units 500 walk 363010.0 664.0 662.0 0.0 193.0 437.0 876.0 2317.0sqft unit 500 walk 363010.0 1455.0 712.0 0.0 1059.0 1436.0 1803.0 3699.0rich 500 walk 363010.0 133.0 148.0 0.0 27.0 81.0 166.0 528.0singles 500 walk 363010.0 201.0 254.0 0.0 35.0 101.0 228.0 868.0elderly hh 500 walk 363010.0 92.0 102.0 0.0 21.0 56.0 117.0 363.0children 500 walk 363010.0 226.0 189.0 0.0 79.0 186.0 327.0 755.0jobs 500 walk 363010.0 759.0 1295.0 0.0 43.0 220.0 748.0 5247.0jobs 1500 walk 363010.0 6589.0 8770.0 0.0 1206.0 3110.0 7220.0 32501.0jobs 10000 363010.0 165285.0 117970.0 0.0 74380.0 127551.0 236962.0 412326.0jobs 25000 363010.0 498022.0 229898.0 37.0 322181.0 584284.0 696465.0 787748.0pop 10000 363010.0 333207.0 191209.0 0.0 183445.0 300216.0 459446.0 763247.0pop black 10000 363010.0 14010.0 18451.0 0.0 2709.0 5754.0 20794.0 90219.0pop hisp 10000 363010.0 57468.0 42489.0 0.0 27776.0 45772.0 81072.0 201053.0pop asian 10000 363010.0 106511.0 77819.0 0.0 37199.0 93097.0 175019.0 282688.0
For this study we scraped Craigslist rental listings from November 2016 through July 2018, and filteredand cleaned the data, adapting the methods used in [2]. The result of the data collection and cleaningyielded over 350,000 rental listings which contained information on the listing date, the location (Latitude,Longitude), asking rent, square footage, number of bedrooms and number of bathrooms. Since the objectiveof this project was to generate a rental model that could be used in an integrated microsimulation of the BayArea real estate market at a building level, we used only the location, rent, and square footage informationfrom the listings.To augment the listing attributes we developed a series of locational attributes and associated them withthe listings data. We employed street networks representing the walking network and a driving networkcontaining tertiary streets and higher capacity roads to measure accessibility, using the OSMNX library [1]to create and clean the networks, and the Pandana library [6] to compute localized accessibility measures.We developed a synthetic population using Synthpop, a library adapted from PopGen [16]. Data for parcels,buildings, and employment by address were obtained for the 9-County Bay Area from the MetropolitanTransportation Commission. We computed a series of localized accessibility measures on the walk and drivenetworks to provide localized and more regional context measures. Each listing was assigned to its nearestnode on the two networks, and each parcel and building was similarly assigned to its nearest node on each ofthe two networks. The localized measures were generally computed within 500 meters as network distances,either as a simple sum, or an average of the variables of interest.
From the buildings database we computed the average residential square feet per residential unit for thosebuildings containing residential units. We used the following attributes of households from the syntheticpopulation: household income, household size, age of householder, presence of children, race of householder,and an indicator of whether the householder was Hispanic. We also used data on jobs by location to computeaccessibilities to employment within 500 meters, 1500 meters, 10 kilometers, and 25 kilometers, all measuredas distances on the network, with those 3 kilometers or below measured on a walking network, and thoseabove 3 kilometers measured on a driving network with tertiary streets or higher.3
Methods
Prior to estimating the models on training subsamples of our data, we examined the data to identify potentialproblems with the data that might adversely impact the quality of the models. In particular, outliers arewell known to influence model parameters in ordinary least squares regression, so we used the Pandas clipfunction to recode values above the 99th percentile on all accessibility variables in order to reduce theirimpact on the model.We also examined the distribution of the variables, and used log transformations to normalize them.Most had significant skewness prior to the transformation. We did not, for purposes of this paper, undertakefurther data cleaning, in order to simplify the exposition and focus on the main objective of the paper tocompare two very different methods to predict rental prices.
The hedonic price model we estimate using Ordinary Least Squares (OLS) method of minimizing the sum ofthe squared errors is represented as Y = f ( Sβ, N γ ) + (cid:15) , where Y is a vector of rents per square foot in ourrental listings, S is a vector of structural characteristics, and N is a vector of neighborhood and accessibilitycharacteristics surrounding each rental listing. This can be specified as a linear model Y = Xβ + (cid:15) , with (cid:15) assumed to be independently and identically distributed ( iid ). Using this assumption on the distribution of (cid:15) , we can estimate the β parameters using OLS by computing ˆ β = ( X T X ) − X T Y . We use the StatsModelsPython library [14] compute the model estimation and predictions for model evaluation. The development of Random forests as an ensemble classification and regression approach by Breiman [3] hasproduced considerable interest owing to its robust predictive capabilities and minimal tuning requirements.We summarize the method here using the exposition of [15]. A random forest is a collection of tree predictors h ( x ; θ k ) , k = 1 , ..., K where x i an input vector of length p , and random vectors X and θ k are independentlyand identically distributed ( iid ). We subset the data into a training and a testing subsample, and takeindependent draws from the training data, which is a joint distribution of ( X , Y ). The random forestregression prediction is an average over the collection ¯ h ( x ) = (1 /K ) (cid:80) Kk =1 h ( x ; θ k ). We use the Scikit-Learnlibrary [12] to train the random forest model on our data. In this section we present results from estimating the multiple regression model using ordinary least squares,and training the random forest regression. One of the standard practices used in machine learning is tosplit the observed data into training and testing samples, and to use only the training sample to train themodel. Both for consistency, and to adopt the valuable practice of out-of-sample validation, we use the sameapproach for the OLS regression as we do for training the random forest regression: we split the observeddata, using two thirds of the data for estimation (training) and separating one third of the data to use forout-of-sample prediction and cross-validation.
We used the training sample of 242,000 rental listings to estimate a hedonic model on rent per square footfrom the Craigslist rental listings data collected from November 2016 through July, 2018 for the San FranciscoBay Area. As is common in the literature, we used a log transformation of the dependent variable, and alsolog-transformed all explanatory variables. 4able 2: Results of OLS EstimationThe estimated model appears to fit the training data reasonable well considering that the only informationincluded about the unit besides accessibility variables is is square footage. Even with such limited numbersof attributes, the model has an Adjusted R-squared of 0.63. The Mean squared error is= 0.45, and theRMSE is 0.67. Key variables have the correct sign and are significant. For clarity of exposition we will notrepeat that all variables are log transformed, which also lends itself to a straightforward interpretation ofthe coefficients as elasticities, with a one percent change in an explanatory variable being associated with apercentage change in the rent per square foot as indicated by its estimated coefficient.Keeping in mind that the dependent variable is expressed as a monthly rent per square foot, the size of theunits in square feet is negatively associated with rent per square foot, consistent with a diminishing marginalutility as square footage increases. The density of housing within 1/2 kilometer is positively associated withrent per square foot, and the size of units in square feet within 1/2 kilometer is negatively correlated withrent per square foot. The number of households within 1/2 kilometer with incomes above $ .2 Random Forest Figure 1: One tree from random forest ensemble of regression treesWe turn next to the random forest model and examine its training results on the same training dataset,using the same variables used in the OLS regression. Results from random forest training are quite differentthan those from OLS estimation, since the underlying algorithms are fundamentally different. Randomforest leverages regression trees, and averages over an ensemble of regression trees to make its predictions.The depth of the trees and the number of samples drawn to generate different trees are controlled by theresearcher, and provide a means to tune the degree to which the model trades off bias and variance, in orderto avoid over-fitting.Figure 1 represents one regression tree of the ensemble of regression trees generated by random forest onthe training dataset. It is included to help illustrate how regression trees use decision trees to split variablesrecursively in order to capture nonlinear patterns within the data.6igure 2: Variable Importance Ranking from Random Forest RegressionFigure 2 depicts the relative importance of the most important variables contributing to its predictions.It is not directly comparable to OLS coefficients, but it does provide insight into the which variables mostinfluence the prediction, and in this way is more interpretable than some machine learning methods. Interms of fit to the training data, the random forest model has an accuracy score on the training data of .96,which comparable to the R-squared of OLS. The Mean squared error on the training data is= 0.0, and theRMSE is 0.02.
In this section we examine how both models performed when the estimated (trained) model is applied toa set of observations not used in training the models. This practice is commonly used to evaluate whetherover-fitting has occurred, a situation in which the model is excessively tuned to the training dataset and7oes not perform very well when it is generalized to other data. For this purpose, we use the 1/3 of theoriginal data that was split into a testing dataset. It shares no observations with the training data.The first result we examine is a comparison of the distribution of the residuals when we predict themodel on these new data and compare the predicted values to the observed values of rent per square foot.These results are shown in Figure 3. While both plots indicate little to no bias in the models, the randomforest plot clearly shows a much lower variance of the residuals, indicating a superior set of out of samplepredictions compared to the predictions from OLS. (a) Ordinary Least Squares (b) Random Forest
Figure 3: Distribution of Residuals of OLS and Random ForestFigure 4 displays predicted values plotted against observed values for the test dataset for both models.Again it is clear that the random forest model fits the test data much better than the predictions of themodel estimated with OLS. The predictions of random forest map closely to the 45 degree line of a perfectfit, and show no signs of distorion at lower or higher values of the observed data. By contrast, the OLSpredictions show some artifacts, with a cloud of points being over-predicted at low ranges of observed rentper square foot, while at high ranges of observed rent per square foot the OLS predictions appear to beskewed downwards, towards under-prediction. The higher dispersion of the errors in the OLS predictionsare also evident. 8 a) Ordinary Least Squares (b) Random Forest
Figure 4: Predicted vs Observed Values for OLS and Random ForestFigure 5 displays the results by plotting the residuals from each model against the predicted values fromthat model for the testing dataset. This plot can be useful in detecting nonlinear patterns in the errorsover the range of the predictions. In this case the patterns that emerged in Figure 4 are still in evidence,with a significant portion of the cloud of points in the OLS plot spreading and drifting downwards at higherpredicted values. By contrast the pattern of residuals vs predicted values is much tighter and does notexhibit a comparable drift in the random forest results. (a) Ordinary Least Squares (b) Random Forest
Figure 5: Residuals vs Predicted Values for OLS and Random Forest9he final assessment of the residuals from our analysis of the testing dataset is to map the spatial patternof the residuals from both models to examine them for spatial clustering. In figure 6 we see that there areclusters of underprediction (blues) in the core of San Francisco and in the Silicon Valley area, and over–predictions in parts of Oakland and Berkeley and in parts of the South Bay – probably reflecting ommittedvariables and non-linearities we were not able to capture in the OLS model. By contrast, the map of residualsfrom the random forest regression appears random, lacking obvious clustering of under- or over-predictions. (a) Ordinary Least Squares (b) Random Forest
Figure 6: Spatial Pattern of Residuals for OLS and Random Forest
The most widely used method for undertaking hedonic regression modeling of housing prices and rents,multiple linear regression estimated using ordinary least squares, has a clear advantage over random forestregression in the interpretability of the estimated model coefficients. If the objective of the application is toevaluate the impact of a specific variable of focus, as is often the case in the hedonic regression literature,then the lower predictive quality of OLS compared to random forest regression is an appropriate trade-off tomake. However, there are other applications in which predictive accuracy is more important than the abilityto interpret a specific coefficient, and in such cases, machine learning methods that have higher predictiveaccuracy than models estimated with OLS are worth considering. The random forest algorithm was designedin a way that overcomes several limitations of ordinary least squares regression [3]: 1) it is designed to handlenon-linear relationships between the dependent and independent variables; 2) it is invariant to scaling ortranslation, and 3) it is robust to irrelevant or highly correlated variables.Our intended application for the hedonic model is to use the predicted rents as initial values for anintegrated land use and transportation model system that is structural. Hedonic regression models are byconstruction reduced form models that provide an estimate of the partial influence of a number of independent10ariables on housing prices or rents, under the assumption that the market is in equilibrium. However, itmay be an appropriate starting point for a structural model in which we predict the demand for each locationusing discrete choice models with predicted rents on the right hand side, and evaluate the total demand ateach location by summing the predicted location probabilities across all choosers at each location, and theniteratively adjusting the rents to account for demand - supply imbalances, to reflect short-term disequilibriumin housing markets. The second and related use of the rent predictions is as an input into real estate supplymodels that use pro forma financial models to predict the development profitability or feasibility on a site,given the costs of developing the site, and the expected revenue from constructing a project. The revenueexpectation is heavily informed by predicted rents, and if these predictions are poor, then the quality of thesupply model is adversely affected. In this context, improving the predictive accuracy of the hedonic modelis mainly motivated by the need to improve inputs to the demand and supply models, and not so muchfor its use to evaluate specific coefficients and iterpret them. This approach puts hedonic regression into asupport role within a structural microsimulation of demand and supply. Within contexts such as the usecase outlined here, and in most other applications where the accuracy of predictions is more important thanthe interpretability of a single coefficient, it would appear that machine learning methods such as randomforest and the closely related gradient boosting algorithms may have substantial value in improving models.Finally, we close with a reiteration of our main substantive finding: using only a modest number of localaccessibility variables in addition to the square footage of the rental unit as the only unit or building specificattribute, both of the methods used in this study are able to predict rents per square foot with a high degreeof accuracy in out of sample predictions. Local density, social composition, and job accessibility are powerfulexplanatory factors, and
The authors confirm contribution to the paper as follows: study conception and design: Paul Waddell; modeldevelopment using OLS: Paul Waddell; model development using Machine Learning: Arezoo Besharati-Zadeh; analysis and interpretation of results: Paul Waddell and Arezoo Besharati-Zadeh; draft manuscriptpreparation: Paul Waddell. All authors reviewed the results and approved the final version of the manuscript.
References [1] Geoff Boeing. OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complexstreet networks.
Computers, Environment and Urban Systems , 65:126–139, 9 2017.[2] Geoff Boeing and Paul Waddell. New Insights into Rental Housing Markets across the United States:Web Scraping and Analyzing Craigslist Rental Listings.
Journal of Planning Education and Research ,37(4):457–476, 12 2017.[3] Leo Breiman. Random forests.
Machine Learning , 45(1):5–32, 2001.[4] Leo Breiman, J. H. (Jerome H.) Friedman, Richard A. Olshen, and Charles J. Stone.
Classification andregression trees . Chapman & Hall/CRC.[5] Marjan ˇCeh, Milan Kilibarda, Anka Lisec, and Branislav Bajat. Estimating the Performance of RandomForest versus Multiple Regression for Predicting Prices of the Apartments.
ISPRS International Journalof Geo-Information , 7(5):168, 5 2018.[6] Fletcher Foti, Paul Waddell, and Dennis Luxen. A generalized computational framework for accessibility:from the pedestrian to the metropolitan scale. In
Proceedings of the 4th TRB Conference on Innovationsin Travel Modeling. Transportation Research Board , 2012.117] Robert Gillingham and David Lund. A Hedonic Approach to Rent Determination. In
Proceedings ofthe Business and Economic Statistics Section of the American Statistical Association , volume 69, pages184–192, 1970.[8] William H Greene.
Econometric Analysis . Pearson Education, 5th edition, 2002.[9] Zvi Griliches. Hedonic price indexes for automobiles: An econometric of quality change. In
The PriceStatistics of the Federal Goverment , pages 173–196. NBER, 1961.[10] Ulrike Gr¨omping. Variable Importance Assessment in Regression: Linear Regression versus RandomForest.
The American Statistician , 63(4):308–319, 11 2009.[11] Kelvin J. Lancaster. A New Approach to Consumer Theory.
Journal of Political Economy , 74(2):132–157, 4 1966.[12] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, AlexandrePassos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´Edouard Duchesnay. Scikit-learn:Machine Learning in Python.
Journal of Machine Learning Research , 12(Oct):2825–2830, 2011.[13] S Rosen. Hedonic Prices and Implicit Markets: Product Differentiation in Pure Competition.
Journalof Political Economy , 82(1):34–null, 1974.[14] Skipper Seabold and Josef Perktold. Statsmodels: Econometric and statistical modeling with python.In