[PDF] Developing a real estate yield investment deviceusing granular data and machine learning

Abstract

This project aims at creating an investment device to help investors determine which real estate units have a higher return to investment in Madrid. To do so, we gather data from this http URL, a real estate web-page with millions of real estate units across Spain, Italy and Portugal. In this preliminary version, we present the road map on how we gather the data; descriptive statistics of the 8,121 real estate units gathered (rental and sale); build a return index based on the difference in prices of rental and sale units(per neighbourhood and size) and introduce machine learning algorithms for rental real estate price prediction.

Full PDF

DDeveloping a real estate yield investment deviceusing granular data and machine learning ∗Azqueta-Gavaldon, Monica † AstraZeneca Computational Pathology

Azqueta-Gavaldon, Gonzalo ‡ University of Strathclyde

Azqueta-Gavaldon, Inigo § Technische UniversitÃďt MÃĳnchen

Azqueta-Gavaldon, Andres ¶ University of Glasgow

August 7, 2020

Abstract

This project aims at creating an investment device to help investors determine whichreal estate units have a higher return to investment in Madrid. To do so, we gatherdata from Idealista.com, a real estate web-page with millions of real estate units acrossSpain, Italy and Portugal. In this preliminary version, we present the road map on howwe gather the data; descriptive statistics of the 8,121 real estate units gathered (rentaland sale); build a return index based on the diﬀerence in prices of rental and sale units(per neighborhood and size) and introduce machine learning algorithms for rental realestate price prediction. keywords—

Investment device, Real Estate, Webscraping, Machine Learning

JEL classiﬁcations:

C44; C58; L85; R31 ∗ Acknowledgments:

We thank Diego Azqueta-Oyarzun and Guillermina Gavaldon Hernandez for valu-able comments. We alone are responsible for any errors. † E-mail: [email protected] ‡ E-mail: [email protected] § E-mail: [email protected] ¶ E-mail: [email protected] a r X i v : . [ q -f i n . GN ] J un Introduction

The rising uncertainties in the current and future economic outlook and low interest rateslikely to remain so for the next couple of years, often leads to a low and risky return toinvestment. Moreover, rising inequalities often lead to rental prices to increase by morethan sale prices, which oﬀer a higher rental yield (the return earned when renting out apurchased property). Latest research shows that this yield ranges from 4.40% to 5.15% inBarcelona and Madrid and most importantly, it has been increasing during the last coupleof years ([Delmendo, 2020]). This preliminary work examines the rental yield of severalreal estate units in Madrid by making use of information gathered from Idealista.com andmachine learning algorithms.Real-state value is usually estimated by taking factors into account such as the size ofa property, its location, pricing of similar neighboring properties etc. A would-be buyer orseller is thus inﬂuenced and restricted by the available real-state information that can beretrieved. Knowing the exact value of a property is paramount for all the parties involvedin its transaction, and basing it on the available information can lead to biases. A sellercan for instance dictate a property’s value based on its size, amenities, number of rooms,bathrooms etc. However many latent factors (factors that are hard to take into account)inﬂuence the market value of the property such as views, neighborhood appeal, price byarea etc. With large amounts of real-estate data and using statistical and novel machinelearning algorithms, these latent features can be used to overcome biases and generate amore accurate picture of a property’s value.In Spain, there is a lower proportion of people living in rental houses compared toother EU members. However, in recent years there has been an increasing trend of livingin rental houses rather than in owned property. More speciﬁcally, there has been an in-crease in medium to long term rental contracts, that is, non-tourist tenants, as the workof [Lopez Rodriguez and Matea, 2019] shows. This increase has been especially strong incities like Barcelona and Madrid. The estimation of a property’s rental price thus gainsimportance given this trend. 2n our work, we ﬁrst present how we obtain and prepare (merge and clean) the data fromSpain‘s biggest real estate portal; Idealsita.com. We then oﬀer an overview of the data andvariables that we obtained while produce a rental yield index for each of neighborhood andproperty sizes in Madrid. To do so, we use the rental prices and the most likely mortgagepayment for the sale units. In other words, we present a method to evaluate the proﬁtabilityon diﬀerent neighborhoods and property sizes in Madrid based on the average purchase andaverage rental prices. We ﬁnd that the highest index can be found in the neighborhoodof Opanel (south-west of Madrid city centre) across those units between 30 to 60 squaremeters. When it comes to bigger apartments, those between 60 to 90 square meters, weﬁnd that the neighborhood of Los Angeles (south-side of Madrid) display the higher yield.We then present diﬀerent machine learning algorithms trained to predict rental pricesand evaluated them. As a benchmark we use multivariate linear regressions which canexplain around 62% of the variance of the dependent variable, rental price, by includingonly three variables; the size, whether or not the apartment is an exterior and the ﬂoornumber. Once we include all variables available, the R rises to 0.88 and the Root MeanSquare Error (the error between predicted values and actual values) sums to 359 euros.We then test Random Forests and support vector regression (SVR) algorithms, this lat-ter is a common sophisticated machine learning algorithm which uses similar properties tosupport vector machines for classiﬁcation predict the values. Preliminary results show theadvantage of using Random Forests and SVR for complex models that are more likely tosuﬀer from non-liner relationships between the explanatory variables and the dependent one.The rest of the paper proceeds as follows: the next section oﬀers a description of therelated literature. Section 3 describes the data and methods used throughout this work.Section 4 introduces a neighborhood return index. Section 5 shows the rental price estima-tion using multivariate linear regression and support vector regressions (SVR) accross fourdiﬀerent models, and Section 5 oﬀers a preliminary conclusion and steps towards futurework on developing the complete index. 3 Related literature

With the ever increasing amount of information about real-sate available online, price pre-diction of property has become an interesting topic of investigation in recent years. De-velopments in Machine learning have also enabled such predictions to be made faster andmore accurately.In their work, [Ma et al., 2018] use a dataset of 2,462 warehouse listings in the area ofBeijing. Each entry of the dataset contains information about the location, size, distanceto city center and second hand house price based on location. With this labelled data, theytrain four diﬀerent machine learning algorithms to predict the price of unseen data. Fromthe four models, Linear Regression, Regression Tree, Random Forest Regression and Gra-dient Boosting Regression Trees, Random Forest Regression yielded the best performance.Feature importance describes what information or variables had the greatest impact onpredicting the warehouse price. They ﬁnd that distance from the city centre is the mostinﬂuential variable, followed by the size of the warehouses and the price of nearby houses.In order to determine whether the price of real-state display non-linear relationships withthe independent variables, [Limsombunchai, 2004] predict house prices using two models,a Neural Network and a hedonic regression model. They use a set of 200 houses with in-formation such as their size, age, number of rooms, bathrooms, toilets, garages and theavailability of amenities on their vicinity. They show that a Neural Network yields moreaccurate results than the hedonic regression model. This ﬁnding is also corroborated inthe work of [Selim, 2009], where they too compare the a hedonic regression model with aNeural Network and ﬁnd the latter to be more accurate. A reason for this is that there ex-ists heteroscedasticity between house price and the independent variables. This non-linearbehavior aﬀects the quality of the predictions and as [Ma et al., 2018] also show, make non-linear classiﬁcation models better suited for such tasks. Similar results are also shown by[Tabales et al., 2013] and [Hamzaoui and Perez, 2011].Moreover, there is a number of studies that make use of the data from Ideaslita.com.For example, [Casas-Rosal et al., 2018] download data from this website to analyse the evo-4ution of the real estate market, commercial premises and industrial warehouses market ofsupply between November 1st, 2016 and May 1st, 2017 in the city of Cordoba, Spain. Lateron, [Casas-Rosal et al., 2019] introduce a software for statistical analysis of real estate unitsbuilt using Java and R for the same city; Cordoba. Their interface displays variables suchreal estate prices, the geographic location, and several of the characteristics contained inthe web-page.Besides, there is a number of studies that focus on real estate prices in a aggregateor macroeconomic set up. For example, [Hott and Monnin, 2008] estimate prices based onmodels of no-arbitrage condition between renting and buying for the USA, UK, Japan,Switzerland and the Netherlands. They ﬁnd that observed prices deviate substantially andfor long periods from their estimated fundamental values in the short run. [Born and Pyhrr, 1994]make use of cycle valuation models that use aggregate cycle measures such as demand andsupply cycles, inﬂation cycles, or rent rate catch-up cycles to evaluate equilibrium real estateprices.

We use data from Idealista.com which is a real estate platform oﬀering timely data on rentsand sales of real estate in Spain, Italy and Portugal. Due to the limited data available todownload each month, we narrow down the search to properties in the area of Madrid, andexclude non habitable premises, i.e. oﬃces, garages, premises and warehouses.

In order to download the real state data, we use the API provided by idealista. Idealistaprovides the user with a password and a key in order to be able to execute queries from theirAPI. An url is created for these queries, specifying several variables that deﬁne the charac-teristics of the search. These variables include operation (rent/sale), center (longitude andlatitude coordinates), radius (size of radius of search from center), type of property (house,ﬂat, chalet) etc. Once the query has been executed, the API then returns a JSON ﬁle withthe retrieved information. An illustration of the road-map can be seen in Figure 1.

5n order to be able to work with the extracted data, it must ﬁrst be prepared and storedin a database. The JSON ﬁle downloaded from the API contains the information aboutall the data points (each individual property) as a long string, where each property, alongwith all its corresponding information, is enclosed in nested brackets to diﬀerentiate it fromother data points.Using the

Regex module of Python to perform regular expressions, the JSON data iscleaned. Brackets, commas, spaces, tabs, new line expressions etc. are ﬁrst removed, whileexpressions such as ”u00f3” (that in this example represents the accentuated letter ´o) arereplaced by their corresponding non-accentuated letters. Once the data is cleaned, eachproperty and its corresponding information is stored as a dictionary. This makes it easierto then use the Python module

Pandas to create a data-frame from a list of dictionariescontaining all data entries. Having a data-frame containing the data facilitates its use formodelling and training purposes.To acquire the data trough the API, we set a point along the coordinates and the radiuswe want to reach. To focus on the Madrid area, we set the central point to 40.4167’ and-3.70325” which corresponds to the city centre of the capital and set a radius to 60km.In total, we retrieve 8,121 houses in Madrid where 3,737 are on sale while 4,384 are forrent. Figure 2 shows the distribution in prices for the data extracted. The mean sell priceis âĆň970,000 euros while the standard deviation is âĆň1.02 millions. While the averagerental price in our sample is 1,912 euros with a 1,683 euros of standard deviation.Figure 4 displays a fraction of houses here analyzed. To build the map we use the lon-gitude and latitude of each house which we can easily locate in the map using the leaﬂet library. Leaﬂet is one of the most popular open-source JavaScript libraries for interactivemaps where one can not only geo-locate objects but which also contains rich informationabout shops, schools, bus stations, gas-stations, and other wide range of services. Using theGEOJSON information about each neighborhood in Madrid, we can extract geo-informationabout speciﬁc areas and also visualize them as shown in ﬁgure 3b.6igure 1: Transition from JSON ﬁle to data-frame

Notes:

Example the pipeline used to transform JSON strings into data-frames. On the top, theJSON string returned by the API query. Below, a list of properties extracted using regularexpressions operations. On the bottom, the data-frame created with sorted information

Sale price (n = 3,737)

Sale Price (in millions) F r equen cy Sale Dataset: size (n = 3,737)

Size (in m2) F r equen cy Rent price (n = 4,384)

Rent Price (per month) F r equen cy Rent Dataset: size (n = 4,384)

Size (in m2) F r equen cy Notes:

Top left and bottom left graphs show the distribution of sales and renting prices of housesrespectively. The graphs in the top right and bottom right show the size distribution of houses forsale and for rent respectively (a)(b)

Notes: (a) Example of a map in the city centre of Madrid where the blue pop outs are a sample ofthe location of some houses analysed using the library leaﬂet (b) Example of a map of Madridwhere the blue pop outs are a sample of the location each neighborhood of the Madrid area shadedin dark grey and delimited by dark continuous lines Neighborhood return

In this section, the estimation of the sale/rent ratio per neighborhood is presented. Thisratio creates an indicator of house proﬁtability, that is, the return of a purchased property ina certain area, given a speciﬁc mortgage and assuming the property is rented out. This indexis thus calculated using two quantities, the monthly average mortgage per neighborhood,and the monthly average renting price per neighborhood. We take the standard mortgageformula and taking into consideration the transaction costs and down payment we obtain: M = [( P + c × P ) − . × DP ] r (1 + r ) n (1 + r ) n − M is the monthly mortgage sum; ( P + c × P ) is the total costs of the property:total price of the house P and the transaction cots ( c × P ); DP is the down payment (setto 30% of the price); r is the monthly interest rates, and n the total number of months thatthe mortgage will last. In the benchmark set up we set r to 0 . n = 360; and the transactioncosts c to be 6.7% of the total price. Following this formula we would obtain a monthlypayment during 30 years of 423 euros for a unit of real estate that costs 150,800 euros (or160,903 euros once the transaction costs are taken into account).Our data contains houses that range from 30 m to over 1,200 m , and there is somedisparity in the distribution of the houses for sale and for rent with respect to size, at itcan be seen in ﬁgure 2. In the graph of the size distribution of the houses for sale, it can beseen that the distribution does not drop as abruptly as the size increases, as it does for therental distribution. A reasonable assumption for this disparity is that it is more likely thata house in a wealthy suburb of Madrid with 1,200 m is sold than rented out. Furthermore,house sale and rental prices can show heteroskedasticity with respect to size. Thus, the stan-dard deviation across the entire size spectrum will be larger for houses for sale than for rent.In order to account for these diﬀerences in the size distribution of property for saleand rent we calculate the average mortgage payed per neighborhood and for a given sizeinterval. For example, an instance of this calculation would be the average mortgage payed10n the Neighborhood of ”Prosperidad” for all houses that have a size between 30 m and60 m , which is calculated using 36 samples and yields 1,581.86âĆň per month. Then, theaverage rent per neighborhood and for the diﬀerent size intervals is also calculated. In theprevious example, the average monthly rent payed in ”Prosperidad” for properties between30 m and 60 m is of 1,371.95âĆň per month, calculated with 41 samples. By dividing theaverage monthly rent by the average monthly mortgage of the instances that belong to thesame neighborhood and size interval, an index of the proﬁtability or return of real state in agiven area and for a given size range is obtained. Using again the example of ”Prosperidad”and properties of a size between 30 m and 60 m , this index is 0.867. Generally, a value of1 would be expected, since the renting prices are adjusted to the housing prices over time.Thus, the higher the index value, the higher the return of real state since there is a largergap between the rental prices minus the mortgage prices. An index value smaller than 1like the one of the example points at an area with negative return.Table 3 shows this index computed for each neighborhood and size, and Figure 4 showsthe same index as a heat map for all neighborhoods of Madrid with available data and forfour diﬀerent size ranges. As it can be seen from Figure 4, there are more houses in themarket with a size range of 30 m and 60 m . The supply of houses decreases for bigger sizeranges, which can be seen in the grayed out neighborhoods of in the ﬁgure, where no datapoints were available. 11igure 4: Sample of proportion of houses‘ geo-location Notes:

Proﬁtability index shown for each neighborhood for diﬀerent size intervals of real state Rent price prediction models

In this section, we will like to explore a diﬀerent approach into obtaining returns for eachneighborhood. So far we have been looking at the diﬀerences between the average price ofreal estate for sale (and therefore their average monthly mortgage rate) and the averagerental price of real estate for rent per neighborhood. For this approach to be valid, wehave to assume identical characteristics between the rental and sale real estate markets perneighborhood (neighborhood homogeneity).With this in mind, we now want to train several classiﬁcation models which will assesswhat is the renting price that a real estate for sale is likely to obtain. The classiﬁcationsmodels will be trained in the set of rental property in order to learn which speciﬁc charac-teristics explain their monthly price. We will then evaluate the performance of each model(accuracy) by splitting the sample into training and testing data1 sets. After this evalua-tion we will take each of the real state units for sale, and calculate their rental price usingthe model that yielded the best results. We can therefore obtain the rental market pricethat a real estate unit for sale will get given its characteristics. Nonetheless, we will haveto assume that hidden characteristics of the rental set (e.g. age) is similar to that of theselling set, although this no longer has to hold at the neighborhood level.We use the rental price per month as our variable to predict based on several charac-teristics of the real estate unit. These characteristics are: I) the average price by area, II)the ﬂoor in which the real estate unit is located (empty for houses), II) whether or not isan exterior or interior ﬂat (facing the main street or, on the contrary, an inner patio), III)whether or not the building has a lift, IV) whether or not it includes parking, V) whether ornot is a new development, VI) The number of photos (proxy for the interest into selling theﬂat), VII) property type (chalet, duplex, ﬂat or penthouse), VIII) the size in square meters,IX) the status (good, new development, or renewed), and X) the number of bathroomsdivided by the number of square meters. As our benchmark model we will use a simplelinear regression to predict house prices. We will then use Random forest which have be- We divide the number of bathrooms per the size of the real estate unit in order to remove multicollinearityamong the two variables: the bigger the ﬂat, the more likely it is to have more bathrooms.

We start our analysis with the simplest of all forms: multivariate linear regression. Ourdependent variable is the rental price per square meter and the controls or independentvariables are the variables previously mentioned. Recall that in linear regression, the re-lationships are modeled using linear predictor functions whose unknown model parametersare estimated from the data. Linear prediction functions is the best ﬁtted line using Ordi-nary Least Squares (OLS) criterion which minimizes the sum of squared prediction error.In other words, we are imposing linear associations between the control variables and ourdependent variable; rental price.Table 1 displays the results of several multi-variable regressions that we performed onrental price. We perform several regressions in order to check for possible multicollinearityamong our regressors. We start with a simple regression consisting only on the size (in m ), whether or not the apartment is an exterior (versus interior) and the ﬂoor (column 1).The adjusted R-squared of this simple regression indicates a good ﬁt: 0.62 indicating that62% of the variability in the dependent variable is explained by these controls alone. Theseresults indicate that for every additional square meter, the rental price increases 11 euroson average. In addition, exterior ﬂats are on average 176 euros more expensive to rent ( ce-teris paribus ), and for every additional ﬂoor, rental price increases 15 euros on average. InColumn 2 we introduce whether or not the ﬂat has a lift, indicating that if that is the case,the rental price on average is 98 euros more expensive. In the next column, we introduceprice per area, a continuous variable that captures the average square meter rental price perneighborhood. It takes an average of 16 and a maximum value of 84. This variable is verysigniﬁcant and its coeﬃcient states that for every additional euro per square meter on priceby area, the rental price increases by almost 90 euros. This highlights the heterogeneity ofrental prices per area in Madrid.Finally, column 4, includes several other reggressors such as the status (as a dummyvariable), number of bathrooms per square meter, the number of photos attached to the ad,14able 1: Linear regression Results Dependent variable:

Rental price(1) (2) (3) (4)Intercept 145 . ∗∗∗ . ∗∗∗ − , . ∗∗∗ − , . ∗∗∗ (56 . . . . . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ (0 . . . . . ∗∗∗ − . ∗∗∗ . ∗∗∗ . . . . . . ∗∗∗ . ∗∗∗ .

987 0 . . . . . . ∗∗ . ∗∗∗ . ∗∗∗ (42 . . . . ∗∗∗ . ∗∗∗ (1 . . . . . . . . , . , . . . . . . ∗∗∗ (39 . . ∗∗ (1 . Note:

In this table, we regress rental price on several real state characteristics variables. Statusindicates the description that the owner gives to the property; good, new development, or renewed.Standard errors are reported in parentheses. *, **, and *** indicate statistical signiﬁcance at the10%, 5%, and 1% level, respectively.

Random forests are built on the same fundamental principles as decision trees and bagging.The concept of bagging, also known as bootstrap aggregation, is to create several subsetsof the data by randomly selecting data-points from the training data with replacement(some observations may be repeated). Decision trees are then trained with the subsets, andtheir results averaged. Bagging trees introduce a random component into the tree building16igure 5: Multivariate linear regression predictors

Regression 1: RMSE = 1,191 euros, n = 3,026

Actual price P r ed i c t i on p r i c e Regression 2: RMSE = 611 euros, n = 2,734

Actual price P r ed i c t i on p r i c e Regression 3: RMSE = 346 euros, n = 2,734

Actual price P r ed i c t i on p r i c e Regression 4: RMSE = 359 euros, n = 946

Actual price P r ed i c t i on p r i c e Notes: scatter plots between the predicted price (y-axes) and the actual price (x-axes) for each ofthe four regressions in Table 1 process that reduces the variance of a single treeâĂŹs prediction and improves predictiveperformance. However, the trees in bagging are not completely independent of each othersince all the original predictors are considered at every split of every tree. Rather, treesfrom diﬀerent bootstrap samples typically have similar structure to each other (especiallyat the top of the tree) due to underlying relationships.For example, if we create several decision trees with diﬀerent bootstrapped samples inour set, most likely all trees will have a very similar structure at the top. This characteris- see for example https://uc-r.github.io/random forests Bootsrap re-sampling process , where each tree is grown to a bootstrapre-sampled data set in order to de-correlated them. On the other hand, we can use the

Split-variable randomization process where the search in the split variable is limited to arandom subset of m of the p variables. Random forests have a handfull of hyperparameters that need to be tuned during train-ing. Typically, the primary concern when starting out is tuning the number of candidatevariables to select from at each split and few additional parameters: the number of trees;the number of variables to randomly sample as candidates at each split; the number ofsamples to train on; the minimum number of samples within the terminal nodes; and themaximum number of terminal nodes.For our training, we focused on tuning the number of candidate variables and the num-ber of trees, setting the rest of the parameters to constant values. In the case of the treesdepth, we set the model to expand them until all the leaves are pure. The number of vari-ables to randomly sample at each split is set randomly with a given seed so that the resultsare always reproducible. When training the model, it was observed that outliers in thedata (e.g. houses with rental prices of 30,000âĆň or houses with a size of 10,000 m ) greatlyimpacted the results. For this reason, we implemented outlier elimination using Z-Scores.In order to ﬁnd the best combination of number of trees, number of candidate variablesand size of the Z-score, we performed a grid search spanning 10 to 500 trees, four to tenvariables and 0.5 to 10 Z-score. Results consistently showed that a performance peak wasachieved when using between 100 to 125 decision trees, and a Z-score of 1.5. Figure 6 showsthe results of the tests of the the model on four diﬀerent variable subsets, trained with 100decision trees and a Z-score of 1.5. The subsets of variables used for the training are thesame as those used for the training of the multi-linear regressor shown in Table 1 of the For regression trees, typical default values are m = p but this should be considered a tuning parameter. m ), whether the apartmentis exterior or not, and what ﬂoor it is at. For the second training we introduce whether ornot the house has a lift. For the third, price by area is taken into account, and the the lasttraining, the status of the house (good, new development or renewed), whether or not it hasa parking space, the type of house (duplex, ﬂat), and the number of bathrooms per squaredmeter. Due to incomplete data, the more variables are introduced, the less complete datasamples are. Thus, models that include more training variables are trained with less data.Figure 6 shows how the results dramatically improve on the fourth training, where moreexplanatory variables are introduced, yielding an impressive RMSE of 84âĆň. This pointsto the fact that the state of the house, as well as its type and the number of bathroomsper square meter play a crucial role in determining their rental price. The fourth model istrained with 1,212 samples, as opposed to the ﬁrst three models that are trained with over3,500 samples each. However, there is no indication that this smaller subset is skewed orfalls outside the larger super-sets, since all the common variables have roughly the samedistribution. 19igure 6: Random Forest predictors RF 1: RMSE = 531 euros

Actual price P r ed i c t i on p r i c e RF 2: RMSE = 552 euros

Actual price P r ed i c t i on p r i c e RF 3: RMSE = 543 euros

Actual price P r ed i c t i on p r i c e RF 4: RMSE = 84 euros

Actual price P r ed i c t i on p r i c e Notes: scatter plots between the predicted price (y-axes) and the actual price (x-axes) for each ofthe four regressions in Table 1. Training done with 100 trees. From top left to lower right, 1 st model trained with 3652 samples, 2 nd and 3 rd with 3598 and 4 th with 1212 .3 Support Vector Regressions Support vector machine (SVM) analysis is a popular machine learning tool for classiﬁca-tion and regression, developed in 1992 by [Boser et al., 1992]. SVM belong to the familyof generalized linear classiﬁers in the sense that it is a prediction tool that uses machinelearning theory to maximize predictive accuracy while automatically avoiding over-ﬁttingto the data ([Jakkula, 2006]). Support vector machines can be deﬁned as systems which usehypothesis space of a linear functions in a high dimensional feature space, trained with alearning algorithm from optimization theory that implements a learning bias derived fromstatistical learning theory. To illustrate this point, consider the data point presented inFigure 7 where we have an inner circle of data that belongs to a cluster while an outer datapoints that belong to a diﬀerent cluster. It will be impossible to correctly classify bothclusters by tracing a line. For this reason, we incur to a 3D transformation of the datapoints (right panel), where we can with no major complications depict the two data clustersusing only a straight line (or more speciﬁcally, a linear plane).Figure 7: Support Vector Machines

Notes:

Image obtained fromhttps://medium.com/@zachary.bedell/support-vector-machines-explained-73f4ec363f13

Support Vector Regression (SVR) works on similar principles as Support Vector Ma-21hine (SVM) classiﬁcation in the sense that SVR is the adapted form of SVM when thedependent variable is numerical rather than categorical. The main beneﬁt of using SVR(or SVM) is that it is a non-parametric technique, therefore we do not assume certain con-ditions or parameters in the data (e.g. linear combinations or lack of heteroskedasticity inour sample). SVR uses the principle of maximal margin, meaning that we do not care somuch about the prediction as long as the error term ( (cid:15) ) is less than a certain value. In otherwords, maximal margin allows viewing SVR as a convex optimization problem. Besides,the regression can also be penalized using a cost parameter (later explain in more detail),which helps to avoid over-ﬁtting.Given that we are not using too many features (control variables) to predict the rentalprice, we do not incur in feature reduction methods such as Recursive Feature Elimination(RFE) or Principal Components Analysis (PCA). Neither we use in log price-transformation,something which might increase the accuracy of the model. In this sense, we feed raw datato the support vector regressions and test for the accuracy using the Root Mean SquareError (RMSE). Just as before, we split the data into training and testing (0.7 and 0.3respectively) and apply the classiﬁcation test on the testing data set only. We considerfour diﬀerent kernel functions to map a lower dimensional data into a higher dimensionalone: i) linear: u v , ii) polynomial: ( γu v + coef degree , iii) radial basis: e ( − γ | u − v | ) andsigmoid: tanh ( γu v + coef u and v are the vectors representing the inputs in thevector space and γ is a weighting factor that scales the amount of inﬂuence that two datapoints have on each other. Besides, we set the cost of constraints violation to the default cost = 1. This is the ’C’-constant of the regularization term in the Lagrange formulation,or put it in other words, by how much you want to avoid misclassifying in each trainingexample. For large values of C, the optimization will choose a smaller-margin hyperplane ofthat hyperplane that does a better job of getting all the training points classiﬁed correctly.Finally, note that the number of support vectors are given by the model and usually rangesfrom 260 to 2,900.Table 2 shows the RMSE across each of the four models and four diﬀerent kernel func-tions. For the ﬁrst model, where we only have the size , exterior and f loor as dependentvariables, the RMSE using a polynomial kernel is 1,191. This is the same error than when22sing a multivariate-linear regression approach. Once we include the variable lif t in ourmodel, the RMSE drops to 690 under the radial basis kernel which is slightly less accuratethan the multivariate linear regression: 611. The next speciﬁcation adds price by area andis in this model when we start seeing a much higher accuracy for SVR compared to themultivariate linear regression. Under the polynomial kernel, the RMSE turns out to be only74 whereas that for the multivariate linear regression was 346. Recall that the multivariatelinear regression kept underpredicting the price for the most expensive units which illustratethe limitations to adjust non-linearities in the data. This is not the case of the SVR, whichindepndently of the price, the estimation runs through a straight line (see Figure 8). Thisis also the case for the last speciﬁcation, where we incorporate all variables that we have.Although the RMSE is slightly higher for the SVR than the multivariate linear regression;385 and 359 respectively, the prediction price and the actual price tend to lie in a straightline (see bottom-right panel of Figure 8). All in all, we can conclude that the multivariatelinear regression performs better for simpler models with fewer variables, as it will not beable to capture non-linearities in the data.Table 2: Support Vector Regression Results Root Mean Square Errors (RMSE)Kernel

SVR 1 SVR 2 SVR 3 SVR 4linear polynomial

Radial

Sigmoid

Notes:

This table displays the Root Mean Square Errors of the diﬀerent models and kernels usingSupport Vector Regression. Note that the additional model parameters such as cost, gamma,epsilon are default values.

SVR 1: RMSE = 1,294 euros, k = poly

Actual price P r ed i c t i on p r i c e SVR 2: RMSE = 690 euros, k = radial

Actual price P r ed i c t i on p r i c e SVR 3: RMSE = 74 euros, k = poly

Actual price P r ed i c t i on p r i c e SVR 4: RMSE = 385 euros, k = linear

Actual price P r ed i c t i on p r i c e Notes: scatter plots between the predicted price (y-axes) and the actual price (x-axes) for each ofthe four support vector regressions considered, see 2. K stands for the type of kernel used (betterprediction power). Conclusions and future work

In this preliminary work, we have motivated a tool for investment in real estate units. Inparticular, we pave the road towards developing a rental yield algorithm to infer the mostlikely rental price of each real estate unit for sale. The rental yield will be the diﬀerencebetween the monthly mortgage payment of the unit and the most likely monthly rental price.In this project we present a preliminary work on how we gather the data and the type ofalgorithms that we use. For future work, we would like to expand the number of controlvariables using geo-location data such as closeness to sport, shopping areas, transport ormedical facilities. 25 eferences [Born and Pyhrr, 1994] Born, W. and Pyhrr, S. (1994). Real estate valuation: the eﬀect ofmarket and property cycles.

Journal of Real Estate Research , 9(4):455–485.[Boser et al., 1992] Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A trainingalgorithm for optimal margin classiﬁers. In

Proceedings of the ﬁfth annual workshop onComputational learning theory , pages 144–152.[Casas-Rosal et al., 2019] Casas-Rosal, J. C., del Rosal, D. E. C., Caridad, J. M., Tabales,J. M. N., et al. (2019). Mercado inmobiliario de espana: Una herramienta para el anal-isis de la oferta.

Cuadernos de economia: Spanish Journal of Economics and Finance ,42(120):207–218.[Casas-Rosal et al., 2018] Casas-Rosal, J. C., Tabales, J. M. N., and del Rio, L. C. L.(2018). Una mirada al mercado de locales comerciales a la venta de la ciudad de cordoba.

International journal of scientiﬁc management and tourism , 4(1):61–71.[Delmendo, 2020] Delmendo, L. (2020). SpainâĂŹs house price rises decelerating, but out-look remains upbeat.

Global Property Guide .[Hamzaoui and Perez, 2011] Hamzaoui, Y. E. and Perez, J. A. H. (2011). Application ofartiﬁcial neural networks to predict the selling price in the real estate valuation process.In , pages 175–181.[Hott and Monnin, 2008] Hott, C. and Monnin, P. (2008). Fundamental real estate prices:An empirical estimation with international data.

The Journal of Real Estate Financeand Economics , 36(4):427–450.[Jakkula, 2006] Jakkula, V. (2006). Tutorial on support vector machine (svm).

School ofEECS, Washington State University , 37.[Limsombunchai, 2004] Limsombunchai, V. (2004). House price prediction: hedonic pricemodel vs. artiﬁcial neural network. In

New Zealand agricultural and resource economicssociety conference , pages 25–26. 26Lopez Rodriguez and Matea, 2019] Lopez Rodriguez, D. and Matea, M. d. l. L. (2019).Recent developments in the rental housing market in spain.

Recent Developments in theRental Housing Market in Spain (August 1, 2019). Banco de Espana Article Forthcoming .[Ma et al., 2018] Ma, Y., Zhang, Z., Ihler, A., and Pan, B. (2018). Estimating warehouserental price using machine learning techniques.

International Journal of Computers,Communications & Control , 13(2).[Selim, 2009] Selim, H. (2009). Determinants of house prices in turkey: Hedonic regressionversus artiﬁcial neural network.

Expert systems with Applications , 36(2):2843–2852.[Tabales et al., 2013] Tabales, J. M. N., Caridad, J. M., Carmona, F. J. R., et al. (2013).Artiﬁcial neural networks for predicting real estate price.

Revista de Metodos Cuantita-tivos para la Economia y la Empresa , 15:29–44.

Return Index per sizeNeighborhood (30-60) (60-90) (90-120) (120-150) ( > eturn Index per sizeNeighborhood (30-60) (60-90) (90-120) (120-150) ( >150) AverageLavapies-Embajadores 0.91 0.91 0.93 1.21 1.21 1.03Legazpi 1.16 1.08 1.30 1.18Lista 0.90 0.95 1.05 1.12 1.14 1.03Los Angeles 2.19 2.49 2.34Los Rosales 1.02 1.02Lucero 1.53 1.16 0.99 1.23Malasana-Universidad 0.94 1.09 1.12 0.81 0.79 0.95Marazuela-El Torreon 0.62 0.73 1.16 1.16 1.16 0.96Marroquina 1.18 1.18 1.18Media Legua 1.76 1.62 1.69Mirasierra 0.55 0.60 0.73 0.78 0.78 0.69Molino de la Hoz 0.89 1.14 1.14 1.14 1.14 1.09Montealina 1.07 1.07 1.07 1.07 1.07 1.07Montecarmelo 1.36 1.36 1.36Montecillo-Pinar de las Rozas 0.60 0.66 0.70 1.09 1.09 0.83Moscardo 1.69 1.69Nueva Espana 0.50 0.55 0.69 0.67 0.66 0.62Nuevos Ministerios-Rios Rosas 0.84 0.94 0.96 0.84 0.85 0.89Numancia 1.84 1.29 1.57Opanel 2.93 2.93Orcasitas 2.07 2.07Paciﬁco 1.16 1.31 1.30 1.26Palacio 0.78 0.62 0.53 0.58 0.54 0.61Palomas 0.93 0.88 1.12 1.20 1.20 1.07Palomeras Bajas 1.93 1.93Palomeras sureste 1.22 1.28 1.25Palos de Moguer 1.12 1.11 1.12Parque Lisboa-LaPaz 1.39 1.39 1.39Parque Mayor 1.20 1.20Parque Ondarreta-Urtinsa 0.82 0.82Pavones 1.46 1.46 1.46Penagrande 0.89 1.00 0.98 1.22 1.22 1.06Pilar 1.60 1.33 1.25 1.15 1.15 1.30Pinar del Rey 1.07 0.91 0.84 0.94Portazgo 2.44 2.44Prado de Santo Domingo 0.84 0.62 0.73Prado de Somosaguas 0.99 0.99 0.99 1.00 1.00 0.99Pradolongo 1.53 1.53Prosperidad 0.87 0.90 0.77 0.84Pueblo Nuevo 1.53 1.26 1.10 1.30Puerta Bonita 1.11 1.11Puerta del Angel 1.68 1.79 1.61 1.66 1.66 1.68Quintana 1.12 1.15 1.19 1.15Recoletos 0.52 0.52 0.57 0.59 0.59 0.56Rejas 1.04 0.99 0.85 0.79 0.79 0.89Rosas 0.97 0.99 0.98Salvador 1.63 1.54 1.89 1.34 1.34 1.55San Andres 1.63 1.63Sanchinarro 0.99 1.06 1.07 1.09 1.09 1.06San Diego 1.99 1.99San Fermin 1.17 1.33 1.25San Isidro 1.17 1.72 1.43 1.43 1.44San Juan Bautista 1.24 1.02 0.92 0.87 0.78 0.97San Pascual 0.63 0.70 0.66 1.08 1.08 0.83Santa Eugenia 1.38 1.38 1.38Simancas 1.23 1.00 0.80 1.01Sol 0.96 1.00 1.35 1.05 1.05 1.08Somosaguas 1.07 1.07 1.07 1.03 1.03 1.05Timon 0.70 0.81 0.61 0.71Trafalgar 0.90 0.89 0.97 0.91 0.90 0.91Tres Olivos-Valverde 0.77 1.42 1.42 1.42 1.42 1.29Valdeacederas 1.43 1.59 1.51Valdebebas-Valdefuentes 1.19 1.19 1.23 1.37 1.37 1.27Valdemarin 0.62 0.63 0.67 0.68 0.66 0.65Valdezarza 1.24 0.93 0.73 1.01 1.01 0.98Vallehermoso 0.80 0.90 0.91 1.00 1.03 0.93Ventas 1.83 1.53 1.68Ventilla-Almenara 1.01 0.91 1.08 1.00 1.00Virgen del Cortijo-Manoteras 1.06 1.12 0.94 0.79 0.79 0.94Vista Alegre 1.58 0.63 1.11Zarzaquemada 1.69 1.69Zona Avenida Europa 0.99 0.99 1.02 1.01 1.01 1.00Zona Estacion 0.64 0.74 0.78 0.95 1.05 0.83Zonanorte 0.91 0.91 0.99 1.05 1.05 0.98Zona Pueblo 0.95 0.88 1.00 0.85 0.85 0.91