Predicting House prices in Taiwan: A Bayesian approach


Predicting House Prices In Taiwan: A Bayesian approach

Bayesian methods are one of the main approaches for estimating models besides the frequentist methods. In bayesian models prior information can be included by using a suitable prior. The models are therefore a combination of the data and the prior information. The package brms is used for fitting Bayesian models using Stan. The package allows for modelling generalized nonlinear multivariate models. Every parameter of the distribution can be modelled in terms of linear or nonlinear terms. This allows for a very flexible modelling strategy. Bayesian methods excels with datasets, which have a hierarchical/multilevel structure and/or are relatively small. For a large number of observations with a relative simple structure the difference between Bayesian and frequentist models are very similiar. The main drawback besides the choice of the priors is the computation time needed for Bayesian models. Bayesian models needs be simulated using MCMC, while classical methods can be calculated by optimizing a function.

The data includes historical real data set of real estate valuation from Taipei. The dataset can be found here The dataset is transformed into a sf format, which is used for spatial data. The sf format is similiar to a data.frame, but includes an additional colum for the geometry of the observation. The dataset includes the transaction date, the age of the house, the number of convenicen stores, the distance to the nearest Metro station (MRT), the geographic coordinates and the the house price, measured in 1000 New Taiwan Dollar per ping. Ping is a local unit and its just 3.3 meter squared.

mapviewOptions(basemaps = "OpenStreetMap")
mapview(real_estate,zcol=c("Transaction_date","House_age","Distance_MRT","Number_convenience_stores",
                           "House_price","Latitude","Longitude"))

To illustrate the spatial structure the package mapview is used. Most properties are located near the centre of the city, while some of them are more outside of the city. The number of convenience stores are higher in the centre of the city which is not surprising. A similiar relationship holds for the distance to MRT. Distances outside of the city are larger than those inside the city centre.

ggpairs(real_estate,columns = c(2,3,4,5,8))

Interestingly enough the age of the house seems to be weakly correlated at all. Only the Distance to MRT and the number of convenience stores are correlated to the house price. The distance shows a nonlinear relationship with the house price and modelling it with a spline might be a suitable strategy.

summary(real_estate$House_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.60   27.70   38.45   37.98   46.60  117.50

The maximum and the minimum house price are quite apart from the other observations. These observations could be outliers.

model0=brm(House_price~Transaction_date+House_age+Distance_MRT+Number_convenience_stores,data=real_estate,cores=7,chains=5,iter = 600,silent=T,refresh=0,file="./model0")

model0
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: House_price ~ Transaction_date + House_age + Distance_MRT + Number_convenience_stores 
##    Data: real_estate (Number of observations: 414) 
## Samples: 5 chains, each with iter = 600; warmup = 300; thin = 1;
##          total post-warmup samples = 1500
## 
## Population-Level Effects: 
##                            Estimate Est.Error  l-95% CI u-95% CI Rhat
## Intercept                 -11684.46   3233.79 -18114.45 -5317.07 1.01
## Transaction_date               5.83      1.61      2.66     9.02 1.01
## House_age                     -0.25      0.04     -0.33    -0.17 1.00
## Distance_MRT                  -0.01      0.00     -0.01    -0.00 1.00
## Number_convenience_stores      1.26      0.20      0.87     1.65 1.01
##                           Bulk_ESS Tail_ESS
## Intercept                     1045      724
## Transaction_date              1045      724
## House_age                      666      685
## Distance_MRT                  1164     1037
## Number_convenience_stores      837      817
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     9.14      0.31     8.59     9.76 1.00     1678     1055
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The brm function is similiar to other regression functions in R like lm. The options silent and refresh are used to to supress messages about the compiler. The Rhat or potential scale reduction factor is a measure for the convergence of the Markov chains. It is based on the within and between variances of the different chains. A Rhat equal to 1 indicates, that the chains did converge. The effective sample size takes into account the samples are not drawn from an iid posterior distribution. The autocorrelation of the sample is used to calculate an effective sample size. A large number of convenience stores leads to an increase of the house price, while the house age reduces it. The transaction date has a positive influence. The number of convenience stores and the distance to the metro are in a way proxies for the location. Houses with a high number of convenience stores and which are close to the metro are located in the city centre.

model1=brm(House_price~Transaction_date+House_age+s(Distance_MRT,bs="ps")+Number_convenience_stores,data=real_estate,cores=7,chains=5,iter = 700,control=list(adapt_delta=0.95),file="./model1")

model1
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: House_price ~ Transaction_date + House_age + s(Distance_MRT, bs = "ps") + Number_convenience_stores 
##    Data: real_estate (Number of observations: 414) 
## Samples: 5 chains, each with iter = 700; warmup = 350; thin = 1;
##          total post-warmup samples = 1750
## 
## Smooth Terms: 
##                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sds(sDistance_MRT_1)     3.25      1.64     1.17     7.39 1.01      551
##                      Tail_ESS
## sds(sDistance_MRT_1)      788
## 
## Population-Level Effects: 
##                            Estimate Est.Error  l-95% CI u-95% CI Rhat
## Intercept                 -14032.56   2953.81 -19956.89 -8334.83 1.00
## Transaction_date               6.99      1.47      4.16     9.93 1.00
## House_age                     -0.25      0.04     -0.33    -0.17 1.00
## Number_convenience_stores      0.52      0.19      0.15     0.90 1.01
## sDistance_MRT_1              -53.33      9.66    -72.15   -34.81 1.00
##                           Bulk_ESS Tail_ESS
## Intercept                     1881     1180
## Transaction_date              1881     1180
## House_age                     2925     1208
## Number_convenience_stores     2069     1418
## sDistance_MRT_1               1018     1245
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     8.33      0.30     7.77     8.93 1.00     2497     1213
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).
marginal_smooths(model1)

marginal_smooths plots splines of the model. A nonlinear relationship seems apparent is visible. This indicates that nonlinear terms are useful for this model.

pp_check(model0,nsamples = 30)

The posterior predictive distribution can be used for model checking. If the posterior distribution describes and captures our knowledge regarding the data perfectly, the predictions should be close to the observed data. pp_check from the bayesplot package performs such posterior predictive checks. The models does not capture the dataset perfectly. In addiction the model predicts negative values for house prices, which are not possible.

pp_check(model1,nsamples = 30)

What about the model with splines for the Distance to MRT? It seems to fit the data better, especially for lower values. Similiar to the first model with no splines negative predictions are made, which is not useful.

model2=brm(House_price~Transaction_date+House_age+Distance_MRT+Number_convenience_stores,data=real_estate,cores=7,chains=5,iter = 800,family=lognormal(),control=list(max_treedepth=17),file="./model2")

model2
##  Family: lognormal 
##   Links: mu = identity; sigma = identity 
## Formula: House_price ~ Transaction_date + House_age + Distance_MRT + Number_convenience_stores 
##    Data: real_estate (Number of observations: 414) 
## Samples: 5 chains, each with iter = 600; warmup = 300; thin = 1;
##          total post-warmup samples = 1500
## 
## Population-Level Effects: 
##                           Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept                  -310.88     84.70  -477.75  -136.99 1.00
## Transaction_date              0.16      0.04     0.07     0.24 1.00
## House_age                    -0.01      0.00    -0.01    -0.00 1.01
## Distance_MRT                 -0.00      0.00    -0.00    -0.00 1.00
## Number_convenience_stores     0.03      0.00     0.02     0.04 1.00
##                           Bulk_ESS Tail_ESS
## Intercept                      412      240
## Transaction_date               412      240
## House_age                     1350      955
## Distance_MRT                   916     1021
## Number_convenience_stores      499      475
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.24      0.01     0.22     0.25 1.01      925      887
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

To ensure only positive values for the house price, a lognormal model is now used. The family of the response variable can be changed by the family argument.

model3=brm(House_price~Transaction_date+House_age+s(Distance_MRT,bs="ps")+Number_convenience_stores,data=real_estate,cores=7,chains=5,iter=800,family=lognormal(),control=list(adapt_delta=0.95,max_treedepth=17),file="./model3")

model3
##  Family: lognormal 
##   Links: mu = identity; sigma = identity 
## Formula: House_price ~ Transaction_date + House_age + s(Distance_MRT, bs = "ps") + Number_convenience_stores 
##    Data: real_estate (Number of observations: 414) 
## Samples: 5 chains, each with iter = 800; warmup = 400; thin = 1;
##          total post-warmup samples = 2000
## 
## Smooth Terms: 
##                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sds(sDistance_MRT_1)     0.15      0.09     0.04     0.40 1.01      363
##                      Tail_ESS
## sds(sDistance_MRT_1)      749
## 
## Population-Level Effects: 
##                           Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept                  -366.36     74.70  -511.85  -217.35 1.00
## Transaction_date              0.18      0.04     0.11     0.26 1.00
## House_age                    -0.01      0.00    -0.01    -0.00 1.00
## Number_convenience_stores     0.02      0.01     0.01     0.03 1.00
## sDistance_MRT_1              -2.07      0.36    -2.83    -1.40 1.00
##                           Bulk_ESS Tail_ESS
## Intercept                     1994     1276
## Transaction_date              1993     1276
## House_age                     2078     1583
## Number_convenience_stores     2577     1353
## sDistance_MRT_1                795      511
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.22      0.01     0.21     0.24 1.00     1906     1039
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).
marginal_smooths(model3)

waic(model0,model1,model2,model3)
## Output of model 'model0':
## 
## Computed from 1500 by 414 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1508.8 41.2
## p_waic        11.7  6.0
## waic        3017.6 82.4
## Warning: 2 (0.5%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.
## 
## Output of model 'model1':
## 
## Computed from 1750 by 414 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1473.9 44.5
## p_waic        16.9  7.3
## waic        2947.7 88.9
## Warning: 6 (1.4%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.
## 
## Output of model 'model2':
## 
## Computed from 1500 by 414 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1469.7 35.3
## p_waic        10.7  4.3
## waic        2939.4 70.6
## Warning: 2 (0.5%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.
## 
## Output of model 'model3':
## 
## Computed from 2000 by 414 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1448.7 40.4
## p_waic        19.8  6.6
## waic        2897.4 80.8
## Warning: 7 (1.7%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.
## 
## Model comparisons:
##        elpd_diff se_diff
## model3   0.0       0.0  
## model2 -21.0       9.2  
## model1 -25.2      39.1  
## model0 -60.1      39.4

The Watanabe Akaike Information Criteria WAIC can be used for model selection instead of doing a cross-validation. Cross-validation can be very time-consuming for a large number of models and WAIC is an approximation for leave-one-out cross-validation and they are asymptotically equal. Standard errors can be computed for the WAIC and testing for statistically difference can be calculated. This is not possible for the different information criteria.

pp_check(model3,nsamples=30)+ggtitle("Lognormal model with Splines")

prediction0=posterior_predict(model0)
prediction1=posterior_predict(model1)
prediction2=posterior_predict(model2)
prediction3=posterior_predict(model3)
ppc_error_scatter_avg(real_estate$House_price,prediction0)+ggtitle("Normal Model with linear terms")

ppc_error_scatter_avg(real_estate$House_price,prediction1)+ggtitle("Normal Model with Splines")

ppc_error_scatter_avg(real_estate$House_price,prediction2)+ggtitle("Log-normal Model with linear terms")

ppc_error_scatter_avg(real_estate$House_price,prediction3)+ggtitle("Log-Normal Model with Splines terms")

For every model two outliers are visible. In addition, a higher variance for large values of the house price is visible. The models so far seem not to be to bad, but there are a couple of issues: The spatial structure has not been included in the model, the average predicted error shows a slight positive linear relationship and heteroscedasticity. In the next post, a model with more spatial components will be developed. In addition the heteroscedasticity needs to be dealt with. In the end a more detailed model selection will be done to asses the quality of the models.