socioeconomic disparitiesa geospatial approach to house price prediction 



introduction

in mecklenburg county, predicting house prices isn't just about looking at the trends; it's about understanding the story each data point tells. our goal? build a super model (literally) to capture the complexity of the housing market using a hedonic model. this model breaks down home value into its core elements, considering everything from neighborhood amenities to internal features.

we faced challenges in picking the right variables and cleaning messy data (think: converting tricky numerical data into useful categories). after a few hiccups and many random errors, we did it. the result? a model that’s ready to strut its stuff.


data collection

we pulled data from multiple sources:
  • official datasets from mecklenburg county covering transit, parks, colleges, and neighborhood attributes
  • census tract data fetched using the tidycensus package in r

after wrangling all these layers of data, we started scoring and ranking the features to ensure they told the right story about home values.


feature engineering



we calculated the ‘average nearest neighbor distance’ for homes to key spots like transit stops, parks, and churches. this spatial twist helped us capture how the proximity of amenities impacts house prices. then, we transformed certain features (e.g., ‘story height,’ ‘heated fuel type,’ and ‘foundation type’) into categories to better fit our model. we are going to delete collinear independent(R srquare>0.75 or <-0.75). thus, we choose the length of shape, distance to nearest 3 transit stops, percent of graduate, and move the shape area, percent bachelor degree and distance to nearest others transit stops out at the same time.





we chose four factors that we thought were most relevant to house price, but it appears that school has little effect on home prices. but i wonder what will change in the multi-factor model afterwards.


there is a map of home sale prices in mecklenburg. the yellow areas indicate high prices, while purple shows low prices. high-price areas are mainly concentrated in the north and south of mecklenburg.
below are three maps of notable independent variables. their spatial distributions are similar, but the density varies in certain areas. average income is concentrated, while heated areas have high values in the northern part of the city.




methodology

we approached this like a machine learning project: data wrangling, exploratory analysis, feature engineering, selection, and model validation.

first, we trained our model on 70% of the data and kept 30% for testing. using metrics like mean absolute error (mae) and mean absolute percentage error (mape), we fine-tuned the predictions. we also used k-fold cross-validation to measure how well our model generalizes to new data, splitting the dataset into 10 folds to ensure robustness.

to address spatial autocorrelation, we included a spatial lag feature using moran’s i, which showed that price errors clustered in space. this helped us refine the model further.


results

our first ols regression model featured 23 predictor variables, including both internal characteristics and spatial factors. we quickly realized floor height wasn’t significant, so we dropped it, which improved model performance.

the refined model explained about 77% of the variation in house prices (adjusted r-squared = 0.77). we found strong predictors like:
  • the number of full baths and heated area size
  • building grade (custom or excellent grades had the highest impact)
  • median household income and proximity to transit stops

cross-validation showed a root mean square error (rmse) of 76,670, suggesting solid predictive accuracy. however, errors varied across different data folds, indicating room for improvement in handling outliers.


mapping insights

we visualized sale prices across mecklenburg county, highlighting high-value clusters in the north and south. additionally, maps of median income and heated area distributions showed clear spatial patterns, with income concentrated in specific neighborhoods.


spatial autocorrelation

even the best regression models can leave spatial patterns unexplained. our spatial lag analysis indicated that errors in home price predictions tend to cluster, reinforcing the need for spatial features in the model. 




moran’s i confirmed significant clustering with a value of 0.31 (p < 0.001). a p-value of 0.001 suggests that the observed point process is more clustered than all 999 random permutations (1 / 999 = 0.001) and is statistically significant.




the r-squared of our new neighborhood model is in the range of 0.82-0.84, which is a satisfactory result. the absolute error (abserror) and absolute percentage error (ape) both decrease, indicating that the neighborhood effects model is more accurate on both a dollar and percentage basis. predicted prices are plotted against observed prices. the purple line represents a perfect fit, while the yellow line shows the predicted fit.

there is no strong relationship between mape and average housing price when mape is measured by blocks.

generalizability

the fit across two partitions is consistent, likely due to the model's strong effect. we’re not entirely sure if this is an ideal model since we included census data factors, which might have a similar spatial lag. we're debating whether using census data for spatial generalizability testing was a good choice.

the fit tends to be better in high-income areas, making the model potentially less accurate for low-income groups if used for taxation or predicting housing prices.

discussion

we were satisfied with converting numerical data to categorical data types, and the final model fit well. however, we couldn’t explore other interesting variables like crime rate due to limited data availability. some data only exist as geometry divided by administrative areas, and we struggled to incorporate these into the model.

we didn’t have time for further step testing, so the current prediction model still includes 23 independent variables. based on the mape and mean house price plots, there is no strong association between mape and mean house price, suggesting that our neighborhood factor had a positive impact.

conclusions

by incorporating spatial data and refining our feature set, we managed to build a predictive model that captures much of the nuance in house prices. however, the presence of spatial autocorrelation in residuals suggests there’s more to explore in terms of local neighborhood dynamics. going forward, we recommend further exploration of spatial features and continued testing across diverse datasets to improve generalizability. our super model may have strutted its stuff, but there's always room for a little more runway.


copyright @liuhaobing, hit the link to github