risk modeling & biasgeospatial crime risk prediction in chicago
introduction
the project begins with wrangling data on burglaries and various risk factors in chicago, transforming them into geospatial features. i correlate these features and build models to predict the latent risk of burglaries. validation is done by comparing predictions with a standard measure of geospatial crime risk. your challenge? develop a model for a different crime type, likely one with more selection bias than burglary. pick chicago or any other city with ample open data. iterate with new features until you find a model that optimizes accuracy and generalizability.
data wrangling
the data comes from the chicago data portal, providing geospatial layers like neighborhoods, crime incidents, and risk factor attributes. i focus on 'robbery - armed - handgun' as the dependent variable, selecting data from 2019 due to outdated risk factors after 2018. the spatial distribution of robberies shows clustering in the city’s north-central area, with sparse points in the southeast.
the top five most frequent crime types include theft, battery, criminal damage, assault, and deceptive practices. the dependent variable is set to armed robbery with a handgun. the fishnet grid aggregates these incidents while omitting o’hare airport to avoid skewing the data. next, i sum crimes across grid cells to analyze the density.
modeling spatial features
to predict robbery hotspots, i engineered features like:
- proximity to cash depositories (atms and banks)
- proximity to bars and liquor stores
- abandoned buildings
- streetlight outages
these features are logically linked to robbery risk—cash points attract offenders, and victims leaving bars may be less attentive. but these variables might also introduce bias if they correlate with socioeconomic factors unrelated to crime.
feature engineering
risk factors are aggregated by fishnet grid cells, creating a nearest-neighbor feature for each. the final data set includes variables like crime counts, risk factor density, and spatial relationships.
local moran’s i
local moran’s i helps identify clusters of high-risk and low-risk areas. in chicago, significant hotspots are observed in the north-central and southern parts of the city, while cold spots are found on the city’s peripheries.
it has two distinct areas, one in the central north side of the city and one in the south side of the city. these areas have a high incidence of handgun-armed robberies. and the cold spots are concentrated in the northernmost and southernmost sides of the city.
poisson regression
the poisson regression model fits well, showing similar spatial distributions of predicted risks. scatterplots reveal correlations between the engineered features and robbery counts.
spatial cross-validation
using a spatial leave-one-group-out (logo-cv) approach, i validate the model’s performance. despite some variance across holdout folds, the spatial cross-validation method generally outperforms the random k-fold approach, suggesting improved generalizability for geospatial predictions.
accuracy and generalizability
the model’s mean absolute error (mae) is lower with spatial cross-validation, demonstrating better handling of spatial dependencies. however, analysis by racial context shows disparities:
- majority non-white neighborhoods: mae is -0.18
- majority white neighborhoods: mae is +0.15
kernel density comparison
kernel density estimation (kde) is used to compare with the predicted risk model. kde centers a smooth curve over each crime point, making predictions based solely on spatial autocorrelation. while kde provides a useful baseline, the risk model captures additional variance by incorporating contextual factors.
model performance
the model’s fit is compared against the kernel density method for predicting 2019 robberies. results show that while the risk prediction model performs well, its accuracy remains uncertain due to possible selection bias. the ambiguity in how features were selected and weighted affects confidence in the model’s generalizability.
discussion
i wouldn’t recommend deploying this model into production just yet. while the model performs well and generalizes across different neighborhoods, i can’t rule out the influence of selection bias. the choice of features and their potential socioeconomic implications may skew results, especially when applied to diverse communities.
the current model includes four key risk factors, but additional data on features like banks, atm cash machines, and shopping malls could enhance predictive accuracy for armed robbery. however, without addressing the inherent biases in the data, we risk amplifying disparities rather than providing fair predictions.
conclusion
this geospatial risk model offers valuable insights and demonstrates strong predictive capabilities, but careful consideration of feature selection and bias mitigation is crucial before practical application. future iterations should focus on refining the feature set and validating the model across varied contexts to ensure equity and fairness.
copyright @liuhaobing