crime risk predictionin chicago
chicago, known for its vibrant history, faces significant challenges with crime and public safety. predicting crime, much like the concept in minority report, is complex due to the randomness of human behavior and varying community dynamics. inspired by recent advancements in crime prediction algorithms (rotaru et al., 2022; kim et al., 2021), this project aims to build a machine learning model using chicago crime data from 2015-2019, incorporating both social and spatial factors to enhance predictive accuracy.
project structure
this project is divided into five key parts:
- data wrangling: cleaned, joined, and transformed the crime dataset, integrating socioeconomic data.
- exploratory analysis: examined crime trends across communities, focusing on type, location, time, and socioeconomic context.
- feature engineering: utilized kde, clustering, and knn algorithms to uncover spatial crime patterns and relationships with amenities.
- predictive modeling: employed a random forest model to forecast crime rates based on selected features.
- evaluation: assessed model performance using metrics like mae, identifying areas for improvement.
data wrangling
the primary dataset consists of 1,303,648 crime records from the chicago data portal (2015-2019), including detailed information on crime type, location, and time. the chicago neighborhood dataset, covering 77 communities, was merged using json data from the api. preprocessing involved formatting date and time, categorizing variables, and calculating additional indices such as weapon count, arrest index, and domestic violence rate.
exploratory analysis
analysis revealed distinct crime patterns across chicago’s communities. using interactive maps created with folium, cumulative crime events were visualized, highlighting high concentrations in south side neighborhoods like austin, west englewood, and near west side. crime trends were further explored using plotly and altair charts:
- crime type and location: theft, battery, and assault are the most frequent offenses. streets are the most common crime locations, followed by residences and apartments.
- seasonal trends: crime peaks during summer months (july-august), correlating with higher social activity and favorable weather conditions.
- district analysis: heatmaps of 31 districts show high frequencies of assault, burglary, narcotics, and motor vehicle theft. specific crimes, like prostitution, are concentrated in districts 7 and 11, indicating localized issues.
these findings suggest the influence of both environmental and socioeconomic factors on crime distribution.
feature engineering
to capture spatial patterns, kernel density estimation (kde) was used to analyze the density of nine representative crime types (e.g., robbery, narcotics, assault). kde results highlighted theft concentrated downtown, while narcotics offenses were clustered on the west side.
k-means clustering categorized communities based on crime frequency, revealing five distinct clusters with varying average incident rates. the analysis showed:
- cluster 0: 31 communities with moderate crime rates (~10,000 incidents).
- cluster 3: a single high-crime community (~77,000 incidents), indicating an area of significant concern.
knn analysis was performed to measure the proximity of crime locations to amenities like schools, grocery stores, and subway stations. the results indicated strong correlations between crime occurrences and certain points of interest (e.g., depository locations, abandoned buildings).
predictive modeling: random forest regression
a random forest model was built using sklearn’s ensemble methods, splitting the dataset into 70% training and 30% testing data. the model incorporated features such as population, median rent, distance to key amenities, and socioeconomic indices.
the random forest achieved a score of 0.54, indicating moderate predictive performance:
- feature importance analysis: top predictors included poverty rate, median contract rent, and distance to subway stations, highlighting the impact of socioeconomic factors on crime rates.
- limitations: the model struggled in high-crime areas like downtown, suggesting the need for additional features (e.g., traffic data, weather conditions) for improved predictions.
valuation and interpretation
model evaluation using mean absolute error (mae) revealed higher errors in downtown and west side areas, where crime rates are elevated. errors were visualized using hexagonal grid maps, with darker colors indicating significant discrepancies. further analysis showed:
- disparities by time and location: higher error rates were linked to periods of increased social activity (e.g., nighttime incidents) and densely populated areas.
- socioeconomic influences: neighborhoods with higher economic hardship indices and crowded housing conditions showed strong correlations with increased crime rates, as visualized in the hexgrid maps.
these findings suggest that incorporating additional data sources could enhance the model’s accuracy, particularly in high-risk areas.
references
- scikit-learn documentation: scikit-learn.org
- chicago crime data: data.cityofchicago.org
- lee, j., & king, g. (2022). predicting crime with machine learning. proceedings of the national academy of sciences.
- rotaru, v., et al. (2022). event-level prediction of urban crime. nature human behaviour.
- tamir, a., & watson, e. (2021). crime prediction using machine learning. ijcsit.
- kim, s., et al. (2018). crime analysis through machine learning. iemcon 2018.
copyright @liuhaobing, hit the link to github