environment modelingflood inundation probability forecast



introduction

floods are among the most destructive natural disasters, causing significant damage to communities and infrastructure. as climate change increases the frequency and intensity of extreme weather events, there is a growing need for effective flood risk management strategies. this project aims to develop a predictive model to estimate the likelihood of flood inundation in calgary, alberta, canada. we then apply the model to pittsburgh, pennsylvania, u.s., a comparable riverine city, to assess its generalizability and predictive power.




motivation

calgary is at high risk of flooding during the spring and summer seasons, especially when heavy rainfall coincides with snowmelt from the rocky mountains. rapid water flow through steep, rocky terrain can lead to severe flooding in southern-alberta watersheds. as a river city, calgary must proactively monitor flood risks and continuously improve its flood forecasting capabilities.


this analysis provides valuable insights for city planners in calgary, helping inform decisions on land use, infrastructure development, and emergency preparedness. the model's application to pittsburgh offers a validation case to understand its performance across different geographic contexts.


data

the analysis incorporates four key datasets for calgary, obtained from open geospatial sources:
  1. hydrology: includes detailed information on water bodies and watercourses in calgary.
  2. digital elevation model (dem): derived from aerial lidar, this dataset provides a high-resolution representation of ground surface topography with 2m resolution.
  3. land cover: a composite dataset from 2015 that categorizes land cover types, including natural areas, permeable surfaces, impermeable surfaces, and storm ponds.
  4. soil data: obtained from the alberta soil information center, this dataset provides details on soil materials, which are crucial for understanding water absorption and runoff patterns.

feature engineering

in arcgis pro, we developed a fishnet grid to divide the study area into smaller, manageable units, facilitating local-level analysis. spatial features like distance to rivers, slope degree, land cover permeability, and soil type were joined to the fishnet.


dependent variables

using satellite imagery, we identified inundated areas along calgary’s major rivers (bow and elbow rivers). the darker fishnet cells indicate areas with confirmed flood events.
independent variables

we hypothesize that flood risk is influenced by:
  • distance to rivers: proximity to bow and elbow rivers increases flood susceptibility, especially during heavy rainfall and snowmelt.
  • average slope degree: steeper terrain is associated with faster water flow, increasing flood risk.
  • land cover permeability: permeable surfaces (e.g., grasslands) help absorb water, while impermeable surfaces (e.g., roads) increase runoff.
  • soil materials: clay and silt soils have low permeability, exacerbating flood risks in low-lying areas.




regression model


we used logistic regression with a split of 70% training data and 30% testing data. the model estimates the probability of flood inundation based on the selected features. key coefficients include:
  • distance to river: negative coefficient, indicating reduced flood risk as distance from the river increases.
  • slope degree: positive coefficient, suggesting higher flood risk in steeper areas.
  • land cover permeability: positive coefficient, showing increased flood risk in areas with impermeable surfaces.
  • soil material: negative coefficient, highlighting the mitigating effect of permeable soils on flood risk.

the model achieved a corrected aic of 400.5 and a log likelihood of -195.206.

the resulting heatmap visualizes the pearson correlation coefficients among the numeric variables. each tile’s color represents the strength and direction of the correlation: olive green indicates a strong negative correlation, white signifies no correlation, and dark blue highlights a strong positive correlation.

from the heatmap, we observe that soil material exhibits a significant negative relationship with flood inundation, suggesting that more permeable soils help reduce flood risk. distance to rivers also shows a negative correlation, as areas farther from water bodies tend to have lower inundation probabilities. in contrast, average slope degree has a significant positive correlation with inundation.



evaluation

to classify the predictions, we created a variable called predOutcome that labels any predicted probability greater than 0.50 (or 50%) as an inundation event. using a 50% threshold is a reasonable starting point for binary classification.
  • predicted = 0, observed = 0 → true negative: the model correctly identified instances as not being inundated.
  • predicted = 1, observed = 1 → true positive: the model correctly identified instances as being inundated.
  • predicted = 1, observed = 0 → false positive: the model incorrectly classified non-inundated instances as inundated.
  • predicted = 0, observed = 1 → false negative: the model incorrectly classified 15 inundated instances as non-inundated.

sensitivity (true positive rate): the proportion of actual positives correctly identified by the model.
specificity (true negative rate): the proportion of actual negatives correctly identified by the model.

from the confusion matrix, we observe:
  • true negatives: 157 instances
  • true positives: 42 instances
  • false positives: 15 instances
  • false negatives: 29 instances

the model’s sensitivity is 0.5915, indicating that it correctly identified 59.15% of the positive cases. the specificity is 0.9128, indicating that it correctly identified 91.28% of the negative cases.

the overall accuracy of the model is 0.8189, meaning it correctly classified 81.89% of the observations. the 95% confidence interval (ci) suggests that the true accuracy lies between 0.7646 and 0.8652. the model significantly outperforms the no information rate (nir), as indicated by a p-value of 0.00004793.

the kappa statistic is 0.5353, measuring the agreement between predicted and actual classes beyond chance. a kappa of 1 indicates perfect agreement, while a kappa of 0 indicates agreement by chance. a value of 0.5353 suggests moderate agreement, reflecting a reasonably strong performance.

the roc curve showed an auc of 0.8867, suggesting high predictive power. cross-validation using 100 folds confirmed the model’s reliability, with an average accuracy of 83% and a kappa statistic of 0.55.

to assess the model’s robustness, we applied 100-fold cross-validation, training on 812 samples. from the summary above, accuracy scores range from 0.5 to 1.0, with an average of 0.8. the kappa statistic spans from -0.23 to 1.0, averaging 0.54. these results indicate substantial variability in model performance across different folds, with some achieving perfect agreement (kappa = 1). this wide range suggests potential overfitting to certain subsets of the data, leading to inconsistent performance across the folds.




    prediction map

    the flood prediction map for calgary highlights areas at high risk, concentrated along the bow and elbow rivers. high-risk areas align with steep slopes and impermeable surfaces near water bodies.




    conclusion

    this project successfully developed a robust predictive model for flood risk in calgary and validated its performance in a comparable city (pittsburgh). key takeaways include:
    • strong predictive power: the model accurately predicts flood risk based on hydrological, topographical, and land cover features.
    • importance of spatial features: distance to rivers and land cover permeability were critical predictors.
    • generalizability: applying the model to pittsburgh demonstrated its potential for use in other riverine cities, offering a valuable tool for urban flood risk management.


    improvements

    • feature simplification: future work could simplify the model by reducing the number of predictors, possibly using decision tree algorithms to identify the most critical features.
    • addressing overfitting: while the model performed well, there is a risk of overfitting in certain folds of the cross-validation. additional testing and regularization methods could enhance robustness.
    • enhancing spatial analysis: further spatial features, such as infrastructure age and urban development patterns, may help refine predictions, particularly in high-variance areas.


    references

    1. ken steif (2021). public policy analytics: code & context for data science in government.
    2. pranav badami (2019). nj transit and amtrak (nec) rail performance dataset. retrieved from kaggle dataset.
    3. michael zhang (2018). what are the chances that nj transit will cause you to miss the dinky? retrieved from medium article.
    4. pranav badami (2018). the 5 stages of a system breakdown on nj transit. retrieved from towards data science article.
    5. alberta soil information center. soil data for flood risk analysis. retrieved from alberta soil data archives.


    copyright @liuhaobing