spatio-temporal modelingforecast metro train delays in and around nyc and new jersey


introduction

this project focuses on predicting delays for nj transit and amtrak trains operating in and around new york city. nj transit is a complex rail network with 11 lines and 162 stations, serving thousands of commuters daily. delays disrupt schedules, cause frustration, and impact passenger satisfaction. 

our goal is to develop a reliable predictive model that helps passengers anticipate delays, giving them enough time to adjust their travel plans. we aim to offer insights into the key factors contributing to train delays, helping operators enhance scheduling and management decisions.



user case: Trainspotting app

train delays are a recurring issue on the nj transit system, impacting travelers daily. from the nj transit website, it’s evident that delays occur frequently, causing passenger frustration and retention issues. to address this, we created trainspotting, an app that predicts delays and provides real-time train information. our target users include:
  • daily commuters relying on trains for work.
  • tourists navigating the city.
  • business travelers who need timely transportation.

different professional people work together to support the APP. the datasets of the app are from train companies. and professional data analysts deal with the data, analysis, and visualize them to let users easy to understand. the transportation department can improve traffic policy according to the analysis result. 

the goal of our app is to provide users with our predicted train arrival times. they can adjust their travel plans with this app.we hope users can get more trust in traveling by train.

a reliable prediction of train delays offers users a better ride experience, which can increase rail ridership. and also the analysis report can let train operators have a better understanding of the reason for train delays. they can improve the train management according to different conditions.

in addition, to put a machine learning model into the hands of a non-technical decision maker, the app should be easy to use. It has a user-friendly interface that allows the decision maker to access and use the model without technical expertise, and has clear instructions on how to use the model and interpret its predictions. the app also includes visualizations and summary statistics that help the decision maker understand the predictions. it allows decision maker to adjust parameters to meet their specific needs.

    here is our user interface design. 

    the main page shows real-time status of trains, and predicted arrival time of each train. users can zoom in and out on the map to get the most intuitive view of train moving dynamics. and when users click the alert button, they can have a quick look at all the delayed trains notifications.

    users can view real-time status, zoom in on a map for detailed train dynamics, and receive alerts for delayed trains. the app also allows users to access a report of their trips, including predicted delays, historical arrival times, and weather at stops.




    methodology 

    as the issue exhibits very strong space-time dependencies, we will create a space-time geospatial machine learning model in this case. the dependent variable here is train delays time. after the exploratory data analysis, we are going to do feature engineering both spatial and temporal features. 

    in addition, a strong time series model is an understanding of the underlying temporal train operation process. and also the space effects like inclement weather are initial features to predict train time delays. specifically, we will use linear regression (because records are large enough) with fixed effects like time lags and station lags to predict. moreover, we will compare the predicted data with observed data, and look at the spatial pattern of MAEs in test set. finally, we will use cross validation think about whether it is an accurate and generalizable model.


    data collection 

    for this project, we gathered a comprehensive dataset that combines historical and real-time information to make accurate predictions about train delays. the datasets used include:
    • train trip data: granular performance data from over 667,000 nj transit and amtrak train trips during october to december 2019 (pre-pandemic period), covering busy travel periods like thanksgiving and christmas. we sourced this dataset from pranav badami’s nj transit and amtrak performance records.
    • station data: latitude and longitude details for nj transit and amtrak stations were merged with the train trip dataset to create a geospatial data frame (trip.sf). this allowed us to analyze spatial patterns and integrate station-level information into the model.
    • census data: demographic and transportation data from the nyc and nj census were included. we used this data to analyze commuting patterns, population density, and transportation usage. the spatial information was utilized for mapping and joining purposes, creating a geospatial object (sf) that includes geometry.
    • transportation data: infrastructure statistics from the bureau of transportation statistics, including metrics like the percentage of bridges in poor condition and the proportion of residents commuting by public transit. these data points were extracted from the national transportation atlas database (ntad), incorporating federal and state-level transportation information.
    • weather data: real-time weather conditions for both new york city and newark, nj, were analyzed. we used riem_measures to import weather data from the newark liberty international airport. the weather dataset (njweather.panel) includes hourly measures of temperature, precipitation, wind speed, and visibility, covering the october to december period.


    data wrangling and preparation

    the datasets mentioned above were cleaned and merged to form a final comprehensive panel (final.panel) that includes every possible combination of spatial and temporal observations. specific details of each dataset are as follows:
    1. train trips data: we selected data from october to december 2019, a period representing typical pre-pandemic train operations, including holiday traffic. this dataset required extensive cleaning due to inconsistencies in temporal data. we sorted the records by 60-minute intervals to standardize the time series.
    2.              
    3. train stations data: the station dataset was integrated into the main train trips dataset by joining latitude and longitude information. this allowed us to create a geospatial data frame (trip.sf), providing accurate mapping and spatial analysis of train movements.
    4.              
    5. census data integration: we imported census data for both new york and new jersey, focusing on variables related to transportation, such as the number of commuters and the total use of public transit. this data was used to assign origin-destination pairs to the train trips, enriching the dataset with demographic context. the census tracts were extracted as geospatial objects for mapping and analysis.
    6. transportation data analysis: sourced from the ntad, this dataset includes information on bridge conditions and commuting trends. it contains 22 columns with variables like the percentage of poor-condition bridges and the proportion of workers commuting within the county. this data helped link infrastructure quality to delay patterns in our analysis.            
    7. weather data processing: we divided the weather data into two panels:
      • nyc weather panel: focused on weather conditions at new york city destination stations, accounting for factors like fog, wind gusts, and precipitation that may influence train delays.
      • nj weather panel: used data from the newark liberty international airport, capturing hourly weather variables such as temperature, wind speed, and precipitation for the new jersey area. this panel dataset was crucial for identifying correlations between adverse weather conditions and train delays.

    feature engineering

    to improve the accuracy of our prediction model, we implemented several feature engineering techniques, focusing on spatial, temporal, and external factors. this process helped capture key dependencies and correlations influencing train delays.

    • station lags: delays at one station often ripple through to subsequent stops, making it challenging for trains to regain punctuality once delayed. to account for this, we introduced station lag features, which measure the influence of delays at preceding stations. we sorted the train data by stop sequence and calculated delay minutes for the three previous stations. this “3-lag range” captures the impact of upstream delays on subsequent stops, enhancing the model's ability to predict cascading disruptions.

    • time lags: train delays exhibit strong temporal dependencies, particularly during peak travel times. we engineered time lag features to capture serial autocorrelation in delay patterns. the dataset was grouped by 60-minute intervals, and we created lag variables that represent delay minutes from previous time periods (e.g., lag 1 hour, lag 2 hours, lag 3 hours). these time lags help the model detect recurring delay patterns over time.

    • panel: after creating station and time lags, we merged these features into a comprehensive final.panel dataset. this panel includes station lags, time lags, train trip data, census data, and weather information, forming a robust framework for our predictive model.

    • serial autocoorelation analysis: our analysis revealed a positive linear relationship between delay times and time lag features. visual inspection of the plots confirmed that delays tend to persist over time, indicating strong temporal autocorrelation. similarly, station lag features showed a clear linear correlation, particularly highlighting the significant influence of delays at previous stations on the next stop’s punctuality.


    • weather correlation: extreme weather conditions can severely impact train schedules, leading to widespread delays. we examined the relationship between various weather factors (e.g., temperature, wind speed, precipitation, visibility, gusts, and ice accretion) and delay times. the analysis indicated that high wind speeds and heavy precipitation at departure stations in new jersey are strongly correlated with longer delay minutes. incorporating weather data as features improved the model’s ability to predict delays under adverse conditions.

    • correlation matrix: to identify and remove redundant features, we visualized the relationships between variables using a correlation matrix. strong correlations were observed between certain demographic and transportation features. for example, the number of resident workers who work from home was closely related to the number of resident workers commuting within the county. based on these findings, we excluded highly correlated variables like total_pop and number.of.resident.workers.who.work.at.home to reduce multicollinearity and enhance the model’s efficiency.


    predictive modeling: regression 

    we tested seven regression models, each incorporating different sets of features:
    • model a: focused solely on time effects (hour and day of the week).
    • model b: included spatial factors (station fixed effects) along with weather data.
    • model c: combined both time and spatial effects.
    • models d & e: incorporated station lags and time lags, respectively.
    • model f: integrated both station lags and time lags.
    • model g: the final model included all features, including census data and transportation variables.

    model g showed the highest predictive accuracy, achieving an r² of 0.789 and a significant reduction in mean absolute error (mae), demonstrating the effectiveness of including both spatial and temporal lags.



    evaluation and model performance

    the model’s performance was assessed using several metrics:
    • mae: the inclusion of station lags resulted in a 50% reduction in mae, highlighting their importance in delay prediction.
    • residual analysis: errors were higher for specific stations (e.g., absecon on the atlantic city line) due to potential issues like outdated infrastructure.
    • cross-validation: conducted using weekly partitions, confirming the model’s generalizability across different time periods.


    despite the strong performance, the model struggled with extreme delays (over 30 minutes), suggesting the need for additional features or adjustments to handle outliers. the following time series line chart gives a good indication of the prediction performance among 7 models, which confirms the previous estimation that regression F and G seems to have the best goodness of fit generally. so we choose the Model G as our training model.


    from the following map, it is significant and straightforward that the mean absolute error of Model E decrease sharply as we added station lags into the model. furthermore, it indicates that the Model with station lags like Model E, Model F, Model G, will have a better performance in predict train delay time. meanwhile, we notice the spatial patterns of MAEs.




    conclusion

    this project successfully developed a regression model to predict train delays in the nj transit and amtrak systems, leveraging spatial, temporal, and external features. key predictors included station data, time effects, weather conditions (temperature, precipitation, wind speed, visibility, gust, and ice), time lags, transportation factors, and station lags.

    the introduction of station lags was a significant improvement, reducing the mean absolute error (mae) by 50%. our model effectively captured the ripple effect of delays between stations, enhancing predictive accuracy. mae analysis indicated spatial correlation in errors, with consistent patterns across certain lines and stations. the model performed well on the princeton shuttle but struggled on the atlantic city line, particularly at absecon station, likely due to infrastructure issues and frequent disruptions.

    cross-validation confirmed the model’s generalizability, though challenges remain in accurately predicting severe delays (over 30 minutes), suggesting a need for additional refinements.


    key takeaways:

    • high predictive accuracy: the model effectively forecasts short delays, aiding passengers in making informed travel decisions.
    • spatial error patterns: observed correlations point to specific infrastructure issues, providing actionable insights for targeted improvements.
    • strong generalizability: cross-validation demonstrated consistent model performance across various time periods, supporting its application in real-time scenarios.


    improvement

    while our model achieved notable success, several areas for enhancement remain:
    1. simplifying predictors: currently, the model uses over 20 features. applying methods like decision trees or lasso regression could streamline the feature set, reducing complexity while maintaining accuracy.
    2. addressing potential overfitting: despite the model’s good fit, we are cautious of overfitting due to the extensive feature set. further regularization techniques may help ensure robust performance.
    3. enhancing accuracy for specific lines and stations: the model's performance was weaker for certain lines (e.g., atlantic city line) and stations (e.g., absecon). incorporating additional spatial features or alternative data sources may help mitigate these issues.
    4. handling extreme delays: the model struggled with outliers involving severe delays. exploring advanced techniques like gradient boosting or deep learning could improve predictions for these rare but impactful events.

    overall, our regression model provides a strong foundation for predicting train delays, offering a valuable tool for both passengers and train operators. with further refinements, it holds great potential for improving real-time rail transit management.

    reference

    1. Ken Steif (2021), Public Policy Analytics: Code & Context for Data Science in Government
    2. Pranav Badami (2019), NJ Transit and Amtrak(NEC) Rail Performance dataset. Retrieved from https://www.kaggle.com/datasets/pranavbadami/nj-transit-amtrak-nec-performance?select=2018_11.csv
    3. Michael Zhang (2018),What are the chances that NJ Transit will cause you to miss the Dinky? Retrieved from https://medium.com/@mzhang13/what-are-the-chances-that-nj-transit-will-cause-you-to-miss-the-dinky-bfeacd11ebc6
    4. Pranav Badami (2018), The 5 Stages of a System Breakdown on NJ Transit. Retrieved from https://towardsdatascience.com/the-5-stages-of-a-system-breakdown-on-nj-transit-8258127e31e9
    5. Weather data (predictive): identifying factors; Service advisories (prescriptive): what type of infrastructure issues are consistently leading to delays and where. Retriev from https://medium.com/@pranavbadami/how-data-can-help-fix-nj-transit-c0d15c0660fe
    6. Amtrak data (descriptive, predictive): a threshold number of concurrently running trains. Retrieved from https://medium.com/@pranavbadami/how-data-can-help-fix-nj-transit-c0d15c0660fe

    copyright @liuhaobing, read the markdown script