Railroad Accident Analysis Using Extreme Gradient Boosting
Abstract
Railroads are critical to the economic health of a nation. Unfortunately, railroads lose hundreds of millions of dollars from accidents each year. Trends reveal that derailments consistently account for more than 70% of the U.S. railroad industry’s average annual accident cost. Hence, knowledge of explanatory factors that distinguish derailments from other accident types can inform more cost-effective and impactful railroad risk management strategies. Five feature scoring methods, including ANOVA and Gini, agreed that the top four explanatory factors in accident type prediction were track class, type of movement authority, excess speed, and territory signalization. Among 11 different types of machine learning algorithms, the extreme gradient boosting method was most effective at predicting the accident type with an area under the receiver operating curve (AUC) metric of 89%. Principle component analysis revealed that relative to other accident types, derailments were more strongly associated with lower track classes, non-signalized territories, and movement authorizations within restricted limits. On average, derailments occurred at 16 kph below the speed limit for the track class whereas other accident types occurred at 32 kph below the speed limit. Railroads can use the integrated data preparation, machine learning, and feature ranking framework presented to gain additional insights for managing risk, based on their unique operating environments.