Application of Data Mining Techniques in Transportation Safety Study
View/ Open
Abstract
Most of current studies are based on Generalized Linear Models (GLMs), which require several assumptions. Those assumptions limit GLMs with the nature of data, and jeopardize models’ performance when handling data with complex and nonlinear patterns, high missing values, and large number of input variables with various data types. Data mining models are famous for strong capability of extracting valuable information and detecting complex patterns from large noisy data. However, they are not popular in transportation safety research, because they are criticized to be unable to provide interpretable and practical outputs. In this study, data mining models are tested in transportation safety research to prove their feasibility to be served as alternative models in safety study. Influential variable importance, contributor variable marginal effect analysis, and model predicting accuracy are further conducted to identify complex and nonlinear patterns in study dataset, and to respond to the criticism that data mining models do not provide practical outputs. Due to the high fatality rate, two types of crashes are selected as research areas: 1) predicting crashes at Highway Rail Grade Crossings (HRGCs); and 2) commercial truck involved crash injury severity. In the HRGC crash likelihood study, three data mining models, Decision Tree (DT), Gradient Boosting (GB), and Neural Network (NN), are tested, and demonstrated to be solid in Highway Rail Grade Crossing (HRGC) crash likelihood study. In the commercial truck involved crash injury severity study, the GB model identifies 11 out of 25 studied variables to be responsible for more than 80% of injury severity level forecasting, and their nonlinear impact on the severity level. Several factors such as trucking company attributes (e.g., company size), safety inspection values, trucking company commerce status (e.g., interstate or intrastate), and registration condition are found to be significantly associated with crash injury severity. Even though most of the identified contributing factors are significant for all four levels of crash severity, their relative importance and marginal effect are all different. Findings in this study can be helpful for transportation agencies to reduce injury severity level, and develop efficient strategies to improve safety.