Comparison of Classification Rates among Logistic Regression, Neural Network and Support Vector Machines in the Presence of Missing Data
Abstract
Statistical models such as Logistic Regression (LR), Neural Network (NN) and Support Vector Machines (SVM) often use datasets with missing values while making inferences regarding the population. When inferences are made based on the data set used, the presence of missing data can severely skew the results and distort the efficiency of the model. Our objective was to identify a robust model among LR, NN, SVM in the presence of missing data. The study was conducted by simulating observations based on Monte Carlo methods and missing data was introduced randomly at 10% level. Single mode imputation was used to impute missing values. Simple random samples of 120, 240 and 500 observations were chosen and these three models were fit for two scenarios. Results showed that the performance of SVM was far superior compared to LR or NN models. However, the classification accuracy of SVM gradually decreased as sample size increased.