Comparing Prediction Accuracies of Cancer Survival Using Machine Learning Techniques and Statistical Methods in Combination with Data Reduction Methods
Abstract
This comparative study of five-year survival prediction for breast, lung, colon, and leukemia cancers using a large SEER dataset along with 10-fold cross-validation provided us with an insight into the relative prediction ability of different machine learning and data reduction methods. Lasso regression and the Boruta algorithm were used for variables selection, and Principal Component Analysis (PCA) was used for dimensionality reduction. We used one statistical method Logistic regression (LR) and several machine learning methods including Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), K Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Naïve Bayes Classifier (NB). For breast cancer, we found LDA, RF, and LR were the best models for five-year survival prediction based on the accuracy, sensitivity, specificity, and area under the curve (AUC) using data reduction method from Z score normalization and the Boruta algorithm. The results for lung cancer indicated the SVM linear, RF, and ANN were the best survival prediction models using data reduction methods from the Z score and max min normalization. The results for colon cancer indicated, ANN, and RF were the best prediction models using the Boruta algorithm and Z score method. The results for leukemia showed ANN, and the RF were the best survival prediction models using the Boruta algorithm and data reduction technique from the Z score. Overall, ANN, RF, and LR were the best prediction models for all cancers using variables selection by the Boruta algorithm.