Extracting Useful Information and Building Predictive Models from Medical and Health-Care Data Using Machine Learning Techniques
View/ Open
Abstract
In healthcare, a large number of medical data has emerged. To effectively use these data to improve healthcare outcomes, clinicians need to identify the relevant measures and apply the correct analysis methods for the type of data at hand. In this dissertation, we present various machine learning (ML) and data mining (DM) methods that could be applied to the type of data sets that are available in the healthcare area.
The first part of the dissertation investigates DM methods on healthcare or medical data to find significant information in the form of rules. Class association rule mining, a variant of association rule mining, was used to obtain the rules with some targeted items or class labels. These rules can be used to improve public awareness of different cancer symptoms and could also be useful to initiate prevention strategies.
In the second part of the thesis, ML techniques have been applied in healthcare or medical data to build a predictive model. Three different classification techniques on a real-world breast cancer risk factor data set have been investigated. Due to the imbalance characteristics of the data set various resampling methods were used before applying the classifiers. It is shown that there was a significant improvement in performance when applying a resampling technique as compared to applying no resampling technique.
Moreover, super learning technique that uses multiple base learners, have been investigated to boost the performance of classification models. Two different forms of super learner have been investigated - the first one uses two base learners while the second one uses three base learners. The models were then evaluated against well-known benchmark data sets related to the healthcare domain and the results showed that the SL model performs better than the individual classifier and the baseline ensemble.
Finally, we assessed cancer-relevant genes of prostate cancer with the most significant correlations with the clinical outcome of the sample type and the overall survival. Rules from the RNA-sequencing of prostate cancer patients was discovered. Moreover, we built the regression model and from the model rules for predicting the survival time of patients were generated.