Naïve Bayes Classifier: A MapReduce Approach
Abstract
Machine learning algorithms have the advantage of making use of the powerful Hadoop distributed computing platform and the MapReduce programming model to process data in parallel. Many machine learning algorithms have been investigated to be transformed to the MapReduce paradigm in order to make use of the Hadoop Distributed File System (HDFS). Naïve Bayes classifier is one of the supervised learning classification algorithm that can be programmed in form of MapReduce. In our study, we build a Naïve Bayes MapReduce model and evaluate the classifier on five datasets based on the prediction accuracy. Also, a scalability analysis is conducted to see the speedup of the data processing time with the increasing number of nodes in the cluster. Results show that running the Naïve Bayes MapReduce model across multiple nodes can save considerate amount of time compared with running the model against a single node, without sacrificing the classification accuracy.