Computer Science Masters Papers

Permanent URI for this collectionhdl:10365/32552

Browse

Now showing 1 - 5 of 5

Analyzing Access Logs Data using Stream Based Architecture
(North Dakota State University, 2018) Gautam, Nitendra
Within the past decades, the enterprise-level IT infrastructure in many businesses have grown from a few to thousands of servers, increasing the digital footprints they produce. These digital footprints include access logs that contain information about different events such as activity related to usage patterns, networks and any hostile activity affecting the network. Apache Hadoop has been one of the most standardized frameworks and is used by many Information Technology (IT) companies for analyzing these log files in distributed batch mode using MapReduce programming model. As these access logs include important information related to security and usage patterns, companies are now looking for an architecture that allows analyzing these logs in real time. To overcome the limitations of the MapReduce based architecture of Hadoop, this paper proposes a new and more efficient data processing architecture using Apache Spark, Kafka and other technologies that can handle both real-time and batch-based data.
Market Basket Analysis Algorithm with MapReduce Using HDFS
(North Dakota State University, 2017) Nuthalapati, Aditya
Market basket analysis techniques are substantially important to every day’s business decision. The traditional single processor and main memory based computing approach is not capable of handling ever increasing large transactional data. In today’s world, the MapReduce approach has been popular to compute huge volumes of data, moreover existing sequential algorithms can be converted in to MapReduce framework for big data. This paper presents a Market Basket Analysis (MBA) algorithm with MapReduce on Hadoop to generate the complete set of maximal frequent item sets. The algorithm is to sort data sets and to convert it to (key, value) pairs to fit with the MapReduce concept. The framework sorts the outputs of the maps, which are then input to the “reduce” tasks. The experimental results show that the code with MapReduce increases the performance as adding more nodes until it reaches saturation.
Mining Association Rules in Cloud
(North Dakota State University, 2012) Roy, Pallavi
The association rule mining was implemented in Hadoop. An association rule mining helps in finding relation between the items or item sets in the given data. The performance of the algorithm was evaluated by testing it in the cloud (EC2) by increasing the number of nodes in the testing set up. The association rules are developed on the basis of the frequent item set generated from the data. The frequent item set were generated following the Apriori algorithm. As the input data and number of distinct items in the data set is large, lots of space and memory is required, so Hadoop was used, as Hadoop provide parallel, scalable, robust framework in the distributed environment.
Naïve Bayes Classifier: A MapReduce Approach
(North Dakota State University, 2014) Zheng, Songtao
Machine learning algorithms have the advantage of making use of the powerful Hadoop distributed computing platform and the MapReduce programming model to process data in parallel. Many machine learning algorithms have been investigated to be transformed to the MapReduce paradigm in order to make use of the Hadoop Distributed File System (HDFS). Naïve Bayes classifier is one of the supervised learning classification algorithm that can be programmed in form of MapReduce. In our study, we build a Naïve Bayes MapReduce model and evaluate the classifier on five datasets based on the prediction accuracy. Also, a scalability analysis is conducted to see the speedup of the data processing time with the increasing number of nodes in the cluster. Results show that running the Naïve Bayes MapReduce model across multiple nodes can save considerate amount of time compared with running the model against a single node, without sacrificing the classification accuracy.
Parallelization of Particle Swarm Optimization Algorithm Using Hadoop Mapreduce
(North Dakota State University, 2016) Ghosh, Priyanka Singh
Particle Swarm Optimization (PSO) has received attention in many research fields and real-world applications for solving optimization problems in the areas of intelligent transportation systems, wireless sensor networks, finance, and engineering. Factor that affects the performance of PSO is its ability of the exploration in a multi-dimensional search space, which can increase the execution time quite significantly. The parallel implementation of PSO is a way to address this. In this paper, we implement and compare the parallel implementation of PSO using two different parallelization techniques using MapReduce programming, 1) all nodes in the cluster work on the same population, and 2) each node in cluster has its own population. Both of the parallel implementations are compared based on performance and speedup. Parallel implementation of the PSO algorithm makes the algorithm faster and scalable in order to find best solutions while working with large datasets in high dimensional search spaces.

Browse

Browsing Computer Science Masters Papers by Subject "Apache Hadoop."