Analyzing Access Logs Data using Stream Based Architecture
Abstract
Within the past decades, the enterprise-level IT infrastructure in many businesses have grown from a few to thousands of servers, increasing the digital footprints they produce. These digital footprints include access logs that contain information about different events such as activity related to usage patterns, networks and any hostile activity affecting the network. Apache Hadoop has been one of the most standardized frameworks and is used by many Information Technology (IT) companies for analyzing these log files in distributed batch mode using MapReduce programming model. As these access logs include important information related to security and usage patterns, companies are now looking for an architecture that allows analyzing these logs in real time. To overcome the limitations of the MapReduce based architecture of Hadoop, this paper proposes a new and more efficient data processing architecture using Apache Spark, Kafka and other technologies that can handle both real-time and batch-based data.