DOM Structure Based Web Pattern Mining

No Thumbnail Available

Date

2011

Journal Title

Journal ISSN

Volume Title

Publisher

North Dakota State University

Abstract

A rapid expansion in the Web has motivated several studies to understand and recognize the implementation structure underlying the interface. Though the presentation of the Web pages looks different, those Web pages may share the same semantic structure to organize information. Those common semantic structures are referred to as Web patterns. There are no strict rules for implementing the HTML structure of the web pages, and the implementation of each web page might not be consistent across the entire website. Also, the HTML implementation of one website varies from other websites. This makes it difficult to recognize the Web patterns that have been used for implementing the websites. In this paper, Document Object Model (called "DOM" hereafter) structure based web pattern mining has been proposed, where the HTML structure and the common patterns are represented in DOM structure format. As an approach for deriving the common web pattern, the implemented patterns observed across different websites are analyzed and summarized manually. Those Web patterns are represented by using the Pattern Structure Definition (PSD) format. which is derived based on the DTD model. Then, an efficient algorithm has been proposed to recognize Web patterns that match with the definition and comply with all the properties defined in the PSD. To recognize the pattern structure, a tool was developed that can take the URL as an input and recognize summarized patterns. The experiment results and evaluation of the tool show the high accuracy of the approach. The implemented approach achieved 91.35% accuracy in finding the navigation pattern structure in the on line shopping websites.

Description

Keywords

Citation