DOM Structure Based Web Pattern Mining
Abstract
A rapid expansion in the Web has motivated several studies to understand and
recognize the implementation structure underlying the interface. Though the presentation
of the Web pages looks different, those Web pages may share the same semantic structure
to organize information. Those common semantic structures are referred to as Web
patterns. There are no strict rules for implementing the HTML structure of the web pages,
and the implementation of each web page might not be consistent across the entire website.
Also, the HTML implementation of one website varies from other websites. This makes it
difficult to recognize the Web patterns that have been used for implementing the websites.
In this paper, Document Object Model (called "DOM" hereafter) structure based web
pattern mining has been proposed, where the HTML structure and the common patterns are
represented in DOM structure format. As an approach for deriving the common web
pattern, the implemented patterns observed across different websites are analyzed and
summarized manually. Those Web patterns are represented by using the Pattern Structure
Definition (PSD) format. which is derived based on the DTD model. Then, an efficient
algorithm has been proposed to recognize Web patterns that match with the definition and
comply with all the properties defined in the PSD. To recognize the pattern structure, a tool
was developed that can take the URL as an input and recognize summarized patterns. The
experiment results and evaluation of the tool show the high accuracy of the approach. The
implemented approach achieved 91.35% accuracy in finding the navigation pattern
structure in the on line shopping websites.