Mining Semantic Relationships Between Concepts Across Documents Using Wikipedia Knowledge
Abstract
The ongoing astounding growth of text data has created an enormous need for fast and efficient Text Mining algorithms. However, the sparsity and high dimensionality of text data present great challenges for representing the semantics of natural language text. Traditional approaches for document representation are mostly based on the Vector Space (VSM) Model which takes a document as an unordered collection of words and only document-level statistical information is recorded (e.g., document frequency, inverse document frequency). Due to the lack of capturing semantics in texts, for certain tasks, especially fine-grained information discovery applications, such as mining relationships between concepts, VSM demonstrates its inherent limitations because of its rationale for computing relatedness between words only based on the statistical information collected from documents themselves. In this dissertation, we present a new framework that attempts to address the above problems by utilizing background knowledge to provide a better semantic representation of any text. This is accomplished through leveraging Wikipedia, the world’s currently largest human built encyclopedia. Meanwhile, this integration also sufficiently complements the existing information contained in text corpus and facilitates the construction of a more comprehensive representation and retrieval framework. Specifically, we present 1) Semantic Path Chaining (SPC), a new text mining model that automatically discovers semantic relationships between concepts across multiple documents (which the traditional search paradigm such as search engines cannot help much) and effectively integrates various evidence sources from Wikipedia; 2) the kernel methods that provide a more appropriate estimation of semantic relatedness between concepts and better utilize Wikipedia background knowledge in our defined query contexts; 3) Concept Association Graph (CAG), a graph-based mining prototype system interfaced directly to Wikipedia, enables fast and customizable concept relationship search using Wikipedia resources. The effectiveness of the proposed techniques has been evaluated on different data sets. The experimental results demonstrate the search performance has been significantly enhanced in terms of accuracy and coverage compared with several baseline models. In particular, some existing state-of-the-art related work such as Srinivasan’s closed text mining algorithm, Explicit Semantic Analysis (ESA) [19] and the RelFinder system [26, 27, 41] has been used as the comparison models.