N-gram-based Search Procedure
View/ Open
Abstract
Efficient querying and discovery of meaningful patterns in data becomes more
and more important with accelerating growth of data published every day on the
Internet. Tree pruning-based algorithms used in most popular search programs have
troubles when dealing with infrequent query strings, limiting the number of returned
results that might be of interest to the user. Furthermore, the existing tools are not
capable of finding data patterns that could inform the user about the frequency of
occurrence and location of a specific set of words in large, user-defined sets of
textual data, in an efficient manner.
In this paper, we present a new search tool, which is based on n-grams and
modern software technologies. Our tool can efficiently index word n-grams existing
in large sets of user-defined, textual data and subsequently assist users in querying
the text corpus, helping them to find hidden patterns and their locations in the input
data, effectively. We describe an algorithm for extracting word n-grams with a
parameter "n" equal to two, three and four, and demonstrate how it can be leveraged
by the end-user of the search tool to mine data in a new way. The presented tool
offers a unique feature that allows the user to search a set of n-grams, extracted from
abstracts of biomedical publications obtained from the U.S. National Library of
Medicine (NLM), filtering the search result by words existing in the English
language. The data tier of the search tool is based on the Microsoft SQL Server 2008
supported by a set of Common Language Runtime (CLR) functions and Transact
Structured Query Language (T-SQL) based stored procedures, whereas the business
logic and the user interface utilizes C# .NET 3.5 libraries to support regular
expression patterns, database connection (LINQ to SQL) and multithreaded system
operations.