N-gram-based Search Procedure

Woznica, Szymon

N-gram-based Search Procedure

Files

Woznica, Szymon_Computer Science MS_2009.pdf (1.26 MB)

Date

2009

Authors

Woznica, Szymon

Publisher

North Dakota State University

Abstract

Efficient querying and discovery of meaningful patterns in data becomes more and more important with accelerating growth of data published every day on the Internet. Tree pruning-based algorithms used in most popular search programs have troubles when dealing with infrequent query strings, limiting the number of returned results that might be of interest to the user. Furthermore, the existing tools are not capable of finding data patterns that could inform the user about the frequency of occurrence and location of a specific set of words in large, user-defined sets of textual data, in an efficient manner. In this paper, we present a new search tool, which is based on n-grams and modern software technologies. Our tool can efficiently index word n-grams existing in large sets of user-defined, textual data and subsequently assist users in querying the text corpus, helping them to find hidden patterns and their locations in the input data, effectively. We describe an algorithm for extracting word n-grams with a parameter "n" equal to two, three and four, and demonstrate how it can be leveraged by the end-user of the search tool to mine data in a new way. The presented tool offers a unique feature that allows the user to search a set of n-grams, extracted from abstracts of biomedical publications obtained from the U.S. National Library of Medicine (NLM), filtering the search result by words existing in the English language. The data tier of the search tool is based on the Microsoft SQL Server 2008 supported by a set of Common Language Runtime (CLR) functions and Transact Structured Query Language (T-SQL) based stored procedures, whereas the business logic and the user interface utilizes C# .NET 3.5 libraries to support regular expression patterns, database connection (LINQ to SQL) and multithreaded system operations.

URI

https://hdl.handle.net/10365/33813

Collections

Computer Science Masters Papers

Full item page

N-gram-based Search Procedure

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections