N-gram-based Search Procedure

Woznica, Szymon

dc.contributor.author	Woznica, Szymon
dc.description.abstract	Efficient querying and discovery of meaningful patterns in data becomes more and more important with accelerating growth of data published every day on the Internet. Tree pruning-based algorithms used in most popular search programs have troubles when dealing with infrequent query strings, limiting the number of returned results that might be of interest to the user. Furthermore, the existing tools are not capable of finding data patterns that could inform the user about the frequency of occurrence and location of a specific set of words in large, user-defined sets of textual data, in an efficient manner. In this paper, we present a new search tool, which is based on n-grams and modern software technologies. Our tool can efficiently index word n-grams existing in large sets of user-defined, textual data and subsequently assist users in querying the text corpus, helping them to find hidden patterns and their locations in the input data, effectively. We describe an algorithm for extracting word n-grams with a parameter "n" equal to two, three and four, and demonstrate how it can be leveraged by the end-user of the search tool to mine data in a new way. The presented tool offers a unique feature that allows the user to search a set of n-grams, extracted from abstracts of biomedical publications obtained from the U.S. National Library of Medicine (NLM), filtering the search result by words existing in the English language. The data tier of the search tool is based on the Microsoft SQL Server 2008 supported by a set of Common Language Runtime (CLR) functions and Transact Structured Query Language (T-SQL) based stored procedures, whereas the business logic and the user interface utilizes C# .NET 3.5 libraries to support regular expression patterns, database connection (LINQ to SQL) and multithreaded system operations.	en_US
dc.publisher	North Dakota State University	en_US
dc.rights	NDSU policy 190.6.2	en_US
dc.title	N-gram-based Search Procedure	en_US
dc.type	Master's Paper	en_US
dc.date.accessioned	2024-05-07T21:38:09Z
dc.date.available	2024-05-07T21:38:09Z
dc.date.issued	2009
dc.identifier.uri	https://hdl.handle.net/10365/33813
dc.subject.lcsh	Information retrieval.	en_US
dc.subject.lcsh	Database searching.	en_US
dc.subject.lcsh	Internet searching.	en_US
dc.rights.uri	https://www.ndsu.edu/fileadmin/policy/190.pdf	en_US
ndsu.degree	Master of Science (MS)	en_US
ndsu.college	Science and Mathematics	en_US
ndsu.department	Computer Science	en_US
ndsu.program	Computer Science	en_US
ndsu.advisor	Denton, Anne

Files in this item

Name:: Woznica, Szymon_Computer Science ...
Size:: 1.259Mb
Format:: PDF
Description:: N-gram-based Search Procedure

View/Open

This item appears in the following Collection(s)

Computer Science Masters Papers

Show simple item record