Frequent Substring-Based Sequence Classification Using Reduced Alphabets

Chitraranjan, Charith Devinda

dc.contributor.author	Chitraranjan, Charith Devinda
dc.description.abstract	In recent years, various disciplines have generated large quantities of sequence data which has necessitated automated techniques for classifying these sequences into different categories of interest. Especially with the rapid rate at which biological sequence data has been emerging out of high throughput sequencing efforts, the need to interpret these large quantities of raw sequence data and gain deeper insights into them has become an essential part of modern biological research. Understanding the functions, localization and structure of newly identified protein sequences in particular has become a major challenge and is seeking the aid of computational techniques to keep up with the pace. In this thesis, we1 evaluate frequent pattern-based algorithms for predicting aforementioned attributes of proteins from their primary structure ( amino acid sequence). \Ve also apply our algorithms to datasets containing wheat Expressed Sequence Tags (ESTs) as an attempt to predict ESTs that are likely to be located near the centromere of their respective chromosomes. \Ve use frequent substrings mined from the training sequences as features to train a classifier. Our evaluation includes SVM and association rule-based classifiers. Some amino acids have similar properties and may substitute one another without altering the topology or function of a protein. Therefore, we use a combination of reduced amino acid alphabets in an attempt to capture patterns that may contain such substitutions. Frequent substrings mined from different alphabets are treated as features resulting from multiple sources and we evaluate both feature fusion and classifier fusion approaches towards multiple source prediction. 'We compare the performance of the different approaches using protein sub-cellular location, protein function and EST chromosomal location datasets. Pair-wise sequence-alignment-based Nearest Neighbor and basic SVM k-gram classifiers are also included as baseline algorithms in the comparison. Results show that frequent pattern-based SVM classifiers demonstrate better performance compared to other classifiers on the sub-cellular location datasets and they perform competitively with the nearest neighbor classifier on the protein function datasets. Our results also show that the use of reduced alphabets provides statistically significant performance improvements for the SVM-based classifier fusion algorithm, for half of the classes studied.	en_US
dc.publisher	North Dakota State University	en_US
dc.rights	NDSU policy 190.6.2	en_US
dc.title	Frequent Substring-Based Sequence Classification Using Reduced Alphabets	en_US
dc.type	Thesis	en_US
dc.date.accessioned	2023-12-26T21:55:03Z
dc.date.available	2023-12-26T21:55:03Z
dc.date.issued	2011
dc.identifier.uri	https://hdl.handle.net/10365/33463
dc.subject.lcsh	Bioinformatics.	en_US
dc.subject.lcsh	Computational biology.	en_US
dc.subject.lcsh	Pattern recognition systems.	en_US
dc.rights.uri	https://www.ndsu.edu/fileadmin/policy/190.pdf	en_US
ndsu.degree	Master of Science (MS)	en_US
ndsu.college	Engineering	en_US
ndsu.department	Computer Science	en_US
ndsu.program	Computer Science	en_US
ndsu.advisor	Denton, Anne M.

Files in this item

Name:: Chitraranjan, Charith Devinda_ ...
Size:: 1.648Mb
Format:: PDF
Description:: Frequent Substring-Based Sequence ...

View/Open

This item appears in the following Collection(s)

Computer Science Masters Theses

Show simple item record