Show simple item record

dc.contributor.authorChitraranjan, Charith Devinda
dc.description.abstractIn recent years, various disciplines have generated large quantities of sequence data which has necessitated automated techniques for classifying these sequences into different categories of interest. Especially with the rapid rate at which biological sequence data has been emerging out of high throughput sequencing efforts, the need to interpret these large quantities of raw sequence data and gain deeper insights into them has become an essential part of modern biological research. Understanding the functions, localization and structure of newly identified protein sequences in particular has become a major challenge and is seeking the aid of computational techniques to keep up with the pace. In this thesis, we1 evaluate frequent pattern-based algorithms for predicting aforementioned attributes of proteins from their primary structure ( amino acid sequence). \Ve also apply our algorithms to datasets containing wheat Expressed Sequence Tags (ESTs) as an attempt to predict ESTs that are likely to be located near the centromere of their respective chromosomes. \Ve use frequent substrings mined from the training sequences as features to train a classifier. Our evaluation includes SVM and association rule-based classifiers. Some amino acids have similar properties and may substitute one another without altering the topology or function of a protein. Therefore, we use a combination of reduced amino acid alphabets in an attempt to capture patterns that may contain such substitutions. Frequent substrings mined from different alphabets are treated as features resulting from multiple sources and we evaluate both feature fusion and classifier fusion approaches towards multiple source prediction. 'We compare the performance of the different approaches using protein sub-cellular location, protein function and EST chromosomal location datasets. Pair-wise sequence-alignment-based Nearest Neighbor and basic SVM k-gram classifiers are also included as baseline algorithms in the comparison. Results show that frequent pattern-based SVM classifiers demonstrate better performance compared to other classifiers on the sub-cellular location datasets and they perform competitively with the nearest neighbor classifier on the protein function datasets. Our results also show that the use of reduced alphabets provides statistically significant performance improvements for the SVM-based classifier fusion algorithm, for half of the classes studied.en_US
dc.publisherNorth Dakota State Universityen_US
dc.rightsNDSU policy 190.6.2en_US
dc.titleFrequent Substring-Based Sequence Classification Using Reduced Alphabetsen_US
dc.typeThesisen_US
dc.date.accessioned2023-12-26T21:55:03Z
dc.date.available2023-12-26T21:55:03Z
dc.date.issued2011
dc.identifier.urihttps://hdl.handle.net/10365/33463
dc.subject.lcshBioinformatics.en_US
dc.subject.lcshComputational biology.en_US
dc.subject.lcshPattern recognition systems.en_US
dc.rights.urihttps://www.ndsu.edu/fileadmin/policy/190.pdfen_US
ndsu.degreeMaster of Science (MS)en_US
ndsu.collegeEngineeringen_US
ndsu.departmentComputer Scienceen_US
ndsu.programComputer Scienceen_US
ndsu.advisorDenton, Anne M.


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record