Frequent Substring-Based Sequence Classification Using Reduced Alphabets
Abstract
In recent years, various disciplines have generated large quantities of sequence
data which has necessitated automated techniques for classifying these sequences
into different categories of interest. Especially with the rapid rate at which biological
sequence data has been emerging out of high throughput sequencing efforts, the need
to interpret these large quantities of raw sequence data and gain deeper insights into
them has become an essential part of modern biological research. Understanding the
functions, localization and structure of newly identified protein sequences in particular
has become a major challenge and is seeking the aid of computational techniques to
keep up with the pace. In this thesis, we1 evaluate frequent pattern-based algorithms
for predicting aforementioned attributes of proteins from their primary structure
( amino acid sequence). \Ve also apply our algorithms to datasets containing wheat
Expressed Sequence Tags (ESTs) as an attempt to predict ESTs that are likely to
be located near the centromere of their respective chromosomes. \Ve use frequent
substrings mined from the training sequences as features to train a classifier. Our
evaluation includes SVM and association rule-based classifiers. Some amino acids
have similar properties and may substitute one another without altering the topology or function of a protein. Therefore, we use a combination of reduced amino acid
alphabets in an attempt to capture patterns that may contain such substitutions.
Frequent substrings mined from different alphabets are treated as features resulting
from multiple sources and we evaluate both feature fusion and classifier fusion approaches
towards multiple source prediction. 'We compare the performance of the different
approaches using protein sub-cellular location, protein function and EST chromosomal
location datasets. Pair-wise sequence-alignment-based Nearest Neighbor and basic
SVM k-gram classifiers are also included as baseline algorithms in the comparison.
Results show that frequent pattern-based SVM classifiers demonstrate better performance
compared to other classifiers on the sub-cellular location datasets and they perform
competitively with the nearest neighbor classifier on the protein function datasets.
Our results also show that the use of reduced alphabets provides statistically significant
performance improvements for the SVM-based classifier fusion algorithm, for half of
the classes studied.