Integrative Data Analysis of Microarray and RNA-seq

Wang, Qi

dc.contributor.author	Wang, Qi
dc.description.abstract	Background: Microarray and RNA sequencing (RNA-seq) are two commonly used high-throughput technologies for gene expression profiling for the past decades. For global gene expression studies, both techniques are expensive, and each has its unique advantages and limitations. Integrative analysis of these two types of data would provide increased statistical power, reduced cost, and complementary technical advantages. However, the complete different mechanisms of the high-throughput techniques make the two types of data highly incompatible. Methods: Based on the degrees of compatibility, the genes are grouped into different clusters using a novel clustering algorithm, called Boundary Shift Partition (BSP). For each cluster, a linear model is fitted to the data and the number of differentially expressed genes (DEGs) is calculated by running two-sample t-test on the residuals. The optimal number of cluster can be determined using the selection criteria that is penalized on the number of parameters for model fitting. The method was evaluated using the data simulated from various distributions and it was compared with the conventional K-means clustering method, Hartigan-Wong’s algorithm. The BSP algorithm was applied to the microarray and RNA-seq data obtained from the embryonic heart tissues from wild type mice and Tbx5 mice. The raw data went through multiple preprocessing steps including data transformation, quantile normalization, linear model, principal component analysis and probe alignments. The differentially expressed genes between wild type and Tbx5 are identified using the BSP algorithm. Results: The accuracies of the BSP algorithm for the simulation data are higher than those of Hartigan-Wong’s algorithm for the cases with smaller standard deviations across the five different underlying distributions. The BSP algorithm can find the correct number of the clusters using the selection criteria. The BSP method identifies 584 differentially expressed genes between the wild type and Tbx5 mice. A core gene network developed from the differentially expressed genes showed a set of key genes that were known to be important for heart development. Conclusion: The BSP algorithm is an efficient and robust classification method to integrate the data obtained from microarray and RNA-seq.	en_US
dc.publisher	North Dakota State University	en_US
dc.title	Integrative Data Analysis of Microarray and RNA-seq	en_US
dc.type	Dissertation	en_US
dc.date.accessioned	2019-08-03T15:24:31Z
dc.date.available	2019-08-03T15:24:31Z
dc.date.issued	2018	en_US
dc.identifier.uri	https://hdl.handle.net/10365/29968
ndsu.degree	Doctor of Philosophy (PhD)	en_US
ndsu.college	Science and Mathematics	en_US
ndsu.department	Statistics	en_US
ndsu.program	Statistics	en_US
ndsu.advisor	Hyun, Seung Won

Files in this item

Name:: Wang_ndsu_0157D_12134.pdf
Size:: 4.177Mb
Format:: PDF
Description:: Integrative Data Analysis of ...

View/Open

This item appears in the following Collection(s)

Statistics Doctoral Work

Show simple item record