Identification of Differentially Expressed Genes When the Distribution of Effect Sizes is Asymmetric in Two Class Experiments
Abstract
High-throughput RNA Sequencing (RNA-Seq) has emerged as an innovative and powerful technology for detecting differentially expressed genes (DE) across different conditions. Unlike continuous microarray data, RNA-Seq data consist of discrete read counts mapped to a particular gene. Most proposed methods for detecting DE genes from RNA-Seq are based on statistics that compare normalized read counts between conditions. However, most of these methods do not take into account potential asymmetry in the distribution of effect sizes. In this dissertation, we propose methods to detect DE genes when the distribution of the effect sizes is observed to be asymmetric. These proposed methods improve detection of differential expression compared to existing methods. Chapter 3 proposes two new methods that modify an existing nonparametric method, Significance Analysis of Microarrays with emphasis on RNA-Seq data (SAMseq), to account for the asymmetry in the distribution of the effect sizes. Results of the simulation studies indicates that the proposed methods, compared to the SAMseq method identifies more DE genes, while adequately controlling false discovery rate (FDR). Furthermore, the use of the proposed methods is illustrated by analyzing a real RNA-Seq data set containing two different mouse strain samples. In Chapter 4, additional simulation studies are performed to show that the one of the proposed method, compared with other existing methods, provides better power for identifying truly DE genes or more sufficiently controls FDR in most settings where asymmetry is present. Chapter 5 compares the performance of parametric methods, DESeq2, NBPSeq and edgeR when there exist asymmetric effect sizes and the analysis takes into account this asymmetry. Through simulation studies, the performance of these methods are compared to the traditional BH and q-value method in the identification of DE genes. This research proposes a new method that modifies these parametric methods to account for asymmetry found in the distribution of effect sizes. Likewise, illustration on the use of these parametric methods and the proposed method by analyzing a real RNA-Seq data set containing two different mouse strain samples. Lastly, overall conclusions are given in Chapter 6.