Statistics Doctoral Work
Permanent URI for this collectionhdl:10365/32399
Browse
Browsing Statistics Doctoral Work by browse.metadata.program "Statistics"
Now showing 1 - 20 of 32
- Results Per Page
- Sort Options
Item Adaptive Two-Stage Optimal Design for Estimating Multiple EDps under the 4-Parameter Logistic Model(North Dakota State University, 2018) Zhang, AnqingIn dose-finding studies, c-optimal designs provide the most efficient design to study an interesting target dose. However, there is no guarantee that a c-optimal design that works best for estimating one specific target dose still performs well for estimating other target doses. Considering the demand in estimating multiple target dose levels, the robustness of the optimal design becomes important. In this study, the 4-parameter logistic model is adopted to describe dose-response curves. Under nonlinear models, optimal design truly depends on the pre-specified nominal parameter values. If the pre-specified values of the parameters are not close to the true values, optimal designs become far from optimum. In this research, I study an optimal design that works well for estimating multiple s and for unknown parameter values. To address this parameter uncertainty, a two-stage design technique is adopted using two different approaches. One approach is to utilize a design augmentation at the second stage, the other one is to apply a Bayesian paradigm to find the optimal design at the second stage. For the Bayesian approach, one challenging task is that it requires heavy computation in the numerical calculation when searching for the Bayesian optimal design. To overcome this problem, a clustering method can be applied. These two-stage design strategies are applied to construct a robust optimal design for estimating multiple s. Through a simulation study, the proposed two-stage optimal designs are compared with the traditional uniform design and the enhanced uniform design to see how well they perform in estimating multiple s when the parameter values are mis-specified.Item Bayesian Lasso Models – With Application to Sports Data(North Dakota State University, 2018) Gao, DiSeveral statistical models were proposed by researchers to fulfill the objective of correctly predicting the winners of sports game, for example, the generalized linear model (Magel & Unruh, 2013) and the probability self-consistent model (Shen et al., 2015). This work studied Bayesian Lasso generalized linear models. A hybrid model estimation approach of full and Empirical Bayesian was proposed. A simple and efficient method in the EM step, which does not require sample mean from the random samples, was also introduced. The expectation step was reduced to derive the theoretical expectation directly from the conditional marginal. The findings of this work suggest that future application will significantly cut down the computation load. Due to Lasso (Tibshirani, 1996)’s desired geometric property, the Lasso method provides a sharp power in selecting significant explanatory variables and has become very popular in solving big data problem in the last 20 years. This work was constructed with Lasso structure hence can also be a good fit to achieve dimension reduction. Dimension reduction is necessary when the number of observations is less than the number of parameters or when the design matrix is non-full rank. A simulation study was conducted to test the power of dimension reduction and the accuracy and variation of the estimates. For an application of the Bayesian Lasso Probit Linear Regression to live data, NCAA March Madness (Men’s Basketball Division I) was considered. In the end, the predicting bracket was used to compare with the real tournament result, and the model performance was evaluated by bracket scoring system (Shen et al., 2015).Item Boundary Estimation(North Dakota State University, 2015) Mu, YingfeiThe existing statistical methods do not provide a satisfactory solution to determining the spatial pattern in spatially referenced data, which is often required by research in many areas including geology, agriculture, forestry, marine science and epidemiology for identifying the source of the unusual environmental factors associated with a certain phenomenon. This work provides a novel algorithm which can be used to delineate the boundary of an area of hot spots accurately and e ciently. Our algorithm, rst of all, does not assume any pre-speci ed geometric shapes for the change-curve. Secondly, the computation complexity by our novel algorithm for changecurve detection is in the order of O(n2), which is much smaller than 2O(n2) required by the CUSP algorithm proposed in M uller&Song [8] and Carlstein's [2] estimators. Furthermore, our novel algorithm yields a consistent estimate of the change-curve as well as the underlying distribution mean of observations in the regions. We also study the hypothesis test of the existence of the change-curve in the presence of independence of the spatially referenced data. We then provide some simulation studies as well as a real case study to compare our algorithm with the popular boundary estimation method : Spatial scan statistic.Item Comparing Prediction Accuracies of Cancer Survival Using Machine Learning Techniques and Statistical Methods in Combination with Data Reduction Methods(North Dakota State University, 2022) Mostofa, MohammadThis comparative study of five-year survival prediction for breast, lung, colon, and leukemia cancers using a large SEER dataset along with 10-fold cross-validation provided us with an insight into the relative prediction ability of different machine learning and data reduction methods. Lasso regression and the Boruta algorithm were used for variables selection, and Principal Component Analysis (PCA) was used for dimensionality reduction. We used one statistical method Logistic regression (LR) and several machine learning methods including Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), K Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Naïve Bayes Classifier (NB). For breast cancer, we found LDA, RF, and LR were the best models for five-year survival prediction based on the accuracy, sensitivity, specificity, and area under the curve (AUC) using data reduction method from Z score normalization and the Boruta algorithm. The results for lung cancer indicated the SVM linear, RF, and ANN were the best survival prediction models using data reduction methods from the Z score and max min normalization. The results for colon cancer indicated, ANN, and RF were the best prediction models using the Boruta algorithm and Z score method. The results for leukemia showed ANN, and the RF were the best survival prediction models using the Boruta algorithm and data reduction technique from the Z score. Overall, ANN, RF, and LR were the best prediction models for all cancers using variables selection by the Boruta algorithm.Item Comparing Several Modeling Methods on NCAA March Madness.(North Dakota State University, 2015) Hua, SuThis year (2015), according to the AGA’s (American Gaming Association) research, nearly about 40 million people filled out about 70 million March Madness brackets (Moyer, 2015). Their objective is to correctly predict the winners of each game. This paper used the probability self-consistent (PSC) model (Shen, Hua, Zhang, Mu, Magel, 2015) to make the prediction of all 63 games in the NCAA Men's Division I Basketball Tournament. PSC model was first introduced by Zhang (2012). The Logit link was used in Zhang’s (2012) paper to connect only five covariates with the conditional probability of a team winning a game given its rival team. In this work, we incorporated fourteen covariates into the model. In addition to this, we used another link function, Cauchit link, in the model to make the predictions. Empirical results show that the PSC model with Cauchit link has better average performance in both simple and doubling scoring than Logit link during the last three years of tournament play. In the generalized linear model, maximum likelihood estimation is a popular method for estimating the parameters; however, convergence failuresmay happen when using large dimension covariates in the model (Griffiths, Hill, Pope, 1987). Therefore, in the second phase in this study, Bayesian inference is used for estimating in the parameters in the prediction model. Bayesian estimation incorporates prior information such as experts’ opinions and historical results in the model. Predictions from three years of March Madness using the model obtained from Bayesian estimation with Logit link will be compared to predictions using the model obtained from maximum likelihood estimation.Item A Comparison of False Discovery Rate Method and Dunnett's Test for a Large Number of Treatments(North Dakota State University, 2015) Gomez, Kayeromi DonoukounmahouIt has become quite common nowadays to perform multiple tests simultaneously in order to detect differences of a certain trait among groups. This often leads to an inflated probability of at least one Type I Error, a rejection of a null hypothesis when it is in fact true. This inflation generally leads to a loss of power of the test especially in multiple testing and multiple comparisons. The aim of the research is to use simulation to address what a researcher should do to determine which treatments are significantly different from the control when there is a large number of treatments and the number of replicates in each treatment is small. We examine two situations in this simulation study: when the number of replicates per treatment is 3 and also when it is 5 and in each of these situations, we simulated from a normal distribution and in mixture of normal distributions. The total number of simulated treatments was progressively increased from 50 to 100 then 150 and finally 300. The goal is to measure the change in the performances of the False Discovery Rate method and Dunnett’s test in terms of type I error and power as the total number of treatments increases. We reported two ways of examining type I error and power: first, we look at the performances of the two tests in relation to all other comparisons in our simulation study, and secondly per simulated sample. In the first assessment, the False Discovery Rate method appears to have a higher power while keeping its type I error in the same neighborhood as Dunnett’s test and in the latter, both tests have similar powers and the False Discovery Rate method has a higher type I error. Overall, the results show that when the objective of the researcher is to detect as many of the differences as possible, then FDR method is preferred. However if error is more detrimental to the outcomes of the research, Dunnett’s test offers a better alternative.Item A Conditional Random Field (CRF) Based Machine Learning Framework for Product Review Mining(North Dakota State University, 2019) Ming, YueThe task of opinion mining from product reviews has been achieved by employing rule-based approaches or generative learning models such as hidden Markov models (HMMs). This paper introduced a discriminative model using linear-chain Conditional Random Fields (CRFs) that can naturally incorporate arbitrary, non-independent features of the input without conditional independence among the features or distributional assumptions of inputs. The framework firstly performs part-of-speech (POS) tagging tasks over each word in sentences of review text. The performance is evaluated based on three criteria: precision, recall and F-score. The result shows that this approach is effective for this type of natural language processing (NLP) tasks. Then the framework extracts the keywords associated with each product feature and summarizes into concise lists that are simple and intuitive for people to read.Item Conditional Random Field with Lasso and its Application to the Classification of Barley Genes Based on Expression Level Affected by Fungal Infection(North Dakota State University, 2019) Liu, XiyuanThe classification problem of gene expression level, more specifically, gene expression analysis, is a major research area in statistics. There are several classical methods to solve the classification problem. To apply Logistic Regression Model (LRM) and other classical methods, the observations in the dataset should fit the assumption of independence. That is, the observations in the dataset are independent to each other, and the predictor (independent variable) should be independent. These assumptions are usually violated in gene expression analysis. Although the Classical Hidden Markov Chain Model (HMM) can solve the independence of observation problem, the classical HMM requires the independent variables in the dataset are discrete and independent. Unfortunately, the gene expression level is a continuous variable. To solve the classification problem of Gene Expression Level data, the Conditional Random Field(CRF) is introduce. Finally, the Least Absolute Selection and Shrinkage Operator (LASSO) penalty, a dimensional reduction method, is introduced to improve the CRF model.Item Distributed Inference for Degenerate U-Statistics with Application to One and Two Sample Test(North Dakota State University, 2020) Atta-Asiamah, ErnestIn many hypothesis testing problems such as one-sample and two-sample test problems, the test statistics are degenerate U-statistics. One of the challenges in practice is the computation of U-statistics for a large sample size. Besides, for degenerate U-statistics, the limiting distribution is a mixture of weighted chi-squares, involving the eigenvalues of the kernel of the U-statistics. As a result, it’s not straightforward to construct the rejection region based on this asymptotic distribution. In this research, we aim to reduce the computation complexity of degenerate U-statistics and propose an easy-to-calibrate test statistic by using the divide-and-conquer method. Specifically, we randomly partition the full n data points into kn even disjoint groups, and compute U-statistics on each group and combine them by averaging to get a statistic Tn. We proved that the statistic Tn has the standard normal distribution as the limiting distribution. In this way, the running time is reduced from O(n^m) to O( n^m/km_n), where m is the order of the one sample U-statistics. Besides, for a given significance level , it’s easy to construct the rejection region. We apply our method to the goodness of fit test and two-sample test. The simulation and real data analysis show that the proposed test can achieve high power and fast running time for both one and two-sample tests.Item Identification of Differentially Expressed Genes When the Distribution of Effect Sizes is Asymmetric in Two Class Experiments(North Dakota State University, 2017) Kotoka, Ekua FesuwaHigh-throughput RNA Sequencing (RNA-Seq) has emerged as an innovative and powerful technology for detecting differentially expressed genes (DE) across different conditions. Unlike continuous microarray data, RNA-Seq data consist of discrete read counts mapped to a particular gene. Most proposed methods for detecting DE genes from RNA-Seq are based on statistics that compare normalized read counts between conditions. However, most of these methods do not take into account potential asymmetry in the distribution of effect sizes. In this dissertation, we propose methods to detect DE genes when the distribution of the effect sizes is observed to be asymmetric. These proposed methods improve detection of differential expression compared to existing methods. Chapter 3 proposes two new methods that modify an existing nonparametric method, Significance Analysis of Microarrays with emphasis on RNA-Seq data (SAMseq), to account for the asymmetry in the distribution of the effect sizes. Results of the simulation studies indicates that the proposed methods, compared to the SAMseq method identifies more DE genes, while adequately controlling false discovery rate (FDR). Furthermore, the use of the proposed methods is illustrated by analyzing a real RNA-Seq data set containing two different mouse strain samples. In Chapter 4, additional simulation studies are performed to show that the one of the proposed method, compared with other existing methods, provides better power for identifying truly DE genes or more sufficiently controls FDR in most settings where asymmetry is present. Chapter 5 compares the performance of parametric methods, DESeq2, NBPSeq and edgeR when there exist asymmetric effect sizes and the analysis takes into account this asymmetry. Through simulation studies, the performance of these methods are compared to the traditional BH and q-value method in the identification of DE genes. This research proposes a new method that modifies these parametric methods to account for asymmetry found in the distribution of effect sizes. Likewise, illustration on the use of these parametric methods and the proposed method by analyzing a real RNA-Seq data set containing two different mouse strain samples. Lastly, overall conclusions are given in Chapter 6.Item Integrative Data Analysis of Microarray and RNA-seq(North Dakota State University, 2018) Wang, QiBackground: Microarray and RNA sequencing (RNA-seq) are two commonly used high-throughput technologies for gene expression profiling for the past decades. For global gene expression studies, both techniques are expensive, and each has its unique advantages and limitations. Integrative analysis of these two types of data would provide increased statistical power, reduced cost, and complementary technical advantages. However, the complete different mechanisms of the high-throughput techniques make the two types of data highly incompatible. Methods: Based on the degrees of compatibility, the genes are grouped into different clusters using a novel clustering algorithm, called Boundary Shift Partition (BSP). For each cluster, a linear model is fitted to the data and the number of differentially expressed genes (DEGs) is calculated by running two-sample t-test on the residuals. The optimal number of cluster can be determined using the selection criteria that is penalized on the number of parameters for model fitting. The method was evaluated using the data simulated from various distributions and it was compared with the conventional K-means clustering method, Hartigan-Wong’s algorithm. The BSP algorithm was applied to the microarray and RNA-seq data obtained from the embryonic heart tissues from wild type mice and Tbx5 mice. The raw data went through multiple preprocessing steps including data transformation, quantile normalization, linear model, principal component analysis and probe alignments. The differentially expressed genes between wild type and Tbx5 are identified using the BSP algorithm. Results: The accuracies of the BSP algorithm for the simulation data are higher than those of Hartigan-Wong’s algorithm for the cases with smaller standard deviations across the five different underlying distributions. The BSP algorithm can find the correct number of the clusters using the selection criteria. The BSP method identifies 584 differentially expressed genes between the wild type and Tbx5 mice. A core gene network developed from the differentially expressed genes showed a set of key genes that were known to be important for heart development. Conclusion: The BSP algorithm is an efficient and robust classification method to integrate the data obtained from microarray and RNA-seq.Item Measuring Performance of United States Commercial and Domestic Banks and its Impact on 2007-2009 Financial Crisis(North Dakota State University, 2019) Sakouvogui, KekouraIn the analysis of efficiency measures, the statistical Stochastic Frontier Analysis (SFA) and linear programming Data Envelopment Analysis (DEA) estimators have been widely applied. This dissertation is centered around two main goals. First, this dissertation addresses respectively the individual limitations of SFA and DEA models in chapters 2 and 3 using Monte Carlo (MC) simulations. Motivated by the lack of justification for the choice of inefficiency distributions in MC simulations, chapter 2 develops the statistical parameters, i.e., mean and standard deviation of the inefficiency distributions - truncated normal, half normal, and exponential. MC simulations results show that within the conventional and proposed approaches, misspecification of the inefficiency distribution matters. More precisely, within the proposed approach, the misspecified truncated normal SFA model provides the smallest mean absolute deviation and mean square error when the inefficiency distribution is a half normal. Chapter 3 examines several misspecifications of the DEA efficiency measures while accounting for the stochastic inefficiency distributions of truncated normal, half normal, and exponential derived in chapter 2. MC simulations were conducted to examine the performance of the DEA model under two different data generating processes - logarithm and level, and across five different scenarios - inefficiency distributions, sample sizes, production functions, input distributions, and curse of dimensionality. The results caution DEA practitioners concerning the accuracy of their estimates and the implications within proposed and conventional approaches of the inefficiency distributions. Second, this dissertation presents in chapter 4 an empirical assessment of the liquidity and solvency financial factors on the cost efficiency measures of U.S banks while accounting for regulatory, macroeconomic, and bank internal factors. The results suggest that the liquidity and solvency financial factors negatively impacted the cost efficiency measures of U.S banks from 2005 to 2017. Moreover, during the financial crisis, U.S banks were inefficient in comparison to the tranquil period, and the solvency financial factor insignificantly impacted the cost efficiency measures. In addition, U.S banks’ liquidity financial factor negatively collapsed due to contagion during the financial crisis.Item Model Validation and Diagnostis in Right Censored Regression(North Dakota State University, 2013) Miljkovic, TatjanaWhen censored data are present in the linear regression setting, the Expectation-Maximization (EM) algorithm and the Buckley and James (BJ) method are two algorithms that can be implemented to fit the regression model. We focus our study on the EM algorithm because it is easier to implement than the BJ algorithm and it uses common assumptions in regression theory, such as normally distributed errors. The BJ algorithm, however, is used for comparison purposes in benchmarking the EM parameter estimates, their variability, and model selection. In this dissertation, validation and influence diagnostic tools are proposed for right censored regression using the EM algorithm. These tools include a reconstructed coefficient of determination, a test for outliers based on the reconstructed Jackknife residual, and influence diagnostics with one-step deletion. To validate the proposed methods, extensive simulation studies are performed to compare the performances of the EM and BJ algorithms in parameter estimation for data with different error distributions, the proportion of censored data, and sample sizes. Sensitivity analysis for the reconstructed coefficient of determination is developed to show how the EM algorithm can be used in model validation for different amounts of censoring and locations of the censored data. Additional simulation studies show the capability of the EM algorithm to detect outliers for different types of outliers (uncensored and censored), proportions of censored data, and the locations of outliers. The proposed formula for the one-step deletion method is validated with an example and a simulation study. Additionally, this research proposes a novel application of the EM algorithm for modeling right censored regression in the area of actuarial science. Both the EM and BJ algorithms are utilized in modeling health benefit data provided by the North Dakota Department of Veterans Affairs (ND DVA). Proposed model validation and diagnostic tools are applied using the EM algorithm. Results of this study can be of great benefit to government policy makers and pricing actuaries.Item Nonparametric Test for Nondecreasing Order Alternatives in Randomized Complete Block and Balanced Incomplete Block Mixed Design(North Dakota State University, 2020) Osafo, MamfeNonparametric tests are used to test hypotheses when the data at hand violate one or more of the assumptions for parametric tests procedures. The test is an ordered alternative (nondecreasing) when there is prior information about the data. It assumes that the underlying distributions are of the same type and therefore differ in location. For example, in dose-response studies, animals are assigned to k groups corresponding to k doses of an experimental drug. The effect of the drug on the animals is likely to increase or decrease with increasing doses. In this case, the ordered alternative is appropriate for the study. In this paper, we propose eight new nonparametric tests useful for testing against nondecreasing order alternatives for a mixed design involving randomized complete block and balanced incomplete block design. These tests involve various modifications of the Jonckheere-Terpstra test (Jonckheere(1952), Terpstra(1954)) and Alvo and Cabilio’s test (1995). Three, four and five treatments were considered with different location parameters under different scenarios. For three and four treatments, 6,12, and 18 blocks were used for the simulation, while 10, 20, and 30 blocks were used for five treatments. Different tests performed best under different block combinations, but overall the standardized last for Alvo outperformed the other test when the number of treatments and number of missing observations per block increases. A simulation study was conducted comparing the powers of the various modification of Jonckheere-Terpstra (Jonckheere(1952), Terpstra(1954)) and Alvo and Cabilio’s (1995) tests under different scenarios. Recommendations are made.Item Nonparametric Tests for the Non-Decreasing and Alternative Hypotheses for the Incomplete Block and Completely Randomized Mixed Design(North Dakota State University, 2014) Ndungu, Alfred MungaiThis research study proposes a solution to deal with missing observations which is a common problem in real world datasets. A nonparametric approach is used because of its ease of use relative to the parametric approach that beleaguer the user with firm assumptions. The study assumes data is in an Incomplete Block (IBD) and Completely Randomized (CRD) Mixed Design. The scope of this research was limited to three, four and five treatments. Mersenne - Twister (2014) simulations were used to vary the design and to estimate the test statistic powers. Two test statistics are proposed if the user expects a non – decreasing order of differences in treatment means. They are both applicable in the cited mixed design. The tests combine Alvo and Cabilio (1995) and Jonckheere – Terpstra ((Jonckheere (1954), Terpstra (1952)) in two ways: standardizing the sum of the standardized statistics and standardizing the sum of the unstandardized statistics. Results showed that the former is better. Three tests are proposed for the umbrella alternative. The first, Mungai’s test, is only applicable in an IBD. The other two tests combine Mungai’s and Mack – Wolfe (1981) using the same methods described in the previous paragraph. The same conclusion holds except when the size of the IBD’s sample was equal to or greater than a quarter that of the CRD.Item Nonparametric Tests for the Umbrella Alternative in a Mixed Design for a Known Peak(North Dakota State University, 2020) Asare, Boampong AduWhen an assumption from a parametric test cannot be verified, a nonparametric test provides a simple way of conducting a test on populations. The motivation behind conducting a test of the hypothesis is to examine the effect of a treatment or multiple treatments against one another. For example, in dose-response studies, monkeys are assigned to k groups corresponding to k doses of an experimental drug. The effect of the drug on these monkeys is likely to increase or decrease with increasing and decreasing doses. The drug’s effect on these monkeys may be an increasing function of dosage to a certain level, and then its effect decreases with further increasing doses. An umbrella alternative, in this case, is considered the most appropriate hypothesis for these kinds of studies. Tests statistics are proposed to test for the umbrella alternative in mixed designs consisting of combinations of a Completely Randomized Design (CRD), a Randomized Complete Block Design (RCBD), an Incomplete Block Design (IBD) and a Balanced Incomplete Block Design (BIBD). Powers obtained were based on a variety of cases. Different proportions of blocks to different sample sizes of a Completely Randomized Design portion were considered. In all treatments, equal sample sizes for the Completely Randomized Design were considered. Furthermore, an equal number of blocks of a randomized complete block design to an Incomplete Block Design and Balanced Incomplete blocks were considered. Studies in a Monte Carlo simulation were conducted using SAS to vary the design and to estimate the test statistic powers to each other. The underlying distributions considered were normal, t and exponential.Item On Lasso Estimation of Linear Mixed Model for High Dimensional Longitudinal Data(North Dakota State University, 2021) Wen, QianWith the advancement of technology in data collection, repeated measurements with high dimensional covariates have become increasingly common. The classical statistics approach for modeling the data of this kind is via the linear mixed model with temporally correlated error. However, most of the research reported in the literature for variable selection is for independent response data. In this study, the proposed algorithm employs Expectation and Maximization (EM) and Least Absolute Shrinkage and Selection Operator (LASSO) approaches under the linear mixed model scheme with the assumption of Gaussianity, an approach that works for data with interdependence. Our algorithm involves two steps: 1.Variance-covariance components estimation by EM; and 2.Variable selection by LASSO. The crucial challenge arises from the fact that linear mixed models usually allow structured variance-covariance, which, in return, renders complexity in its estimation: No explicit maxima in general in the M-step of the EM algorithm. Our EM algorithm uses one iteration of the projection gradient descent method, which turns out to be quite computationally efficient compared with the classical EM algorithm because it obviates the process of finding the maxima of the variance-covariance components in the M-step. With the estimates of variance-covariance components obtained from step 1, the LASSO estimation is executed on the full log-likelihood function imposed with an L1 regularization. The LASSO method has the effect of shrinking all coefficients towards zero, which plays a variable selection role. We apply the gradient descent algorithm to find LASSO estimates and the pathwise coordinate descent to set up the tuning parameter for the penalized log-likelihood function. The simulation studies are carried out under the assumption that measurement errors of each subject are of first-order autoregressive AR(1) correlation structure. The numerical results show that the variance-covariance parameters estimates by our method are comparable to the classic Newton-Raphson (NR) method in the simple case and outperforms NR method when the variance-covariance matrix having a complex structure. Moreover, our method successfully identifies all the relevant explanatory variables and most of the redundant explanatory variables. The proposed method is also applied to a life data and the result is very reasonable.Item Performance of Permutation Tests Using Simulated Genetic Data(North Dakota State University, 2022) Soumare, IbrahimDisease statuses and biological conditions are known to be greatly impacted by differences in gene expression levels. A common challenge in RNA-seq data analysis is to identify genes whose mean expression levels change across different groups of samples, or, more generally, are associated with one or more variables of interest. Such analysis is called differential expression analysis. Many tools have been developed for analyzing differential gene expression (DGE) for RNA-seq data. RNA-seq data are represented as counts. Typically, a generalized linear model with a log link and a negative binomial response is fit to the count data for each gene, and DE genes are identified by testing, for each gene, whether a model parameter or linear combination of model parameters is zero. We conducted a simulation study to compare the performance of our proposed modified permutation test to DESeq2 edgeR, Limma, LFC and Voom when applied to RNA-seq data. We considered different combinations of sample sizes and underlying distributions. In this simulation study, we first simulated data using Monte Carlo simulation in SAS and assessed True Detection rate and False Positive rate for each model involved. We then simulated data from real RNA-seq data using SimSeq algorithm and compared the performance of our proposed model to DESeq2 edgeR, Limma, LFC and Voom. The simulation results suggest that Permutation tests are a competitive alternative to traditional parametric methods for analyzing RNA-seq data when we have sufficient sample sizes. Specifically, the results show that Permutation controlled Type I error fairly well and had a comparable Power rate. Moreover, for a sample size n≥10 simulation exhibited a comparable True detection rate and consistently kept the False Positive rate very low when sampling from Poisson and Negative Binomial distributions. Likewise, the results from SimSeq confirm that Permutation tests do a better job at keeping the False Positive rate the lowest.Item Predicting the Outcomes of NCAA Women’s Sports(North Dakota State University, 2017) Wang, WentingSports competitions provide excellent opportunities for model building and using basic statistical methodology in an interesting way. More attention has been paid to and more research has been conducted pertaining to men’s sports as opposed to women’s sports. This paper will focus on three kinds of women’s sports, i.e. NCAA women’s basketball, volleyball and soccer. Several ordinary least squares models were developed that help explain the variation in point spread of a women’s basketball game, volleyball game and soccer game based on in-game statistics. Several logistic models were also developed that help estimate the probability that a particular team will win the game for women’s basketball, volleyball and soccer tournaments. Ordinary least squares models for Round 1, Round 2 and Rounds 3-6 with point spread being the dependent variable by using differences in ranks of seasonal averages and differences of seasonal averages were developed to predict winners of games in each of those rounds for the women’s basketball, volleyball and soccer tournament. Logistic models for Round 1, Round 2 and Rounds 3-6 that estimate the probability of a team winning the game by using differences in ranks of seasonal averages and differences of seasonal averages were developed to predict winners of games in each of those rounds for the basketball, volleyball and soccer tournaments. The prediction models were validated before doing the prediction. For basketball, the least squares model developed by using differences in ranks of seasonal averages with a double scoring system variable predicted the results of a 76.2% of the games for the entire tournament with all the predictions made before the start of the tournament. For volleyball, the logistic model developed by using differences of seasonal averages predicted 65.1% of the games for the entire tournament. For soccer, the logistic regression model developed by using differences of seasonal averages predicted 45% of all games in the tournament. Correctly when all 6 rounds were predicted before the tournament began. In this case, team predicted to win in the second round or higher might not have even made it to this round since prediction was done ahead of time.Item Profile Matching in Observational Studies With Multilevel Data(North Dakota State University, 2022) McGrath, BrendaMatching is a popular method to use with observational data to replicate desired features of a randomized control trial. A common problem encountered in observational studies is the lack of common support or the limited overlap of the covariate distributions across treatment groups. A new approach, cardinality matching, leverages mathematical optimization to directly balance observed covariates. When conducting cardinality matching, the user specifies the tolerable balance constraints of individual covariates and the desired number of matched controls. The algorithm then finds the largest possible match given these constraints. Profile matching is a newly proposed method that uses cardinality matching, in which the user can specify a target profile directly and find the largest cardinality match that is balanced to the target profile. We developed an R package called ProfileMatchit that will employ profile matching. We employed the new package in the setting of hospital quality assessment using a real-world dataset. Profile matching has not yet been used in hospital quality assessment but may be an improvement over current approaches, which have limitations in the ability to find sufficient matches in a heterogeneous sample. This application would be the culmination of our work to develop an improved version of cardinality matching and provide a new application of profile matching and a better approach to hospital quality assessment.