Statistics Masters Theses
Permanent URI for this collectionhdl:10365/32401
Browse
Recent Submissions
Item Graph two-sample test via empirical likelihood(North Dakota State University, 2024) Zou, TingIn the past two decades, there has been a notable surge in network data. This proliferation has spurred significant advancements in methods for analyzing networks across various disciplines, including computer science, information sciences, biology, bioinformatics, physics, economics, sociology, and health science. Graph two-sample hypothesis testing, aimed at discerning differences between two populations of networks, arises naturally in diverse scenarios. In this paper, we delve into the essential yet intricate task of testing for equivalence between two networks. There are many testing procedures available. For instance, the t-test based on subgraph counts is one of the methods. In this paper, we propose a new test method by using the empirical likelihood. We run extensive simulations to evaluate the performance of the proposed method and apply it a real-world network. Based on the simulation experiments and real data application, the empirical likelihood test consistently outperforms existing subgraph count tests.Item A Comparison of the Ansari-Bradley Test and the Moses Test for the Variances(North Dakota State University, 2011) Yuni, ChenThis paper is aimed to compare the powers and significance levels of two well known nonparametric tests: the Ansari-Bradley test and the Moses test in both situations where the equal-median assumption is satisfied and where the equal-median assumption is violated. R-code is used to generate the random data from several distributions: the normal distribution, the exponential distribution, and the t-distribution with three degrees of freedom. The power and significance level of each test was estimated for a given situation based on 10,000 iterations. Situations with the equal samples of size 10, 20, and 30, and unequal samples of size 10 and 20, 20 and 10, and 20 and 30 were considered for a variety of different location parameter shifts. The study shows that when two location parameters are equal, generally the Ansari-Bradley test is more powerful than the Moses test regardless ofthe underlying distribution; when two location parameters are different, the Moses is generally preferred. The study also shows that when the underlying distribution is symmetric, the Moses test with large subset size k generally has higher power than the test with smaller k; when the underlying distribution is not symmetric, the Moses test with larger k is more powerful for relatively small sample sizes and the Moses test with medium k has higher power for relatively large sample sizes.Item Estimating the Number of Genes That Are Differentially Expressed in Two Dependent Experiments or Analyses(North Dakota State University, 2022) Lawal, HammedMany researchers have used the intersection method to compare the results of differential expression analysis between two or more gene expression experiments. Some methods have been proposed to estimate the number of genes commonly differentially expressed in two independent gene expression experiments or analyses, but there has not been a method for estimating this number using dependent experiments or analyses other than the intersection method. In this thesis project, we propose a method for estimating the number of differentially expressed genes in two dependent experiments or analyses. Simulation studies are performed to compare the proposed to existing methods and an analysis of a real gene expression data set is performed to illustrate the use of the proposed method.Item An Analysis of the NBA Draft: Are Teams Drafting Better and Does College Experience Truly Matter(North Dakota State University, 2022) Wolfe, KevinThis thesis attempts to answer two questions. Are NBA organizations doing a reasonable job at drafting players and getting better at the process, and does college experience play a significant role in a player’s performance during their early NBA career (first 3 seasons).In regard to these two questions, we determined through our research that NBA organizations are not showing any significant improvements in their ability to draft the best available players, this is surprising given the increase in available scouting data teams currently have access to. We suspected however that this lack of drafting improvements may be related to players entering the NBA with less college experience. However, after we determined that college experience does not appear to play a large role in a player’s early career NBA performance, we determined that experience does not appear to be the reason why teams aren’t doing a better job of drafting.Item A Comparative Multiple Simulation Study for Parametric and Nonparametric Methods in the Identification of Differentially Expressed Genes(North Dakota State University, 2021) Palmer, Daniel GrantRNA-seq data simulated from a negative binomial distribution, sampled without replacement, or modified from read counts were analyzed to compare differential gene expression analysis methods in terms of false discovery rate control and power. The goals of the study were to determine optimal sample sizes/proportions of differential expression needed to adequately control false discovery rate and which differential gene expression methods performed best with the given simulation methods. Parametric tools like edgeR and limma-voom tended to be conservative when controlling false discovery rate from a negative binomial distribution as the proportion of differential expression increased. For the nonparametric simulation methods, many differential gene expression methods did not adequately control false discovery rate and results varied greatly when different reference data sets were used for simulations.Item Robustness of the Eigenvalue Test for Community Structure(North Dakota State University, 2021) Rose, Matthew AllanNetworks can take on many different forms, such as the people from the University you attend. Within these networks, community structure may exist. This "community structure" refers to the clustering of nodes by a common characteristic. There are many algorithms to extract communities within a network. These methods depend on the assumption that structure exists within the network. Statistical tests have been proposed to test this assumption. In practice, networks may have measurement errors. This usually comes in the form of missing data or other faults. As a result, networks may not tell the full story at surface level and network structure often suffer from some type of error, as there may be nodes or edges absent from the data or ones that should not exist within the network. We wish to observe the effectiveness of the largest eigenvalue test for community structure when error is introduced into the network.Item A Spectral Two Sample Test(North Dakota State University, 2021) Xu, YangshuangIn statistics, two-sample tests are commonly desired under many topics. It's crucial to differentiate populations as a prerequisite for developing further analysis. A number of statistical tests have been developed for this hypothesis testing problem. In this thesis, we propose a novel method to perform two-sample test. Specifically, we use the two samples to form a matrix and adopt the largest eigen-value as the test statistic. This test statistic followed the tracy-widom law as the limiting distribution under the null hypothesis. We evaluate the performance of the proposed method by extensive simulation study and real data application. The type I error is consistently and asymptotically controlled to nominal level. Our test manifests competitive power and prevailing calculation cost compared with several well-known two-sample test methods. The real data application also shows the advantage of the proposed method.Item Survival Analysis of Treatment Effect For Brain Cancer Based on The Surveillance, Epidemiology, and End Results Database(North Dakota State University, 2020) Mathiason, Madison JaneCancer is one of the leading causes of death in the United States. The Surveillance, Epidemiology, and End Results (SEER) data from the National Cancer Institute is a population based cancer registry, which geographically covers 34.6% of the US population. The SEER database was used to model surivial time for 21,524 patients with primary malignant brain tumors. The Kaplan-Meier surivial curves and the logrank test were used to compare the effect of treatment in each grade. The Cox Proportional Hazard Model was used to show the simultaneous effect of treatment, sex, and age on the risk of death for patients in each grade. Elderly patients had the lowest survival time, while adults had the highest. The risk of death for males was slightly higher than females. The results demonstrate that the survival curves of the three treatment groups only significantly differ among participants with grade 4 primary brain tumors.Item Propensity Score and Survival Analysis for Lung Cancer(North Dakota State University, 2020) Mostofa, Mohammad GulamPropensity scores were used to assess covariate balance between black and white groups in each lung cancer stage of a large data set. Pairwise log rank tests were used to test the equality of survival distribution for treatment and race groups. Cox regression models were used to estimate the hazard ratios for each treatment in all stages. In stage one, radiation and surgery was found the best treatment. In stage two, treatment chemotherapy was found as the best option. Radiation and chemo were found to be the best treatment combinations in stage three. Based on hazard ratios, the treatment chemo was the best for stage four. Statistically significant differences in survival curves were found between different gender and race combinations in stages one and three, but not in stages two or four.Item The Determinants of Aeronautical Charges of U.S.Airports: A Spatial Analysis(North Dakota State University, 2020) Karanki, FecriUsing U.S. airport data from 2009 through 2016, this thesis examines the determinants of aeronautical charges of large and medium hub airports and accounts for the spatial dependence of neighboring airports in a spatial panel regression model. The major finding of this thesis are (1) U.S. airports’ aeronautical charges are spatially dependent, and neighboring airport charges are spatially and positively correlated; (2) there is evidence of airport cost recovery through non-aeronautical revenues; (3) airports sharing non-aeronautical revenues with airlines charge lower aeronautical fees than their peers that do not share revenues; (4) aeronautical charges increase with higher delays.Item A Comparison of Filtering and Normalization Methods in the Statistical Analysis of Gene Expression Experiments(North Dakota State University, 2020) Speicher, Mackenzie Rosa MarieBoth microarray and RNA-seq technologies are powerful tools which are commonly used in differential expression (DE) analysis. Gene expression levels are compared across treatment groups to determine which genes are differentially expressed. With both technologies, filtering and normalization are important steps in data analysis. In this thesis, real datasets are used to compare current analysis methods of two-color microarray and RNA-seq experiments. A variety of filtering, normalization and statistical approaches are evaluated. The results of this study show that although there is still no widely accepted method for the analysis of these types of experiments, the method chosen can largely impact the number of genes that are declared to be differentially expressed.Item Comparing Performance of ANOVA to Poisson and Negative Binomial Regression When Applied to Count Data(North Dakota State University, 2020) Soumare, IbrahimAnalysis of Variance (ANOVA) is the easiest and most widely used model nowadays in statistics. ANOVA however requires a set of assumptions for the model to be a valid choice and for the inferences to be accurate. Among many, ANOVA assumes the data in question is normally distributed and homogenous. However, data from most disciplines does not meet the assumption of normality and/or equal variance. Regrettably, researchers do not always check whether the assumptions are met, and if these assumptions are violated, inferences might well be wrong. We conducted a simulation study to compare the performance of standard ANOVA to Poisson and Negative Binomial models when applied to counts data. We considered different combination of sample sizes and underlying distributions. In this simulation study, we first assed Type I error for each model involved. We then compared power as well as the quality of the estimated parameters across the models.Item Investigating Gender Bias Among Grant Applicants(North Dakota State University, 2020) Heim, Michael ThomasAn ongoing debate in society is about the existence of a wage gap between genders, and society’s alleged preference to hire a man over an equally qualified woman. This debate extends from the commercial employment world into the funding of research grants. Given data collected at North Dakota State University between 2012 and 2018 have women who have sought federal funding for their research experienced a gender bias? To investigate, a logistic regression model is fit to determine whether gender affects funding probability. Other characteristics such as the primary investigator’s college, requested amount, and the research team’s make up of tenured and Caucasian members is also investigated. It was found that there is not a gender bias towards faculty at NDSU. Naturally, there was a bias towards researchers from different colleges and towards proposals requesting less funding. Surprisingly, a bias towards higher-proportion Caucasian research projects was found.Item Empirical Comparison of Statistical Tests of Dense Subgraph in Network Data(North Dakota State University, 2020) Fornshell, Caleb JosephNetwork analysis is useful in modeling the structures of different phenomena. A fundamental question in the analysis of network data is whether a network contains community structure. One type of community structure of interest is a dense subgraph. Statistically deciding whether a network contains a dense subgraph can be formulated as a hypothesis test where under the null hypothesis, there is no community structure, and under the alternative hypothesis, the network contains a dense subgraph. One method in performing this hypothesis test is by counting the frequency of shapes created by all three-node subgraphs. In this study, three different test statistics based on the frequency of three-node subgraph shapes will be compared in their ability to detect a dense subgraph in simulated networks of varying size and characteristics.Item Factors Associated with Teacher Preparedness and Career Satisfaction in First Year Teachers(North Dakota State University, 2020) Buth, Kevin RossThe objective of this study is to determine the potential association between teaching state, subject taught, perceived preparation given by teacher preparedness programs, and perceived support from administration and colleagues, and overall happiness of teachers and their satisfaction with the university education program they attended. We use generalized Fisher’s exact tests, two-sample t-tests, linear regression, logistic regression to accomplish this objective. State and subject have very little effect on teacher satisfaction. Teacher support systems are associated with both the way a teacher perceives they were prepared, as well as the satisfaction they experience in their career. How well a teacher feels they were is also associated with teacher satisfaction.Item Linear Modeling of Election Results for U.S. House of Representatives Candidates and State Executive Offices for Iowa, Minnesota, and North Dakota(North Dakota State University, 2020) McEwen, ChristopherBetter understanding the relationship between the results for the U.S. House of Representatives and for state executive offices could potentially be useful in predicting outcomes if a significant relationship is present and if one has more information about either the election for the U.S. House of Representatives candidate or the state executive office candidate. To better understand this relationship, election results were analyzed using regression models for three upper Midwest states - Iowa, Minnesota, and North Dakota - to compare the outcomes of the state executive office elections and the U.S. House of Representative elections. Additionally, median income was included in the models to see if this affected the relationship. Each state had a statistically significant relationship between the results of the state executive offices and the U.S. House of Representatives. Median income either was not statistically significant or not practically significant in overall effect on the relationship.Item A Comparison of Two Scaling Techniques to Reduce Uncertainty in Predictive Models(North Dakota State University, 2020) Todd, Austin LukeThis research examines the use of two scaling techniques to accurately transfer information from small-scale data to large-scale predictions in a handful of nonlinear functions. The two techniques are (1) using random draws from distributions that represent smaller time scales and (2) using a single draw from a distribution representing the mean over all time represented by the model. This research used simulation to create the underlying distributions for the variable and parameters of the chosen functions which were then scaled accordingly. Once scaled, the variable and parameters were plugged into our chosen functions to give an output value. Using simulation, output distributions were created for each combination of scaling technique, underlying distribution, variable bounds, and parameter bounds. These distributions were then compared using a variety of statistical tests, measures, and graphical plots.Item Type I Error Assessment and Power Comparison of ANOVA and Zero-Inflated Methods on Zero-Inflated Data(North Dakota State University, 2019) Young, Lucas BlackmoreMany tests for the analysis of continuous data have the underlying assumption that the data in question follows a normal distribution (ex. ANOVA, regression, etc.). Within certain research topics, it is common to end up with a dataset that has a disproportionately high number of zero-values but is otherwise relatively normal. These datasets are often referred to as ‘zero-inflated’ and their analysis can be challenging. An example of where these zero-inflated datasets arise is in plant science. We conducted a simulation study to compare the performance of zero-inflated models to a standard ANOVA model on different types of zero-inflated data. Underlying distributions, experimental design scenario, sample sizes, and percentages of zeros were variables of consideration. In this study, we conduct a Type I error assessment followed by a power comparison between the models.Item Bayesian Sparse Factor Analysis of High Dimensional Gene Expression Data(North Dakota State University, 2019) Zhao, JingjunThis work closely studied fundamental techniques of Bayesian sparse Factor Analysis model - constrained Least Square regression, Bayesian Lasso regression, and some popular sparsity-inducing priors. In Appendix A, we introduced each of the fundamental techniques in a coherent manner and provided detailed proof for important formulas and definitions. We consider provided introduction and detailed proof, which are very helpful in learning Bayesian sparse Factor Analysis, as a contribution of this work. We also systematically studied a computationally tractable biclustering approach in identifying co-regulated genes, BicMix, by proving all point estimates of the parameters and by running the method on both simulated data sets and a real high-dimensional gene expression data set. Missed derivation of all point estimates in BicMix has been provided for better understanding variational expectation maximization (VEM) algorithm. The performance of the method for identifying true biclusters has been analyzed using the experimental results.Item Empirical Study of Two Hypothesis Test Methods for Community Structure in Networks(North Dakota State University, 2019) Nan, YehongMany real-world network data can be formulated as graphs, where a binary relation exists between nodes. One of the fundamental problems in network data analysis is community detection, clustering the nodes into different groups. Statistically, this problem can be formulated as hypothesis testing: under the null hypothesis, there is no community structure, while under the alternative hypothesis, community structure exists. One is of the method is to use the largest eigenvalues of the scaled adjacency matrix proposed by Bickel and Sarkar (2016), which works for dense graph. Another one is the subgraph counting method proposed by Gao and Lafferty (2017a), valid for sparse network. In this paper, firstly, we empirically study the BS or GL methods to see whether either of them works for moderately sparse network; secondly, we propose a subsampling method to reduce the computation of the BS method and run simulations to evaluate the performance.