Statistics Masters Theses
Permanent URI for this collectionhdl:10365/32401
Browse
Browsing Statistics Masters Theses by browse.metadata.department "Statistics"
Now showing 1 - 20 of 65
- Results Per Page
- Sort Options
Item Analysis of Bootstrap Techniques for Loss Reserving(North Dakota State University, 2015) Chase, Taryn RuthInsurance companies must have an appropriate method of estimating future reserve amounts. These values will directly influence the rates that are charged to the customer. This thesis analyzes stochastic reserving techniques that use bootstrap methods in order to obtain variability estimates of predicted reserves. Bootstrapping techniques are of interest because they usually do not require advanced statistical software to implement. Some bootstrap techniques have incorporated generalized linear models in order to produce results. To analyze how well these methods are performing, data with known future losses was obtained from the National Association of Insurance Commissioners. Analysis of this data shows that most bootstrapping methods produce results that are comparable to one another and to the trusted Chain Ladder method. The methods are then applied to loss data from a small Midwestern insurance company to predict variation of their future reserve amounts.Item An Analysis of Factors Contributing to Wins in the National Hockey League(North Dakota State University, 2013) Roith, Joseph MichaelThis thesis looks at common factors that have the largest impact on winning games in the NHL. Data was collected from regular season games for all teams in the NHL over seven seasons. Logistic and least squares regressions were performed to create a win probability model and a goal margin model to predict the outcome of games. Discriminant analysis was also used to determine significant factors over the course of an entire season. Save percentage margin, shot margin, block margin, short-handed shot margin, short-handed faceoff percentage, and even-handed faceoff percentage were found to be significant influences on individual game wins. Total goals, total goals against and takeaway totals for a season were enough to correctly predict whether a team made the playoffs 87% of the time. The accuracies of the models were then tested by predicting the outcome of games from the 2012 NHL regular season.Item Analysis of Salary for Major League Baseball Players(North Dakota State University, 2014) Hoffman, Michael GlennThis thesis examines the salary of Major League Baseball (MLB) players and whether players are paid based on their on-the-field performance. Each salary was examined on both the yearly production and the overall career production of the player. Several different production statistics were collected for the 2010-2012 MLB seasons. A random sample of players was selected from each season and separate models were created for position players and pitchers. Significant production statistics that were helpful in predicting salary were selected for each different model. These models were deemed to be good models having a predictive r-squared value of at least 0.70 for each of the different models. After the regression models were found, the models were tested for accuracy by predicting the salaries of a random sample of players from the 2013 MLB season.Item Analysis of Significant Factors in Division I Men's College Basketball and Development of a Predictive Model(North Dakota State University, 2013) Unruh, Samuel PaulWhile a number of statistics are collected during an NCAA Division I men’s college basketball game, it is potentially of interest to universities, coaches, players, and fans which of these variables are most significant in determining wins and losses. To this end, statistics were collected from two seasons of games and analyzed using logistic and least squares regression methods. The differences between the two competing teams in four common statistics were found to be significant to determining victory: assists, free throw attempts, defensive rebounds, and turnovers. The logistic and least squares models were then used with data from the 2011- 2012 season to verify the accuracy of the models. To determine the accuracy of the models in predicting future game outcomes, four prior game median statistics were collected for teams competing in a sample of games from 2011-2012, with the differences taken and used in the models.Item An Analysis of the NBA Draft: Are Teams Drafting Better and Does College Experience Truly Matter(North Dakota State University, 2022) Wolfe, KevinThis thesis attempts to answer two questions. Are NBA organizations doing a reasonable job at drafting players and getting better at the process, and does college experience play a significant role in a player’s performance during their early NBA career (first 3 seasons).In regard to these two questions, we determined through our research that NBA organizations are not showing any significant improvements in their ability to draft the best available players, this is surprising given the increase in available scouting data teams currently have access to. We suspected however that this lack of drafting improvements may be related to players entering the NBA with less college experience. However, after we determined that college experience does not appear to play a large role in a player’s early career NBA performance, we determined that experience does not appear to be the reason why teams aren’t doing a better job of drafting.Item Analyzing and Controlling Biases in Student Rating of Instruction(North Dakota State University, 2019) Zhou, YueMany colleges and universities have adopted the student ratings of instruction (SROI) system as one of the measures for instructional effectiveness. This study aims to establish a predictive model and address two questions related to SROI: firstly, whether gender bias against female instructors at North Dakota State University (NDSU) exists and, secondly, how other factors related to students, instructors and courses affect the SROI. In total, 30,303 SROI from seven colleges at NDSU for the 2013-2014 academic year are studied. Our results demonstrate that there is a significant association between students’ gender and instructors’ gender in the rating scores. Therefore, we cannot determine how the gender of an instructor effects the course rating unless we know the composition of genders of students in that class. Predictive proportional odds models for the students’ ordinal categorical ratings are established.Item Bayesian Sparse Factor Analysis of High Dimensional Gene Expression Data(North Dakota State University, 2019) Zhao, JingjunThis work closely studied fundamental techniques of Bayesian sparse Factor Analysis model - constrained Least Square regression, Bayesian Lasso regression, and some popular sparsity-inducing priors. In Appendix A, we introduced each of the fundamental techniques in a coherent manner and provided detailed proof for important formulas and definitions. We consider provided introduction and detailed proof, which are very helpful in learning Bayesian sparse Factor Analysis, as a contribution of this work. We also systematically studied a computationally tractable biclustering approach in identifying co-regulated genes, BicMix, by proving all point estimates of the parameters and by running the method on both simulated data sets and a real high-dimensional gene expression data set. Missed derivation of all point estimates in BicMix has been provided for better understanding variational expectation maximization (VEM) algorithm. The performance of the method for identifying true biclusters has been analyzed using the experimental results.Item Bracketing the NCAA Women's Basketball Tournament(North Dakota State University, 2014) Wang, WentingThis paper presents a bracketing method for all the 63 games in NCAA Division I Women's basketball tournament. Least squares models and logistic regression models for Round 1, Round 2 and Rounds 3-6 were developed, to predict winners of basketball games in each of those rounds for the NCAA Women's Basketball tournament. For the first round, three-point goals, free throws, blocks and seed were found to be significant; For the second round, field goals and average points were found to be significant; For the third and higher rounds, assists, steals and seed were found to be significant. A complete bracket was filled out in 2014 before any game was played. When the differences of the seasonal averages for both teams for all previously mentioned variables were considered for entry in the least squares models, the models had approximately a 76% chance of correctly predicting the winner of a basketball game.Item Comparative Analysis of Traditional and Modified DECODE Method in Small Sample Gene Expression Experiments(North Dakota State University, 2018) Neset, KatieBackground: The DECODE method integrates differential co-expression and differential expression analysis methods to better understand biological functions of genes and their associations with disease. The DECODE method originally was designed to analyze large sample gene expression experiments, however most gene expression experiments consist of small sample sizes. This paper proposes modified test statistic to replace the traditional test statistic in the DECODE method. Using three simulations studies, we compare the performances of the modified and traditional DECODE methods using measures of sensitivity, positive predictive value (PPV), false discovery rate (FDR), and overall error rate for genes found to be highly differentially expressed and highly differentially co-expressed. Results: In comparison of sensitivity and PPV a minor increase is seen when using modified DECODE method along with minor decrease in FDR and overall error rate. Thus, a recommendation is made to use the modified DECODE method with small sample sizes.Item Comparative Classification of Prostate Cancer Data using the Support Vector Machine, Random Forest, Dualks and k-Nearest Neighbours(North Dakota State University, 2015) Sakouvogui, KekouraThis paper compares four classifications tools, Support Vector Machine (SVM), Random Forest (RF), DualKS and the k-Nearest Neighbors (kNN) that are based on different statistical learning theories. The dataset used is a microarray gene expression of 596 male patients with prostate cancer. After treatment, the patients were classified into one group of phenotype with three levels: PSA (Prostate-Specific Antigen), Systematic and NED (No Evidence of Disease). The purpose of this research is to determine the performance rate of each classifier by selecting the optimal kernels and parameters that give the best prediction rate of the phenotype. The paper begins with the discussion of previous implementations of the tools and their mathematical theories. The results showed that three classifiers achieved a comparable performance that was above the average while DualKS did not. We also observed that SVM outperformed the kNN, RF and DualKS classifiers.Item A Comparative Multiple Simulation Study for Parametric and Nonparametric Methods in the Identification of Differentially Expressed Genes(North Dakota State University, 2021) Palmer, Daniel GrantRNA-seq data simulated from a negative binomial distribution, sampled without replacement, or modified from read counts were analyzed to compare differential gene expression analysis methods in terms of false discovery rate control and power. The goals of the study were to determine optimal sample sizes/proportions of differential expression needed to adequately control false discovery rate and which differential gene expression methods performed best with the given simulation methods. Parametric tools like edgeR and limma-voom tended to be conservative when controlling false discovery rate from a negative binomial distribution as the proportion of differential expression increased. For the nonparametric simulation methods, many differential gene expression methods did not adequately control false discovery rate and results varied greatly when different reference data sets were used for simulations.Item Comparing Accuracies of Spatial Interpolation Methods on 1-Minute Ground Magnetometer Readings(North Dakota State University, 2017) Campbell, Kathryn MaryGeomagnetic disturbances caused by external solar events can create geomagnetically induced currents (GIC) throughout conducting networks of Earth’s surface. GIC can cause disruption that scales from minor to catastrophic. However, systems can implement preemptive measure to mitigate the effects of GICs with the use of GIC forecasting. Accurate forecasting is dependent on accurate modeling of Earth’s geomagnetic field. Unfortunately, it is not currently possible to have a measurement at every point of Earth’s field. Spatial interpolation methods can be implemented to fill in for the unmeasured space. The performances of two spatial interpolation methods, Inverse Distance Weighting and Kriging, are assessed to determine which better predicts the unmeasured space. Error testing shows both methods to be comparable, with the caveat of Kriging having a tighter precision on predictions.Item Comparing Dunnett's Test with the False Discovery Rate Method: A Simulation Study(North Dakota State University, 2013) Kubat, JamieRecently, the idea of multiple comparisons has been criticized because of its lack of power in datasets with a large number of treatments. Many family-wise error corrections are far too restrictive when large quantities of comparisons are being made. At the other extreme, a test like the least significant difference does not control the family-wise error rate, and therefore is not restrictive enough to identify true differences. A solution lies in multiple testing. The false discovery rate (FDR) uses a simple algorithm and can be applied to datasets with many treatments. The current research compares the FDR method to Dunnett's test using agronomic data from a study with 196 varieties of dry beans. Simulated data is used to assess type I error and power of the tests. In general, the FDR method provides a higher power than Dunnett's test while maintaining control of the type I error rate.Item Comparing Performance of ANOVA to Poisson and Negative Binomial Regression When Applied to Count Data(North Dakota State University, 2020) Soumare, IbrahimAnalysis of Variance (ANOVA) is the easiest and most widely used model nowadays in statistics. ANOVA however requires a set of assumptions for the model to be a valid choice and for the inferences to be accurate. Among many, ANOVA assumes the data in question is normally distributed and homogenous. However, data from most disciplines does not meet the assumption of normality and/or equal variance. Regrettably, researchers do not always check whether the assumptions are met, and if these assumptions are violated, inferences might well be wrong. We conducted a simulation study to compare the performance of standard ANOVA to Poisson and Negative Binomial models when applied to counts data. We considered different combination of sample sizes and underlying distributions. In this simulation study, we first assed Type I error for each model involved. We then compared power as well as the quality of the estimated parameters across the models.Item Comparing Total Hip Replacement Drug Treatments for Cost and Length of Stay(North Dakota State University, 2015) Huebner, Blake JamesThe objective of this study is to identify the potential effect anticoagulants, spinal blocks, and antifibrinolytics have on overall cost, length of stay, and re-admission rates for total hip replacement patients. We use ordinary least squares regression, multiple comparison testing, logistic regression, and chi square tests to fulfill this objective. The combination of warfarin and enoxaparin is associated with the highest cost and length of stay out of the anticoagulants studied. There is no clear combination of spinal blocks associated with the highest cost and length of stay. Tranexamic acid is associated with a reduction in length of stay and likelihood of receiving a blood transfusion, while not increasing overall cost. No drug combination in any category is associated with a change in re-admission rates.Item A Comparison of Filtering and Normalization Methods in the Statistical Analysis of Gene Expression Experiments(North Dakota State University, 2020) Speicher, Mackenzie Rosa MarieBoth microarray and RNA-seq technologies are powerful tools which are commonly used in differential expression (DE) analysis. Gene expression levels are compared across treatment groups to determine which genes are differentially expressed. With both technologies, filtering and normalization are important steps in data analysis. In this thesis, real datasets are used to compare current analysis methods of two-color microarray and RNA-seq experiments. A variety of filtering, normalization and statistical approaches are evaluated. The results of this study show that although there is still no widely accepted method for the analysis of these types of experiments, the method chosen can largely impact the number of genes that are declared to be differentially expressed.Item A Comparison of Methods Taking into Account Asymmetry when Evaluating Differential Expression in Gene Expression Experiments(North Dakota State University, 2018) Tchakounte Wakem, SeguyGene expression technologies allow expression levels to be compared across treatments for thousands of genes simultaneously. Asymmetry in the empirical distribution of the test statistics from the analysis of a gene expression experiment is often observed. Statistical methods exist for identifying differentially expressed (DE) genes while controlling multiple testing error while taking into account the asymmetry of the distribution of the effect sizes. This paper compares three statistical methods (Modified Q-value, Modified SAM, and Asymmetric Local False Discovery Rate) used to identify differentially expressed (DE) genes that take into account such patterns while controlling false discovery rate (FDR). The results of the simulation studies performed suggest that the Modified Q-values outperforms the other methods most of the time and also better controls the FDR.Item Comparison of Proposed K Sample Tests with Dietz's Test for Nondecreasing Ordered Alternatives for Bivariate Normal Data(North Dakota State University, 2011) Zhao, YanchunThere are many situations in which researchers want to consider a set of response variables simultaneously rather than just one response variable. For instance, a possible example is when a researcher wishes to determine the effects of an exercise and diet program on both the cholesterol levels and the weights of obese subjects. Dietz (1989) proposed two multivariate generalizations of the Jonckheere test for ordered alternatives. In this study, we propose k-sample tests for nondecreasing ordered alternatives for bivariate normal data and compare their powers with Dietz's sum statistic. The proposed k-sample tests are based on transformations of bivariate data to univariate data. The transformations considered are the sum, maximum and minimum functions. The ideas for these transformations come from the Leconte, Moreau, and Lellouch (1994). After the underlying bivariate normal data are reduced to univariate data, the Jonckheere-Terpstra (JT) test (Terpstra, 1952 and Jonckheere, 1954) and the Modified Jonckheere-Terpstra (MJT) test (Tryon and Hettmansperger, 1973) are applied to the univariate data. A simulation study is conducted to compare the proposed tests with Dietz's test for k bivariate normal populations (k=3, 4, 5). A variety of sample sizes and various location shifts are considered in this study. Two different correlations are used for the bivariate normal distributions. The simulation results show that generally the Dietz test performs the best for the situations considered with the underlying bivariate normal distribution. The estimated powers of MJT sum and JT sum are often close with the MJT sum generally having a little higher power. The sum transformation was the best of the three transformations to use for bivariate normal data.Item A Comparison of the Ansari-Bradley Test and the Moses Test for the Variances(North Dakota State University, 2011) Yuni, ChenThis paper is aimed to compare the powers and significance levels of two well known nonparametric tests: the Ansari-Bradley test and the Moses test in both situations where the equal-median assumption is satisfied and where the equal-median assumption is violated. R-code is used to generate the random data from several distributions: the normal distribution, the exponential distribution, and the t-distribution with three degrees of freedom. The power and significance level of each test was estimated for a given situation based on 10,000 iterations. Situations with the equal samples of size 10, 20, and 30, and unequal samples of size 10 and 20, 20 and 10, and 20 and 30 were considered for a variety of different location parameter shifts. The study shows that when two location parameters are equal, generally the Ansari-Bradley test is more powerful than the Moses test regardless ofthe underlying distribution; when two location parameters are different, the Moses is generally preferred. The study also shows that when the underlying distribution is symmetric, the Moses test with large subset size k generally has higher power than the test with smaller k; when the underlying distribution is not symmetric, the Moses test with larger k is more powerful for relatively small sample sizes and the Moses test with medium k has higher power for relatively large sample sizes.Item A Comparison of Two Scaling Techniques to Reduce Uncertainty in Predictive Models(North Dakota State University, 2020) Todd, Austin LukeThis research examines the use of two scaling techniques to accurately transfer information from small-scale data to large-scale predictions in a handful of nonlinear functions. The two techniques are (1) using random draws from distributions that represent smaller time scales and (2) using a single draw from a distribution representing the mean over all time represented by the model. This research used simulation to create the underlying distributions for the variable and parameters of the chosen functions which were then scaled accordingly. Once scaled, the variable and parameters were plugged into our chosen functions to give an output value. Using simulation, output distributions were created for each combination of scaling technique, underlying distribution, variable bounds, and parameter bounds. These distributions were then compared using a variety of statistical tests, measures, and graphical plots.