Statistics Masters Theses

Permanent URI for this collectionhdl:10365/32401

Browse

Now showing 1 - 20 of 51

Analysis of Bootstrap Techniques for Loss Reserving
(North Dakota State University, 2015) Chase, Taryn Ruth
Insurance companies must have an appropriate method of estimating future reserve amounts. These values will directly influence the rates that are charged to the customer. This thesis analyzes stochastic reserving techniques that use bootstrap methods in order to obtain variability estimates of predicted reserves. Bootstrapping techniques are of interest because they usually do not require advanced statistical software to implement. Some bootstrap techniques have incorporated generalized linear models in order to produce results. To analyze how well these methods are performing, data with known future losses was obtained from the National Association of Insurance Commissioners. Analysis of this data shows that most bootstrapping methods produce results that are comparable to one another and to the trusted Chain Ladder method. The methods are then applied to loss data from a small Midwestern insurance company to predict variation of their future reserve amounts.
An Analysis of Factors Contributing to Wins in the National Hockey League
(North Dakota State University, 2013) Roith, Joseph Michael
This thesis looks at common factors that have the largest impact on winning games in the NHL. Data was collected from regular season games for all teams in the NHL over seven seasons. Logistic and least squares regressions were performed to create a win probability model and a goal margin model to predict the outcome of games. Discriminant analysis was also used to determine significant factors over the course of an entire season. Save percentage margin, shot margin, block margin, short-handed shot margin, short-handed faceoff percentage, and even-handed faceoff percentage were found to be significant influences on individual game wins. Total goals, total goals against and takeaway totals for a season were enough to correctly predict whether a team made the playoffs 87% of the time. The accuracies of the models were then tested by predicting the outcome of games from the 2012 NHL regular season.
Analysis of Salary for Major League Baseball Players
(North Dakota State University, 2014) Hoffman, Michael Glenn
This thesis examines the salary of Major League Baseball (MLB) players and whether players are paid based on their on-the-field performance. Each salary was examined on both the yearly production and the overall career production of the player. Several different production statistics were collected for the 2010-2012 MLB seasons. A random sample of players was selected from each season and separate models were created for position players and pitchers. Significant production statistics that were helpful in predicting salary were selected for each different model. These models were deemed to be good models having a predictive r-squared value of at least 0.70 for each of the different models. After the regression models were found, the models were tested for accuracy by predicting the salaries of a random sample of players from the 2013 MLB season.
Analysis of Significant Factors in Division I Men's College Basketball and Development of a Predictive Model
(North Dakota State University, 2013) Unruh, Samuel Paul
While a number of statistics are collected during an NCAA Division I men’s college basketball game, it is potentially of interest to universities, coaches, players, and fans which of these variables are most significant in determining wins and losses. To this end, statistics were collected from two seasons of games and analyzed using logistic and least squares regression methods. The differences between the two competing teams in four common statistics were found to be significant to determining victory: assists, free throw attempts, defensive rebounds, and turnovers. The logistic and least squares models were then used with data from the 2011- 2012 season to verify the accuracy of the models. To determine the accuracy of the models in predicting future game outcomes, four prior game median statistics were collected for teams competing in a sample of games from 2011-2012, with the differences taken and used in the models.
An Analysis of the NBA Draft: Are Teams Drafting Better and Does College Experience Truly Matter
(North Dakota State University, 2022) Wolfe, Kevin
This thesis attempts to answer two questions. Are NBA organizations doing a reasonable job at drafting players and getting better at the process, and does college experience play a significant role in a player’s performance during their early NBA career (first 3 seasons).In regard to these two questions, we determined through our research that NBA organizations are not showing any significant improvements in their ability to draft the best available players, this is surprising given the increase in available scouting data teams currently have access to. We suspected however that this lack of drafting improvements may be related to players entering the NBA with less college experience. However, after we determined that college experience does not appear to play a large role in a player’s early career NBA performance, we determined that experience does not appear to be the reason why teams aren’t doing a better job of drafting.
Bayesian Sparse Factor Analysis of High Dimensional Gene Expression Data
(North Dakota State University, 2019) Zhao, Jingjun
This work closely studied fundamental techniques of Bayesian sparse Factor Analysis model - constrained Least Square regression, Bayesian Lasso regression, and some popular sparsity-inducing priors. In Appendix A, we introduced each of the fundamental techniques in a coherent manner and provided detailed proof for important formulas and definitions. We consider provided introduction and detailed proof, which are very helpful in learning Bayesian sparse Factor Analysis, as a contribution of this work. We also systematically studied a computationally tractable biclustering approach in identifying co-regulated genes, BicMix, by proving all point estimates of the parameters and by running the method on both simulated data sets and a real high-dimensional gene expression data set. Missed derivation of all point estimates in BicMix has been provided for better understanding variational expectation maximization (VEM) algorithm. The performance of the method for identifying true biclusters has been analyzed using the experimental results.
Bracketing the NCAA Women's Basketball Tournament
(North Dakota State University, 2014) Wang, Wenting
This paper presents a bracketing method for all the 63 games in NCAA Division I Women's basketball tournament. Least squares models and logistic regression models for Round 1, Round 2 and Rounds 3-6 were developed, to predict winners of basketball games in each of those rounds for the NCAA Women's Basketball tournament. For the first round, three-point goals, free throws, blocks and seed were found to be significant; For the second round, field goals and average points were found to be significant; For the third and higher rounds, assists, steals and seed were found to be significant. A complete bracket was filled out in 2014 before any game was played. When the differences of the seasonal averages for both teams for all previously mentioned variables were considered for entry in the least squares models, the models had approximately a 76% chance of correctly predicting the winner of a basketball game.
Comparative Analysis of Traditional and Modified DECODE Method in Small Sample Gene Expression Experiments
(North Dakota State University, 2018) Neset, Katie
Background: The DECODE method integrates differential co-expression and differential expression analysis methods to better understand biological functions of genes and their associations with disease. The DECODE method originally was designed to analyze large sample gene expression experiments, however most gene expression experiments consist of small sample sizes. This paper proposes modified test statistic to replace the traditional test statistic in the DECODE method. Using three simulations studies, we compare the performances of the modified and traditional DECODE methods using measures of sensitivity, positive predictive value (PPV), false discovery rate (FDR), and overall error rate for genes found to be highly differentially expressed and highly differentially co-expressed. Results: In comparison of sensitivity and PPV a minor increase is seen when using modified DECODE method along with minor decrease in FDR and overall error rate. Thus, a recommendation is made to use the modified DECODE method with small sample sizes.
Comparative Classification of Prostate Cancer Data using the Support Vector Machine, Random Forest, Dualks and k-Nearest Neighbours
(North Dakota State University, 2015) Sakouvogui, Kekoura
This paper compares four classifications tools, Support Vector Machine (SVM), Random Forest (RF), DualKS and the k-Nearest Neighbors (kNN) that are based on different statistical learning theories. The dataset used is a microarray gene expression of 596 male patients with prostate cancer. After treatment, the patients were classified into one group of phenotype with three levels: PSA (Prostate-Specific Antigen), Systematic and NED (No Evidence of Disease). The purpose of this research is to determine the performance rate of each classifier by selecting the optimal kernels and parameters that give the best prediction rate of the phenotype. The paper begins with the discussion of previous implementations of the tools and their mathematical theories. The results showed that three classifiers achieved a comparable performance that was above the average while DualKS did not. We also observed that SVM outperformed the kNN, RF and DualKS classifiers.
A Comparative Multiple Simulation Study for Parametric and Nonparametric Methods in the Identification of Differentially Expressed Genes
(North Dakota State University, 2021) Palmer, Daniel Grant
RNA-seq data simulated from a negative binomial distribution, sampled without replacement, or modified from read counts were analyzed to compare differential gene expression analysis methods in terms of false discovery rate control and power. The goals of the study were to determine optimal sample sizes/proportions of differential expression needed to adequately control false discovery rate and which differential gene expression methods performed best with the given simulation methods. Parametric tools like edgeR and limma-voom tended to be conservative when controlling false discovery rate from a negative binomial distribution as the proportion of differential expression increased. For the nonparametric simulation methods, many differential gene expression methods did not adequately control false discovery rate and results varied greatly when different reference data sets were used for simulations.
Comparing Dunnett's Test with the False Discovery Rate Method: A Simulation Study
(North Dakota State University, 2013) Kubat, Jamie
Recently, the idea of multiple comparisons has been criticized because of its lack of power in datasets with a large number of treatments. Many family-wise error corrections are far too restrictive when large quantities of comparisons are being made. At the other extreme, a test like the least significant difference does not control the family-wise error rate, and therefore is not restrictive enough to identify true differences. A solution lies in multiple testing. The false discovery rate (FDR) uses a simple algorithm and can be applied to datasets with many treatments. The current research compares the FDR method to Dunnett's test using agronomic data from a study with 196 varieties of dry beans. Simulated data is used to assess type I error and power of the tests. In general, the FDR method provides a higher power than Dunnett's test while maintaining control of the type I error rate.
Comparing Total Hip Replacement Drug Treatments for Cost and Length of Stay
(North Dakota State University, 2015) Huebner, Blake James
The objective of this study is to identify the potential effect anticoagulants, spinal blocks, and antifibrinolytics have on overall cost, length of stay, and re-admission rates for total hip replacement patients. We use ordinary least squares regression, multiple comparison testing, logistic regression, and chi square tests to fulfill this objective. The combination of warfarin and enoxaparin is associated with the highest cost and length of stay out of the anticoagulants studied. There is no clear combination of spinal blocks associated with the highest cost and length of stay. Tranexamic acid is associated with a reduction in length of stay and likelihood of receiving a blood transfusion, while not increasing overall cost. No drug combination in any category is associated with a change in re-admission rates.
A Comparison of Methods Taking into Account Asymmetry when Evaluating Differential Expression in Gene Expression Experiments
(North Dakota State University, 2018) Tchakounte Wakem, Seguy
Gene expression technologies allow expression levels to be compared across treatments for thousands of genes simultaneously. Asymmetry in the empirical distribution of the test statistics from the analysis of a gene expression experiment is often observed. Statistical methods exist for identifying differentially expressed (DE) genes while controlling multiple testing error while taking into account the asymmetry of the distribution of the effect sizes. This paper compares three statistical methods (Modified Q-value, Modified SAM, and Asymmetric Local False Discovery Rate) used to identify differentially expressed (DE) genes that take into account such patterns while controlling false discovery rate (FDR). The results of the simulation studies performed suggest that the Modified Q-values outperforms the other methods most of the time and also better controls the FDR.
Comparison of Proposed K Sample Tests with Dietz's Test for Nondecreasing Ordered Alternatives for Bivariate Normal Data
(North Dakota State University, 2011) Zhao, Yanchun
There are many situations in which researchers want to consider a set of response variables simultaneously rather than just one response variable. For instance, a possible example is when a researcher wishes to determine the effects of an exercise and diet program on both the cholesterol levels and the weights of obese subjects. Dietz (1989) proposed two multivariate generalizations of the Jonckheere test for ordered alternatives. In this study, we propose k-sample tests for nondecreasing ordered alternatives for bivariate normal data and compare their powers with Dietz's sum statistic. The proposed k-sample tests are based on transformations of bivariate data to univariate data. The transformations considered are the sum, maximum and minimum functions. The ideas for these transformations come from the Leconte, Moreau, and Lellouch (1994). After the underlying bivariate normal data are reduced to univariate data, the Jonckheere-Terpstra (JT) test (Terpstra, 1952 and Jonckheere, 1954) and the Modified Jonckheere-Terpstra (MJT) test (Tryon and Hettmansperger, 1973) are applied to the univariate data. A simulation study is conducted to compare the proposed tests with Dietz's test for k bivariate normal populations (k=3, 4, 5). A variety of sample sizes and various location shifts are considered in this study. Two different correlations are used for the bivariate normal distributions. The simulation results show that generally the Dietz test performs the best for the situations considered with the underlying bivariate normal distribution. The estimated powers of MJT sum and JT sum are often close with the MJT sum generally having a little higher power. The sum transformation was the best of the three transformations to use for bivariate normal data.
A Comparison of the Ansari-Bradley Test and the Moses Test for the Variances
(North Dakota State University, 2011) Yuni, Chen
This paper is aimed to compare the powers and significance levels of two well known nonparametric tests: the Ansari-Bradley test and the Moses test in both situations where the equal-median assumption is satisfied and where the equal-median assumption is violated. R-code is used to generate the random data from several distributions: the normal distribution, the exponential distribution, and the t-distribution with three degrees of freedom. The power and significance level of each test was estimated for a given situation based on 10,000 iterations. Situations with the equal samples of size 10, 20, and 30, and unequal samples of size 10 and 20, 20 and 10, and 20 and 30 were considered for a variety of different location parameter shifts. The study shows that when two location parameters are equal, generally the Ansari-Bradley test is more powerful than the Moses test regardless ofthe underlying distribution; when two location parameters are different, the Moses is generally preferred. The study also shows that when the underlying distribution is symmetric, the Moses test with large subset size k generally has higher power than the test with smaller k; when the underlying distribution is not symmetric, the Moses test with larger k is more powerful for relatively small sample sizes and the Moses test with medium k has higher power for relatively large sample sizes.
A Comparison of Two Scaling Techniques to Reduce Uncertainty in Predictive Models
(North Dakota State University, 2020) Todd, Austin Luke
This research examines the use of two scaling techniques to accurately transfer information from small-scale data to large-scale predictions in a handful of nonlinear functions. The two techniques are (1) using random draws from distributions that represent smaller time scales and (2) using a single draw from a distribution representing the mean over all time represented by the model. This research used simulation to create the underlying distributions for the variable and parameters of the chosen functions which were then scaled accordingly. Once scaled, the variable and parameters were plugged into our chosen functions to give an output value. Using simulation, output distributions were created for each combination of scaling technique, underlying distribution, variable bounds, and parameter bounds. These distributions were then compared using a variety of statistical tests, measures, and graphical plots.
D-optimal Design for the 5PL-1P Model in Chemical Toxicity Assessment
(North Dakota State University, 2016) MacDonald, Jenna Lynn
The five-parameter logistic minus one-parameter model is a hybrid between the five-parameter model and the four-parameter model used for the relationship between concentration and response. The four-parameter model includes the maximum and minimum concentration, slope, and the median concentration level EC50. The five-parameter model add an asymmetric factor which is important due to asymmetry of the sigmoid curve. This model, however, is more difficult to fit due to the addition of the fifth parameter, which is why the 5PL-1P model is used so that the asymmetric factor is taken into account but has less parameters. For the 5PL-1P model, D-optimal designs are obtained to estimate the model parameters effectively. Then we compare the D-optimal designs to the designs that are used to study the 5PL-1P model in real toxicity assessment and show that they work better than the original designs by comparing their efficiencies and comparing MSEs through simulation studies.
Demographic Analysis of Student Evaluations
(North Dakota State University, 2015) Huebner, Lucas James
Data was collected from North Dakota State University’s student rating of instructor’s forms during the fall of 2013 and the spring of 2014. This thesis investigates differences between male and female instructor’s ratings, as well as attempts to describe outcomes using other demographics. T-tests were performed comparing the means of class averages for male and female instructors for each question on the student evaluation. There was not a difference for the mean class averages between male and female instructors when the whole university was considered and when only looking at the College of Science and Math. The analysis conducted also shows that male students tend to rate male instructors higher and female students tend to rate female instructors higher.
The Determinants of Aeronautical Charges of U.S.Airports: A Spatial Analysis
(North Dakota State University, 2020) Karanki, Fecri
Using U.S. airport data from 2009 through 2016, this thesis examines the determinants of aeronautical charges of large and medium hub airports and accounts for the spatial dependence of neighboring airports in a spatial panel regression model. The major finding of this thesis are (1) U.S. airports’ aeronautical charges are spatially dependent, and neighboring airport charges are spatially and positively correlated; (2) there is evidence of airport cost recovery through non-aeronautical revenues; (3) airports sharing non-aeronautical revenues with airlines charge lower aeronautical fees than their peers that do not share revenues; (4) aeronautical charges increase with higher delays.
Development of a Prediction Model for the NCAA Division-I Football Championship Subdivision
(North Dakota State University, 2013) Long, Joseph
This thesis investigates which in-game team statistics are most significant in determining the outcome in a NCAA Division-I Football Championship Subdivision (FCS) game. The data was analyzed using logistic and ordinary least squares regression techniques to create models that explained the outcome of the past games. The models were then used to predict games where the actual in-game statistics were unknown. A random sample of games from the 2012 NCAA Division-I Football Championship Subdivision regular season was used to test the accuracy of the models when used to predict future games. Various techniques were used to estimate the in-game statistics in the models for each individual team in order to predict future games. The most accurate technique consisted of using three game medians with respect to total yards gained by the teams in consideration. This technique correctly predicted 78.85% of the games in the sample data set when used with the logistic regression model.

Browse

Browsing Statistics Masters Theses by browse.metadata.program "Statistics"