Statistics Masters Theses
Permanent URI for this collectionhdl:10365/32401
Browse
Browsing Statistics Masters Theses by Issue Date
Now showing 1 - 20 of 65
- Results Per Page
- Sort Options
Item Mass Spectrum Analysis of a Substance Sample Placed into Liquid Solution(North Dakota State University, 2011) Wang, YunliMass spectrometry is an analytical technique commonly used for determining elemental composition in a substance sample. For this purpose, the sample is placed into some liquid solution called liquid matrix. Unfortunately, the spectrum of the sample is not observable separate from that of the solution. Thus, it is desired to distinguish the sample spectrum. The analysis is usually based on the comparison of the mixed spectrum with the one of the sole solution. Introducing the missing information about the origin of observed spectrum peaks, the author obtains a classic set up for the Expectation-Maximization (EM) algorithm. The author proposed a mixture modeling the spectrum of the liquid solution as well as that of the sample. A bell-shaped probability mass function obtained by discretization of the univariate Gaussian probability density function was proposed or serving as a mixture component. The E- and M- steps were derived under the proposed model. The corresponding R program is written and tested on a small but challenging simulation example. Varying the number of mixture components for the liquid matrix and sample, the author found the correct model according to Bayesian Information Criterion. The initialization of the EM algorithm is a difficult standalone problem that was successfully resolved for this case. The author presents the findings and provides results from the simulation example as well as corresponding illustrations supporting the conclusions.Item Comparison of Proposed K Sample Tests with Dietz's Test for Nondecreasing Ordered Alternatives for Bivariate Normal Data(North Dakota State University, 2011) Zhao, YanchunThere are many situations in which researchers want to consider a set of response variables simultaneously rather than just one response variable. For instance, a possible example is when a researcher wishes to determine the effects of an exercise and diet program on both the cholesterol levels and the weights of obese subjects. Dietz (1989) proposed two multivariate generalizations of the Jonckheere test for ordered alternatives. In this study, we propose k-sample tests for nondecreasing ordered alternatives for bivariate normal data and compare their powers with Dietz's sum statistic. The proposed k-sample tests are based on transformations of bivariate data to univariate data. The transformations considered are the sum, maximum and minimum functions. The ideas for these transformations come from the Leconte, Moreau, and Lellouch (1994). After the underlying bivariate normal data are reduced to univariate data, the Jonckheere-Terpstra (JT) test (Terpstra, 1952 and Jonckheere, 1954) and the Modified Jonckheere-Terpstra (MJT) test (Tryon and Hettmansperger, 1973) are applied to the univariate data. A simulation study is conducted to compare the proposed tests with Dietz's test for k bivariate normal populations (k=3, 4, 5). A variety of sample sizes and various location shifts are considered in this study. Two different correlations are used for the bivariate normal distributions. The simulation results show that generally the Dietz test performs the best for the situations considered with the underlying bivariate normal distribution. The estimated powers of MJT sum and JT sum are often close with the MJT sum generally having a little higher power. The sum transformation was the best of the three transformations to use for bivariate normal data.Item A Comparison of the Ansari-Bradley Test and the Moses Test for the Variances(North Dakota State University, 2011) Yuni, ChenThis paper is aimed to compare the powers and significance levels of two well known nonparametric tests: the Ansari-Bradley test and the Moses test in both situations where the equal-median assumption is satisfied and where the equal-median assumption is violated. R-code is used to generate the random data from several distributions: the normal distribution, the exponential distribution, and the t-distribution with three degrees of freedom. The power and significance level of each test was estimated for a given situation based on 10,000 iterations. Situations with the equal samples of size 10, 20, and 30, and unequal samples of size 10 and 20, 20 and 10, and 20 and 30 were considered for a variety of different location parameter shifts. The study shows that when two location parameters are equal, generally the Ansari-Bradley test is more powerful than the Moses test regardless ofthe underlying distribution; when two location parameters are different, the Moses is generally preferred. The study also shows that when the underlying distribution is symmetric, the Moses test with large subset size k generally has higher power than the test with smaller k; when the underlying distribution is not symmetric, the Moses test with larger k is more powerful for relatively small sample sizes and the Moses test with medium k has higher power for relatively large sample sizes.Item A Nonparametric Test for the Non-Decreasing Alternative in an Incomplete Block Design(North Dakota State University, 2011) Ndungu, Alfred MungaiThe purpose of this paper is to present a new nonparametric test statistic for testing against ordered alternatives in a Balanced Incomplete Block Design (BIBD). This test will then be compared with the Durbin test which tests for differences between treatments in a BIBD but without regard to order. For the comparison, Monte Carlo simulations were used to generate the BIBD. Random samples were simulated from: Normal Distribution; Exponential Distribution; T distribution with three degrees of freedom. The number of treatments considered was three, four and five with all the possible combinations necessary for a BIBD. Small sample sizes were 20 or less and large sample sizes were 30 or more. The powers and alpha values were then estimated after 10,000 repetitions.The results of the study show that the new test proposed is more powerful than the Durbin test. Regardless of the distribution, sample size or number of treatments, the new test tended to have higher powers than the Durbin test.Item Nonparametric Test for the Umbrella Alternative in a Randomized Complete Block and Balanced Incomplete Block Mixed Design(North Dakota State University, 2012) Hemmer, Michael ToshiroNonparametric tests have served as robust alternatives to traditional statistical tests with rigid underlying assumptions. If a researcher expects the treatment effects to follow an umbrella alternative, then the test developed in this research will be applicable in the Balanced Incomplete Block Design (Hemmer’s test). It is hypothesized that Hemmer’s test will prove to be more powerful than the Durbin test when the umbrella alternative is true. A mixed design consisting of a Balanced Incomplete Block Design and a Randomized Complete Block Design will also be considered, where two additional test statistics are developed for the umbrella alternative. Monte Carlo simulation studies were conducted using SAS to estimate powers. Various underlying distributions were used with 3, 4, and 5 treatments, and a variety of peaks and mean parameter values. For the mixed design, different ratios of complete to incomplete blocks were considered. Recommendations are given.Item Entropy as a Criterion for Variable Reduction in Cluster Data(North Dakota State University, 2012) Olson, ChristopherEntropy is a measure of the randomness of a system state. This quantity gives us a measure of uncertainty that is associated with each particular observation belonging to a specific cluster. We examine this property and its potential use in analyzing high dimension datasets. Entropy proves most interesting in identifying possible dimensions that do not contribute meaningful classification to the clusters present. We can remove the dimension(s) found which are the least important and generalize this idea to a procedure. After identifying all the dimensions that should be eliminated from the dataset, we then compare its ability in recovering the true classification of the observations versus the estimated classification of the data. From the results obtained and shown in this paper, it is clear that entropy is a good candidate for a criterion in variable reduction.Item On K-Means Clustering Using Mahalanobis Distance(North Dakota State University, 2012) Nelson, JoshuaA problem that arises quite frequently in statistics is that of identifying groups, or clusters, of data within a population or sample. The most widely used procedure to identify clusters in a set of observations is known as K-Means. The main limitation of this algorithm is that it uses the Euclidean distance metric to assign points to clusters. Hence, this algorithm operates well only if the covariance structures of the clusters are nearly spherical and homogeneous in nature. To remedy this shortfall in the K-Means algorithm the Mahalanobis distance metric was used to capture the variance structure of the clusters. The issue with using Mahalanobis distances is that the accuracy of the distance is sensitive to initialization. If this method serves as a signicant improvement over its competitors, then it will provide a useful tool for analyzing clusters.Item Examining Influential Factors and Predicting Outcomes in European Soccer Games(North Dakota State University, 2013) Melnykov, YanaModels are developed using least squares regression and logistic regression to predict outcomes of European soccer games based on four variables related to the past k games of each team playing with the following values of k considered: 4, 6, 8, 10, and 12. Soccer games from the European soccer leagues of England, Italy, and Spain are considered for the 2011-2012 year. Each league has 20 teams playing two games with each other: one game is played at home; the other game is played away. There are 38 rounds in each league. The first 33 rounds are used to developed models to predict outcomes of games. Predictions are made for the last 5 rounds in each league. We were able to correctly predict 76% of the results for the last 5 rounds using the linear regression model and 77% of results correctly using the logistic regression model.Item Development of a Prediction Model for the NCAA Division-I Football Championship Subdivision(North Dakota State University, 2013) Long, JosephThis thesis investigates which in-game team statistics are most significant in determining the outcome in a NCAA Division-I Football Championship Subdivision (FCS) game. The data was analyzed using logistic and ordinary least squares regression techniques to create models that explained the outcome of the past games. The models were then used to predict games where the actual in-game statistics were unknown. A random sample of games from the 2012 NCAA Division-I Football Championship Subdivision regular season was used to test the accuracy of the models when used to predict future games. Various techniques were used to estimate the in-game statistics in the models for each individual team in order to predict future games. The most accurate technique consisted of using three game medians with respect to total yards gained by the teams in consideration. This technique correctly predicted 78.85% of the games in the sample data set when used with the logistic regression model.Item Predicting Recessions in the U.S. with Yield Curve Spread(North Dakota State University, 2013) Huang, DiThis paper proposes a hidden Markov model for the signal of U.S. recessions. The model uses the spread of interest rate between 10-year Treasury bond and 3-month Treasury bill, together with other financial indicators which are the real M2 growth, the change in the Standard and Poor's 500 index of stock prices, and difference between the 6-month commercial paper and 6-month Treasury bill rates as predictors. The hidden Markov model considers temporal dependence between the recession signals and provides an estimate of the long-term probability of recessions. The empirical results indicate the hidden Markov model well predict the signal of recessions in the U.S.Item Comparing Dunnett's Test with the False Discovery Rate Method: A Simulation Study(North Dakota State University, 2013) Kubat, JamieRecently, the idea of multiple comparisons has been criticized because of its lack of power in datasets with a large number of treatments. Many family-wise error corrections are far too restrictive when large quantities of comparisons are being made. At the other extreme, a test like the least significant difference does not control the family-wise error rate, and therefore is not restrictive enough to identify true differences. A solution lies in multiple testing. The false discovery rate (FDR) uses a simple algorithm and can be applied to datasets with many treatments. The current research compares the FDR method to Dunnett's test using agronomic data from a study with 196 varieties of dry beans. Simulated data is used to assess type I error and power of the tests. In general, the FDR method provides a higher power than Dunnett's test while maintaining control of the type I error rate.Item An Analysis of Factors Contributing to Wins in the National Hockey League(North Dakota State University, 2013) Roith, Joseph MichaelThis thesis looks at common factors that have the largest impact on winning games in the NHL. Data was collected from regular season games for all teams in the NHL over seven seasons. Logistic and least squares regressions were performed to create a win probability model and a goal margin model to predict the outcome of games. Discriminant analysis was also used to determine significant factors over the course of an entire season. Save percentage margin, shot margin, block margin, short-handed shot margin, short-handed faceoff percentage, and even-handed faceoff percentage were found to be significant influences on individual game wins. Total goals, total goals against and takeaway totals for a season were enough to correctly predict whether a team made the playoffs 87% of the time. The accuracies of the models were then tested by predicting the outcome of games from the 2012 NHL regular season.Item Analysis of Significant Factors in Division I Men's College Basketball and Development of a Predictive Model(North Dakota State University, 2013) Unruh, Samuel PaulWhile a number of statistics are collected during an NCAA Division I men’s college basketball game, it is potentially of interest to universities, coaches, players, and fans which of these variables are most significant in determining wins and losses. To this end, statistics were collected from two seasons of games and analyzed using logistic and least squares regression methods. The differences between the two competing teams in four common statistics were found to be significant to determining victory: assists, free throw attempts, defensive rebounds, and turnovers. The logistic and least squares models were then used with data from the 2011- 2012 season to verify the accuracy of the models. To determine the accuracy of the models in predicting future game outcomes, four prior game median statistics were collected for teams competing in a sample of games from 2011-2012, with the differences taken and used in the models.Item Using Imputed Microrna Regulation Based on Weighted Ranked Expression and Putative Microrna Targets and Analysis of Variance to Select Micrornas for Predicting Prostate Cancer Recurrence(North Dakota State University, 2014) Wang, QiImputed microRNA regulation based on weighted ranked expression and putative microRNA targets (IMRE) is a method to predict microRNA regulation from genome-wide gene expression. A false discovery rate (FDR) for each microRNA is calculated using the expression of the microRNA putative targets to analyze the regulation between different conditions. FDR is calculated to identify the differences of gene expression. The dataset used in this research is the microarray gene expression of 596 patients with prostate cancer. This dataset includes three different phenotypes: PSA (Prostate-Specific Antigen recurrence), Systemic (Systemic Disease Progression) and NED (No Evidence of Disease). We used the IMRE and ANOVA methods to analyze the dataset and identified several microRNA candidates that can be used to predict PSA recurrence and systemic disease progression in prostate cancer patients.Item Identification of Differentially Expressed Genes and Gene Sets Using a Modified Q-Value(North Dakota State University, 2014) Bentil, Ekua FesuwaGene expression technologies allow expression levels to be compared across treatments for thousands of genes simultaneously. Statistical methods exist for identifying differentially expressed (DE) genes and gene sets while controlling multiple testing error. Most methods do not take into account the distribution of effect sizes or the overrepresentation of observed patterns. This paper compares a recently proposed modified q-value method that takes into account such patterns to a traditional q-value method for experiments with three treatments. The results of simulation studies performed suggest that the proposed methods improve upon the traditional method in the identification of DE genes in certain settings, but are outperformed by the traditional method in other settings. Analysis of data sets from real microarray.Item Bracketing the NCAA Women's Basketball Tournament(North Dakota State University, 2014) Wang, WentingThis paper presents a bracketing method for all the 63 games in NCAA Division I Women's basketball tournament. Least squares models and logistic regression models for Round 1, Round 2 and Rounds 3-6 were developed, to predict winners of basketball games in each of those rounds for the NCAA Women's Basketball tournament. For the first round, three-point goals, free throws, blocks and seed were found to be significant; For the second round, field goals and average points were found to be significant; For the third and higher rounds, assists, steals and seed were found to be significant. A complete bracket was filled out in 2014 before any game was played. When the differences of the seasonal averages for both teams for all previously mentioned variables were considered for entry in the least squares models, the models had approximately a 76% chance of correctly predicting the winner of a basketball game.Item T-Optimal Designs for Model Discrimination in Probit Models(North Dakota State University, 2014) Ming, YueWhen dose-response functions have a downturn, one interesting feature to study is the significance of the downturn. The interesting feature can be studied using model discrimination between two rival models (model describing dose-response functions with a downturn versus model describing only increasing part of the response functions). In this article, we study T-optimal designs that can best discriminate between these two rival models. Three different sets of model parameter values are considered to demonstrate various shapes of dose-response functions. Under the different sets of the parameter values, the T-optimal designs are obtained, and their performances are compared to two other known designs for the model discrimination (Ds-optimal design and Uniform design) through Monte Carlo Simulation.Item Analysis of Salary for Major League Baseball Players(North Dakota State University, 2014) Hoffman, Michael GlennThis thesis examines the salary of Major League Baseball (MLB) players and whether players are paid based on their on-the-field performance. Each salary was examined on both the yearly production and the overall career production of the player. Several different production statistics were collected for the 2010-2012 MLB seasons. A random sample of players was selected from each season and separate models were created for position players and pitchers. Significant production statistics that were helpful in predicting salary were selected for each different model. These models were deemed to be good models having a predictive r-squared value of at least 0.70 for each of the different models. After the regression models were found, the models were tested for accuracy by predicting the salaries of a random sample of players from the 2013 MLB season.Item Frost Depth Prediction(North Dakota State University, 2014) Luo, MengThe purpose of this research project is to develop a model that is able to accurately predict frost depth on a particular date, using available information. Frost depth prediction is useful in many applications in several domains. For example in agriculture, knowing frost depth early is crucial for farmers to determine when and how deep they should plant. In this study, data is collected primarily from NDAWN(North Dakota AgriculturalWeather Network) Fargo station for historical soil depth temperature and weather information. Lasso regression is used to model the frost depth. Since soil temperature is clearly seasonal, meaning there should be an obvious correlation between temperature and different days, our model can handle residual correlations that are generated not only from time domain, but space domain, since temperatures of different levels should also be correlated. Furthermore, root mean square error (RMSE) is used to evaluate goodness-of-fit of the model.Item Robust c-Optimal Design for Estimating the Edp(North Dakota State University, 2014) Zhang, AnqingOptimal design provides the most efficient design to study dose-response functions. It is often observed to adopt the four-parameter logistic model to describe the dose-response relationships in many dose finding trials. Under the four-parameter logistic model, optimal design to estimate the EDp accurately is presented. The EDp is the dose achieving 100p% of the maximum treatment effect. C-optimal design works the best to estimate the EDp, but the value of p must be predetermined in order to obtain the c-optimal design. Here we investigate the efficiency of c-optimal design to estimate the EDp for different values of p and present robust c-optimal design that works well for the changes in the value of p. Five different values of p are considered in this study: ED10, ED30, ED50, ED70, and ED90. The performance of the robust c-optimal design is obtained and compared to the c-optimal designs and traditional uniform designs.