Statistics
Permanent URI for this communityhdl:10365/32398
Research from the Department of Statistics. The department website may be found at https://www.ndsu.edu/statistics/
Proceedings for the annual Red River Valley Statistical Conferences may be found at http://hdl.handle.net/10365/26113
Browse
Browsing Statistics by Issue Date
Now showing 1 - 20 of 123
- Results Per Page
- Sort Options
Item Comparing Tests for a Mixed Design with Block Effect(North Dakota State University, 2009) Zhao, HuiTests Comb and Comb II are used to test the equality of means in a mixed design which is a combination of randomized complete block design and completely randomized design. The powers of Comb and Comb II for a mixed design have already been compared with Page's test (Magel, Terpstra, Wen (2009)) when there was little or no block effect added to the portion that was analyzed as a completely randomized design. In this paper, we wish to compare the tests when the portion of the design analyzed as a completely randomized design actually has a block effect. A Monte Carlo simulation study was conducted to compare the power of the three tests where Page's test was used only on data from the randomized complete block portion. A variety of situations were considered. Three underlying distributions were included in the simulation study. These included the normal distribution, exponential distribution, and t distribution with degree of freedom equal to 3. For every distribution, 16, 32 and 40 blocks were used in the randomized complete block design portion where the equal sample size of completely randomized data portion was 1/8, 1/4 and 1/2 the number of blocks considered. Unequal sample sizes for the completely randomized design portion were also considered. Powers were estimated for different location parameter arrangements for 3, 4 and 5 populations. Two variances, 0.25 and I, for the block effect were used. The block factor added into the completely randomized design portion didn't change the test with highest rejection percentage for the equal sample size cases, although the powers of the two tests for the mixed design decreased. For most of unequal sample size cases, Page's test has the highest rejection percentage. Overall, it was concluded that it was better to use one of the two tests for mixed design instead of Page's test when there were equal sample sizes for portion analyzed as a completely randomized design. When there were not equal size samples, but the first sample size was twice the size of the others, it was generally better to use Comb over Page's unless the number of populations became very large or there was a large block effect variance.Item A Pilot Study of Module Interconnectedness(North Dakota State University, 2010) Vanguru, PrasanthComplexity plays an important role in understanding and working with a program, and has been measured in many different ways for software applications. The use of statistical analysis is one of the ways to predict the pattern of complexity among the modules present in a software application. A random sample of twelve software applications was selected for this study to examine complexity. A single pair of complexity measures was evaluated. This pair of complexity measures was the indegrees and out-degrees for each module of an application. The next step was to try to fit suitable statistical distributions to the in-degrees and to the out-degrees. By using various statistical distributions such as the normal, log-normal, exponential, geometric, uniform, poisson and the chi-square, we try to determine the type of distribution for the in-degrees and the type of distribution for out-degrees of the modules present in the software applications so that the pattern of complexity can be derived. The chi-square goodness of fit test was used to test various null hypotheses about the distributions for the in-degrees and for the out-degrees. Results showed that the pattern of in-degrees and the pattern of out-degrees both followed chi-square distributions.Item Robust Tests for Cointegration with Application to Statistical Arbitrage Trading Strategies(North Dakota State University, 2010) Hanson, Thomas AlanThis study proposes two new cointegration tests that employ rank-based and least absolute deviation techniques to create a robust version of the Engle-Granger cointegration test. Critical values are generated through a Monte Carlo simulation over a range of error distributions, and the performance of the tests is then compared against the Engle-Granger and Johansen tests. The robust procedures underperform slightly for normally distributed error terms but outperform for fatter-tailed distributions. This characteristic suggests the robust tests are more appropriate for many applications where departures from normality are common. One particular example discussed here is statistical arbitrage, a stock trading strategy based on cointegration and mean reversion. In a simple example, the rank-based procedure produces additional profits over the Engle-Granger procedure.Item Mass Spectrum Analysis of a Substance Sample Placed into Liquid Solution(North Dakota State University, 2011) Wang, YunliMass spectrometry is an analytical technique commonly used for determining elemental composition in a substance sample. For this purpose, the sample is placed into some liquid solution called liquid matrix. Unfortunately, the spectrum of the sample is not observable separate from that of the solution. Thus, it is desired to distinguish the sample spectrum. The analysis is usually based on the comparison of the mixed spectrum with the one of the sole solution. Introducing the missing information about the origin of observed spectrum peaks, the author obtains a classic set up for the Expectation-Maximization (EM) algorithm. The author proposed a mixture modeling the spectrum of the liquid solution as well as that of the sample. A bell-shaped probability mass function obtained by discretization of the univariate Gaussian probability density function was proposed or serving as a mixture component. The E- and M- steps were derived under the proposed model. The corresponding R program is written and tested on a small but challenging simulation example. Varying the number of mixture components for the liquid matrix and sample, the author found the correct model according to Bayesian Information Criterion. The initialization of the EM algorithm is a difficult standalone problem that was successfully resolved for this case. The author presents the findings and provides results from the simulation example as well as corresponding illustrations supporting the conclusions.Item Comparison of Proposed K Sample Tests with Dietz's Test for Nondecreasing Ordered Alternatives for Bivariate Normal Data(North Dakota State University, 2011) Zhao, YanchunThere are many situations in which researchers want to consider a set of response variables simultaneously rather than just one response variable. For instance, a possible example is when a researcher wishes to determine the effects of an exercise and diet program on both the cholesterol levels and the weights of obese subjects. Dietz (1989) proposed two multivariate generalizations of the Jonckheere test for ordered alternatives. In this study, we propose k-sample tests for nondecreasing ordered alternatives for bivariate normal data and compare their powers with Dietz's sum statistic. The proposed k-sample tests are based on transformations of bivariate data to univariate data. The transformations considered are the sum, maximum and minimum functions. The ideas for these transformations come from the Leconte, Moreau, and Lellouch (1994). After the underlying bivariate normal data are reduced to univariate data, the Jonckheere-Terpstra (JT) test (Terpstra, 1952 and Jonckheere, 1954) and the Modified Jonckheere-Terpstra (MJT) test (Tryon and Hettmansperger, 1973) are applied to the univariate data. A simulation study is conducted to compare the proposed tests with Dietz's test for k bivariate normal populations (k=3, 4, 5). A variety of sample sizes and various location shifts are considered in this study. Two different correlations are used for the bivariate normal distributions. The simulation results show that generally the Dietz test performs the best for the situations considered with the underlying bivariate normal distribution. The estimated powers of MJT sum and JT sum are often close with the MJT sum generally having a little higher power. The sum transformation was the best of the three transformations to use for bivariate normal data.Item A Comparison of the Ansari-Bradley Test and the Moses Test for the Variances(North Dakota State University, 2011) Yuni, ChenThis paper is aimed to compare the powers and significance levels of two well known nonparametric tests: the Ansari-Bradley test and the Moses test in both situations where the equal-median assumption is satisfied and where the equal-median assumption is violated. R-code is used to generate the random data from several distributions: the normal distribution, the exponential distribution, and the t-distribution with three degrees of freedom. The power and significance level of each test was estimated for a given situation based on 10,000 iterations. Situations with the equal samples of size 10, 20, and 30, and unequal samples of size 10 and 20, 20 and 10, and 20 and 30 were considered for a variety of different location parameter shifts. The study shows that when two location parameters are equal, generally the Ansari-Bradley test is more powerful than the Moses test regardless ofthe underlying distribution; when two location parameters are different, the Moses is generally preferred. The study also shows that when the underlying distribution is symmetric, the Moses test with large subset size k generally has higher power than the test with smaller k; when the underlying distribution is not symmetric, the Moses test with larger k is more powerful for relatively small sample sizes and the Moses test with medium k has higher power for relatively large sample sizes.Item Comparison of Proposed K Sample Tests with Dietz's Test For Nondecreasing Ordered Alternatives for Bivariate Exponential Data(North Dakota State University, 2011) Pothana, JyothsnadeviComparison of powers is essential to determine the best test that can be used for data under certain specific conditions. Likewise, several nonparametric methods have been developed for testing the ordered alternatives. The Jonckheere-Terpstra (JT) test and the Modified Jonckheere-Terpstra (MJT) test are for testing nondecreasing ordered alternatives for univariate data. The Dietz test is for testing nondecreasing alternatives based on bivariate data. This paper compares various tests when testing for nondecreasing alternatives specifically when the underlying distributions are bivariate exponential. The JT test and the MJT test are applied to univariate data which is derived by reducing bivariate data to univariate data using various transformations. A Monte Carlo simulation study is conducted comparing the estimated powers of JT tests and MJT tests (based on a variety of transformed univariate data) with the estimated powers of Dietz test (based on bivariate data) under a variety of location shifts and sample sizes. The results are compared with Zhao' s (2011) results for bivariate normal data. The overall best test statistic for bivariate data ordered alternatives is discussed in this paper.Item A Nonparametric Test for the Non-Decreasing Alternative in an Incomplete Block Design(North Dakota State University, 2011) Ndungu, Alfred MungaiThe purpose of this paper is to present a new nonparametric test statistic for testing against ordered alternatives in a Balanced Incomplete Block Design (BIBD). This test will then be compared with the Durbin test which tests for differences between treatments in a BIBD but without regard to order. For the comparison, Monte Carlo simulations were used to generate the BIBD. Random samples were simulated from: Normal Distribution; Exponential Distribution; T distribution with three degrees of freedom. The number of treatments considered was three, four and five with all the possible combinations necessary for a BIBD. Small sample sizes were 20 or less and large sample sizes were 30 or more. The powers and alpha values were then estimated after 10,000 repetitions.The results of the study show that the new test proposed is more powerful than the Durbin test. Regardless of the distribution, sample size or number of treatments, the new test tended to have higher powers than the Durbin test.Item Factors Influencing Carbon Sequestration in Northern Great Plains Grasslands(North Dakota State University, 2011) AnnamSoil development is influenced by the five soil forming factors; parent material, climate, landscape, organisms and time. This study was designed to examine the effects of landscape and organisms (vegetation) on carbon (C) in Conservation Reserve Program (CRP), restored grasslands, and undisturbed grasslands across the northern Great Plains of the U.S. using statistical methods. The effects of vegetation, slope, and aspect on C sequestered in the surface 30 cm of the soil for 997 sites sampled across portions oflowa, Minnesota, Montana, and North and South Dakota were evaluated. A Partial F-test was used to evaluate models to determine the significance of factors and their interaction effects. For the vegetation component of these models, cool season grasses with or without legumes showed higher levels of soil organic C than warm season grasses with or without legumes or mixed cool and warm season grass regimes. When slopes were evaluated, slopes less than 3 % showed higher levels of sequestered C than slopes greater than 3 %. Southern and western aspects showed higher soil C levels than other aspects.Item A Proposed Nonparametric Test for Simple Tree Alternative in a BIBD Design(North Dakota State University, 2011) Wang, ZhuangliA nonparametric test is proposed to test for the simple tree alternative in a Balanced Incomplete Block Design (BIBD). The details of the test statistic when the null hypothesis is true are given. The paper also introduces the calculations of the means and variances under a variety of situations. A Monte Carlo simulation study based on SAS is conducted to compare the powers of the new proposed test and the Durbin test. The simulation study is used to generate the BIBD data from three distributions: the normal distribution, the exponential distribution, and the Student's t distribution with three degrees of freedom. The powers of the proposed test and the Durbin test are both estimated based on 10,000 iterations for three, four, and five treatments, and for different location shifts. According to the results of simulation study, the Durbin test is better when at least one treatment mean is close to or equal to the control mean: otherwise, the proposed test is better.Item Optimal Designs for the Hill Model with Three Parameters(North Dakota State University, 2012) Dockter, Travis JonOptimal designs specify design points to use and how to distribute subjects over these design points in the most efficient manner. The Hill model with three parameters is often used to describe sigmoid dose response functions. In our paper, we study optimal designs under the Hill model. The first is D-optimal design, which works best to study the model to fit the data. Next is c-optimal design, which works best to study a target dose level, such as ED50 - the dose level with 50% maximum treatment effect. The third is a two-stage optimal design, which considers both D-optimality and c-optimality. In order to compare the optimal designs, their design efficiencies are compared.Item On K-Means Clustering Using Mahalanobis Distance(North Dakota State University, 2012) Nelson, JoshuaA problem that arises quite frequently in statistics is that of identifying groups, or clusters, of data within a population or sample. The most widely used procedure to identify clusters in a set of observations is known as K-Means. The main limitation of this algorithm is that it uses the Euclidean distance metric to assign points to clusters. Hence, this algorithm operates well only if the covariance structures of the clusters are nearly spherical and homogeneous in nature. To remedy this shortfall in the K-Means algorithm the Mahalanobis distance metric was used to capture the variance structure of the clusters. The issue with using Mahalanobis distances is that the accuracy of the distance is sensitive to initialization. If this method serves as a signicant improvement over its competitors, then it will provide a useful tool for analyzing clusters.Item Entropy as a Criterion for Variable Reduction in Cluster Data(North Dakota State University, 2012) Olson, ChristopherEntropy is a measure of the randomness of a system state. This quantity gives us a measure of uncertainty that is associated with each particular observation belonging to a specific cluster. We examine this property and its potential use in analyzing high dimension datasets. Entropy proves most interesting in identifying possible dimensions that do not contribute meaningful classification to the clusters present. We can remove the dimension(s) found which are the least important and generalize this idea to a procedure. After identifying all the dimensions that should be eliminated from the dataset, we then compare its ability in recovering the true classification of the observations versus the estimated classification of the data. From the results obtained and shown in this paper, it is clear that entropy is a good candidate for a criterion in variable reduction.Item Nonparametric Test for the Umbrella Alternative in a Randomized Complete Block and Balanced Incomplete Block Mixed Design(North Dakota State University, 2012) Hemmer, Michael ToshiroNonparametric tests have served as robust alternatives to traditional statistical tests with rigid underlying assumptions. If a researcher expects the treatment effects to follow an umbrella alternative, then the test developed in this research will be applicable in the Balanced Incomplete Block Design (Hemmer’s test). It is hypothesized that Hemmer’s test will prove to be more powerful than the Durbin test when the umbrella alternative is true. A mixed design consisting of a Balanced Incomplete Block Design and a Randomized Complete Block Design will also be considered, where two additional test statistics are developed for the umbrella alternative. Monte Carlo simulation studies were conducted using SAS to estimate powers. Various underlying distributions were used with 3, 4, and 5 treatments, and a variety of peaks and mean parameter values. For the mixed design, different ratios of complete to incomplete blocks were considered. Recommendations are given.Item Assessing Changes in Within Individual Variation Over Time for Nutritional Intake Data Using 24 Hour Recalls from the National Health and Examination Survey(North Dakota State University, 2012) Brandt, Kyal ScottNutritional surveys often use 24 hour recalls to assess the nutritional intake of certain populations. The National Health and Examination Survey (NHANES) collects two 24-hour recalls for each individual in the study. This small sampling can lead to a great deal of variation due to day-to-day differences in an individual’s intake, making it difficult to assess “usual intake.” The ISU method is implemented in the PC-Side software package, breaking our observed variation into two components: within individual variation (WIV) and between-individual variation (BIV). In this paper, we will use the PC-Side software to get WIV estimates for several different age, gender, and nutrients from NHANES nutrition data. We will look at how WIV estimates change over time and using past WIV estimates to get a “usual intake” distribution and the calculated proportion below an estimated average requirement (EAR).Item Examining Influential Factors and Predicting Outcomes in European Soccer Games(North Dakota State University, 2013) Melnykov, YanaModels are developed using least squares regression and logistic regression to predict outcomes of European soccer games based on four variables related to the past k games of each team playing with the following values of k considered: 4, 6, 8, 10, and 12. Soccer games from the European soccer leagues of England, Italy, and Spain are considered for the 2011-2012 year. Each league has 20 teams playing two games with each other: one game is played at home; the other game is played away. There are 38 rounds in each league. The first 33 rounds are used to developed models to predict outcomes of games. Predictions are made for the last 5 rounds in each league. We were able to correctly predict 76% of the results for the last 5 rounds using the linear regression model and 77% of results correctly using the logistic regression model.Item Development of a Prediction Model for the NCAA Division-I Football Championship Subdivision(North Dakota State University, 2013) Long, JosephThis thesis investigates which in-game team statistics are most significant in determining the outcome in a NCAA Division-I Football Championship Subdivision (FCS) game. The data was analyzed using logistic and ordinary least squares regression techniques to create models that explained the outcome of the past games. The models were then used to predict games where the actual in-game statistics were unknown. A random sample of games from the 2012 NCAA Division-I Football Championship Subdivision regular season was used to test the accuracy of the models when used to predict future games. Various techniques were used to estimate the in-game statistics in the models for each individual team in order to predict future games. The most accurate technique consisted of using three game medians with respect to total yards gained by the teams in consideration. This technique correctly predicted 78.85% of the games in the sample data set when used with the logistic regression model.Item Predicting Recessions in the U.S. with Yield Curve Spread(North Dakota State University, 2013) Huang, DiThis paper proposes a hidden Markov model for the signal of U.S. recessions. The model uses the spread of interest rate between 10-year Treasury bond and 3-month Treasury bill, together with other financial indicators which are the real M2 growth, the change in the Standard and Poor's 500 index of stock prices, and difference between the 6-month commercial paper and 6-month Treasury bill rates as predictors. The hidden Markov model considers temporal dependence between the recession signals and provides an estimate of the long-term probability of recessions. The empirical results indicate the hidden Markov model well predict the signal of recessions in the U.S.Item Bracketing NCAA Men's Division I Basketball Tournament(North Dakota State University, 2013) Zhang, XiaoThis paper presents a new bracketing method for all 63 games in the NCAA Division 1 basketball tournament. This method, based on the logistic conditional probability models, is self-consistent in terms of constructing winning probabilities of each game. Empirical results show that this method outperforms the ordinal logistic regression and expectation method with restriction(Restricted OLRE model) proposed by West (2006).Item Model Validation and Diagnostis in Right Censored Regression(North Dakota State University, 2013) Miljkovic, TatjanaWhen censored data are present in the linear regression setting, the Expectation-Maximization (EM) algorithm and the Buckley and James (BJ) method are two algorithms that can be implemented to fit the regression model. We focus our study on the EM algorithm because it is easier to implement than the BJ algorithm and it uses common assumptions in regression theory, such as normally distributed errors. The BJ algorithm, however, is used for comparison purposes in benchmarking the EM parameter estimates, their variability, and model selection. In this dissertation, validation and influence diagnostic tools are proposed for right censored regression using the EM algorithm. These tools include a reconstructed coefficient of determination, a test for outliers based on the reconstructed Jackknife residual, and influence diagnostics with one-step deletion. To validate the proposed methods, extensive simulation studies are performed to compare the performances of the EM and BJ algorithms in parameter estimation for data with different error distributions, the proportion of censored data, and sample sizes. Sensitivity analysis for the reconstructed coefficient of determination is developed to show how the EM algorithm can be used in model validation for different amounts of censoring and locations of the censored data. Additional simulation studies show the capability of the EM algorithm to detect outliers for different types of outliers (uncensored and censored), proportions of censored data, and the locations of outliers. The proposed formula for the one-step deletion method is validated with an example and a simulation study. Additionally, this research proposes a novel application of the EM algorithm for modeling right censored regression in the area of actuarial science. Both the EM and BJ algorithms are utilized in modeling health benefit data provided by the North Dakota Department of Veterans Affairs (ND DVA). Proposed model validation and diagnostic tools are applied using the EM algorithm. Results of this study can be of great benefit to government policy makers and pricing actuaries.