Statistics Masters Papers
Permanent URI for this collectionhdl:10365/32400
Browse
Recent Submissions
Item Comparing Tests for a Mixed Design with Block Effect(North Dakota State University, 2009) Zhao, HuiTests Comb and Comb II are used to test the equality of means in a mixed design which is a combination of randomized complete block design and completely randomized design. The powers of Comb and Comb II for a mixed design have already been compared with Page's test (Magel, Terpstra, Wen (2009)) when there was little or no block effect added to the portion that was analyzed as a completely randomized design. In this paper, we wish to compare the tests when the portion of the design analyzed as a completely randomized design actually has a block effect. A Monte Carlo simulation study was conducted to compare the power of the three tests where Page's test was used only on data from the randomized complete block portion. A variety of situations were considered. Three underlying distributions were included in the simulation study. These included the normal distribution, exponential distribution, and t distribution with degree of freedom equal to 3. For every distribution, 16, 32 and 40 blocks were used in the randomized complete block design portion where the equal sample size of completely randomized data portion was 1/8, 1/4 and 1/2 the number of blocks considered. Unequal sample sizes for the completely randomized design portion were also considered. Powers were estimated for different location parameter arrangements for 3, 4 and 5 populations. Two variances, 0.25 and I, for the block effect were used. The block factor added into the completely randomized design portion didn't change the test with highest rejection percentage for the equal sample size cases, although the powers of the two tests for the mixed design decreased. For most of unequal sample size cases, Page's test has the highest rejection percentage. Overall, it was concluded that it was better to use one of the two tests for mixed design instead of Page's test when there were equal sample sizes for portion analyzed as a completely randomized design. When there were not equal size samples, but the first sample size was twice the size of the others, it was generally better to use Comb over Page's unless the number of populations became very large or there was a large block effect variance.Item Robust Tests for Cointegration with Application to Statistical Arbitrage Trading Strategies(North Dakota State University, 2010) Hanson, Thomas AlanThis study proposes two new cointegration tests that employ rank-based and least absolute deviation techniques to create a robust version of the Engle-Granger cointegration test. Critical values are generated through a Monte Carlo simulation over a range of error distributions, and the performance of the tests is then compared against the Engle-Granger and Johansen tests. The robust procedures underperform slightly for normally distributed error terms but outperform for fatter-tailed distributions. This characteristic suggests the robust tests are more appropriate for many applications where departures from normality are common. One particular example discussed here is statistical arbitrage, a stock trading strategy based on cointegration and mean reversion. In a simple example, the rank-based procedure produces additional profits over the Engle-Granger procedure.Item A Pilot Study of Module Interconnectedness(North Dakota State University, 2010) Vanguru, PrasanthComplexity plays an important role in understanding and working with a program, and has been measured in many different ways for software applications. The use of statistical analysis is one of the ways to predict the pattern of complexity among the modules present in a software application. A random sample of twelve software applications was selected for this study to examine complexity. A single pair of complexity measures was evaluated. This pair of complexity measures was the indegrees and out-degrees for each module of an application. The next step was to try to fit suitable statistical distributions to the in-degrees and to the out-degrees. By using various statistical distributions such as the normal, log-normal, exponential, geometric, uniform, poisson and the chi-square, we try to determine the type of distribution for the in-degrees and the type of distribution for out-degrees of the modules present in the software applications so that the pattern of complexity can be derived. The chi-square goodness of fit test was used to test various null hypotheses about the distributions for the in-degrees and for the out-degrees. Results showed that the pattern of in-degrees and the pattern of out-degrees both followed chi-square distributions.Item Factors Influencing Carbon Sequestration in Northern Great Plains Grasslands(North Dakota State University, 2011) AnnamSoil development is influenced by the five soil forming factors; parent material, climate, landscape, organisms and time. This study was designed to examine the effects of landscape and organisms (vegetation) on carbon (C) in Conservation Reserve Program (CRP), restored grasslands, and undisturbed grasslands across the northern Great Plains of the U.S. using statistical methods. The effects of vegetation, slope, and aspect on C sequestered in the surface 30 cm of the soil for 997 sites sampled across portions oflowa, Minnesota, Montana, and North and South Dakota were evaluated. A Partial F-test was used to evaluate models to determine the significance of factors and their interaction effects. For the vegetation component of these models, cool season grasses with or without legumes showed higher levels of soil organic C than warm season grasses with or without legumes or mixed cool and warm season grass regimes. When slopes were evaluated, slopes less than 3 % showed higher levels of sequestered C than slopes greater than 3 %. Southern and western aspects showed higher soil C levels than other aspects.Item Optimizing Prediction Power of RNA-seq on Intrinsic Characteristics in Breast Cancer(North Dakota State University, 2022) Liu, YuanBreast cancer is the most common cancer in women worldwide, and accurate and early detection of breast cancer is vital in characterizing the disease. Transcriptomic expression is embedded abundant tumor and cell state information. However, selecting a good pipeline in applying mRNA expression is critical in downstream characteristics prediction. We designed a study that focused on determining the best combinations of preprocessing processes in predictions. We tested six normalization methods, two gene selection methods, and over ten machine learning algorithms. By using appropriate evaluation metrics, we recommend using FPKM normalization method combined with either gene selection method and employing RF for the purpose of breast cancer downstream prediction.Item The Influence of Race, Age, Comorbidities, and BMI on Disability Following Stroke in Elderly People Living in Their Own Home(North Dakota State University, 2020) Endo, IzumiStroke is one of the major health issues in the United States. I explored different aspects of disability based on a history of stroke, race, comorbidities, age, and body mass index for the population of community dwelling stroke survivors. Using a dataset drawn from the first wave of the longitudinal study of the National Social Life, Health, and Aging Project (Waite et al., 2019), analysis was performed. The dataset consists of a nationally representative sample of 3,005 community dwelling people between the ages of 57 to 86 years old at the time of recruitment. The results demonstrated that the history of stroke, presence of comorbidities such as arthritis, chronic obstructive pulmonary disease, asthma, and heart failure, age, and body mass index significantly influenced the amount of disability an elderly person had. Performing screening and addressing the issues are essential to lower the amount of disability in the elderly population.Item Rutin Extraction and Content in Buckwheat (Fagopyrum esculentum) Bran-Fortified Pasta(North Dakota State University, 2019) Kaiser, Amber ChristineThe objectives of this study were to optimize extraction of rutin from buckwheat bran and buckwheat bran-fortified spaghetti and to determine the stability of rutin during spaghetti production and preparation. Aqueous ethanol and ethanol at 50, 60, 70, 80, and 90 % were used with Soxhlet or ultrasound-assisted extraction methods and 80 % methanol extraction was evaluated with or without papain treatment. Optimal extraction treatment (80 % methanol using ultrasound-assisted extraction without enzyme treatment) was used to determine rutin content in buckwheat bran-fortified spaghetti dried at low (40 °C) or high (90 °C) temperature. Rutin content was evaluated in raw, hydrated, extruded, dried, and cooked pasta. High temperature drying reduced rutin content more than low temperature drying, and total reduction in rutin content from raw pasta mix to cooked pasta was 25 – 30 %.Item Comparison of Proposed K Sample Tests with Dietz's Test For Nondecreasing Ordered Alternatives for Bivariate Exponential Data(North Dakota State University, 2011) Pothana, JyothsnadeviComparison of powers is essential to determine the best test that can be used for data under certain specific conditions. Likewise, several nonparametric methods have been developed for testing the ordered alternatives. The Jonckheere-Terpstra (JT) test and the Modified Jonckheere-Terpstra (MJT) test are for testing nondecreasing ordered alternatives for univariate data. The Dietz test is for testing nondecreasing alternatives based on bivariate data. This paper compares various tests when testing for nondecreasing alternatives specifically when the underlying distributions are bivariate exponential. The JT test and the MJT test are applied to univariate data which is derived by reducing bivariate data to univariate data using various transformations. A Monte Carlo simulation study is conducted comparing the estimated powers of JT tests and MJT tests (based on a variety of transformed univariate data) with the estimated powers of Dietz test (based on bivariate data) under a variety of location shifts and sample sizes. The results are compared with Zhao' s (2011) results for bivariate normal data. The overall best test statistic for bivariate data ordered alternatives is discussed in this paper.Item A Proposed Nonparametric Test for Simple Tree Alternative in a BIBD Design(North Dakota State University, 2011) Wang, ZhuangliA nonparametric test is proposed to test for the simple tree alternative in a Balanced Incomplete Block Design (BIBD). The details of the test statistic when the null hypothesis is true are given. The paper also introduces the calculations of the means and variances under a variety of situations. A Monte Carlo simulation study based on SAS is conducted to compare the powers of the new proposed test and the Durbin test. The simulation study is used to generate the BIBD data from three distributions: the normal distribution, the exponential distribution, and the Student's t distribution with three degrees of freedom. The powers of the proposed test and the Durbin test are both estimated based on 10,000 iterations for three, four, and five treatments, and for different location shifts. According to the results of simulation study, the Durbin test is better when at least one treatment mean is close to or equal to the control mean: otherwise, the proposed test is better.Item Investment Behavior Analysis Based on Tail Risk Management(North Dakota State University, 2018) Sun, YuAs behavioral finance is becoming more prevalent in academic area, a study is worth conducting to pinpoint investors’ preference through managing tail risk of asset portfolios. This study investigates investors’ investment behaviors by modeling their investment personalities based on tail risk management. We incorporate CVaR approach to model traditional and non-traditional investment behaviors by reshaping the tails of portfolio return. To be specific, we build model to maximize left-tail CVaR, minimize right-tail CVaR, minimize left-tail CVaR models, and a mixed model that maximize left-tail CVaR and minimize righttail CVaR simultaneously based on various group of rational and irrational investors. Our work incorporates empirical historical data and Monte Carlo simulation to compare these models with the classical Markowitz approach via different dimensions. We make contributions to fill the gap by making a more comprehensively study that incorporates investors’ psychological factors and exploring economic information regarding asset pricing puzzle and long-run risk.Item Robust D-Optimal Design for Multiple Nominal Parameter Values under the 5PL-1P Model(North Dakota State University, 2018) Liang, CuipingA robust D-optimal design that works well for multiple nominal parameter values is presented in this paper. In general, D-optimal design works very well for estimating the model parameters, but it is very sensitive to multiple nominal model parameter values when the response is modeled by nonlinear models. The 5PL-1P model is considered in this study to describe a dose-response function. The sensitivity of the D-optimal design to the model parameter values under the 5PL-1P model is studied. The robust D-optimal design that can reduce the impact of the multiple nominal model parameter values is proposed using the Bayesian technique. Lastly, we compare performances of the proposed design to other well-known designs for estimating the model parameters under the 5PL-1P model.Item Exploring Associations between Lifestyles and Metabolic Syndrome in Middle-Age Chinese Population(North Dakota State University, 2018) Zhou, XiaoyiNowadays the prevalence of Metabolic Syndrome (MetS) affects many middle-age people in China. MetS is associated with the risk of type 2 diabetes and cardiovascular disease. Identifying the potential risk factors contribute to MetS is very important for preventing cardiovascular disease. The associations between lifestyles and prevalence of MetS are extensively studied by researchers. A cross-sectional study, which was conducted by Strand, MA. surveyed 659 subjects in Yuci, China in 2012. The proportional odds model was applied to determine the associations between lifestyles and MetS in three Chinese middle-age groups. The results demonstrated that doing daily exercise was one of the best method to treat MetS. Moderate alcohol consumption could prevent MetS in age group born in 1956. Occasionally milk consumption could prevent MetS in age group born in 1964, while it did not help age groups born in 1960-1961 and in 1956.Item Forecasting Point Spread for Women’s Volleyball(North Dakota State University, 2016) Zhang, DelingVolleyball has become a well-known and competitive sport with physical and technical performances over the years. The game results are determined by some important factors such as players, and the team’s skills to succeed in a championship. In this research, we propose to analyze volleyball data by using a multiple linear regression model and a logistic regression model. We develop a multiple regression model using in-game statistics that explain the point spread of a volleyball game. We also develop a logistic regression model that estimates the probability of a team winning the game based on the in-game statistics. Both of the models are validated and then the point spread model is used to predict the results of a volleyball game replacing the in-game statistics with the averages of the in-game statistics based on the past two previous matches of both teams. Results are given.Item Prediction of the World Cup Soccer Winner: Using Two Statistical Methods(North Dakota State University, 2016) Sylla, Mohamed Dit ModySoccer is considered the most popular sport on earth and applying statistical models to analyze small soccer data has been of a keen interest to modern researchers. Statistical modeling of soccer data also provides guidance and assistance to stakeholders. The goal of this paper is to establish a consistent statistical approach to help in the prediction of future World Cup championships. Ordinary least squares regression is used to develop models which predict goal margin of games and logistic regression is used to develop models which estimate the probability of a team winning the game. Discriminant Analysis was also used to determine which variables significantly influence individual game wins. The Fisher classification procedure allows for interpretability while providing a robust approach to classifying the 32 contestants of the 2014 World Cup using the previous data from 2006 and 2010 World Cup Championships.Item Clustering Algorithm Comparison for Ellipsoidal Data(North Dakota State University, 2015) Loeffler, Shane RobertThe main objective of cluster analysis is the statistical technique of identifying data points and assigning them into meaningful clusters. The purpose of this paper is to compare different types of clustering algorithms to find the clustering algorithm that performs the best for varying complexities in Gaussian data. The clustering algorithms used would include: Partitioning Around Medoids (PAM), K-means, Hierarchical with different linkages (Ward’s linkage, Single linkage, Complete linkage, Average linkage, McQuitty’s method, Gower’s method, and Centroid method). The different types of complexities would include different number of dimensions, average pairwise overlap between clusters, number of points simulated from each cluster. After the data is simulated the Adjusted Rand Index will be used gauge the performance of the clusters. From that a t-test will also be used to see if there are any clustering algorithms that as well as other clustering algorithms.Item Identifying Significant Factors Influencing Metabolic Syndrome In China(North Dakota State University, 2015) Gu, XiaoxueMetabolic Syndrome occurs when a person’s body does not properly use and store energy. The disease has five criteria: abdominal obesity, insulin resistance, hypertension, dyslipidemia, and impaired glucose regulation. The purpose of this paper was to analysis a longitudinal data obtained from China. The data was collected using surveys in 2008 and 2012. For finding the factors that contributed significantly to the development of Metabolic Syndrome, a marginal model was applied. To fit the marginal model, the Generalized Estimating Equation method was used. The developed model did not have high accuracy of presenting the proportion of true results ( Metabolic Syndrome observed and no Metabolic Syndrome observed).Item Comparison of Classification Rates among Logistic Regression, Neural Network and Support Vector Machines in the Presence of Missing Data(North Dakota State University, 2014) Upadhyaya, SudhiStatistical models such as Logistic Regression (LR), Neural Network (NN) and Support Vector Machines (SVM) often use datasets with missing values while making inferences regarding the population. When inferences are made based on the data set used, the presence of missing data can severely skew the results and distort the efficiency of the model. Our objective was to identify a robust model among LR, NN, SVM in the presence of missing data. The study was conducted by simulating observations based on Monte Carlo methods and missing data was introduced randomly at 10% level. Single mode imputation was used to impute missing values. Simple random samples of 120, 240 and 500 observations were chosen and these three models were fit for two scenarios. Results showed that the performance of SVM was far superior compared to LR or NN models. However, the classification accuracy of SVM gradually decreased as sample size increased.Item An Application of Simplicial Intercept Depth (SID) Method for Fitting Linear Models(North Dakota State University, 2014) Sun, ZhongxingThis paper presents an application based on the Simplicial Intercept Depth method introduced by Liu (2004). We use this method to get the best linear fit of the phenotypic data for spot blotch resistant reaction of two different barley groups. The Simplicial Intercept Depth method is generalized by Simplicial Depth, also proposed by Liu in 1990. It provides a robust way for data analysis when outliers appear. In this paper, we use the Bootstrapping method, which is introduced by Bradley Efron (1979), to resample from the original dataset to get a distribution of the estimates. We also compare the SID with least squares regression and the Theil-type estimate which introduced by Shen (2009). The result shows that the SID is a robust method for estimating the coefficients of the linear regression model.Item Ds-Optimal Design for Model Discrimination in a Probit Model(North Dakota State University, 2014) Liu, RuifengIn toxicology studies, dose response functions with a downturn at higher doses are often observed. For such response functions, researchers often want to see if the downturn of the response is signifcant. A probit model with a quadratic term is adopted to demonstrate the dose response with a downturn. Under the probit model, we obtain optimal designs to study the signifcance of the downturn and their efficiencies are compared. Our approach identites the upper bound of the number of optimal design points and searches for the optimal design numerically based on the upper bound.Item Bracketing NCAA Men's Division I Basketball Tournament(North Dakota State University, 2013) Zhang, XiaoThis paper presents a new bracketing method for all 63 games in the NCAA Division 1 basketball tournament. This method, based on the logistic conditional probability models, is self-consistent in terms of constructing winning probabilities of each game. Empirical results show that this method outperforms the ordinal logistic regression and expectation method with restriction(Restricted OLRE model) proposed by West (2006).