Statistics
Permanent URI for this communityhdl:10365/32398
Research from the Department of Statistics. The department website may be found at https://www.ndsu.edu/statistics/
Proceedings for the annual Red River Valley Statistical Conferences may be found at http://hdl.handle.net/10365/26113
Browse
Browsing Statistics by Title
Now showing 1 - 20 of 123
- Results Per Page
- Sort Options
Item Adaptive Two-Stage Optimal Design for Estimating Multiple EDps under the 4-Parameter Logistic Model(North Dakota State University, 2018) Zhang, AnqingIn dose-finding studies, c-optimal designs provide the most efficient design to study an interesting target dose. However, there is no guarantee that a c-optimal design that works best for estimating one specific target dose still performs well for estimating other target doses. Considering the demand in estimating multiple target dose levels, the robustness of the optimal design becomes important. In this study, the 4-parameter logistic model is adopted to describe dose-response curves. Under nonlinear models, optimal design truly depends on the pre-specified nominal parameter values. If the pre-specified values of the parameters are not close to the true values, optimal designs become far from optimum. In this research, I study an optimal design that works well for estimating multiple s and for unknown parameter values. To address this parameter uncertainty, a two-stage design technique is adopted using two different approaches. One approach is to utilize a design augmentation at the second stage, the other one is to apply a Bayesian paradigm to find the optimal design at the second stage. For the Bayesian approach, one challenging task is that it requires heavy computation in the numerical calculation when searching for the Bayesian optimal design. To overcome this problem, a clustering method can be applied. These two-stage design strategies are applied to construct a robust optimal design for estimating multiple s. Through a simulation study, the proposed two-stage optimal designs are compared with the traditional uniform design and the enhanced uniform design to see how well they perform in estimating multiple s when the parameter values are mis-specified.Item Analysis of Bootstrap Techniques for Loss Reserving(North Dakota State University, 2015) Chase, Taryn RuthInsurance companies must have an appropriate method of estimating future reserve amounts. These values will directly influence the rates that are charged to the customer. This thesis analyzes stochastic reserving techniques that use bootstrap methods in order to obtain variability estimates of predicted reserves. Bootstrapping techniques are of interest because they usually do not require advanced statistical software to implement. Some bootstrap techniques have incorporated generalized linear models in order to produce results. To analyze how well these methods are performing, data with known future losses was obtained from the National Association of Insurance Commissioners. Analysis of this data shows that most bootstrapping methods produce results that are comparable to one another and to the trusted Chain Ladder method. The methods are then applied to loss data from a small Midwestern insurance company to predict variation of their future reserve amounts.Item An Analysis of Factors Contributing to Wins in the National Hockey League(North Dakota State University, 2013) Roith, Joseph MichaelThis thesis looks at common factors that have the largest impact on winning games in the NHL. Data was collected from regular season games for all teams in the NHL over seven seasons. Logistic and least squares regressions were performed to create a win probability model and a goal margin model to predict the outcome of games. Discriminant analysis was also used to determine significant factors over the course of an entire season. Save percentage margin, shot margin, block margin, short-handed shot margin, short-handed faceoff percentage, and even-handed faceoff percentage were found to be significant influences on individual game wins. Total goals, total goals against and takeaway totals for a season were enough to correctly predict whether a team made the playoffs 87% of the time. The accuracies of the models were then tested by predicting the outcome of games from the 2012 NHL regular season.Item Analysis of Salary for Major League Baseball Players(North Dakota State University, 2014) Hoffman, Michael GlennThis thesis examines the salary of Major League Baseball (MLB) players and whether players are paid based on their on-the-field performance. Each salary was examined on both the yearly production and the overall career production of the player. Several different production statistics were collected for the 2010-2012 MLB seasons. A random sample of players was selected from each season and separate models were created for position players and pitchers. Significant production statistics that were helpful in predicting salary were selected for each different model. These models were deemed to be good models having a predictive r-squared value of at least 0.70 for each of the different models. After the regression models were found, the models were tested for accuracy by predicting the salaries of a random sample of players from the 2013 MLB season.Item Analysis of Significant Factors in Division I Men's College Basketball and Development of a Predictive Model(North Dakota State University, 2013) Unruh, Samuel PaulWhile a number of statistics are collected during an NCAA Division I men’s college basketball game, it is potentially of interest to universities, coaches, players, and fans which of these variables are most significant in determining wins and losses. To this end, statistics were collected from two seasons of games and analyzed using logistic and least squares regression methods. The differences between the two competing teams in four common statistics were found to be significant to determining victory: assists, free throw attempts, defensive rebounds, and turnovers. The logistic and least squares models were then used with data from the 2011- 2012 season to verify the accuracy of the models. To determine the accuracy of the models in predicting future game outcomes, four prior game median statistics were collected for teams competing in a sample of games from 2011-2012, with the differences taken and used in the models.Item An Analysis of the NBA Draft: Are Teams Drafting Better and Does College Experience Truly Matter(North Dakota State University, 2022) Wolfe, KevinThis thesis attempts to answer two questions. Are NBA organizations doing a reasonable job at drafting players and getting better at the process, and does college experience play a significant role in a player’s performance during their early NBA career (first 3 seasons).In regard to these two questions, we determined through our research that NBA organizations are not showing any significant improvements in their ability to draft the best available players, this is surprising given the increase in available scouting data teams currently have access to. We suspected however that this lack of drafting improvements may be related to players entering the NBA with less college experience. However, after we determined that college experience does not appear to play a large role in a player’s early career NBA performance, we determined that experience does not appear to be the reason why teams aren’t doing a better job of drafting.Item Analyzing and Controlling Biases in Student Rating of Instruction(North Dakota State University, 2019) Zhou, YueMany colleges and universities have adopted the student ratings of instruction (SROI) system as one of the measures for instructional effectiveness. This study aims to establish a predictive model and address two questions related to SROI: firstly, whether gender bias against female instructors at North Dakota State University (NDSU) exists and, secondly, how other factors related to students, instructors and courses affect the SROI. In total, 30,303 SROI from seven colleges at NDSU for the 2013-2014 academic year are studied. Our results demonstrate that there is a significant association between students’ gender and instructors’ gender in the rating scores. Therefore, we cannot determine how the gender of an instructor effects the course rating unless we know the composition of genders of students in that class. Predictive proportional odds models for the students’ ordinal categorical ratings are established.Item An Application of Simplicial Intercept Depth (SID) Method for Fitting Linear Models(North Dakota State University, 2014) Sun, ZhongxingThis paper presents an application based on the Simplicial Intercept Depth method introduced by Liu (2004). We use this method to get the best linear fit of the phenotypic data for spot blotch resistant reaction of two different barley groups. The Simplicial Intercept Depth method is generalized by Simplicial Depth, also proposed by Liu in 1990. It provides a robust way for data analysis when outliers appear. In this paper, we use the Bootstrapping method, which is introduced by Bradley Efron (1979), to resample from the original dataset to get a distribution of the estimates. We also compare the SID with least squares regression and the Theil-type estimate which introduced by Shen (2009). The result shows that the SID is a robust method for estimating the coefficients of the linear regression model.Item Assessing Changes in Within Individual Variation Over Time for Nutritional Intake Data Using 24 Hour Recalls from the National Health and Examination Survey(North Dakota State University, 2012) Brandt, Kyal ScottNutritional surveys often use 24 hour recalls to assess the nutritional intake of certain populations. The National Health and Examination Survey (NHANES) collects two 24-hour recalls for each individual in the study. This small sampling can lead to a great deal of variation due to day-to-day differences in an individual’s intake, making it difficult to assess “usual intake.” The ISU method is implemented in the PC-Side software package, breaking our observed variation into two components: within individual variation (WIV) and between-individual variation (BIV). In this paper, we will use the PC-Side software to get WIV estimates for several different age, gender, and nutrients from NHANES nutrition data. We will look at how WIV estimates change over time and using past WIV estimates to get a “usual intake” distribution and the calculated proportion below an estimated average requirement (EAR).Item Bayesian Lasso Models – With Application to Sports Data(North Dakota State University, 2018) Gao, DiSeveral statistical models were proposed by researchers to fulfill the objective of correctly predicting the winners of sports game, for example, the generalized linear model (Magel & Unruh, 2013) and the probability self-consistent model (Shen et al., 2015). This work studied Bayesian Lasso generalized linear models. A hybrid model estimation approach of full and Empirical Bayesian was proposed. A simple and efficient method in the EM step, which does not require sample mean from the random samples, was also introduced. The expectation step was reduced to derive the theoretical expectation directly from the conditional marginal. The findings of this work suggest that future application will significantly cut down the computation load. Due to Lasso (Tibshirani, 1996)’s desired geometric property, the Lasso method provides a sharp power in selecting significant explanatory variables and has become very popular in solving big data problem in the last 20 years. This work was constructed with Lasso structure hence can also be a good fit to achieve dimension reduction. Dimension reduction is necessary when the number of observations is less than the number of parameters or when the design matrix is non-full rank. A simulation study was conducted to test the power of dimension reduction and the accuracy and variation of the estimates. For an application of the Bayesian Lasso Probit Linear Regression to live data, NCAA March Madness (Men’s Basketball Division I) was considered. In the end, the predicting bracket was used to compare with the real tournament result, and the model performance was evaluated by bracket scoring system (Shen et al., 2015).Item Bayesian Sparse Factor Analysis of High Dimensional Gene Expression Data(North Dakota State University, 2019) Zhao, JingjunThis work closely studied fundamental techniques of Bayesian sparse Factor Analysis model - constrained Least Square regression, Bayesian Lasso regression, and some popular sparsity-inducing priors. In Appendix A, we introduced each of the fundamental techniques in a coherent manner and provided detailed proof for important formulas and definitions. We consider provided introduction and detailed proof, which are very helpful in learning Bayesian sparse Factor Analysis, as a contribution of this work. We also systematically studied a computationally tractable biclustering approach in identifying co-regulated genes, BicMix, by proving all point estimates of the parameters and by running the method on both simulated data sets and a real high-dimensional gene expression data set. Missed derivation of all point estimates in BicMix has been provided for better understanding variational expectation maximization (VEM) algorithm. The performance of the method for identifying true biclusters has been analyzed using the experimental results.Item Boundary Estimation(North Dakota State University, 2015) Mu, YingfeiThe existing statistical methods do not provide a satisfactory solution to determining the spatial pattern in spatially referenced data, which is often required by research in many areas including geology, agriculture, forestry, marine science and epidemiology for identifying the source of the unusual environmental factors associated with a certain phenomenon. This work provides a novel algorithm which can be used to delineate the boundary of an area of hot spots accurately and e ciently. Our algorithm, rst of all, does not assume any pre-speci ed geometric shapes for the change-curve. Secondly, the computation complexity by our novel algorithm for changecurve detection is in the order of O(n2), which is much smaller than 2O(n2) required by the CUSP algorithm proposed in M uller&Song [8] and Carlstein's [2] estimators. Furthermore, our novel algorithm yields a consistent estimate of the change-curve as well as the underlying distribution mean of observations in the regions. We also study the hypothesis test of the existence of the change-curve in the presence of independence of the spatially referenced data. We then provide some simulation studies as well as a real case study to compare our algorithm with the popular boundary estimation method : Spatial scan statistic.Item Bracketing NCAA Men's Division I Basketball Tournament(North Dakota State University, 2013) Zhang, XiaoThis paper presents a new bracketing method for all 63 games in the NCAA Division 1 basketball tournament. This method, based on the logistic conditional probability models, is self-consistent in terms of constructing winning probabilities of each game. Empirical results show that this method outperforms the ordinal logistic regression and expectation method with restriction(Restricted OLRE model) proposed by West (2006).Item Bracketing the NCAA Women's Basketball Tournament(North Dakota State University, 2014) Wang, WentingThis paper presents a bracketing method for all the 63 games in NCAA Division I Women's basketball tournament. Least squares models and logistic regression models for Round 1, Round 2 and Rounds 3-6 were developed, to predict winners of basketball games in each of those rounds for the NCAA Women's Basketball tournament. For the first round, three-point goals, free throws, blocks and seed were found to be significant; For the second round, field goals and average points were found to be significant; For the third and higher rounds, assists, steals and seed were found to be significant. A complete bracket was filled out in 2014 before any game was played. When the differences of the seasonal averages for both teams for all previously mentioned variables were considered for entry in the least squares models, the models had approximately a 76% chance of correctly predicting the winner of a basketball game.Item Clustering Algorithm Comparison for Ellipsoidal Data(North Dakota State University, 2015) Loeffler, Shane RobertThe main objective of cluster analysis is the statistical technique of identifying data points and assigning them into meaningful clusters. The purpose of this paper is to compare different types of clustering algorithms to find the clustering algorithm that performs the best for varying complexities in Gaussian data. The clustering algorithms used would include: Partitioning Around Medoids (PAM), K-means, Hierarchical with different linkages (Ward’s linkage, Single linkage, Complete linkage, Average linkage, McQuitty’s method, Gower’s method, and Centroid method). The different types of complexities would include different number of dimensions, average pairwise overlap between clusters, number of points simulated from each cluster. After the data is simulated the Adjusted Rand Index will be used gauge the performance of the clusters. From that a t-test will also be used to see if there are any clustering algorithms that as well as other clustering algorithms.Item Community detection in censored hypergraph(North Dakota State University, 2024) Bin, ZhaoNetwork, or graph, represent relationships between entities in various applications, such as social networks, biological systems, and communication networks. A common feature in network data is the presence of community structures, where groups of nodes exhibit higher connectivity within themselves than with other groups. Identifying these community structures, a task known as community detection, is essential for gaining valuable insights in diverse applications, including uncovering hidden relationships in social networks, detecting functional modules in biological systems, and identifying vulnerabilities in communication networks. However, real-world network data may have missing values, significantly impacting the network’s structural properties. Existing community detection methods primarily focus on networks without missing values, leaving a gap in the analysis of censored networks. This study addresses the community detection problem in censored m-uniform hypergraphs. Firstly, utilizing an information-theoretic approach, we obtain a threshold that enables the exact recovery of the community structure. Then, we proposed a two-stage polynomial-time algorithm, which encompasses a spectral algorithm complemented by a refinement step, aiming to achieve exact recovery. Moreover, we introduce a semi-definite relaxation algorithm, studying its operational performance as a standalone community detection algorithm, without the integration of a refinement step. Lastly, in consideration of the effect of imputation methods on censored hypergraphs, we propose several methods grounded in network properties. We subsequently employ simulation to assess the performance of these methods. Finally, we apply the proposed algorithm to real-world data, showcasing its practical utility in various settings.Item Comparative Analysis of Traditional and Modified DECODE Method in Small Sample Gene Expression Experiments(North Dakota State University, 2018) Neset, KatieBackground: The DECODE method integrates differential co-expression and differential expression analysis methods to better understand biological functions of genes and their associations with disease. The DECODE method originally was designed to analyze large sample gene expression experiments, however most gene expression experiments consist of small sample sizes. This paper proposes modified test statistic to replace the traditional test statistic in the DECODE method. Using three simulations studies, we compare the performances of the modified and traditional DECODE methods using measures of sensitivity, positive predictive value (PPV), false discovery rate (FDR), and overall error rate for genes found to be highly differentially expressed and highly differentially co-expressed. Results: In comparison of sensitivity and PPV a minor increase is seen when using modified DECODE method along with minor decrease in FDR and overall error rate. Thus, a recommendation is made to use the modified DECODE method with small sample sizes.Item Comparative Classification of Prostate Cancer Data using the Support Vector Machine, Random Forest, Dualks and k-Nearest Neighbours(North Dakota State University, 2015) Sakouvogui, KekouraThis paper compares four classifications tools, Support Vector Machine (SVM), Random Forest (RF), DualKS and the k-Nearest Neighbors (kNN) that are based on different statistical learning theories. The dataset used is a microarray gene expression of 596 male patients with prostate cancer. After treatment, the patients were classified into one group of phenotype with three levels: PSA (Prostate-Specific Antigen), Systematic and NED (No Evidence of Disease). The purpose of this research is to determine the performance rate of each classifier by selecting the optimal kernels and parameters that give the best prediction rate of the phenotype. The paper begins with the discussion of previous implementations of the tools and their mathematical theories. The results showed that three classifiers achieved a comparable performance that was above the average while DualKS did not. We also observed that SVM outperformed the kNN, RF and DualKS classifiers.Item A Comparative Multiple Simulation Study for Parametric and Nonparametric Methods in the Identification of Differentially Expressed Genes(North Dakota State University, 2021) Palmer, Daniel GrantRNA-seq data simulated from a negative binomial distribution, sampled without replacement, or modified from read counts were analyzed to compare differential gene expression analysis methods in terms of false discovery rate control and power. The goals of the study were to determine optimal sample sizes/proportions of differential expression needed to adequately control false discovery rate and which differential gene expression methods performed best with the given simulation methods. Parametric tools like edgeR and limma-voom tended to be conservative when controlling false discovery rate from a negative binomial distribution as the proportion of differential expression increased. For the nonparametric simulation methods, many differential gene expression methods did not adequately control false discovery rate and results varied greatly when different reference data sets were used for simulations.Item Comparing Accuracies of Spatial Interpolation Methods on 1-Minute Ground Magnetometer Readings(North Dakota State University, 2017) Campbell, Kathryn MaryGeomagnetic disturbances caused by external solar events can create geomagnetically induced currents (GIC) throughout conducting networks of Earth’s surface. GIC can cause disruption that scales from minor to catastrophic. However, systems can implement preemptive measure to mitigate the effects of GICs with the use of GIC forecasting. Accurate forecasting is dependent on accurate modeling of Earth’s geomagnetic field. Unfortunately, it is not currently possible to have a measurement at every point of Earth’s field. Spatial interpolation methods can be implemented to fill in for the unmeasured space. The performances of two spatial interpolation methods, Inverse Distance Weighting and Kriging, are assessed to determine which better predicts the unmeasured space. Error testing shows both methods to be comparable, with the caveat of Kriging having a tighter precision on predictions.