Search Results

Now showing 1 - 10 of 35
  • Item
    Proposed Methods for the Nondecreasing Order-Restricted Alternative in a Mixed Design
    (North Dakota State University, 2020) Alnssyan, Badr Suliman
    Nonparametric statistics are commonly used in the field of statistics due to their robustness when the underlying assumptions are violated for the usual parametric statistics. In this dissertation, we proposed eight nonparametric methods to test for nondecreasing ordered alternative for a mixed design consisting of a combination of completely randomized design (CRD) and randomized complete block design (RCBD). There were four nonparametric tests, based on the Jonckheere-Terpstra test and modifications of it, employed to propose these nonparametric methods. A Monte Carlo simulation study was conducted using SAS program to investigate the performance of the proposed tests under a variety of nondecreasing location shifts among three, four and five populations and then compare these powers to each other and with the powers of the test statistics introduced by Magel et al. (2009). Three underlying distributions are used in the study including the standard normal distribution, the standard exponential distribution and student's t-distribution (3 degrees of freedom). We considered three scenarios of proportions of the number of blocks in the RCBD portion to the sample size in the CRD portion, namely, assuming that the portion of the number of blocks in RCBD is larger, equal, and smaller than the portion of the sample size in the CRD. Moreover, equal and unequal sample sizes were both considered for the CRD portion. The results of the simulation study indicate that all the proposed methods maintain their type one error and also indicate that at least one of the proposed methods did better compared to the tests of Magel et al. (2009) in terms of the estimated powers. In general, situations are found in which the proposed methods have higher powers and situations are found in which tests in Magel et al. (2009) have higher powers.
  • Item
    Conditional Random Field with Lasso and its Application to the Classification of Barley Genes Based on Expression Level Affected by Fungal Infection
    (North Dakota State University, 2019) Liu, Xiyuan
    The classification problem of gene expression level, more specifically, gene expression analysis, is a major research area in statistics. There are several classical methods to solve the classification problem. To apply Logistic Regression Model (LRM) and other classical methods, the observations in the dataset should fit the assumption of independence. That is, the observations in the dataset are independent to each other, and the predictor (independent variable) should be independent. These assumptions are usually violated in gene expression analysis. Although the Classical Hidden Markov Chain Model (HMM) can solve the independence of observation problem, the classical HMM requires the independent variables in the dataset are discrete and independent. Unfortunately, the gene expression level is a continuous variable. To solve the classification problem of Gene Expression Level data, the Conditional Random Field(CRF) is introduce. Finally, the Least Absolute Selection and Shrinkage Operator (LASSO) penalty, a dimensional reduction method, is introduced to improve the CRF model.
  • Item
    Comparing Prediction Accuracies of Cancer Survival Using Machine Learning Techniques and Statistical Methods in Combination with Data Reduction Methods
    (North Dakota State University, 2022) Mostofa, Mohammad
    This comparative study of five-year survival prediction for breast, lung, colon, and leukemia cancers using a large SEER dataset along with 10-fold cross-validation provided us with an insight into the relative prediction ability of different machine learning and data reduction methods. Lasso regression and the Boruta algorithm were used for variables selection, and Principal Component Analysis (PCA) was used for dimensionality reduction. We used one statistical method Logistic regression (LR) and several machine learning methods including Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), K Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Naïve Bayes Classifier (NB). For breast cancer, we found LDA, RF, and LR were the best models for five-year survival prediction based on the accuracy, sensitivity, specificity, and area under the curve (AUC) using data reduction method from Z score normalization and the Boruta algorithm. The results for lung cancer indicated the SVM linear, RF, and ANN were the best survival prediction models using data reduction methods from the Z score and max min normalization. The results for colon cancer indicated, ANN, and RF were the best prediction models using the Boruta algorithm and Z score method. The results for leukemia showed ANN, and the RF were the best survival prediction models using the Boruta algorithm and data reduction technique from the Z score. Overall, ANN, RF, and LR were the best prediction models for all cancers using variables selection by the Boruta algorithm.
  • Item
    Integrative Data Analysis of Microarray and RNA-seq
    (North Dakota State University, 2018) Wang, Qi
    Background: Microarray and RNA sequencing (RNA-seq) are two commonly used high-throughput technologies for gene expression profiling for the past decades. For global gene expression studies, both techniques are expensive, and each has its unique advantages and limitations. Integrative analysis of these two types of data would provide increased statistical power, reduced cost, and complementary technical advantages. However, the complete different mechanisms of the high-throughput techniques make the two types of data highly incompatible. Methods: Based on the degrees of compatibility, the genes are grouped into different clusters using a novel clustering algorithm, called Boundary Shift Partition (BSP). For each cluster, a linear model is fitted to the data and the number of differentially expressed genes (DEGs) is calculated by running two-sample t-test on the residuals. The optimal number of cluster can be determined using the selection criteria that is penalized on the number of parameters for model fitting. The method was evaluated using the data simulated from various distributions and it was compared with the conventional K-means clustering method, Hartigan-Wong’s algorithm. The BSP algorithm was applied to the microarray and RNA-seq data obtained from the embryonic heart tissues from wild type mice and Tbx5 mice. The raw data went through multiple preprocessing steps including data transformation, quantile normalization, linear model, principal component analysis and probe alignments. The differentially expressed genes between wild type and Tbx5 are identified using the BSP algorithm. Results: The accuracies of the BSP algorithm for the simulation data are higher than those of Hartigan-Wong’s algorithm for the cases with smaller standard deviations across the five different underlying distributions. The BSP algorithm can find the correct number of the clusters using the selection criteria. The BSP method identifies 584 differentially expressed genes between the wild type and Tbx5 mice. A core gene network developed from the differentially expressed genes showed a set of key genes that were known to be important for heart development. Conclusion: The BSP algorithm is an efficient and robust classification method to integrate the data obtained from microarray and RNA-seq.
  • Item
    Identification of Differentially Expressed Genes When the Distribution of Effect Sizes is Asymmetric in Two Class Experiments
    (North Dakota State University, 2017) Kotoka, Ekua Fesuwa
    High-throughput RNA Sequencing (RNA-Seq) has emerged as an innovative and powerful technology for detecting differentially expressed genes (DE) across different conditions. Unlike continuous microarray data, RNA-Seq data consist of discrete read counts mapped to a particular gene. Most proposed methods for detecting DE genes from RNA-Seq are based on statistics that compare normalized read counts between conditions. However, most of these methods do not take into account potential asymmetry in the distribution of effect sizes. In this dissertation, we propose methods to detect DE genes when the distribution of the effect sizes is observed to be asymmetric. These proposed methods improve detection of differential expression compared to existing methods. Chapter 3 proposes two new methods that modify an existing nonparametric method, Significance Analysis of Microarrays with emphasis on RNA-Seq data (SAMseq), to account for the asymmetry in the distribution of the effect sizes. Results of the simulation studies indicates that the proposed methods, compared to the SAMseq method identifies more DE genes, while adequately controlling false discovery rate (FDR). Furthermore, the use of the proposed methods is illustrated by analyzing a real RNA-Seq data set containing two different mouse strain samples. In Chapter 4, additional simulation studies are performed to show that the one of the proposed method, compared with other existing methods, provides better power for identifying truly DE genes or more sufficiently controls FDR in most settings where asymmetry is present. Chapter 5 compares the performance of parametric methods, DESeq2, NBPSeq and edgeR when there exist asymmetric effect sizes and the analysis takes into account this asymmetry. Through simulation studies, the performance of these methods are compared to the traditional BH and q-value method in the identification of DE genes. This research proposes a new method that modifies these parametric methods to account for asymmetry found in the distribution of effect sizes. Likewise, illustration on the use of these parametric methods and the proposed method by analyzing a real RNA-Seq data set containing two different mouse strain samples. Lastly, overall conclusions are given in Chapter 6.
  • Item
    A Study of Influential Statistics Associated with Success in the National Football League
    (North Dakota State University, 2015) Roith, Joseph Michael
    This dissertation considers the most important aspects of success in the National Football League (NFL). Success is defined, for this paper, as winning individual games in the short term, and making the playoffs over the course of a season in the long term. Data was collected for 750 different regular season games over the course of five seasons in the NFL, and used to create models that identify those factors which are most significant towards winning at both the short term and long term levels. A point spread model was developed using an ordinary least squares regression method, and stepwise selection technique to reduce the number of variables included. Logistic regression models were also created to state the probability a team will win an individual game, and also the probability a team will make the playoffs at the end of the season. Discriminant analysis was performed to compare the significant variables in our models, and determine which had the largest influence. We considered the relationship between offense and defense in the NFL to conclude whether or not one area had a significant advantage over the other. We also fit a proportional odds model on the data set to categorize blowout games, and those that are close at the end. The overwhelming presence of turnover margin, passing efficiency, first down margin, and sack yardage in all of our models is clear evidence that there are a handful of statistics that can explain success in the NFL. Using the statistics from games, we were able to correctly identify the winner around 88% of the time. Finally, we used simulations and historical team performances to forecast future game outcomes, our models classified the actual winner with a 71% accuracy rate. Analytics are slowly gaining momentum in football, and the advantages are clear. Quantifying success in the NFL can benefit both individual teams, and the league as a whole, to present the best possible product to their audiences.
  • Item
    Predicting the Outcomes of NCAA Women’s Sports
    (North Dakota State University, 2017) Wang, Wenting
    Sports competitions provide excellent opportunities for model building and using basic statistical methodology in an interesting way. More attention has been paid to and more research has been conducted pertaining to men’s sports as opposed to women’s sports. This paper will focus on three kinds of women’s sports, i.e. NCAA women’s basketball, volleyball and soccer. Several ordinary least squares models were developed that help explain the variation in point spread of a women’s basketball game, volleyball game and soccer game based on in-game statistics. Several logistic models were also developed that help estimate the probability that a particular team will win the game for women’s basketball, volleyball and soccer tournaments. Ordinary least squares models for Round 1, Round 2 and Rounds 3-6 with point spread being the dependent variable by using differences in ranks of seasonal averages and differences of seasonal averages were developed to predict winners of games in each of those rounds for the women’s basketball, volleyball and soccer tournament. Logistic models for Round 1, Round 2 and Rounds 3-6 that estimate the probability of a team winning the game by using differences in ranks of seasonal averages and differences of seasonal averages were developed to predict winners of games in each of those rounds for the basketball, volleyball and soccer tournaments. The prediction models were validated before doing the prediction. For basketball, the least squares model developed by using differences in ranks of seasonal averages with a double scoring system variable predicted the results of a 76.2% of the games for the entire tournament with all the predictions made before the start of the tournament. For volleyball, the logistic model developed by using differences of seasonal averages predicted 65.1% of the games for the entire tournament. For soccer, the logistic regression model developed by using differences of seasonal averages predicted 45% of all games in the tournament. Correctly when all 6 rounds were predicted before the tournament began. In this case, team predicted to win in the second round or higher might not have even made it to this round since prediction was done ahead of time.
  • Item
    Measuring Performance of United States Commercial and Domestic Banks and its Impact on 2007-2009 Financial Crisis
    (North Dakota State University, 2019) Sakouvogui, Kekoura
    In the analysis of efficiency measures, the statistical Stochastic Frontier Analysis (SFA) and linear programming Data Envelopment Analysis (DEA) estimators have been widely applied. This dissertation is centered around two main goals. First, this dissertation addresses respectively the individual limitations of SFA and DEA models in chapters 2 and 3 using Monte Carlo (MC) simulations. Motivated by the lack of justification for the choice of inefficiency distributions in MC simulations, chapter 2 develops the statistical parameters, i.e., mean and standard deviation of the inefficiency distributions - truncated normal, half normal, and exponential. MC simulations results show that within the conventional and proposed approaches, misspecification of the inefficiency distribution matters. More precisely, within the proposed approach, the misspecified truncated normal SFA model provides the smallest mean absolute deviation and mean square error when the inefficiency distribution is a half normal. Chapter 3 examines several misspecifications of the DEA efficiency measures while accounting for the stochastic inefficiency distributions of truncated normal, half normal, and exponential derived in chapter 2. MC simulations were conducted to examine the performance of the DEA model under two different data generating processes - logarithm and level, and across five different scenarios - inefficiency distributions, sample sizes, production functions, input distributions, and curse of dimensionality. The results caution DEA practitioners concerning the accuracy of their estimates and the implications within proposed and conventional approaches of the inefficiency distributions. Second, this dissertation presents in chapter 4 an empirical assessment of the liquidity and solvency financial factors on the cost efficiency measures of U.S banks while accounting for regulatory, macroeconomic, and bank internal factors. The results suggest that the liquidity and solvency financial factors negatively impacted the cost efficiency measures of U.S banks from 2005 to 2017. Moreover, during the financial crisis, U.S banks were inefficient in comparison to the tranquil period, and the solvency financial factor insignificantly impacted the cost efficiency measures. In addition, U.S banks’ liquidity financial factor negatively collapsed due to contagion during the financial crisis.
  • Item
    Boundary Estimation
    (North Dakota State University, 2015) Mu, Yingfei
    The existing statistical methods do not provide a satisfactory solution to determining the spatial pattern in spatially referenced data, which is often required by research in many areas including geology, agriculture, forestry, marine science and epidemiology for identifying the source of the unusual environmental factors associated with a certain phenomenon. This work provides a novel algorithm which can be used to delineate the boundary of an area of hot spots accurately and e ciently. Our algorithm, rst of all, does not assume any pre-speci ed geometric shapes for the change-curve. Secondly, the computation complexity by our novel algorithm for changecurve detection is in the order of O(n2), which is much smaller than 2O(n2) required by the CUSP algorithm proposed in M uller&Song [8] and Carlstein's [2] estimators. Furthermore, our novel algorithm yields a consistent estimate of the change-curve as well as the underlying distribution mean of observations in the regions. We also study the hypothesis test of the existence of the change-curve in the presence of independence of the spatially referenced data. We then provide some simulation studies as well as a real case study to compare our algorithm with the popular boundary estimation method : Spatial scan statistic.
  • Item
    A Conditional Random Field (CRF) Based Machine Learning Framework for Product Review Mining
    (North Dakota State University, 2019) Ming, Yue
    The task of opinion mining from product reviews has been achieved by employing rule-based approaches or generative learning models such as hidden Markov models (HMMs). This paper introduced a discriminative model using linear-chain Conditional Random Fields (CRFs) that can naturally incorporate arbitrary, non-independent features of the input without conditional independence among the features or distributional assumptions of inputs. The framework firstly performs part-of-speech (POS) tagging tasks over each word in sentences of review text. The performance is evaluated based on three criteria: precision, recall and F-score. The result shows that this approach is effective for this type of natural language processing (NLP) tasks. Then the framework extracts the keywords associated with each product feature and summarizes into concise lists that are simple and intuitive for people to read.