123 results
Search Results
Now showing 1 - 10 of 123
Item Predicting Outcomes of NBA Basketball Games(North Dakota State University, 2016) Jones, Eric ScotA stratified random sample of 144 NBA basketball games was taken over a three-year period, between 2008 and 2011. Models were developed to predict point spread and to estimate the probability of a specific team winning based on various in-game statistics. Statistics significant in the model were field-goal shooting percentage, three-point shooting percentage, free-throw shooting percentage, offensive rebounds, assists, turnovers, and free-throws attempted. Models were verified using exact in-game statistics for a random sample of 50 NBA games taken during the 2011-2012 season with 88-94% accuracy. Three methods were used to estimate in-game statistics of future games so that the models could be used to predict a winner in games played by Team A and Team B. Models using these methods had accuracies of approximately 62%. Seasonal averages for these in-game statistics were used in the model developed to predict the winner of each game for the 2013-2016 NBA Championships.Item A Visualization Technique for Course Evaluations and Other Likert Scale Data(North Dakota State University, 2018) Saho, MuhammedCourse evaluation is one of the primary ways of collecting feedback from students at NDSU. Since almost every student in every course submits one at the end of the semester, it generates a lot of data. The data is summarized into text based reports with emphasis on average rating of each question. At one page per course, analyzing these reports can be overwhelming. Furthermore, it is very difficult to identify patterns in the text reports. We combine heat maps and small multiples to introduce a visualization of the data that allows for easier comparison between courses, departments, etc. We defined a data format for storing and transmitting the data. We built an interactive web application that consumes the aforementioned data format and generates the visualizations. We simulated reference data to facilitate interpretation of the visualizations. Finally, we discussed how our research can be applied more generally to Likert scale data.Item Proposed Methods for the Nondecreasing Order-Restricted Alternative in a Mixed Design(North Dakota State University, 2020) Alnssyan, Badr SulimanNonparametric statistics are commonly used in the field of statistics due to their robustness when the underlying assumptions are violated for the usual parametric statistics. In this dissertation, we proposed eight nonparametric methods to test for nondecreasing ordered alternative for a mixed design consisting of a combination of completely randomized design (CRD) and randomized complete block design (RCBD). There were four nonparametric tests, based on the Jonckheere-Terpstra test and modifications of it, employed to propose these nonparametric methods. A Monte Carlo simulation study was conducted using SAS program to investigate the performance of the proposed tests under a variety of nondecreasing location shifts among three, four and five populations and then compare these powers to each other and with the powers of the test statistics introduced by Magel et al. (2009). Three underlying distributions are used in the study including the standard normal distribution, the standard exponential distribution and student's t-distribution (3 degrees of freedom). We considered three scenarios of proportions of the number of blocks in the RCBD portion to the sample size in the CRD portion, namely, assuming that the portion of the number of blocks in RCBD is larger, equal, and smaller than the portion of the sample size in the CRD. Moreover, equal and unequal sample sizes were both considered for the CRD portion. The results of the simulation study indicate that all the proposed methods maintain their type one error and also indicate that at least one of the proposed methods did better compared to the tests of Magel et al. (2009) in terms of the estimated powers. In general, situations are found in which the proposed methods have higher powers and situations are found in which tests in Magel et al. (2009) have higher powers.Item Rutin Extraction and Content in Buckwheat (Fagopyrum esculentum) Bran-Fortified Pasta(North Dakota State University, 2019) Kaiser, Amber ChristineThe objectives of this study were to optimize extraction of rutin from buckwheat bran and buckwheat bran-fortified spaghetti and to determine the stability of rutin during spaghetti production and preparation. Aqueous ethanol and ethanol at 50, 60, 70, 80, and 90 % were used with Soxhlet or ultrasound-assisted extraction methods and 80 % methanol extraction was evaluated with or without papain treatment. Optimal extraction treatment (80 % methanol using ultrasound-assisted extraction without enzyme treatment) was used to determine rutin content in buckwheat bran-fortified spaghetti dried at low (40 °C) or high (90 °C) temperature. Rutin content was evaluated in raw, hydrated, extruded, dried, and cooked pasta. High temperature drying reduced rutin content more than low temperature drying, and total reduction in rutin content from raw pasta mix to cooked pasta was 25 – 30 %.Item Conditional Random Field with Lasso and its Application to the Classification of Barley Genes Based on Expression Level Affected by Fungal Infection(North Dakota State University, 2019) Liu, XiyuanThe classification problem of gene expression level, more specifically, gene expression analysis, is a major research area in statistics. There are several classical methods to solve the classification problem. To apply Logistic Regression Model (LRM) and other classical methods, the observations in the dataset should fit the assumption of independence. That is, the observations in the dataset are independent to each other, and the predictor (independent variable) should be independent. These assumptions are usually violated in gene expression analysis. Although the Classical Hidden Markov Chain Model (HMM) can solve the independence of observation problem, the classical HMM requires the independent variables in the dataset are discrete and independent. Unfortunately, the gene expression level is a continuous variable. To solve the classification problem of Gene Expression Level data, the Conditional Random Field(CRF) is introduce. Finally, the Least Absolute Selection and Shrinkage Operator (LASSO) penalty, a dimensional reduction method, is introduced to improve the CRF model.Item Comparing Total Hip Replacement Drug Treatments for Cost and Length of Stay(North Dakota State University, 2015) Huebner, Blake JamesThe objective of this study is to identify the potential effect anticoagulants, spinal blocks, and antifibrinolytics have on overall cost, length of stay, and re-admission rates for total hip replacement patients. We use ordinary least squares regression, multiple comparison testing, logistic regression, and chi square tests to fulfill this objective. The combination of warfarin and enoxaparin is associated with the highest cost and length of stay out of the anticoagulants studied. There is no clear combination of spinal blocks associated with the highest cost and length of stay. Tranexamic acid is associated with a reduction in length of stay and likelihood of receiving a blood transfusion, while not increasing overall cost. No drug combination in any category is associated with a change in re-admission rates.Item Comparing Prediction Accuracies of Cancer Survival Using Machine Learning Techniques and Statistical Methods in Combination with Data Reduction Methods(North Dakota State University, 2022) Mostofa, MohammadThis comparative study of five-year survival prediction for breast, lung, colon, and leukemia cancers using a large SEER dataset along with 10-fold cross-validation provided us with an insight into the relative prediction ability of different machine learning and data reduction methods. Lasso regression and the Boruta algorithm were used for variables selection, and Principal Component Analysis (PCA) was used for dimensionality reduction. We used one statistical method Logistic regression (LR) and several machine learning methods including Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), K Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Naïve Bayes Classifier (NB). For breast cancer, we found LDA, RF, and LR were the best models for five-year survival prediction based on the accuracy, sensitivity, specificity, and area under the curve (AUC) using data reduction method from Z score normalization and the Boruta algorithm. The results for lung cancer indicated the SVM linear, RF, and ANN were the best survival prediction models using data reduction methods from the Z score and max min normalization. The results for colon cancer indicated, ANN, and RF were the best prediction models using the Boruta algorithm and Z score method. The results for leukemia showed ANN, and the RF were the best survival prediction models using the Boruta algorithm and data reduction technique from the Z score. Overall, ANN, RF, and LR were the best prediction models for all cancers using variables selection by the Boruta algorithm.Item Using Imputed Microrna Regulation Based on Weighted Ranked Expression and Putative Microrna Targets and Analysis of Variance to Select Micrornas for Predicting Prostate Cancer Recurrence(North Dakota State University, 2014) Wang, QiImputed microRNA regulation based on weighted ranked expression and putative microRNA targets (IMRE) is a method to predict microRNA regulation from genome-wide gene expression. A false discovery rate (FDR) for each microRNA is calculated using the expression of the microRNA putative targets to analyze the regulation between different conditions. FDR is calculated to identify the differences of gene expression. The dataset used in this research is the microarray gene expression of 596 patients with prostate cancer. This dataset includes three different phenotypes: PSA (Prostate-Specific Antigen recurrence), Systemic (Systemic Disease Progression) and NED (No Evidence of Disease). We used the IMRE and ANOVA methods to analyze the dataset and identified several microRNA candidates that can be used to predict PSA recurrence and systemic disease progression in prostate cancer patients.Item Integrative Data Analysis of Microarray and RNA-seq(North Dakota State University, 2018) Wang, QiBackground: Microarray and RNA sequencing (RNA-seq) are two commonly used high-throughput technologies for gene expression profiling for the past decades. For global gene expression studies, both techniques are expensive, and each has its unique advantages and limitations. Integrative analysis of these two types of data would provide increased statistical power, reduced cost, and complementary technical advantages. However, the complete different mechanisms of the high-throughput techniques make the two types of data highly incompatible. Methods: Based on the degrees of compatibility, the genes are grouped into different clusters using a novel clustering algorithm, called Boundary Shift Partition (BSP). For each cluster, a linear model is fitted to the data and the number of differentially expressed genes (DEGs) is calculated by running two-sample t-test on the residuals. The optimal number of cluster can be determined using the selection criteria that is penalized on the number of parameters for model fitting. The method was evaluated using the data simulated from various distributions and it was compared with the conventional K-means clustering method, Hartigan-Wong’s algorithm. The BSP algorithm was applied to the microarray and RNA-seq data obtained from the embryonic heart tissues from wild type mice and Tbx5 mice. The raw data went through multiple preprocessing steps including data transformation, quantile normalization, linear model, principal component analysis and probe alignments. The differentially expressed genes between wild type and Tbx5 are identified using the BSP algorithm. Results: The accuracies of the BSP algorithm for the simulation data are higher than those of Hartigan-Wong’s algorithm for the cases with smaller standard deviations across the five different underlying distributions. The BSP algorithm can find the correct number of the clusters using the selection criteria. The BSP method identifies 584 differentially expressed genes between the wild type and Tbx5 mice. A core gene network developed from the differentially expressed genes showed a set of key genes that were known to be important for heart development. Conclusion: The BSP algorithm is an efficient and robust classification method to integrate the data obtained from microarray and RNA-seq.Item Identification of Differentially Expressed Genes When the Distribution of Effect Sizes is Asymmetric in Two Class Experiments(North Dakota State University, 2017) Kotoka, Ekua FesuwaHigh-throughput RNA Sequencing (RNA-Seq) has emerged as an innovative and powerful technology for detecting differentially expressed genes (DE) across different conditions. Unlike continuous microarray data, RNA-Seq data consist of discrete read counts mapped to a particular gene. Most proposed methods for detecting DE genes from RNA-Seq are based on statistics that compare normalized read counts between conditions. However, most of these methods do not take into account potential asymmetry in the distribution of effect sizes. In this dissertation, we propose methods to detect DE genes when the distribution of the effect sizes is observed to be asymmetric. These proposed methods improve detection of differential expression compared to existing methods. Chapter 3 proposes two new methods that modify an existing nonparametric method, Significance Analysis of Microarrays with emphasis on RNA-Seq data (SAMseq), to account for the asymmetry in the distribution of the effect sizes. Results of the simulation studies indicates that the proposed methods, compared to the SAMseq method identifies more DE genes, while adequately controlling false discovery rate (FDR). Furthermore, the use of the proposed methods is illustrated by analyzing a real RNA-Seq data set containing two different mouse strain samples. In Chapter 4, additional simulation studies are performed to show that the one of the proposed method, compared with other existing methods, provides better power for identifying truly DE genes or more sufficiently controls FDR in most settings where asymmetry is present. Chapter 5 compares the performance of parametric methods, DESeq2, NBPSeq and edgeR when there exist asymmetric effect sizes and the analysis takes into account this asymmetry. Through simulation studies, the performance of these methods are compared to the traditional BH and q-value method in the identification of DE genes. This research proposes a new method that modifies these parametric methods to account for asymmetry found in the distribution of effect sizes. Likewise, illustration on the use of these parametric methods and the proposed method by analyzing a real RNA-Seq data set containing two different mouse strain samples. Lastly, overall conclusions are given in Chapter 6.