Statistics Doctoral Work

Permanent URI for this collectionhdl:10365/32399

Browse

Recent Submissions

Now showing 1 - 20 of 35
  • Item
    Community detection in censored hypergraph
    (North Dakota State University, 2024) Bin, Zhao
    Network, or graph, represent relationships between entities in various applications, such as social networks, biological systems, and communication networks. A common feature in network data is the presence of community structures, where groups of nodes exhibit higher connectivity within themselves than with other groups. Identifying these community structures, a task known as community detection, is essential for gaining valuable insights in diverse applications, including uncovering hidden relationships in social networks, detecting functional modules in biological systems, and identifying vulnerabilities in communication networks. However, real-world network data may have missing values, significantly impacting the network’s structural properties. Existing community detection methods primarily focus on networks without missing values, leaving a gap in the analysis of censored networks. This study addresses the community detection problem in censored m-uniform hypergraphs. Firstly, utilizing an information-theoretic approach, we obtain a threshold that enables the exact recovery of the community structure. Then, we proposed a two-stage polynomial-time algorithm, which encompasses a spectral algorithm complemented by a refinement step, aiming to achieve exact recovery. Moreover, we introduce a semi-definite relaxation algorithm, studying its operational performance as a standalone community detection algorithm, without the integration of a refinement step. Lastly, in consideration of the effect of imputation methods on censored hypergraphs, we propose several methods grounded in network properties. We subsequently employ simulation to assess the performance of these methods. Finally, we apply the proposed algorithm to real-world data, showcasing its practical utility in various settings.
  • Item
    Profile Matching in Observational Studies With Multilevel Data
    (North Dakota State University, 2022) McGrath, Brenda
    Matching is a popular method to use with observational data to replicate desired features of a randomized control trial. A common problem encountered in observational studies is the lack of common support or the limited overlap of the covariate distributions across treatment groups. A new approach, cardinality matching, leverages mathematical optimization to directly balance observed covariates. When conducting cardinality matching, the user specifies the tolerable balance constraints of individual covariates and the desired number of matched controls. The algorithm then finds the largest possible match given these constraints. Profile matching is a newly proposed method that uses cardinality matching, in which the user can specify a target profile directly and find the largest cardinality match that is balanced to the target profile. We developed an R package called ProfileMatchit that will employ profile matching. We employed the new package in the setting of hospital quality assessment using a real-world dataset. Profile matching has not yet been used in hospital quality assessment but may be an improvement over current approaches, which have limitations in the ability to find sufficient matches in a heterogeneous sample. This application would be the culmination of our work to develop an improved version of cardinality matching and provide a new application of profile matching and a better approach to hospital quality assessment.
  • Item
    Performance of Permutation Tests Using Simulated Genetic Data
    (North Dakota State University, 2022) Soumare, Ibrahim
    Disease statuses and biological conditions are known to be greatly impacted by differences in gene expression levels. A common challenge in RNA-seq data analysis is to identify genes whose mean expression levels change across different groups of samples, or, more generally, are associated with one or more variables of interest. Such analysis is called differential expression analysis. Many tools have been developed for analyzing differential gene expression (DGE) for RNA-seq data. RNA-seq data are represented as counts. Typically, a generalized linear model with a log link and a negative binomial response is fit to the count data for each gene, and DE genes are identified by testing, for each gene, whether a model parameter or linear combination of model parameters is zero. We conducted a simulation study to compare the performance of our proposed modified permutation test to DESeq2 edgeR, Limma, LFC and Voom when applied to RNA-seq data. We considered different combinations of sample sizes and underlying distributions. In this simulation study, we first simulated data using Monte Carlo simulation in SAS and assessed True Detection rate and False Positive rate for each model involved. We then simulated data from real RNA-seq data using SimSeq algorithm and compared the performance of our proposed model to DESeq2 edgeR, Limma, LFC and Voom. The simulation results suggest that Permutation tests are a competitive alternative to traditional parametric methods for analyzing RNA-seq data when we have sufficient sample sizes. Specifically, the results show that Permutation controlled Type I error fairly well and had a comparable Power rate. Moreover, for a sample size n≥10 simulation exhibited a comparable True detection rate and consistently kept the False Positive rate very low when sampling from Poisson and Negative Binomial distributions. Likewise, the results from SimSeq confirm that Permutation tests do a better job at keeping the False Positive rate the lowest.
  • Item
    A Study of Locally D-optimal Designs for the Emax Model with Heteroscedasticity
    (North Dakota State University, 2022) Zhang, Xiao
    The classic theory of locally optimal designs is developed on the center+error model assuming Gaussianity and homoscedasticity for random error, in which, the Maximum Likelihood Estimator (MLE) turns out to be the most efficient in model parameter estimation. However, these assumptions are typically absent in practice. In this work, we study the locally D-optimal design based on our new oracle Second-order Least Square Estimator (SLSE). We compare asymptotic efficiency of locally D-optimal designs obtained via SLSE, the Maximum quasi-Likelihood Estimator (MqLE) and Maximum Gaussian Likelihood Estimator (MGLE), in the case where the underlying probability distribution of response is non-Gaussian and heteroscedastic. We find that even with less stringent assumptions, asymptotic efficiency of the locally D-optimal designs obtained via MqLE is comparable to oracle SLSE in some cases, albeit lesser in general. As a demonstration of how the locally D-optimal design is numerically found, we apply our feasibility-based particle swarm optimization algorithm to the locally D-optimal design based on the original SLSE.
  • Item
    Comparing Prediction Accuracies of Cancer Survival Using Machine Learning Techniques and Statistical Methods in Combination with Data Reduction Methods
    (North Dakota State University, 2022) Mostofa, Mohammad
    This comparative study of five-year survival prediction for breast, lung, colon, and leukemia cancers using a large SEER dataset along with 10-fold cross-validation provided us with an insight into the relative prediction ability of different machine learning and data reduction methods. Lasso regression and the Boruta algorithm were used for variables selection, and Principal Component Analysis (PCA) was used for dimensionality reduction. We used one statistical method Logistic regression (LR) and several machine learning methods including Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), K Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Naïve Bayes Classifier (NB). For breast cancer, we found LDA, RF, and LR were the best models for five-year survival prediction based on the accuracy, sensitivity, specificity, and area under the curve (AUC) using data reduction method from Z score normalization and the Boruta algorithm. The results for lung cancer indicated the SVM linear, RF, and ANN were the best survival prediction models using data reduction methods from the Z score and max min normalization. The results for colon cancer indicated, ANN, and RF were the best prediction models using the Boruta algorithm and Z score method. The results for leukemia showed ANN, and the RF were the best survival prediction models using the Boruta algorithm and data reduction technique from the Z score. Overall, ANN, RF, and LR were the best prediction models for all cancers using variables selection by the Boruta algorithm.
  • Item
    Proposed Nonparametric Tests for the Umbrella Alternative in a Mixed Design Testing for Location and Location-Scale
    (North Dakota State University, 2022) Alotaibi, Eid Sadun
    Researchers sometimes use the umbrella alternative when testing for differences in treatment effects, where the parameters increase up to a point and decrease after that point. Sometimes different treatment effects may result in changes to location parameters only, to scale parameters only, or to both. In this study, we considered tests for three distinct scenarios; the tests in each scenario were compared based on estimated power for the different underlying distributions and on different known umbrella peaks that were based on 3, 4, or 5 populations. For all three scenarios, recommendations for which test was better will be given in a variety of cases. In scenario one, this research investigates existing test statistics proposed by Magel et al. (2010) for detecting umbrella alternatives when the peak is known, and the underlying design consists of a completely randomized design (CRD) and randomized complete block design (RCBD). We investigate the powers of the tests compared to each other when testing for location in this design when the variance of the CRD portion is 2, 4, and 9 times larger than the variance of the RCBD portion. Three underlying distributions, a variety of location shifts, and different ratios between the sample size in the CRD portion compared to the number of blocks in the RCBD portion are considered. In the second scenario, three nonparametric tests are proposed for a CRD design with k populations to test for the umbrella alternative with known peak, p, for both location and scale parameters. A simulation study was implemented to see if the proposed tests maintained their significance levels. Also, the tests proposed were compared based on estimated powers for sample sizes of 15 and a variety of location and scale shifts. In the third scenario, we proposed nonparametric test statistics to test for an umbrella pattern testing for location and scale for a mixed design. Powers were estimated for different ratios of sample size in the CRD to the number of blocks in the RCBD and equal variance ratios between a CRD and a RCBD, as well as changes in the location and scale parameters.
  • Item
    Comparing Prediction Methods of Wheat Grain Quality With the Area Under the Receiver Operating Characteristic Curves
    (North Dakota State University, 2021) Lin, Ying
    A widely used breeding method is genomic selection, which uses genome-wide marker coverage to predict genotypic values for quantitative traits. Genomic selection combines molecular and phenotypic data in a training population to obtain the genomic estimated breeding values of individuals in a testing population that have been genotyped but not phenotyped. One popular method for this estimation is G-BLUP. To further simplify data collection efforts and costs, we developed models with linear model, Bayesian linear model, K-nearest neighbors, and Random Forest to predict quality traits and compare the predictive ability of this new approach with G-BLUP using Pearson correlation and area under the receiver operating characteristic curve. The goal of this approach is to enable the analysis of large-scale data sets to provide relatively accurate estimates of quality traits without the time and energy consumption of marker analysis. Application of the methods to predict the quality traits for spring wheat breeding data reveals that compared with G-BLUP methods, the proposed methods perform better in loaf volume prediction, perform poorly in flour extraction and bake absorption prediction, and in mixograph prediction, the performance is not bad.
  • Item
    A Simulation Study Using a Mixed Model Framework to Analyze the Impact of Sample Size and Variability on Type I Error and Power
    (North Dakota State University, 2021) Gu, Xiaoxue
    Repeated measures design (or longitudinal study) are commonly seen in many research fields, especially in pharmaceutical clinical trials, agricultural research, and psychology. PROC MIXED (SAS Inc.) is a well-known standard tool for analyzing repeated measures data nowadays. The MIXED procedure is based on the standard linear MIXED model, which estimates parameters by maximizing the restricted likelihood. The usual assumption for a standard linear MIXED model is normality. However, the character of data in the real world is hard to tell; it may be non- smoothed, non-symmetric, and having heavy tails, having a small sample size, and so on. Therefore, this simulation study was conducted to check the validity of a MIXED model's statistical inference when violating the underlying assumptions – normality of random errors [Scheffe, 1959], and giving two design features as unbalanced group size and inequality of variance of errors [Scheffe, 1959]. We compare the Type I error rate in different combinations of settings with the Type I error rate under the normal distribution. The power rate is also provided for checking the robustness. The main results in this study show us that the MIXED model is reasonably robust to modest violations of the normal distribution. In the meantime, when small group size combines with large variance, it would cause a severe inflation problem on Type I error rates, which breaks the MIXED model's performance. When the Type I errors were found to be inflated, the Group= option was found to often help with this problem, or sometimes one could use a Sub-Sampling procedure.
  • Item
    On Lasso Estimation of Linear Mixed Model for High Dimensional Longitudinal Data
    (North Dakota State University, 2021) Wen, Qian
    With the advancement of technology in data collection, repeated measurements with high dimensional covariates have become increasingly common. The classical statistics approach for modeling the data of this kind is via the linear mixed model with temporally correlated error. However, most of the research reported in the literature for variable selection is for independent response data. In this study, the proposed algorithm employs Expectation and Maximization (EM) and Least Absolute Shrinkage and Selection Operator (LASSO) approaches under the linear mixed model scheme with the assumption of Gaussianity, an approach that works for data with interdependence. Our algorithm involves two steps: 1.Variance-covariance components estimation by EM; and 2.Variable selection by LASSO. The crucial challenge arises from the fact that linear mixed models usually allow structured variance-covariance, which, in return, renders complexity in its estimation: No explicit maxima in general in the M-step of the EM algorithm. Our EM algorithm uses one iteration of the projection gradient descent method, which turns out to be quite computationally efficient compared with the classical EM algorithm because it obviates the process of finding the maxima of the variance-covariance components in the M-step. With the estimates of variance-covariance components obtained from step 1, the LASSO estimation is executed on the full log-likelihood function imposed with an L1 regularization. The LASSO method has the effect of shrinking all coefficients towards zero, which plays a variable selection role. We apply the gradient descent algorithm to find LASSO estimates and the pathwise coordinate descent to set up the tuning parameter for the penalized log-likelihood function. The simulation studies are carried out under the assumption that measurement errors of each subject are of first-order autoregressive AR(1) correlation structure. The numerical results show that the variance-covariance parameters estimates by our method are comparable to the classic Newton-Raphson (NR) method in the simple case and outperforms NR method when the variance-covariance matrix having a complex structure. Moreover, our method successfully identifies all the relevant explanatory variables and most of the redundant explanatory variables. The proposed method is also applied to a life data and the result is very reasonable.
  • Item
    Proposed Nonparametric Tests for Equality of Location and Scale Against Ordered Alternatives
    (North Dakota State University, 2021) Zhu, Tiwei
    Ordered alternatives tests are sometimes used in life-testing experiments and drug-screening studies. An ordered alternative test is sometimes used to gain power if the researcher thinks parameters will be ordered in a certain way if they are different. This research proposal focuses on developing new nonparametric tests for the nondecreasing ordered alternative problem for k (k≥3) populations when testing for differences in both location and scale. Six nonparametric tests are proposed for the nondecreasing ordered alternative when testing for a difference in either location or scale. The six tests are various combinations of a well-known ordered alternatives test for location and a test based on the Moses test technique for testing differences in scale. A simulation study is conducted to determine how well the proposed tests maintain their significance levels. Powers are estimated for the proposed tests under a variety of conditions for three, four and five populations. Several types of variable parameters are considered: when the location parameters are different and the scale parameters are equal; when the location parameters are equal and the scale parameters are different; when the location and scale parameters are both different. Equal and unequal samples sizes of 18 and 30 are considered. Subgroup sizes of 3 and 6 are both used when applying the Moses test technique. Recommendations are given for which test should be used for various situations.
  • Item
    Nonparametric Test for Nondecreasing Order Alternatives in Randomized Complete Block and Balanced Incomplete Block Mixed Design
    (North Dakota State University, 2020) Osafo, Mamfe
    Nonparametric tests are used to test hypotheses when the data at hand violate one or more of the assumptions for parametric tests procedures. The test is an ordered alternative (nondecreasing) when there is prior information about the data. It assumes that the underlying distributions are of the same type and therefore differ in location. For example, in dose-response studies, animals are assigned to k groups corresponding to k doses of an experimental drug. The effect of the drug on the animals is likely to increase or decrease with increasing doses. In this case, the ordered alternative is appropriate for the study. In this paper, we propose eight new nonparametric tests useful for testing against nondecreasing order alternatives for a mixed design involving randomized complete block and balanced incomplete block design. These tests involve various modifications of the Jonckheere-Terpstra test (Jonckheere(1952), Terpstra(1954)) and Alvo and Cabilio’s test (1995). Three, four and five treatments were considered with different location parameters under different scenarios. For three and four treatments, 6,12, and 18 blocks were used for the simulation, while 10, 20, and 30 blocks were used for five treatments. Different tests performed best under different block combinations, but overall the standardized last for Alvo outperformed the other test when the number of treatments and number of missing observations per block increases. A simulation study was conducted comparing the powers of the various modification of Jonckheere-Terpstra (Jonckheere(1952), Terpstra(1954)) and Alvo and Cabilio’s (1995) tests under different scenarios. Recommendations are made.
  • Item
    Distributed Inference for Degenerate U-Statistics with Application to One and Two Sample Test
    (North Dakota State University, 2020) Atta-Asiamah, Ernest
    In many hypothesis testing problems such as one-sample and two-sample test problems, the test statistics are degenerate U-statistics. One of the challenges in practice is the computation of U-statistics for a large sample size. Besides, for degenerate U-statistics, the limiting distribution is a mixture of weighted chi-squares, involving the eigenvalues of the kernel of the U-statistics. As a result, it’s not straightforward to construct the rejection region based on this asymptotic distribution. In this research, we aim to reduce the computation complexity of degenerate U-statistics and propose an easy-to-calibrate test statistic by using the divide-and-conquer method. Specifically, we randomly partition the full n data points into kn even disjoint groups, and compute U-statistics on each group and combine them by averaging to get a statistic Tn. We proved that the statistic Tn has the standard normal distribution as the limiting distribution. In this way, the running time is reduced from O(n^m) to O( n^m/km_n), where m is the order of the one sample U-statistics. Besides, for a given significance level , it’s easy to construct the rejection region. We apply our method to the goodness of fit test and two-sample test. The simulation and real data analysis show that the proposed test can achieve high power and fast running time for both one and two-sample tests.
  • Item
    Nonparametric Tests for the Umbrella Alternative in a Mixed Design for a Known Peak
    (North Dakota State University, 2020) Asare, Boampong Adu
    When an assumption from a parametric test cannot be verified, a nonparametric test provides a simple way of conducting a test on populations. The motivation behind conducting a test of the hypothesis is to examine the effect of a treatment or multiple treatments against one another. For example, in dose-response studies, monkeys are assigned to k groups corresponding to k doses of an experimental drug. The effect of the drug on these monkeys is likely to increase or decrease with increasing and decreasing doses. The drug’s effect on these monkeys may be an increasing function of dosage to a certain level, and then its effect decreases with further increasing doses. An umbrella alternative, in this case, is considered the most appropriate hypothesis for these kinds of studies. Tests statistics are proposed to test for the umbrella alternative in mixed designs consisting of combinations of a Completely Randomized Design (CRD), a Randomized Complete Block Design (RCBD), an Incomplete Block Design (IBD) and a Balanced Incomplete Block Design (BIBD). Powers obtained were based on a variety of cases. Different proportions of blocks to different sample sizes of a Completely Randomized Design portion were considered. In all treatments, equal sample sizes for the Completely Randomized Design were considered. Furthermore, an equal number of blocks of a randomized complete block design to an Incomplete Block Design and Balanced Incomplete blocks were considered. Studies in a Monte Carlo simulation were conducted using SAS to vary the design and to estimate the test statistic powers to each other. The underlying distributions considered were normal, t and exponential.
  • Item
    Proposed Nonparametric Tests for the Umbrella Alternative in a Mixed Design
    (North Dakota State University, 2020) Al-Thubaiti, Samah Abdullah
    Several nonparametric tests are proposed for a mixed design consisting of a randomized complete block design (RCBD) and a completely randomized design (CRD) under the umbrella hypothesis with a known and an unknown peak. The combination of the two statistics is based on two different methods. A simulation study was conducted to investigate the performance of the proposed mixed design tests under many different cases. In either case of a known or an unknown peak umbrella hypothesis, the estimated power of the first method used for the proposed test statistics is better than the second method for all situations. We use a square distance as a weight in terms of assessing the power’s performance of the proposed test statistics for the known peak umbrella hypothesis. The square distance modification improves in increasing the test’s power; in particular, if the peak is indistinct with the first location parameter for four and five treatments, or if the location parameter on the left side of the umbrella hypothesis (upside) is greater than all the different location parameters on the right side of the umbrella hypothesis (downside) such as, (0.8 , 1.0 , 0.75 , 0.2) ; (0.75 , 0.8 , 0.6 , 0.4 , 0.2). Also, the modification improves the test’s power for five treatments and peak at 3 once the underlying distribution is symmetric, as long as the peak of the umbrella hypothesis is distinct. In general, for the unknown peak umbrella hypothesis, the result of the test’s power differs slightly between a modification and nonmodification cases. However, we can distinguish some cases based on the type of underlying distribution. In the case of having a symmetric distribution, the square distance modification is much better than test statistics without modification for some cases once we have four and five treatments. For the case of having three treatments; the estimated power for the proposed test statistics with a square distance modification (3.3.15), (3.3.16) is slightly different from the estimated power for the test statistic without modification (3.3.13), (3.3.14) in both cases of underlying distributions ”symmetric and skewed.”
  • Item
    Proposed Methods for the Nondecreasing Order-Restricted Alternative in a Mixed Design
    (North Dakota State University, 2020) Alnssyan, Badr Suliman
    Nonparametric statistics are commonly used in the field of statistics due to their robustness when the underlying assumptions are violated for the usual parametric statistics. In this dissertation, we proposed eight nonparametric methods to test for nondecreasing ordered alternative for a mixed design consisting of a combination of completely randomized design (CRD) and randomized complete block design (RCBD). There were four nonparametric tests, based on the Jonckheere-Terpstra test and modifications of it, employed to propose these nonparametric methods. A Monte Carlo simulation study was conducted using SAS program to investigate the performance of the proposed tests under a variety of nondecreasing location shifts among three, four and five populations and then compare these powers to each other and with the powers of the test statistics introduced by Magel et al. (2009). Three underlying distributions are used in the study including the standard normal distribution, the standard exponential distribution and student's t-distribution (3 degrees of freedom). We considered three scenarios of proportions of the number of blocks in the RCBD portion to the sample size in the CRD portion, namely, assuming that the portion of the number of blocks in RCBD is larger, equal, and smaller than the portion of the sample size in the CRD. Moreover, equal and unequal sample sizes were both considered for the CRD portion. The results of the simulation study indicate that all the proposed methods maintain their type one error and also indicate that at least one of the proposed methods did better compared to the tests of Magel et al. (2009) in terms of the estimated powers. In general, situations are found in which the proposed methods have higher powers and situations are found in which tests in Magel et al. (2009) have higher powers.
  • Item
    Proposed Nonparametric Tests for the Simple Tree Alternative for Location and Scale Testing
    (North Dakota State University, 2020) Alsubie, Abdelaziz Qasi
    Location-scale problems arise in many cases, such as, bioinformatics, climate dynamics, finance and medicine (Marozzi, 2013). This research focuses on developing tests to determine whether one or more treatment effects differ from the control. It will be assumed that when a treatment effect differs from the control effect, it is greater either in mean or variance (simple tree alternative). It is also assumed that a treatment effect difference results in the change of the location and / or scale parameters between two population distributions. This research will consider the area of nonparametric tests when determining whether one or more of the treatment effects is larger than the control. Five nonparametric tests are proposed for the simple tree alternative. A simulation study will be conducted to determine how well the proposed tests maintain their significance levels. Powers will be estimated for the proposed tests under a variety of conditions for two, three and four populations. Three different types of variable parameters will be considered. The first type considered is when the location parameters are different, and the scale parameters are equal. The second type considered is when the location parameters are equal, and the scale parameters are different. The final type considered is when the location and scale parameters are both different. Results will be given as far as which test does better under certain conditions.
  • Item
    Proposed Nonparametric Tests for the Umbrella Alternative in a Mixed Design for Both Known and Unknown Peak
    (North Dakota State University, 2019) Alsuhabi, Hassan Rashed
    In several situations, and among various treatment effects, researchers might test for an umbrella alternative. The need for an umbrella alternative arises in the evaluation of the reaction to drug dosage. For instance, the reaction might increase as the level of drug dosage increases, where after exceeding the optimal dosage a downturn may occur. A test statistic used for the umbrella alternative was proposed by Mack and Wolfe (1981) using a completely randomized design. Moreover, an extension of the Mack-Wolfe test for the randomized complete block design was proposed by Kim and Kim (1992), where the blocking factor was introduced. This thesis proposes two nonparametric test statistics for mixed design data with k treatments when the peak is known and four statistics when the peak is unknown. The data are a mixture of a CRD and an RCBD. A Monte Carlo simulation is conducted to compare the power of the first two proposed tests when the peak is known, and each one of them has been compared to the tests that were proposed by Magel et al. (2010). Also, it is conducted to compare the power of the last four proposed tests when the peak is unknown. In this study, we consider the simulation from exponential, normal and t distributions with 3 degrees of freedom. For every distribution, equal sample sizes for the CRD portion are selected so that the sample size, n, is 6, 10, 16 and 20. The number of blocks for the RCBD are considered to be half, equal and twice the sample size for each treatment. Furthermore, a variety of location parameter configurations are considered for three, four and five populations. The powers were estimated for both cases, known and unknown peak. In both cases, the results of the simulation study show that the proposed tests, in which we use the method of standardized first, generally perform better than those with standardized second. This thesis also shows that adding the distance modification to the Mack-Wolfe and Kim- Kim statistics provides more power to the proposed test statistics more than those without the application of the distance modification.
  • Item
    Integrative Data Analysis of Microarray and RNA-seq
    (North Dakota State University, 2018) Wang, Qi
    Background: Microarray and RNA sequencing (RNA-seq) are two commonly used high-throughput technologies for gene expression profiling for the past decades. For global gene expression studies, both techniques are expensive, and each has its unique advantages and limitations. Integrative analysis of these two types of data would provide increased statistical power, reduced cost, and complementary technical advantages. However, the complete different mechanisms of the high-throughput techniques make the two types of data highly incompatible. Methods: Based on the degrees of compatibility, the genes are grouped into different clusters using a novel clustering algorithm, called Boundary Shift Partition (BSP). For each cluster, a linear model is fitted to the data and the number of differentially expressed genes (DEGs) is calculated by running two-sample t-test on the residuals. The optimal number of cluster can be determined using the selection criteria that is penalized on the number of parameters for model fitting. The method was evaluated using the data simulated from various distributions and it was compared with the conventional K-means clustering method, Hartigan-Wong’s algorithm. The BSP algorithm was applied to the microarray and RNA-seq data obtained from the embryonic heart tissues from wild type mice and Tbx5 mice. The raw data went through multiple preprocessing steps including data transformation, quantile normalization, linear model, principal component analysis and probe alignments. The differentially expressed genes between wild type and Tbx5 are identified using the BSP algorithm. Results: The accuracies of the BSP algorithm for the simulation data are higher than those of Hartigan-Wong’s algorithm for the cases with smaller standard deviations across the five different underlying distributions. The BSP algorithm can find the correct number of the clusters using the selection criteria. The BSP method identifies 584 differentially expressed genes between the wild type and Tbx5 mice. A core gene network developed from the differentially expressed genes showed a set of key genes that were known to be important for heart development. Conclusion: The BSP algorithm is an efficient and robust classification method to integrate the data obtained from microarray and RNA-seq.
  • Item
    Conditional Random Field with Lasso and its Application to the Classification of Barley Genes Based on Expression Level Affected by Fungal Infection
    (North Dakota State University, 2019) Liu, Xiyuan
    The classification problem of gene expression level, more specifically, gene expression analysis, is a major research area in statistics. There are several classical methods to solve the classification problem. To apply Logistic Regression Model (LRM) and other classical methods, the observations in the dataset should fit the assumption of independence. That is, the observations in the dataset are independent to each other, and the predictor (independent variable) should be independent. These assumptions are usually violated in gene expression analysis. Although the Classical Hidden Markov Chain Model (HMM) can solve the independence of observation problem, the classical HMM requires the independent variables in the dataset are discrete and independent. Unfortunately, the gene expression level is a continuous variable. To solve the classification problem of Gene Expression Level data, the Conditional Random Field(CRF) is introduce. Finally, the Least Absolute Selection and Shrinkage Operator (LASSO) penalty, a dimensional reduction method, is introduced to improve the CRF model.
  • Item
    A Conditional Random Field (CRF) Based Machine Learning Framework for Product Review Mining
    (North Dakota State University, 2019) Ming, Yue
    The task of opinion mining from product reviews has been achieved by employing rule-based approaches or generative learning models such as hidden Markov models (HMMs). This paper introduced a discriminative model using linear-chain Conditional Random Fields (CRFs) that can naturally incorporate arbitrary, non-independent features of the input without conditional independence among the features or distributional assumptions of inputs. The framework firstly performs part-of-speech (POS) tagging tasks over each word in sentences of review text. The performance is evaluated based on three criteria: precision, recall and F-score. The result shows that this approach is effective for this type of natural language processing (NLP) tasks. Then the framework extracts the keywords associated with each product feature and summarizes into concise lists that are simple and intuitive for people to read.