Computer Science Doctoral Work

Permanent URI for this collectionhdl:10365/32551

Browse

Now showing 1 - 20 of 44

Adaptive Regression Testing Strategies for Cost-Effective Regression Testing
(North Dakota State University, 2013) Schwartz, Amanda Jo
Regression testing is an important but expensive part of the software development life-cycle. Many different techniques have been proposed for reducing the cost of regression testing. To date, much research has been performed comparing regression testing techniques, but very little research has been performed to aid practitioners and researchers in choosing the most cost-effective technique for a particular regression testing session. One recent study investigated this problem and proposed Adaptive Regression Testing (ART) strategies to aid practitioners in choosing the most cost-effective technique for a specific version of a software system. The results of this study showed that the techniques chosen by the ART strategy were more cost-effective than techniques that did not consider system lifetime and testing processes. This work has several limitations, however. First, it only considers one ART strategy. There are many other strategies which could be developed and studied that could be more cost-effective. Second, the ART strategy used the Analytical Hierarchy Process (AHP). The AHP method is subjective to the weights made by the decision maker. Also, the AHP method is very time consuming because it requires many pairwise comparisons. Pairwise comparisons also limit the scalability of the approach and are often found to be inconsistent. This work proposes three new ART strategies to address these limitations. One strategy utilizing the fuzzy AHP method is proposed to address imprecision in the judgment made by the decision maker. A second strategy utilizing a fuzzy expert system is proposed to reduce the time required by the decision maker, eliminate inconsistencies due to pairwise comparisons, and increase scalability. A third strategy utilizing the Weighted Sum Model is proposed to study the performance of a simple, low cost strategy. Then, a series of empirical studies are performed to evaluate the new strategies. The results of the studies show that the strategies proposed in this work are more cost-effective than the strategy presented in the previous study.
Addressing Challenges in Data Privacy and Security: Various Approaches to Secure Data
(North Dakota State University, 2021) Pattanayak, Sayantica
Emerging neural networks based machine learning techniques such as deep learning and its variants have shown tremendous potential in many application domains. However, the neural network models raise serious privacy concerns due to the risk of leakage of highly privacy-sensitive data. In this dissertation, we propose various techniques to hide the sensitive information and also evaluate the performance and efficacy of our proposed models. In our first research work we propose a model, which can both encrypt and decrypt a ciphertext. Our model is based on symmetric key encryption and back propagation neural network. Our model takes the decimal values and converts them to ciphertext and then again to decimal values. In our second research work, we propose a remote password authentication scheme using neural network. In this model, we have shown how an user can communicate securely with more than one server. A user registers himself / herself with a trusted authority and gets a user id and a password. The user uses the password and the user id to login to one or multiple servers. The servers can validate the legitimacy of the user. Our experiments use different classifiers to evaluate the accuracy and the efficiency of our proposed model. In our third research paper, we develop a technique to securely send patient information to different organizations. Our technique used different fuzzy membership functions to hide the sensitive information about patients. In our fourth research paper, we introduced an approach to substitute the sensitive attributes with the non-sensitive attributes. We divide the data set into three different subsets: desired, sensitive and non-sensitive subsets. The output of the denoising autoencoder will only be the desired and non-sensitive subsets. The sensitive subsets are hidden by the non-sensitive subsets. We evaluate the efficacy of our predictive model using three different flavors of autoencoders. We measure the F1-score of our model against each of the three autoencoders. As our predictive model is based on privacy, we have also used a Generative Adversarial Neural Network (GAN), which is used to show to what extend our model is secure.
Analyses of People’s Perceptions Toward Risks in Genetically Modified Organisms
(North Dakota State University, 2016) Dass, Pranav
This research aims to analyze people’s perceptions about the potential risks associated with the presence of genetically modified organisms (GMOs) in food products. We formulated research questions and hypotheses based on parameters, including age, gender, state of residence, and more to analyze these perceptions. We conducted an online nationwide survey across the United States and recruited participants from the general population to understand their perceptions about risks for GMOs and GM foods. We formulated a set of questions regarding the effects of GMOs on food products (including both the pre- and post-study questions) and investigated the changes in people’s perceptions after reading selected news releases about GMOs. The survey responses were collected and categorized according to the research parameters and statistical assessments were conducted to test the hypotheses. Additionally, we introduced a novel approach to analyze the responses by creating a mind-map framework for both the pre- and post-study responses. We found that people residing in the southern region of the United States responded more positively toward GMOs when compared to individuals residing in the northeast, west and mid-west regions. We also deduced that people’s perceptions about GMOs were not significantly different from each other whether they resided in states with Republican or Democrat/non-partisan party affiliations. Further, we observed that the male participants responded more negatively compared to the female participants across the nation. We compared the results obtained from respondents in the general population to those from a group of Computer Science students at North Dakota State University who completed the same survey. We found that students considered GMOs less risky when compared to the general population. A third research study compared participants from the general population to a group of participants who were recruited from the general population. The second group didn’t read the news releases that separated the survey’s pre- and post-study questions. We observed that the news releases impacted the participants from the first group and, eventually, changed the individuals’ perceptions about GMOs when compared to the participants from the second group who possessed no or fewer perception changes.
Blockchain-Based Trust Model: Alleviating the Threat of Malicious Cyber-Attacks
(North Dakota State University, 2020) Bugalwi, Ahmed Youssef
Online communities provide a unique environment where interactions performed among its subscribers who have shared interest. Members of these virtual communities are typically classified as trustworthy and untrustworthy. Trust and reputation became indispensable properties due to the rapid growth of uncertainty and risk. This risk is a result of cyber-attacks carried out by untrustworthy actors. A malicious attack may produce misleading information making the community unreliable. Trust mechanism is a substantial instrument for empowering safe functioning within a community. Most virtual communities are centralized, which implies that they own, manage, and control trust information without given permission from the legitimate owner. The problem of ownership arises as actors may lose their reputations if the community decided to shut down its business. Sharing information is another valuable feature that aids lessening the impact of dishonest behavior. A new trust model called “TrustMe” was developed in this research as a reliable mechanism that generates precise trust information for virtual communities. TrustMe consists of several factors that aim to confuse untrustworthy actors, and to make the generated trust score is hardly reversed. A blockchain-based trust model is also developed to address the problem of ownership as well as offering a decentralized information sharing mechanism through a distributed application called “DATTC.” The efficiency of the proposed models was identified by conducting various analytic experimental studies. An unsupervised machine learning method (density-based clustering) was applied using two different datasets. Also, graph analysis was conducted to study the evolvement of communities and trust by finding connections between graph metrics and trust scores generated by TrustMe. Finally, a set of simulations using stochastic models to evaluate the accuracy and success rates of TrustMe, and a simulation set mimicked the blockchain-model in alleviating the influence of Sybil attack. The relationships among actors were hypothesized as actors divided into trustworthy and untrustworthy performing cooperative and malicious attacks. The results of the study prove that TrustMe can be promising and support the first hypothesis as TrustMe outperformed other trust models. Additionally, the results confirm that the blockchain-based trust model efficiently mitigates malicious cyber-attack by employing cross-community trust and preserves ownership property.
Computational Methods for Predicting Protein-Nucleic Acids Interaction
(North Dakota State University, 2015) Cheng, Wen
Since the inception of various proteomic projects, protein structures with unknown functions have been discovered at a fast speed. The proteins regulate many important biological processes by interacting with nucleic acids that include DNA and RNA. Traditional wet-lab methods for protein function discovery are too slow to handle this rapid increase of data. Therefore, there is a need for computational methods that can predict the interaction between proteins and nucleic acids. There are two related problems when predicting protein-nucleic interactions. One problem is to identify nucleic acid-binding sites on the protein structures, and the other problem is to predict the 3-D structure of the complex that protein and nucleic acids form during interaction. The second problem can be further divided into two steps. The first step is to generate potential structures for the protein-nucleic acids complex. The second step is to assign scores to the poses generated in the first step. This dissertation presents two computational methods that we developed to predict the protein-nucleic acids interaction. The first method is a scoring function that can discriminate native structures of protein-DNA complexes from non-native poses, which are also known as docking decoys. We analyze the distribution of protein atoms around each structural component of the DNA and develop spatial-specific scoring matrices (SSSMs) based on the observed distribution. We show that the SSSMs could be used as a knowledge-based energy function to discriminate native protein-DNA structures and various decoys. Our second method discovers the graphs that are enriched on the protein-nucleic acids interfaces and then uses the sub-graphs to predict RNA-binding sites on protein structures and to assign scores to protein-RNA poses. First, the interface area of each RNA-binding protein is represented as a graph, where each node represents an interface residue. Then, common sub-graphs being abundant in these graphs are identified. The method is able to identify RNA-binding sites on the protein surface with high accuracy. We also demonstrate that the common sub-graphs can be used as a scoring function to rank the protein-RNA poses. Our method is simple in computation, while its results are easier to interpret in biological contexts.
Contributing Factors Promoting Success for Females in Computing: A Comparative Study
(North Dakota State University, 2022) Gronneberg, Bethlehem
Despite the growing global demand for Computer Science (CS) professionals, their high earning potential, and diversified career paths (U.S. BLS 2021, UNESCO 2017), a critical gap exists between enrollment and graduation rates among female students in computing fields across the world (Raigoza 2017, Hailu 2018, UNESCO 2017, Bennedsen and Caspersen 2007). The largest dropout point occurs during the first two years of their CS studies (Giannakos, et al., 2017). The purpose of this parallelly convergent mixed-methods research was to comparatively investigate, describe and analyze factors correlated to the experiences and perceptions of female undergraduates as it relates to their persistence in CS/Software Engineering (SE) degrees, conducted in two public universities in the U.S. & Ethiopia. Anchored in Tinto’s theory of retention, the quantitative part of the study examined three possible predictive factors of success for students who were enrolled in the first two CS/SE courses and evaluated differences between genders and institutions on those factors. Pearson’s correlation coefficient tests were applied to test the hypothesis that the perceptions of Degree’s Usefulness (DU), Previously Acquired Knowledge (PAK) and Cognitive Attitude (CA) correlate to the decision to persist for the research participants. The results showed a statistically significant positive correlation between perceptions of DU, the influence of PAK, and the decision to persist. Two sample t-tests revealed gender and institutional differences exhibited in the influence of PAK and CA. The qualitative part of the study reported 12 contributing factors of success for graduating class of females in CS/SE using a unique approach of sentiment analysis and topic modeling from the domain of Natural Language Processing (NLP) through the interpretation of auto transcribed interview responses.
A Data Mining Approach to Radiation Hybrid Mapping
(North Dakota State University, 2014) Seetan, Raed
The task of mapping markers from Radiation Hybrid (RH) mapping experiments is typically viewed as equivalent to the traveling-salesman problem, which has combinatorial complexity. As an additional problem, experiments commonly result in some unreliable markers that reduce the overall map quality. Due to the large numbers of markers in current radiation hybrid populations, the use of the data mining techniques becomes increasingly important for reducing both the computational complexity and the impact of noise of the original data. In this dissertation, a clustering-based approach is proposed for addressing both the problem of filtering unreliable markers (framework maps) and the problem of mapping large numbers of markers (comprehensive maps) efficiently. Traditional approaches for eliminating unreliable markers use resampling of the full data set, which has an even higher computational complexity than the original mapping problem. In contrast, the proposed algorithms use a divide-and-conquer strategy to construct framework maps based on clusters that exclude unreliable markers. The clusters of markers are ordered using parallel processing and are then combined to form the complete map. Three algorithms are presented that explore the trade-off between the number of markers included in the framework map and placement accuracy. Since the mapping problem is susceptible to noise, it is often beneficial to remove markers that are not trustworthy. Traditional mapping techniques for building comprehensive maps process all markers together, including unreliable markers, in a single-iteration approach. The accuracy of the constructed maps may be reduced. In this research work, two-stage algorithms are proposed to mapping most markers by first creating a framework map of the reliable markers, and then incrementally adding the remaining markers to construct high quality comprehensive maps. All proposed algorithms have been evaluated on several human chromosomes using radiation hybrid datasets with varying sizes, and also the performance of our proposed algorithms is compared with state-of-the-art RH mapping softwares. Overall, the proposed algorithms are not only much faster than the comparative approaches, but that the quality of the resulting maps is also much higher.
Detecting Insider and Masquerade Attacks by Identifying Malicious User Behavior and Evaluating Trust in Cloud Computing and IoT Devices
(North Dakota State University, 2019) Kambhampaty, Krishna Kanth
There are a variety of communication mediums or devices for interaction. Users hop from one medium to another frequently. Though the increase in the number of devices brings convenience, it also raises security concerns. Provision of platform to users is as much important as its security. In this dissertation we propose a security approach that captures user behavior for identifying malicious activities. System users exhibit certain behavioral patterns while utilizing the resources. User behaviors such as device location, accessing certain files in a server, using a designated or specific user account etc. If this behavior is captured and compared with normal users’ behavior, anomalies can be detected. In our model, we have identified malicious users and have assigned trust value to each user accessing the system. When a user accesses new files on the servers that have not been previously accessed, accessing multiple accounts from the same device etc., these users are considered suspicious. If this behavior continues, they are categorized as ingenuine. A trust value is assigned to users. This value determines the trustworthiness of a user. Genuine users get higher trust value and ingenuine users get a lower trust value. The range of trust value varies from zero to one, with one being the highest trustworthiness and zero being the lowest. In our model, we have sixteen different features to track user behavior. These features evaluate users’ activities. From the time users’ log in to the system till they log out, users are monitored based on these sixteen features. These features determine whether the user is malicious. For instance, features such as accessing too many accounts, using proxy servers, too many incorrect logins attribute to suspicious activity. Higher the number of these features, more suspicious is the user. More such additional features contribute to lower trust value. Identifying malicious users could prevent and/or mitigate the attacks. This will enable in taking timely action against these users from performing any unauthorized or illegal actions. This could prevent insider and masquerade attacks. This application could be utilized in mobile, cloud and pervasive computing platforms.
A Distributed Linear Programming Model in a Smart Grid
(North Dakota State University, 2013) Ranganathan, Prakash
Advances in computing and communication have resulted in large-scale distributed environments in recent years. They are capable of storing large volumes of data and, often, have multiple compute nodes. However, the inherent heterogeneity of data components, the dynamic nature of distributed systems, the need for information synchronization and data fusion over a network, and security and access-control issues makes the problem of resource management and monitoring a tremendous challenge in the context of a Smart grid. Unfortunately, the concept of cloud computing and the deployment of distributed algorithms have been overlooked in the electric grid sector. In particular, centralized methods for managing resources and data may not be sufficient to monitor a complex electric grid. Most of the electric grid management that includes generation, transmission, and distribution is, by and large, managed at a centralized control. In this dissertation, I present a distributed algorithm for resource management which builds on the traditional simplex algorithm used for solving large-scale linear optimization problems. The distributed algorithm is exact, meaning its results are identical if run in a centralized setting. More specifically in this dissertation, I discuss a distributed decision model, where a large-scale electric grid is decomposed into many sub models that can support the resource assignment, communication, computation, and control functions necessary to provide robustness and to prevent incidents such as cascading blackouts. The key contribution of this dissertation is to design, develop, and test a resource-allocation process through a decomposition principle in a Smart grid. I have implemented and tested the Dantzig-Wolfe decomposition process in standard IEEE 14-bus and 30-bus systems. The dissertation provides details about how to formulate, implement, and test such an LP-based design to study the dynamic behavior and impact of an electrical network while considering its failure and repair rates. The computational benefits of the Dantzig-Wolfe approach to find an optimal solution and its applicability to IEEE bus systems are presented.
Extracting Useful Information and Building Predictive Models from Medical and Health-Care Data Using Machine Learning Techniques
(North Dakota State University, 2020) Kabir, Md Faisal
In healthcare, a large number of medical data has emerged. To effectively use these data to improve healthcare outcomes, clinicians need to identify the relevant measures and apply the correct analysis methods for the type of data at hand. In this dissertation, we present various machine learning (ML) and data mining (DM) methods that could be applied to the type of data sets that are available in the healthcare area. The first part of the dissertation investigates DM methods on healthcare or medical data to find significant information in the form of rules. Class association rule mining, a variant of association rule mining, was used to obtain the rules with some targeted items or class labels. These rules can be used to improve public awareness of different cancer symptoms and could also be useful to initiate prevention strategies. In the second part of the thesis, ML techniques have been applied in healthcare or medical data to build a predictive model. Three different classification techniques on a real-world breast cancer risk factor data set have been investigated. Due to the imbalance characteristics of the data set various resampling methods were used before applying the classifiers. It is shown that there was a significant improvement in performance when applying a resampling technique as compared to applying no resampling technique. Moreover, super learning technique that uses multiple base learners, have been investigated to boost the performance of classification models. Two different forms of super learner have been investigated - the first one uses two base learners while the second one uses three base learners. The models were then evaluated against well-known benchmark data sets related to the healthcare domain and the results showed that the SL model performs better than the individual classifier and the baseline ensemble. Finally, we assessed cancer-relevant genes of prostate cancer with the most significant correlations with the clinical outcome of the sample type and the overall survival. Rules from the RNA-sequencing of prostate cancer patients was discovered. Moreover, we built the regression model and from the model rules for predicting the survival time of patients were generated.
Foundational Algorithms Underlying Horizontal Processing of Vertically Structured Big Data Using pTrees
(North Dakota State University, 2016) Hossain, Mohammad
For Big Data, the time taken to process a data mining algorithm is a critical issue. Many reliable algorithms are unusable in the big data environment due to the fact that the processing takes an unacceptable amount of time. Therefore, increasing the speed of processing is very important. To address the speed issue we use horizontal processing of vertically structured data rather than the ubiquitous vertical (scan) processing of horizontal (record) data. pTree technology represents and processes data differently from the traditional horizontal data technologies. In pTree technology, the data is structured column-wise (into bit slices) and the columns are processed horizontally (typically across a few to a few hundred bit level columns), while in horizontal technologies, data is structured row-wise and those rows are processed vertically. pTrees are lossless, compressed and data-mining ready data structures. pTrees are lossless because the vertical bit-wise partitioning that is used in the pTree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. pTrees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to reconstruct the original form of data. This technique has been exploited in various domains and data mining algorithms, ranging from classification, clustering, association rule mining, as well as other data mining algorithms. In this research work, we evaluate and compare the speeds of various foundational algorithms required for using this pTree technology in many data mining tasks.
Harnessing User Generated Multimedia Content in the Creation of Collaborative Classification Structures and Retrieval Learning Games
(North Dakota State University, 2015) Borchert, Otto Jerome
This paper describes a software tool to assist groups of people in the classification and identification of real world objects called the Classification, Identification, and Retrieval-based Collaborative Learning Environment (CIRCLE). A thorough literature review identified current pedagogical theories that were synthesized into a series of five tasks: gathering, elaboration, classification, identification, and reinforcement through game play. This approach is detailed as part of an included peer reviewed paper. Motivation is increased through the use of formative and summative gamification; getting points completing important portions of the tasks and playing retrieval learning based games, respectively, which is also included as a peer-reviewed conference proceedings paper. Collaboration is integrated into the experience through specific tasks and communication mediums. Implementation focused on a REST-based client-server architecture. The client is a series of web-based interfaces to complete each of the tasks, support formal classroom interaction through faculty accounts and student tracking, and a module for peers to help each other. The server, developed using an in-house JavaMOO platform, stores relevant project data and serves data through a series of messages implemented as a JavaScript Object Notation Application Programming Interface (JSON API). Through a series of two beta tests and two experiments, it was discovered the second, elaboration, task requires considerable support. While students were able to properly suggest experiments and make observations, the subtask involving cleaning the data for use in CIRCLE required extra support. When supplied with more structured data, students were enthusiastic about the classification and identification tasks, showing marked improvement in usability scores and in open ended survey responses. CIRCLE tracks a variety of educationally relevant variables, facilitating support for instructors and researchers. Future work will revolve around material development, software refinement, and theory building. Curricula, lesson plans, instructional materials need to be created to seamlessly integrate CIRCLE in a variety of courses. Further refinement of the software will focus on improving the elaboration interface and developing further game templates to add to the motivation and retrieval learning aspects of the software. Data gathered from CIRCLE experiments can be used to develop and strengthen theories on teaching and learning.
Heuristic Clustering with Secured Routing in Two Tier Sensor Networks
(North Dakota State University, 2013) Gagneja, Kanwalinderjit Kaur
This study addresses the management of Heterogeneous Sensor Networks (HSNs) in an area of interest. The use of sensors in our day-to-day life has increased dramatically, and in ten to fifteen years the sensor nodes may cover the whole world, and could be accessed through the Internet. Currently, sensors are in use for such things as vehicular movement tracking, nuclear power plant monitoring, fire incident reporting, traffic controlling, and environmental monitoring. There is vast potential for various applications, such as entertainment, drug trafficking, border surveillance, crisis management, under water environment monitoring, and smart spaces. So this research area has a lot of potential. The sensors have limited resources and researchers have invented methods to deal with the related issues. But security and routing in sensor networks and clustering of sensors are handled separately by past researchers. Since route selection directly depends on the position of the nodes and sets of resources may change dynamically, so cumulative and coordinated activities are essential to maintain the organizational structure of laid out sensors. So for conserving the sensor network energy, it is better if we follow a holistic approach taking care of both clustering and secure routing. In this research, we have developed an efficient key management approach with an improved tree routing algorithm for clustered heterogeneous sensor networks. The simulation results show that this scheme offers good security and uses less computation with substantial savings in memory requirements, when compared with some of other key management, clustering and routing techniques. The low end nodes are simple and low cost, while the high end nodes are costly but provide significantly more processing power. In this type of sensor network, the low end nodes are clustered and report to a high end node, which in turn uses a network backbone to route data to a base station. Initially, we partition the given area into Voronoi clusters. Voronoi diagrams generate polygonal clusters using Euclidian distance. Since sensor networks routing is multi-hopped, we apply a tabu search to adjust some of the nodes in the Voronoi clusters. The Voronoi clusters then work with hop counts instead of distance. When some event occurs in the network, low end nodes gather and forward data to cluster heads using the Secure Improved Tree Routing approach. The routing amongst the low end nodes, high end nodes and the base station is made secure and efficient by applying a 2-way handshaking secure Improved Tree Routing (ITR) technique. The secure ITR data routing procedure improves the energy efficiency of the network by reducing the number of hops utilized to reach the Base Station. We gain robustness and energy efficiency by reducing the vulnerability points in the network by employing alternatives to shortest path tree routing. In this way a complete solution is provided to the travelling data in a two tier heterogeneous sensor network by reducing the hop count, making it secure and energy efficient. Empirical evaluations show how the described algorithm performs with respect to delivery ratio, end to end delays, and energy usage.
Improved Genetic Programming Techniques For Data Classification
(North Dakota State University, 2014) Al-Madi, Naila Shikri
Evolutionary algorithms are one category of optimization techniques that are inspired by processes of biological evolution. Evolutionary computation is applied to many domains and one of the most important is data mining. Data mining is a relatively broad field that deals with the automatic knowledge discovery from databases and it is one of the most developed fields in the area of artificial intelligence. Classification is a data mining method that assigns items in a collection to target classes with the goal to accurately predict the target class for each item in the data. Genetic programming (GP) is one of the effective evolutionary computation techniques to solve classification problems. GP solves classification problems as an optimization tasks, where it searches for the best solution with highest accuracy. However, GP suffers from some weaknesses such as long execution time, and the need to tune many parameters for each problem. Furthermore, GP can not obtain high accuracy for multiclass classification problems as opposed to binary problems. In this dissertation, we address these drawbacks and propose some approaches in order to overcome them. Adaptive GP variants are proposed in order to automatically adapt the parameter settings and shorten the execution time. Moreover, two approaches are proposed to improve the accuracy of GP when applied to multiclass classification problems. In addition, a Segment-based approach is proposed to accelerate the GP execution time for the data classification problem. Furthermore, a parallelization of the GP process using the MapReduce methodology was proposed which aims to shorten the GP execution time and to provide the ability to use large population sizes leading to a faster convergence. The proposed approaches are evaluated using different measures, such as accuracy, execution time, sensitivity, specificity, and statistical tests. Comparisons between the proposed approaches with the standard GP, and with other classification techniques were performed, and the results showed that these approaches overcome the drawbacks of standard GP by successfully improving the accuracy and execution time.
Incorporating Sliding Window-Based Aggregation for Evaluating Topographic Variables in Geographic Information Systems
(North Dakota State University, 2019) Gomes, Rahul
The resolution of spatial data has increased over the past decade making them more accurate in depicting landform features. From using a 60m resolution Landsat imagery to resolution close to a meter provided by data from Unmanned Aerial Systems, the number of pixels per area has increased drastically. Topographic features derived from high resolution remote sensing is relevant to measuring agricultural yield. However, conventional algorithms in Geographic Information Systems (GIS) used for processing digital elevation models (DEM) have severe limitations. Typically, 3-by-3 window sizes are used for evaluating the slope, aspect and curvature. Since this window size is very small compared to the resolution of the DEM, they are mostly resampled to a lower resolution to match the size of typical topographic features and decrease processing overheads. This results in low accuracy and limits the predictive ability of any model using such DEM data. In this dissertation, the landform attributes were derived over multiple scales using the concept of sliding window-based aggregation. Using aggregates from previous iteration increases the efficiency from linear to logarithmic thereby addressing scalability issues. The usefulness of DEM-derived topographic features within Random Forest models that predict agricultural yield was examined. The model utilized these derived topographic features and achieved the highest accuracy of 95.31% in predicting Normalized Difference Vegetation Index (NDVI) compared to a 51.89% for window size 3-by-3 in the conventional method. The efficacy of partial dependence plots (PDP) in terms of interpretability was also assessed. This aggregation methodology could serve as a suitable replacement for conventional landform evaluation techniques which mostly rely on reducing the DEM data to a lower resolution prior to data processing.
Knowledge Discovery and Management within Service Centers
(North Dakota State University, 2016) Zaman, Nazia
These days, most enterprise service centers deploy Knowledge Discovery and Management (KDM) systems to address the challenge of timely delivery of a resourceful service request resolution while efficiently utilizing the huge amount of data. These KDM systems facilitate prompt response to the critical service requests and if possible then try to prevent the service requests getting triggered in the first place. Nevertheless, in most cases, information required for a request resolution is dispersed and suppressed under the mountain of irrelevant information over the Internet in unstructured and heterogeneous formats. These heterogeneous data sources and formats complicate the access to reusable knowledge and increase the response time required to reach a resolution. Moreover, the state-of-the art methods neither support effective integration of domain knowledge with the KDM systems nor promote the assimilation of reusable knowledge or Intellectual Capital (IC). With the goal of providing an improved service request resolution within the shortest possible time, this research proposes an IC Management System. The proposed tool efficiently utilizes domain knowledge in the form of semantic web technology to extract the most valuable information from those raw unstructured data and uses that knowledge to formulate service resolution model as a combination of efficient data search, classification, clustering, and recommendation methods. Our proposed solution also handles the technology categorization of a service request which is very crucial in the request resolution process. The system has been extensively evaluated with several experiments and has been used in a real enterprise customer service center.
Mapreduce-Enabled Scalable Nature-Inspired Approaches for Clustering
(North Dakota State University, 2014) Aljarah, Ibrahim Mithgal
The increasing volume of data to be analyzed imposes new challenges to the data mining methodologies. Traditional data mining such as clustering methods do not scale well with larger data sizes and are computationally expensive in terms of memory and time. Clustering large data sets has received attention in the last few years in several application areas such as document categorization, which is in urgent need of scalable approaches. Swarm intelligence algorithms have self-organizing features, which are used to share knowledge among swarm members to locate the best solution. These algorithms have been successfully applied to clustering, however, they suffer from the scalability issue when large data is involved. In order to satisfy these needs, new parallel scalable clustering methods need to be developed. The MapReduce framework has become a popular model for parallelizing data-intensive applications due to its features such as fault-tolerance, scalability, and usability. However, the challenge is to formulate the tasks with map and reduce functions. This dissertation firstly presents a scalable particle swarm optimization (MR-CPSO) clustering algorithm that is based on the MapReduce framework. Experimental results reveal that the proposed algorithm scales very well with increasing data set sizes while maintaining good clustering quality. Moreover, a parallel intrusion detection system using the MR-CPSO is introduced. This system has been tested on a real large-scale intrusion data set to confirm its scalability and detection quality. In addition, the MapReduce framework is utilized to implement a parallel glowworm swarm optimization (MR-GSO) algorithm to optimize difficult multimodal functions. The experiments demonstrate that MR-GSO can achieve high function peak capture rates. Moreover, this dissertation presents a new clustering algorithm based on GSO (CGSO). CGSO takes into account the multimodal search capability to locate optimal centroids in order to enhance the clustering quality without the need to provide the number of clusters in advance. The experimental results demonstrate that CGSO outperforms other well-known clustering algorithms. In addition, a MapReduce GSO clustering (MRCGSO) algorithm version is introduced to evaluate the algorithm's scalability with large scale data sets. MRCGSO achieves a good speedup and utilization when more computing nodes are used.
Measurement of Non-Technical Skills of Software Development Teams
(North Dakota State University, 2014) Bender, Lisa Louise
Software Development managers recognize that project team dynamics is a key component of the success of any project. Managers can have a project with well-defined goals, an adequate schedule, technically skilled people and all the necessary tools, but if the project team members cannot communicate and collaborate effectively with each other and with end users, then project success is at risk. Common problems with non-technical skills include dysfunctional communication, negative attitudes, uncooperativeness, mistrust, avoidance, and ineffective negotiations between team members and users. Such problems must be identified and addressed to improve individual and team performance. There are tools available that assist in measuring the effectiveness of the technical skills and processes that teams use to execute projects, but there are no proven tools to effectively measure the non-technical skills of software developers. Other industries (e.g. airline and medical) are also finding that teamwork issues are related to non-technical skills, as well as lack of technical expertise. These industries are beginning to use behavioral marker systems to structure individual and team assessments. Behavioral markers are observable behaviors that impact individual or team performance. This dissertation work explores and develops a behavioral marker system tool, adapted from model in other industries, to assist managers in assessing the non-technical skills of project team individuals within groups. An empirical study was also conducted to prove the validity of the tool and the report is included in this study. We also developed and report upon empirical work that assesses how Social Sensitivity (a non-technical skill) impacts team performance. There are four components to this work: Develop a useful non-technical skills taxonomy; Develop a behavioral marker system for software developers and the non-technical skills taxonomy; Validate the software developer behavioral marker system; Investigate specifically the effect of social sensitivity on team performance. The evaluation is based on data collected from experiments. The overall goal of this work is to provide software development team managers with a methodology to evaluate and provide feedback on the non-technical skills of software developers and to investigate if a particular non-technical skill can positively affect team performance.
Metrics and Tools to Guide Design of Graphical User Interfaces
(North Dakota State University, 2014) Alemerien, Khalid Ali
User interface design metrics assist developers evaluate interface designs in early phase before delivering the software to end users. This dissertation presents a metric-based tool called GUIEvaluator for evaluating the complexity of the user interface based on its structure. The metrics-model consists of five modified structural measures of interface complexity: Alignment, grouping, size, density, and balance. The results of GUIEvaluator are discussed in comparison with the subjective evaluations of interface layouts and the existing complexity metrics-models. To extend this metrics-model, the Screen-Layout Cohesion (SLC) metric has been proposed. This metric is used to evaluate the usability of user interfaces. The SLC metric has been developed based on Aesthetic, structural, and semantic aspects of GUIs. To provide the SLC calculation, a complementary tool has been developed, which is called GUIExaminer. This dissertation demonstrates the potential of incorporating automated complexity and cohesion metrics into the user interface design process. The findings show that a strong positive correlation between the subjective evaluation and both the GUIEvaluator and GUIExaminer, at a significance level 0.05. Moreover, the findings provide evidence of the effectiveness of the GUIEvaluator and GUIExaminer to predict the best user interface design among a set of alternative user interfaces. In addition, the findings show that the GUIEvaluator and GUIExaminer can measure some usability aspects of a given user interface. However, the metrics validation proves the usefulness of GUIEvaluator and GUIExaminer for evaluating user interface designs.
Mining for Significant Information from Unstructured and Structured Biological Data and Its Applications
(North Dakota State University, 2012) Al-Azzam, Omar Ghazi
Massive amounts of biological data are being accumulated in science. Searching for significant meaningful information and patterns from different types of data is necessary towards gaining knowledge from these large amounts of data available to users. However, data mining techniques do not normally deal with significance. Integrating data mining techniques with standard statistical procedures provides a way for mining statistically signi- ficant, interesting information from both structured and unstructured data. In this dissertation, different algorithms for mining significant biological information from both unstructured and structured data are proposed. A weighted-density-based approach is presented for mining item data from unstructured textual representations. Different algorithms in the area of radiation hybrid mapping are developed for mining significant information from structured binary data. The proposed algorithms have different applications in the ordering problem in radiation hybrid mapping including: identifying unreliable markers, and building solid framework maps. Effectiveness of the proposed algorithms towards improving map stability is demonstrated. Map stability is determined based on resampling analysis. The proposed algorithms deal effectively and efficiently with multidimensional data and also reduce computational cost dramatically. Evaluation shows that the proposed algorithms outperform comparative methods in terms of both accuracy and computation cost.

Browse

Browsing Computer Science Doctoral Work by browse.metadata.program "Computer Science"