Computer Science
Permanent URI for this collection
Browse
Recent Submissions
Item Open Access AN EMPIRICAL EXPLORATION OF ARTIFICIAL INTELLIGENCE FOR SOFTWARE DEFECT PREDICTION IN SOFTWARE ENGINEERING(East Carolina University, July 2024) Cahill, Elaine; Madhusudan Srinivasan; Nic Herndon; Rui WuArtificial Intelligence (AI) is an important topic in software engineering not only for data analysis and pattern recognition, but for the opportunity of finding solutions to problems that may not have explicit rules or instructions. Reliable prediction methods are needed because we cannot prove that there are no defects in software. Deep learning and machine learning have been applied to software defect prediction in the attempt to generate valid software engineering practices since at least 1971. Avoiding safety-critical or expensive system failures can save lives and reduce the economic burden of maintaining systems by preventing failures in systems such as aviation software, medical devices, and autonomous vehicles. This thesis contributes to the field of software defect prediction by empirically evaluating the performance of various machine learning models, including Logistic Regression, Random Forest, Support Vector Machine, and a stacking classifier combining these models. The findings highlight the importance of model selection and feature engineering in achieving accurate predictions. We followed this with a stacking classifier that combines Logistic Regression, Random Forest, and Support Vector Machine (SVM) to see if that improved predictive performance. We compared our results with previous work and analyzed which features or attributes appeared to be effective in predicting defects. We end by discussing potential next steps for further research based on our work results.Item Open Access Time Series Forecasting Using Generative Adversarial Networks(East Carolina University, 2023-05-04) Mamua, Sharon Sone; Wu, Rui; Computer ScienceTime series data is prevalent in many fields, such as finance, weather forecasting, and economics. Predicting future values of a time series can provide valuable insights for decision-making, such as identifying trends, detecting anomalies, and improving resource allocation. Recently, Generative Adversarial Networks (GANs) have been used to learn from these features to aid in time-series forecasting. We propose a novel framework that utilizes the unsupervised paradigm of a GAN based on related research called TimeGAN. Instead of using the discriminator as a classification model, we employ it as a regressive model to learn both temporal and static features. This framework can help generate synthetic data and facilitate forecasting. Our model outperforms TimeGAN, which only preserves temporal dynamics and uses the discriminator as a classifier to distinguish between synthetic and real datasets.Item Open Access QPE: A System For Deconstructing SQL Queries(East Carolina University, 2023-04-26) Bullard, Connor D; Gudivada, Venkat N; Computer ScienceResearch on the topic of converting natural language to machine-readable code has experienced great interest over the last decade, however studies into converting machine-readable code into natural language are sparse. The applications of translating spoken or written languages into code are well-established, such as allowing a more novice or non-technical user to interact with a program or database with ease. The benefits of such applications are readily observable and are likely to grow as software systems continue to increase in complexity and capability. Likewise, parsing code to natural language produces certain benefits from which the potential gain in utility and knowledge has yet to be fully realized. This thesis identifies opportunities for deploying solutions that provide a natural language explanation of programming languages, specifically with Structured Query Language (SQL) and database interfacing. A novel solution is proposed in the form of an application named Query Purpose Extractor (QPE), which utilizes existing open-source libraries to aid in the process of translating SQL statements into English sentences.Item Open Access Warfarin Sensitivity is Associated with Increased Hospital Mortality in Critically Ill Patients(2022-05-05) Wang, Ping; et alItem Open Access Performance analysis of machine learning algorithms to predict mobile applications' star ratings via its user interface features(East Carolina University, 2022-05-05) Navaei, Maryam; Tabrizi, M. H. N; Computer ScienceThe first part of this thesis concludes with an overall summary of the publications so far on the applied Machine Learning techniques in different phases of the Software Development Life Cycle that including Requirements Analysis, Design, Implementation, Testing, and Maintenance. We have performed a systematic review of the research studies published from 2015-2021 and revealed that the Software Requirements Analysis phase has the least number of papers published; in contrast, Software Testing is the phase with the greatest number of papers published. The second part of this thesis compares multiple Machine Learning algorithms for predicting mobile application star ratings by its user interface features. User interface features offer a great source of information that can be utilized by various Machine Learning algorithms to generate this prediction. To do so, we have developed and selected multiple user interface features extracted from the largest mobile user interface design prediction dataset that is available to the public, RICO repository. We initially employed the Machine Learning algorithms to a subset from RICO and then compared our results against the actual dataset using the same algorithms. Furthermore, we calculated Accuracy, Recall, and Precision for each algorithm before and after cross-validation, and showcased our results in various charts. The ultimate results demonstrate that our methodology works to predict the star rating of Android mobile applications utilizing the features we extracted from RICO dataset.Item Open Access ADVANCED DRIVER ASSISTANCE SYSTEM CAR FOLLOWING MODEL OPTIMIZATION FRAMEWORK USING GENETIC ALGORITHM IMPLEMENTED IN SUMO TRAFFIC SIMULATION(East Carolina University, 2022-04-25) Carroll, Matthew; Wu, Rui; Computer ScienceAs advanced driver-assistance systems (ADAS) such as smart cruise control and lane keeping have become common technologies, self-driving above SAE level 3 are being competitively developed by major automobile manufacturers, autonomous vehicles (AVs) will prevail in the near future traffic network. In particular, evasive action algorithms with collision detection by sensors and faster braking response will enable AVs to drive with a shorter gap at higher speeds which has not been possible with human drivers. Such technologies will be able to improve current traffic performance as long as raising concerns on safety are addressed. Therefore, there have been efforts to improve understanding between stakeholders such as regulatory authorities and developers to draw a consensus about autonomous driving standard and regulations. Meanwhile, a mixed traffic network with human driving vehicles and AVs will show transient system behavior based on penetration rate of AVs thereby requiring different optimal AV settings. We are interested in understanding this system behavior over transitional period to achieve an optimal traffic performance with safety as a hard constraint. We investigate the system behavior with agent-based simulation with different penetration rates by mixing of human-driving and AV vehicle models, identify the key parameters of ADAS algorithms for traffic flow, and find the optimal parameter set per penetration rate by using genetic algorithm (GA). Simulation results with optimal parameter values reveal improvement in average traffic performance measures such as flow (5.6% increase), speed (4.9% increase), density (15.9% decrease), and waiting time (48.2% decrease). We provide simulation examples and discuss the implication of the optimal parameter values for both traffic control authorities and AV developers during the transitional period.Item Open Access Analysis of the Impact of Tags on Stack Overflow Questions(East Carolina University, 2022-04-26) Ithipathachai, Von; Hills, Mark; Computer ScienceUser queries on Stack Overflow commonly suffer from either inadequate length or inadequate clarity with regards to the languages and/or tools they are meant for. Although the site makes use of a tagging system for classifying questions, tags are used minimally (if at all). To investigate the impact of tags in the quality of results returned by the queries, in this research we propose a new query expansion solution. Our technique assigns tags to queries based on how well they match the queries' topics. We evaluated our technique on eight sets of queries categorized by overall length and programming language. We examined the retrieval results by adding varying numbers of tags to the queries, and monitored the recall and precision rates. Our results indicate that queries yield considerably higher recall and precision rates with extra tags than without. We further conclude that tags are a particularly effective means of enhancement when the original queries do not already return sufficient yields to begin with.Item Open Access Using n-Grams to Identify Time Periods of Cultural Influence(2016-11-03) Tabrizi, NassehItem Open Access Customer Reviews Analysis with Deep Neural Networks for E-Commerce Recommender Systems(2019-01-01) Tabrizi, NassehItem Open Access IMPLEMENTATION OF BERT BASED MACHINE LEARNING MODEL TO EXTRACT CANCER –MIRNA RELATIONSHIP FROM RESEARCH LITERATURE(East Carolina University, 2021-04-15) Sundharam, Arunprasad; Ding, Qin; Computer ScienceIn the world today, text mining is a widely popular and growing branch of Information technology, in which we extract useful information out of the given pile of text data. There are thousands of research papers in medical science pertaining to the study of how microRNAs (miRNAs) can assist or impede the development of various types of cancers. mirCancer is a repository which provides the details of this cancer-miRNA association by analyzing 6500+ research papers using text mining techniques. It would be helpful to create a machine learning model which can analyze the title and abstract content of the research papers and extract the cancer-miRNA association details if it is available in the given text. In this thesis work, we are proposing a solution for creating a machine learning model using the open source NLP framework - BERT, provided by Google which can identify the cancer-miRNA relationship in the given abstract text content. Bert is a deep learning model which is pretrained on Wikipedia text corpse and has built-in knowledge on the usage of English language. As part of this work, we have designed and implemented a machine learning model using Bert framework along with preparation of the dataset required to train the model in the task of identifying cancer-miRNA relationship from the given text. The machine learning model developed in this thesis work performed with an overall accuracy of 90.3% in retrieving the required information from the research papers of the test dataset and hence it can be leveraged to review the results of the existing mircancer text mining implementation.Item Restricted Tactile Demographics: Predicting Demographic Information Using Touch Data from Mobile Devices(East Carolina University, 2021-04-20) Williams, Baylea; Tabrizi, M. H. N; Computer ScienceThe research conducted in this thesis was to serve as a baseline on which human demographics are most likely to be able to be predicted through touch screen interactions. In addition, it served as a way of finding which machine learning models are best suited to be applied to a larger scale experiment of this phenomena. We were able to reliably predict both age and race of participants and in the meantime show that the best machine learning models used was Random Forest Decision Trees and Naïve Bayes producing a higher classifier of accuracy than other classifiers tested. While the sample size used during this study was small, due to the ongoing Covid-19 pandemic, the results of this study indicate that research in this area is worthy of significant exploration.Item Open Access Auto-Count Symbols in Portable Document Format (PDF)(2021-04-22) Florian, Andrew; Hills, Mark; Computer ScienceEstimating electrical costs often involves counting symbols in a PDF document. Existing software has sped up this process compared to manual counting, but there is room for further improvement. The proposed solution builds on open source components to efficiently search a PDF document for the outlines of all symbols, including letters or numbers, used by electrical engineers to differentiate between otherwise similar symbols. It then sorts these outlines into groups and counts each occurrence. Symbol for symbol, it takes less than half the time required by two leading competitors. Unfortunately, current settings often produce numerous sub-groups which need to be combined to provide meaningful totals. K-means and other improved clustering methods are being explored. The proposed concept could also be helpful in other similar applications that identify symbols or text in images.Item Open Access Developing Concept Enriched Models for Big Data Processing Within the Medical Domain(2020-07) Gudivada, Akhil; Tabrizi, Nasseh; Philips, JamesWithin the past few years, the medical domain has endeavored to incorporate artificial intelligence, including cognitive computing tools, to develop enriched models for processing and synthesizing knowledge from Big Data. Due to the rapid growth in published medical research, the ability of medical practitioners to keep up with research developments has become a persistent challenge. Despite this challenge, using data-driven artificial intelligence to process large amounts of data can overcome this difficulty. This research summarizes cognitive computing methodologies and applications utilized in the medical domain. Likewise, this research describes the development process for a novel, concept-enriched model using the IBM Watson service and a publicly available diabetes dataset and knowledge-base. Finally, reflection is offered on the strengths and limitations of the model and enhancements for future experiments. This work thus provides an initial framework for those interested in effectively developing, maintaining, and using cognitive models to enhance the quality of healthcare.Item Open Access A FRAMEWORK FOR AUTOMATICALLY GENERATING QUESTIONS FOR TOPICS IN DISCRETE MATHEMATICS(East Carolina University, 2020-11-17) Houshvand, Salar; Gudivada, Venkat N; Computer ScienceAutomated question generation is critical for realizing personalized learning. Also, learning research shows that answering questions is a more effective method than rereading the textbook multiple times. However, creating different types of questions is intellectually challenging and time-intensive. Therefore, it emphasizes a necessity for an automated way to generate questions and evaluate them. In this research after analyzing the existing approaches to automated question generation, we conclude that most of the current systems use natural language process techniques to extract questions from the text, therefore, other topics such as mathematics are lacking an automated question generation system that could help learners to assess their knowledge.In this research we present a novel framework that automatically generates unlimited numbers of questions for different topics in discrete mathematics. We created multiple algorithms for various questions in four main topics using Python. Our final product is presented as an application programming interface (API) using Flask library, which makes it easy to gain access and use this system in any future developments. Finally, we discuss the potential extensions that can be added to our framework as future contributions. The repository for this framework is freely available at https://github.com/SalarHoushvand/discrete-math-restfulAPI.Item Open Access Software Engineering for Real-Time NoSQL Systems-centric Big Data Analytics(East Carolina University, 2020-12-02) Clark, William F; Gudivada, Venkat N; Computer ScienceRecent advances in Big Data Analytics (BDA) have stimulated widespread interest to integrate BDA capabilities into all aspects of a business. Before these advances, companies have spent time optimizing the software development process and best practices associated with application development. These processes include project management structures and how to deliver new features of an application to its customers efficiently. While these processes are significant for application development, they cannot be utilized effectively for the software development of Big Data Analytics. Instead, some practices and technologies enable automation and monitoring across the full lifecycle of productivity from design to deployment and operations of Analytics. This paper builds on those practices and technologies and introduces a highly scalable framework for Big Data Analytics development operations. This framework builds on top of the best-known processes associated with DevOps. These best practices are then shown using a NoSQL cloud-based platform that consumes and processes structured and unstructured real-time data. As a result, the framework produces scalable, timely, and accurate analytics in real-time, which can be easily adjusted or enhanced to meet the needs of a business and its customers.Item Open Access Studies on Gopala-Hemachandra Codes and their Applications(East Carolina University, 2020-11-16) Childers, Logan; Gopalakrishnan, Krishnan; Computer ScienceGopala-Hemachandra codes are a variation of the Fibonacci universal code and have applications in data compression and cryptography. We study a specific parameterization of Gopala-Hemachandra codes and present several results pertaining to these codes. We show that GH_{a}(n) always exists for any n >= 1, when -2 >= a >= -4, meaning that these are universal codes. We develop two new algorithms to determine whether a GH code exists for a given a and n, and to construct them if they exist. We also prove that when a = -(4+k), where k >= 1, that there are at most k consecutive integers for which GH codes do not exist. In 2014, Nalli and Ozyilmaz proposed a stream cipher based on GH codes. We show that this cipher is insecure and provide experimental results on the performance of our program that cracks this cipher.Item Open Access Applied Machine Learning for Cybersecurity in Spam Filtering and Malware Detection(East Carolina University, 2020-11-17) Sokolov, Mark; Herndon, Nic; Computer ScienceMachine learning is one of the fastest-growing fields and its application to cybersecurity is increasing. In order to protect people from malicious attacks, several machine learning algorithms have been used to predict the malicious attacks. This research emphasizes two vulnerable areas of cybersecurity that could be easily exploited. First, we show that spam filtering is a well known problem that has been addressed by many authors, yet it still has vulnerabilities. Second, with the increase of malware threats in our world, a lot of companies use AutoAI to help protect their systems. Nonetheless, AutoAI is not perfect, and data scientists can still design better models. In this thesis I show that although there are efficient mechanisms to prevent malicious attacks, there are still vulnerabilities that could be easily exploited. In the visual spoofing experiment, we show that using a classifier trained on data using Latin alphabet, to classify a message with a combination of Latin and Cyrillic letters leads to much lower classification accuracy. In Malware prediction experiment, our model has been able to predict malware attacks on Microsoft computers and got higher accuracy than any well known Auto AI.Item Open Access An Empirical Exploration of Python Machine Learning API Usage(East Carolina University, 2020-11-16) Vilkomir, Aleksei; Hills, Mark; Computer ScienceMachine learning is becoming an increasingly important part of many domains, both inside and outside of computer science. With this has come an increase in developers learning to write machine learning applications in languages like Python, using application programming interfaces (APIs) such as pandas and scikit-learn. However, given the complexity of these APIs, they can be challenging to learn, especially for new programmers. To create better tools for assisting developers with machine learning APIs, we need to understand how these APIs are currently used. In this thesis, we present a study of machine learning API usage in Python code in a corpus of machine learning projects hosted on Kaggle, a machine learning education and competition community site. We analyzed the most frequently used machine learning related libraries and the sub-modules of those libraries. Next, we studied the usage of different calls used by the developers to solve machine learning tasks. We also found information about which libraries are used in combination and discovered a number of cases where the libraries were imported but never used. We end by discussing potential next steps for further research and developments based on our work results.Item Restricted A FRAMEWORK FOR QUESTION ANSWERING SYSTEM USING DYNAMIC CO-ATTENTION NETWORKS(East Carolina University, 2020-06-22) Busireddy, Swetha; Gudivada, Venkat NQuestion answering (QA) systems have evolved exponentially over the past few years and have reached a reliable human standard. Attention mechanisms, as well as other methods of deep learning, paved the way for this development. But, because of their single-pass nature, they are incapable of recovering from local maxima matching to incorrect answers. Dynamic coattention network (DCN) is used to answer this issue. But as it has only one layer, the ability of the DCN to write diverse input representations is limited. We proposed a few modifications to DCN to overcome these findings. First, we used a bidirectional long short-term memory network (biLSTM) to encode the question and document. Next, we applied the concept of self-attention to DCN by using multiple coattention layers. This helps the encoder to generate more profuse input representations. Lastly, we combine outputs from these layers; this improves the long-range dependencies. We built a question answering system based on this multiattention DCN and tested on one of our course documents. On Stanford question answering dataset (SQuAD), this system improves the F1 mean on validation to 79.9% from its previous state of art at 75.6%.Item Open Access ENVIRONMENTAL MODEL ACCURACY IMPROVEMENT FRAMEWORK USING STATISTICAL TECHNIQUES AND A NOVEL TRAINING APPROACH(East Carolina University, 2020-06-22) Matta, Rekesh; Wu, RuiIt is challenging to predict environmental behaviors because of extreme events, such as heatwaves, typhoons, droughts, tsunamis, torrential downpour, wind ramps, or hurricanes. In this thesis, we proposed a novel framework to improve environmental model accuracy with a novel training approach. Extreme event detection algorithms are surveyed, selected, and applied in our proposed framework. The application of statistics in extreme events detection is quite diverse and leads to diverse formulations, which need to be designed for a specific problem. Each formula needs to be tailored specially to work with the available data in the given situation. This diversity is one of the driving forces of this research towards identifying the most common mixture of components utilized in the analysis of extreme events detection. Besides the extreme event detection algorithm, we also integrated the sliding window approach to see how well our models predict future events. To test the proposed framework, we collected coastal data from various sources and obtained the results; we improved the predictive accuracy of various machine learning models by 20% to 25% increase in R2 value using our approach. Apart from that, we organized the discussion along with different extreme event detection types, presented a few outlier definitions, and briefly introduced their techniques. We also summarized the statistical methods involved in the detection of environmental extremes, such as wind ramps and climatic events.