Computer Science
Permanent URI for this collectionhttp://hdl.handle.net/10342/42
Browse
Recent Submissions
Item Open Access METAMORPHIC TESTING PRIORITIZATION FOR FAIRNESS EVALUATION IN LARGE LANGUAGE MODELS(East Carolina University, December 2024) Giramata, SuavisABSTRACT Large language models (LLMs) face challenges in detecting fairness related faults due to the oracle problem, where it is difficult to define correct outputs for all scenarios. This research applies metamorphic testing (MT) as a solution, focusing on the prioritization of metamorphic relations (MRs) based on their diversity scores to maximize fault detection efficiency. The study hypothesizes that MRs with high diversity scores, indicating significant dif- ferences between source and follow-up test cases, are more likely to reveal faults related to fairness and bias in LLMs. To test this, several diversity metrics, including cosine similarity, sentiment analysis, and named entity recognition, are used to quantify differences between test cases. The proposed approach is evaluated on two popular LLMs, GPT and LLaMA, comparing it against random, fault-based, and distance-based MR ordering strategies. The results indicate that prioritizing high-diversity MRs significantly improves fault de- tection speed and effectiveness, particularly for identifying biases across sensitive attributes. Specifically, our proposed Total Diversity Score-based approach shows a 91.6% improvement in fault detection over the Random-Based approach at the first MR, gradually reducing to 21.05% by the fifth MR. Additionally, compared to the Distance-Based method, our ap- proach achieves an initial 130% improvement in fault detection rate, decreasing to 1.61% by the ninth MR before performance levels stabilize. Notably, our approach also performs closely to the Fault-Based prioritization, offering a balanced and effective method for uncov- ering faults efficiently.Item Open Access ANALYZING STYLE TRANSFER ALGORITHMS FOR SEGMENTED IMAGES(East Carolina University, December 2024) Seyed, SeyedhadiThe recently developed Segment Anything Model has made grabbing semantically meaningful regions of an image easier than before. This will allow for new applications that build on this approach that weren’t previously possible. This thesis investigates integrating the Segment Anything Model with style transfer. Specifically, it proposes Partial Convolution as a way to improve style transfer for segmented regions. Additionally, it investigates how different style transfer techniques are affected by different mask sizes, image statistics, etc.Item Open Access Towards Automated Garment Measurements In the Wild Using Landmark and Depth Estimation(East Carolina University, December 2024) Zbavitel, Cris IanThis research introduces an innovative approach to automate garment measurements from photos, combining depth estimation and landmark detection to address the high return rates in the fashion industry due to inaccurate sizing. Utilizing the DeepFashion2 dataset and a custom set of images, we employ DepthAnything for depth estimation and Keypoint R-CNN for landmark estimation, advancing previous methodologies by offering a scalable and accurate solution for the fashion industry. Initial findings suggest promising avenues for reducing returns and enhancing the garment fitting processes.Item Open Access AN EMPIRICAL EXPLORATION OF ARTIFICIAL INTELLIGENCE FOR SOFTWARE DEFECT PREDICTION IN SOFTWARE ENGINEERING(East Carolina University, July 2024) Cahill, ElaineArtificial Intelligence (AI) is an important topic in software engineering not only for data analysis and pattern recognition, but for the opportunity of finding solutions to problems that may not have explicit rules or instructions. Reliable prediction methods are needed because we cannot prove that there are no defects in software. Deep learning and machine learning have been applied to software defect prediction in the attempt to generate valid software engineering practices since at least 1971. Avoiding safety-critical or expensive system failures can save lives and reduce the economic burden of maintaining systems by preventing failures in systems such as aviation software, medical devices, and autonomous vehicles. This thesis contributes to the field of software defect prediction by empirically evaluating the performance of various machine learning models, including Logistic Regression, Random Forest, Support Vector Machine, and a stacking classifier combining these models. The findings highlight the importance of model selection and feature engineering in achieving accurate predictions. We followed this with a stacking classifier that combines Logistic Regression, Random Forest, and Support Vector Machine (SVM) to see if that improved predictive performance. We compared our results with previous work and analyzed which features or attributes appeared to be effective in predicting defects. We end by discussing potential next steps for further research based on our work results.Item Open Access Time Series Forecasting Using Generative Adversarial Networks(East Carolina University, 2023-05-04) Mamua, Sharon SoneTime series data is prevalent in many fields, such as finance, weather forecasting, and economics. Predicting future values of a time series can provide valuable insights for decision-making, such as identifying trends, detecting anomalies, and improving resource allocation. Recently, Generative Adversarial Networks (GANs) have been used to learn from these features to aid in time-series forecasting. We propose a novel framework that utilizes the unsupervised paradigm of a GAN based on related research called TimeGAN. Instead of using the discriminator as a classification model, we employ it as a regressive model to learn both temporal and static features. This framework can help generate synthetic data and facilitate forecasting. Our model outperforms TimeGAN, which only preserves temporal dynamics and uses the discriminator as a classifier to distinguish between synthetic and real datasets.Item Open Access QPE: A System For Deconstructing SQL Queries(East Carolina University, 2023-04-26) Bullard, Connor DResearch on the topic of converting natural language to machine-readable code has experienced great interest over the last decade, however studies into converting machine-readable code into natural language are sparse. The applications of translating spoken or written languages into code are well-established, such as allowing a more novice or non-technical user to interact with a program or database with ease. The benefits of such applications are readily observable and are likely to grow as software systems continue to increase in complexity and capability. Likewise, parsing code to natural language produces certain benefits from which the potential gain in utility and knowledge has yet to be fully realized. This thesis identifies opportunities for deploying solutions that provide a natural language explanation of programming languages, specifically with Structured Query Language (SQL) and database interfacing. A novel solution is proposed in the form of an application named Query Purpose Extractor (QPE), which utilizes existing open-source libraries to aid in the process of translating SQL statements into English sentences.Item Open Access Warfarin Sensitivity is Associated with Increased Hospital Mortality in Critically Ill Patients(2022-05-05) Wang, Ping; et alItem Open Access Performance analysis of machine learning algorithms to predict mobile applications' star ratings via its user interface features(East Carolina University, 2022-05-05) Navaei, MaryamThe first part of this thesis concludes with an overall summary of the publications so far on the applied Machine Learning techniques in different phases of the Software Development Life Cycle that including Requirements Analysis, Design, Implementation, Testing, and Maintenance. We have performed a systematic review of the research studies published from 2015-2021 and revealed that the Software Requirements Analysis phase has the least number of papers published; in contrast, Software Testing is the phase with the greatest number of papers published. The second part of this thesis compares multiple Machine Learning algorithms for predicting mobile application star ratings by its user interface features. User interface features offer a great source of information that can be utilized by various Machine Learning algorithms to generate this prediction. To do so, we have developed and selected multiple user interface features extracted from the largest mobile user interface design prediction dataset that is available to the public, RICO repository. We initially employed the Machine Learning algorithms to a subset from RICO and then compared our results against the actual dataset using the same algorithms. Furthermore, we calculated Accuracy, Recall, and Precision for each algorithm before and after cross-validation, and showcased our results in various charts. The ultimate results demonstrate that our methodology works to predict the star rating of Android mobile applications utilizing the features we extracted from RICO dataset.Item Open Access ADVANCED DRIVER ASSISTANCE SYSTEM CAR FOLLOWING MODEL OPTIMIZATION FRAMEWORK USING GENETIC ALGORITHM IMPLEMENTED IN SUMO TRAFFIC SIMULATION(East Carolina University, 2022-04-25) Carroll, MatthewAs advanced driver-assistance systems (ADAS) such as smart cruise control and lane keeping have become common technologies, self-driving above SAE level 3 are being competitively developed by major automobile manufacturers, autonomous vehicles (AVs) will prevail in the near future traffic network. In particular, evasive action algorithms with collision detection by sensors and faster braking response will enable AVs to drive with a shorter gap at higher speeds which has not been possible with human drivers. Such technologies will be able to improve current traffic performance as long as raising concerns on safety are addressed. Therefore, there have been efforts to improve understanding between stakeholders such as regulatory authorities and developers to draw a consensus about autonomous driving standard and regulations. Meanwhile, a mixed traffic network with human driving vehicles and AVs will show transient system behavior based on penetration rate of AVs thereby requiring different optimal AV settings. We are interested in understanding this system behavior over transitional period to achieve an optimal traffic performance with safety as a hard constraint. We investigate the system behavior with agent-based simulation with different penetration rates by mixing of human-driving and AV vehicle models, identify the key parameters of ADAS algorithms for traffic flow, and find the optimal parameter set per penetration rate by using genetic algorithm (GA). Simulation results with optimal parameter values reveal improvement in average traffic performance measures such as flow (5.6% increase), speed (4.9% increase), density (15.9% decrease), and waiting time (48.2% decrease). We provide simulation examples and discuss the implication of the optimal parameter values for both traffic control authorities and AV developers during the transitional period.Item Open Access Analysis of the Impact of Tags on Stack Overflow Questions(East Carolina University, 2022-04-26) Ithipathachai, VonUser queries on Stack Overflow commonly suffer from either inadequate length or inadequate clarity with regards to the languages and/or tools they are meant for. Although the site makes use of a tagging system for classifying questions, tags are used minimally (if at all). To investigate the impact of tags in the quality of results returned by the queries, in this research we propose a new query expansion solution. Our technique assigns tags to queries based on how well they match the queries' topics. We evaluated our technique on eight sets of queries categorized by overall length and programming language. We examined the retrieval results by adding varying numbers of tags to the queries, and monitored the recall and precision rates. Our results indicate that queries yield considerably higher recall and precision rates with extra tags than without. We further conclude that tags are a particularly effective means of enhancement when the original queries do not already return sufficient yields to begin with.Item Open Access Using n-Grams to Identify Time Periods of Cultural Influence(2016-11-03) Tabrizi, NassehItem Open Access Customer Reviews Analysis with Deep Neural Networks for E-Commerce Recommender Systems(2019-01-01) Tabrizi, NassehItem Open Access IMPLEMENTATION OF BERT BASED MACHINE LEARNING MODEL TO EXTRACT CANCER –MIRNA RELATIONSHIP FROM RESEARCH LITERATURE(East Carolina University, 2021-04-15) Sundharam, ArunprasadIn the world today, text mining is a widely popular and growing branch of Information technology, in which we extract useful information out of the given pile of text data. There are thousands of research papers in medical science pertaining to the study of how microRNAs (miRNAs) can assist or impede the development of various types of cancers. mirCancer is a repository which provides the details of this cancer-miRNA association by analyzing 6500+ research papers using text mining techniques. It would be helpful to create a machine learning model which can analyze the title and abstract content of the research papers and extract the cancer-miRNA association details if it is available in the given text. In this thesis work, we are proposing a solution for creating a machine learning model using the open source NLP framework - BERT, provided by Google which can identify the cancer-miRNA relationship in the given abstract text content. Bert is a deep learning model which is pretrained on Wikipedia text corpse and has built-in knowledge on the usage of English language. As part of this work, we have designed and implemented a machine learning model using Bert framework along with preparation of the dataset required to train the model in the task of identifying cancer-miRNA relationship from the given text. The machine learning model developed in this thesis work performed with an overall accuracy of 90.3% in retrieving the required information from the research papers of the test dataset and hence it can be leveraged to review the results of the existing mircancer text mining implementation.Item Restricted Tactile Demographics: Predicting Demographic Information Using Touch Data from Mobile Devices(East Carolina University, 2021-04-20) Williams, BayleaThe research conducted in this thesis was to serve as a baseline on which human demographics are most likely to be able to be predicted through touch screen interactions. In addition, it served as a way of finding which machine learning models are best suited to be applied to a larger scale experiment of this phenomena. We were able to reliably predict both age and race of participants and in the meantime show that the best machine learning models used was Random Forest Decision Trees and Naïve Bayes producing a higher classifier of accuracy than other classifiers tested. While the sample size used during this study was small, due to the ongoing Covid-19 pandemic, the results of this study indicate that research in this area is worthy of significant exploration.Item Open Access Auto-Count Symbols in Portable Document Format (PDF)(2021-04-22) Florian, AndrewEstimating electrical costs often involves counting symbols in a PDF document. Existing software has sped up this process compared to manual counting, but there is room for further improvement. The proposed solution builds on open source components to efficiently search a PDF document for the outlines of all symbols, including letters or numbers, used by electrical engineers to differentiate between otherwise similar symbols. It then sorts these outlines into groups and counts each occurrence. Symbol for symbol, it takes less than half the time required by two leading competitors. Unfortunately, current settings often produce numerous sub-groups which need to be combined to provide meaningful totals. K-means and other improved clustering methods are being explored. The proposed concept could also be helpful in other similar applications that identify symbols or text in images.Item Open Access Developing Concept Enriched Models for Big Data Processing Within the Medical Domain(2020-07) Gudivada, Akhil; Tabrizi, Nasseh; Philips, JamesWithin the past few years, the medical domain has endeavored to incorporate artificial intelligence, including cognitive computing tools, to develop enriched models for processing and synthesizing knowledge from Big Data. Due to the rapid growth in published medical research, the ability of medical practitioners to keep up with research developments has become a persistent challenge. Despite this challenge, using data-driven artificial intelligence to process large amounts of data can overcome this difficulty. This research summarizes cognitive computing methodologies and applications utilized in the medical domain. Likewise, this research describes the development process for a novel, concept-enriched model using the IBM Watson service and a publicly available diabetes dataset and knowledge-base. Finally, reflection is offered on the strengths and limitations of the model and enhancements for future experiments. This work thus provides an initial framework for those interested in effectively developing, maintaining, and using cognitive models to enhance the quality of healthcare.Item Open Access A FRAMEWORK FOR AUTOMATICALLY GENERATING QUESTIONS FOR TOPICS IN DISCRETE MATHEMATICS(East Carolina University, 2020-11-17) Houshvand, SalarAutomated question generation is critical for realizing personalized learning. Also, learning research shows that answering questions is a more effective method than rereading the textbook multiple times. However, creating different types of questions is intellectually challenging and time-intensive. Therefore, it emphasizes a necessity for an automated way to generate questions and evaluate them. In this research after analyzing the existing approaches to automated question generation, we conclude that most of the current systems use natural language process techniques to extract questions from the text, therefore, other topics such as mathematics are lacking an automated question generation system that could help learners to assess their knowledge.In this research we present a novel framework that automatically generates unlimited numbers of questions for different topics in discrete mathematics. We created multiple algorithms for various questions in four main topics using Python. Our final product is presented as an application programming interface (API) using Flask library, which makes it easy to gain access and use this system in any future developments. Finally, we discuss the potential extensions that can be added to our framework as future contributions. The repository for this framework is freely available at https://github.com/SalarHoushvand/discrete-math-restfulAPI.Item Open Access Software Engineering for Real-Time NoSQL Systems-centric Big Data Analytics(East Carolina University, 2020-12-02) Clark, William FRecent advances in Big Data Analytics (BDA) have stimulated widespread interest to integrate BDA capabilities into all aspects of a business. Before these advances, companies have spent time optimizing the software development process and best practices associated with application development. These processes include project management structures and how to deliver new features of an application to its customers efficiently. While these processes are significant for application development, they cannot be utilized effectively for the software development of Big Data Analytics. Instead, some practices and technologies enable automation and monitoring across the full lifecycle of productivity from design to deployment and operations of Analytics. This paper builds on those practices and technologies and introduces a highly scalable framework for Big Data Analytics development operations. This framework builds on top of the best-known processes associated with DevOps. These best practices are then shown using a NoSQL cloud-based platform that consumes and processes structured and unstructured real-time data. As a result, the framework produces scalable, timely, and accurate analytics in real-time, which can be easily adjusted or enhanced to meet the needs of a business and its customers.Item Open Access Studies on Gopala-Hemachandra Codes and their Applications(East Carolina University, 2020-11-16) Childers, LoganGopala-Hemachandra codes are a variation of the Fibonacci universal code and have applications in data compression and cryptography. We study a specific parameterization of Gopala-Hemachandra codes and present several results pertaining to these codes. We show that GH_{a}(n) always exists for any n >= 1, when -2 >= a >= -4, meaning that these are universal codes. We develop two new algorithms to determine whether a GH code exists for a given a and n, and to construct them if they exist. We also prove that when a = -(4+k), where k >= 1, that there are at most k consecutive integers for which GH codes do not exist. In 2014, Nalli and Ozyilmaz proposed a stream cipher based on GH codes. We show that this cipher is insecure and provide experimental results on the performance of our program that cracks this cipher.Item Open Access Applied Machine Learning for Cybersecurity in Spam Filtering and Malware Detection(East Carolina University, 2020-11-17) Sokolov, MarkMachine learning is one of the fastest-growing fields and its application to cybersecurity is increasing. In order to protect people from malicious attacks, several machine learning algorithms have been used to predict the malicious attacks. This research emphasizes two vulnerable areas of cybersecurity that could be easily exploited. First, we show that spam filtering is a well known problem that has been addressed by many authors, yet it still has vulnerabilities. Second, with the increase of malware threats in our world, a lot of companies use AutoAI to help protect their systems. Nonetheless, AutoAI is not perfect, and data scientists can still design better models. In this thesis I show that although there are efficient mechanisms to prevent malicious attacks, there are still vulnerabilities that could be easily exploited. In the visual spoofing experiment, we show that using a classifier trained on data using Latin alphabet, to classify a message with a combination of Latin and Cyrillic letters leads to much lower classification accuracy. In Malware prediction experiment, our model has been able to predict malware attacks on Microsoft computers and got higher accuracy than any well known Auto AI.