AN EMPIRICAL EXPLORATION OF ARTIFICIAL INTELLIGENCE FOR SOFTWARE DEFECT PREDICTION IN SOFTWARE ENGINEERING by Elaine Cahill July, 2024 Director of Thesis: Madhusudan Srinivasan, PhD Major Department: Computer Science Artificial Intelligence (AI) is an important topic in software engineering not only for data analysis and pattern recognition, but for the opportunity of finding solutions to problems that may not have explicit rules or instructions. Reliable prediction methods are needed because we cannot prove that there are no defects in software. Deep learning and machine learning have been applied to software defect prediction in the attempt to generate valid software engineering practices since at least 1971. Avoiding safety-critical or expensive system failures can save lives and reduce the economic burden of maintaining systems by preventing failures in systems such as aviation software, medical devices, and autonomous vehicles. This thesis contributes to the field of software defect prediction by empirically evaluating the performance of various machine learning models, including Logistic Regression, Random Forest, Support Vector Machine, and a stacking classifier combining these models. The findings highlight the importance of model selection and feature engineering in achieving accurate predictions. We followed this with a stacking classifier that combines Logistic Regression, Random Forest, and Support Vector Machine (SVM) to see if that improved predictive performance. We compared our results with previous work and analyzed which features or attributes appeared to be effective in predicting defects. We end by discussing potential next steps for further research based on our work results. AN EMPIRICAL EXPLORATION OF ARTIFICIAL INTELLIGENCE FOR SOFTWARE DEFECT PREDICTION IN SOFTWARE ENGINEERING A Thesis Presented to The Faculty of the Department of Computer Science East Carolina University In Partial Fulfillment of the Requirements for the Degree Master of Science in Software Engineering by Elaine Cahill July, 2024 Director of Thesis: Madhusudan Srinivasan, PhD Thesis Committee Members: Nic Herndon, PhD Wu Rui, PhD Copyright Elaine Cahill, 2024 DEDICATION This work is dedicated to the memory of my husband, James Cahill. ACKNOWLEDGEMENTS I would like to express love and thanks to my son, John, for all the support and encouragement throughout my studies. I could not have got through them without your help. Sincere thanks to Rolf, who has the patience of a saint, and for his unlimited support and love. My brother, John Vickers, and his wife Debra for believing in me. Sincere thanks to Dr. Mark Hills who started this journey with me, and to Dr. N. Herndon, Dr. R. Wui, Dr. M.N.H. Tabrizi and especially Dr. M. Srinivasan for getting me over the finish line. Table of Contents LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivations for this Study . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Purpose of this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 8 CHAPTER 2: RELATED WORK . . . . . . . . . . . . . . . . . . . . . 9 2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . . 13 2.4 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 CHAPTER 3: METHODOLOGY . . . . . . . . . . . . . . . . . . . . . 16 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1 Choice of Model . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 One-hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.3 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 30 3.5.4 Stacking Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.7 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.8 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 CHAPTER 4: RESULTS AND ANALYSIS . . . . . . . . . . . . . . . 35 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 RQ1: What are the key indicators in software code that machine learning models can use to predict defects? . . . . . . . . . . . . . . . 36 4.2.1 Correlation and Multicollinearity . . . . . . . . . . . . . . . . 36 4.2.2 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 RQ2: Which classifier among logistic regression, random forest, and sup- port vector machine provides the most accurate predictions of software defects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.1 ROC AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.5 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 RQ3: Can an ensemble using a stacking classifier that combines lo- gistic regression, random forest, and support vector machine improve predictive performance? . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5 How do our results compare with similar research? . . . . . . . . . . . 51 4.6 Threats to Validity - Internal . . . . . . . . . . . . . . . . . . . . . . 56 4.7 Threats to Validity - External Validity . . . . . . . . . . . . . . . . . 56 CHAPTER 5: CONCLUSIONS AND FUTURE WORK . . . . . . . 58 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 APPENDIX A:LITERATURE REVIEW SUMMARY . . . . . . . . 66 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 A.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A.2.1 ACM Digital Library . . . . . . . . . . . . . . . . . . . . . . . 67 A.2.2 IEEE Xplore . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.2.3 Springer Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.3 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 68 A.4 Defect Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 A.5 Repositories of Defect Datasets . . . . . . . . . . . . . . . . . . . . . 71 LIST OF TABLES 3.1 Overview of MDP ′′ Defect Datasets . . . . . . . . . . . . . . . . . . 19 3.2 Dataset Attribute Existence Verification . . . . . . . . . . . . . . . . 22 3.3 Dataset CM1 Column Values Before and After Scaling . . . . . . . . 23 3.4 Training Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Test Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 90th Percentile of Coefficients . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Accuracy By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Precision By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Recall By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 F1 - score By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Stacking Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.7 Comparison of SVM with SVM using FILTER . . . . . . . . . . . . . 53 4.8 Comparison of Random Forest: This study v. Soe et al. . . . . . . . . 54 4.9 Comparison of Ali et al. Proposed Ensemble for dataset PC1 . . . . . 55 A.1 Repositories of Defect Datasets . . . . . . . . . . . . . . . . . . . . . 73 LIST OF FIGURES 1.1 Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) 4 1.2 Support Vector Machine Hyperplane . . . . . . . . . . . . . . . . . . 5 2.1 Deep Learning Models Usage . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Machine Learning Workflow . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Stacking Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Correlation Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Distribution of Absolute Values of Coefficients . . . . . . . . . . . . . 39 4.3 Random Forest Feature Importance . . . . . . . . . . . . . . . . . . . 42 4.4 ROC Curves Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 45 A.1 Papers Per Year by Database . . . . . . . . . . . . . . . . . . . . . . 66 A.2 Top Ten Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.3 Deep Learning Models Usage . . . . . . . . . . . . . . . . . . . . . . 70 A.4 Top Ten Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.5 Top Ten Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 1 Introduction 1.1 Background Predictive software maintenance and defect prediction studies have been documented since at least 1971 [1]. Artificial Intelligence (AI) is an important topic in software engineering not only for data analysis and pattern recognition, but for the opportunity of finding solutions to problems that may not have explicit rules or instructions. Deep learning and machine learning have been applied to software defect prediction (SDP) in the attempt to generate valid software engineering practices. The ultimate goal for study in the area of defect prediction was proposed by Srivastava et al. [33] “A mathematical theory of software failure needs to be developed which allows for the development of provably correct algorithms for detecting, diagnosing, predicting, and mitigating the adverse events of software issues.” Whilst this paper does not present such a theory, and we still cannot prove that there are no defects in software, we do try to add to the body of work by presenting results from our research. For clarification, the following definitions of error, fault and failure are from “International Standard - Systems and software engineering - Systems and software assurance - Part 1: Concepts and vocabulary” [15]. 1. Failure (3.4.9) is a termination of the ability of a system to perform a re- quired function, or its inability to perform within previously specified limits; an externally visible deviation from the system’s specification. 2. A fault (3.4.6) is a defect in a system or a representation of a system that if executed or activated can potentially result in an error. 3. An error (3.4.5) is a discrepancy between a computed, observed or measured value or condition and the true, specified or theoretically correct value or condition. Errors can also be human mistakes where an action produces an incorrect result. Using machine learning to predict defects or faults is an attempt to discover them before a failure occurs. Static analysis of code can warn of modules that could benefit from refactoring. If modules exist with no unit test coverage, it does not automatically mean that errors will occur in that module, but it would increase confidence in the software if code coverage was closer to 100%. Another approach could be to monitor the condition of the system by running tests continuously, or checking that deployed code has not been changed unknowingly. Also, sampling or testing to verify that bad data has not been added to the system that would cause issues when selected or updated would be useful. An attempt to measure the economic impact of poor quality software can be found in the report to the Consortium for Information & Software Quality [18] where Krasner shows that the cost of finding and fixing defects in the United States in 2022 was $607 billion. Reliable prediction methods can be used to avoid safety-critical or expensive system failures. “Defects, like quality, can be defined in many different ways but are more commonly defined as deviations from specifications or expectations which might lead to failures in operation.” [6]. Complexity and size metrics have been used for many years in an attempt to predict the number of defects that will be found in testing or operating in production. Reliability models have been developed to predict failure 2 rates based on the expected operational usage profile of the system, for example, 1. predicting the number of defects in the system, 2. estimating the reliability of the system in terms of time to failure and, 3. understanding the impact of design and testing processes on defect counts and failure densities. Approaching software defect prediction as a binary classification problem, we first use three machine learning algorithms, Random Forest (RF), Support Vector Ma- chines(SVM), and Logistic Regression (LR). After measuring performance individually we implement a stacking classifier to compare performance against individual results. Logistic regression predicts the probability of an event occurring such as whether or not a software module is defective or not i.e., the dependent variable is categorical. It is easy to implement and interpret, making it a good starting point for experiments. Results can be also be displayed graphically such as with a Receiver Operating Characteristic or ROC curve, as shown in Figure 1.1. This plots the true positive rate against the false positive rate, the blue line, and shows that area under the curve (AUC) in this example for dataset KC1 after being trained with dataset CM1 is 0.67. The dashed line labelled “No Skill” is what we expect for random guessing. Random Forest (RF) is an ensemble learning technique used in machine learning. It uses a collection of decision trees. Decision trees can be used to map the possible outcomes of a decision and each node of a decision tree represents a possible outcome. Percentages are assigned to nodes based on the likelihood of the outcome occurring. Voting is the final step to combine the predictions of all the trees to make the final prediction. By having multiple decision trees in the random forest algorithm, predictions can be more accurate by accounting for variability in the data, therefore reducing the risk of bias and overfitting. 3 Figure 1.1: Receiver Operating Characteristic (ROC) Area Under the Curve (AUC): Example ROC curve. Support Vector Machines (SVM) are supervised learning models used for classifi- cation and regression analysis. It finds the hyperplane that best separates the classes and tries to maximize the margin between the classes. An example is shown in Figure 1.2 where two classes, the blue and orange dots, are separated by a central solid line with dotted lines to represent the widest margin of separation. The dots, or nodes, nearest to the margins are called the support vectors. 4 Figure 1.2: Support Vector Machine Hyperplane: The dots, or nodes, nearest to the margins are called the support vectors. Source: https://scikit-learn.org/stable/modules/svm.html. 1.2 Motivations for this Study Reliable prediction methods can be used to avoid safety-critical or expensive system failures. Preventing system failure can not only reduce the economic impact of events, but also save lives. Consider air craft disasters such as, the Ethiopian Airlines flight 30 on March 10, 2019 where all 157 passengers and crew on board died. The accident involved the Boeing 737 Max 1. The Ethiopian investigation [5] documents the problem with the angle of attack in the flight software called Maneuvering Characteristics Augmentation System (MCAS). The investigation’s report was published in 2022 and 1https://www.ntsb.gov/news/press-releases/Pages/NR20190314.aspx 5 it explains that the design of MCAS pushed the jet’s nose down, even as the pilot attempted to pull it up, and was a major cause of the accident. A well documented example of how software issues can cause expensive failures was Knight Capital2 that lost over $460 million one day on August 1 2012 when its automated system, Smart Market Access Routing System (SMARS), a high-speed order router for equity orders failed. According to the Securities and Exchange Commission (SEC) [39], “Knight did not have adequate controls in SMARS to prevent the entry of erroneous orders.” One feature of SMARS was to receive orders from upstream components in the trading platform, and then, as needed based on the available liquidity and price, send one or more orders downstream for execution. The SEC report describes how upstream orders were processed by defective, legacy ‘Power Peg’ code, and SMARS sent millions of downstream orders, resulting in erroneous executions in stocks and shares in approximately 45 minutes. Knight was unable to cover the unintended positions, resulting in $460 million loss and were ordered to pay a $12 million penalty by the SEC. 1.3 Purpose of this Study The purpose of this study is to contribute to the work on code quality, specifically software defect prediction and maintenance in the fields of software engineering and AI. We build upon previous work to advance the understanding and application of machine learning to software defect prediction by experimenting with different machine learning models, including LR, RF, SVM, and a stacking classifier that combines these models. Reliable defect prediction can have far-reaching impacts beyond immediate software quality and so we also contribute to the broader goal of development of reliable prediction methods that can minimize the impact of defects in software costs 2https://www.wsj.com/articles/SB10000872396390443866404577565402442234764 6 associated with fixing defects post-deployment. It can prevent disasters and save lives in safety-critical systems e.g., aviation and medical devices, and reduce costs associated with software maintenance and technical debt. Another purpose of this study is to identify and analyze the key indicators in software code that machine learning models can use to predict defects. Understanding which features are most predictive of defects can focus efforts on the most critical aspects of the code, thereby improving code quality and reducing the incidence of defects. 1.4 Research Questions In order to add to the body of work on software defect prediction and the progress towards the main idea of reaching provably correct algorithms [1], we consider the following research questions: • RQ1: What are the key indicators in software code (attribute/feature importance) that machine learning models can use to predict defects? • RQ2: Which classifier among logistic regression, random forest, and support vector machine provides the most accurate predictions of software defects? • RQ3: Can an ensemble using a stacking classifier that combines logistic regression, random forest, and support vector machine improve predictive performance? Answering these questions gives us an insight into how machine learning could be applied in practice and consider where to focus further research. 1.5 Structure of the Thesis The rest of the thesis is structured as follows. Related work is described in Chapter 2. Chapter 3 describes the research methodology, data preparation, performance measures and training setup. Chapter 4 presents the results and analysis of our work 7 as it applies to answer the research questions. We conclude our work in Chapter 5 and propose future work. 1.6 Research Contribution To the best of our knowledge no one has trained a stacking classifier specifically with RF, SVM and LR on software defect data. In this paper we use the Scikit library [27] to fill this gap in current research by implementing a novel application of stacking classifier to stack the predictions of three supervised learning base models RF, SVM and LR, and compute a final prediction. We contribute to the field of software engineering and AI by presenting results aimed at the development of more reliable and cost-effective software systems. For example, we identify and analyze the key indicators in software code that machine learning models can use to predict defects. We benchmark the performance of LR, RF, and SVM models against each other, as well as against the proposed stacking classifier. Providing this comparison provides insight into the strengths and weaknesses and a view of the most effective approach for different datasets and context. Our evaluation of models on datasets from different projects written in the same programming language (C/C++) provides an assessment of model performance and ability to generalize. Cross-dataset evaluation demonstrates the applicability and limitations of the models across software projects. The importance of investing in reliable defect prediction methods can lead to substantial economic savings and enhanced safety in critical systems and justifies further research and development in this area. Based on the findings of this study, we also suggest areas where further work and refinement of models could yield even better results. 8 Chapter 2 Related Work In this chapter, we discuss existing approaches in applying artificial intelligence (AI), machine learning in particular, to software defect prediction (SDP). In the sections relating to supervised machine learning and ensemble learning we highlight how they are different from this study. For a detailed summary of the results from performing a systematic review of software defect prediction research, see the appendix. A complete list of repositories used in the period is provided in the appendix A.1. There have been a number of studies relating to the use of AI in the form of using machine learning and deep learning models to predict defects in a piece of software using the NASA MDP datasets. They have attempted to predict which module is defective and needs attention. The current literature applies many machine learning models to software defect prediction. The most popular machine learning algorithms are Random Forests (RF), Näıve Bayes (NB), and Support Vector Machine (SVM). The top ten most used algorithms accounted for 41% of all models and techniques used in the research papers in our final selection. Figure A.2 lists these top ten models used by researchers for the search period. Traditional software defect prediction links defects with module size and complexity. Some techniques found in the review that are not directly classified as machine learning or traditional SDP were class dependency networks and Augmented- Code Property Graph (CPG). For example, Xu et al. [42, 43] in their papers on CPG and augmented code graph for defect prediction (ACGDP) presented graphical neural networks (GNN) for obtaining defect characteristics. Under the broader category of data mining techniques, Zang and Ren [47] propose a new method for software defect prediction based on self-organizing data mining (SODM) specifically applied to finance software systems. They used SVM as the binary classification model but focused on predicting if a change to the system was a faulty change or non-faulty change rather than if a module was faulty as we do in this paper. Atomic Rule Mining was used along with Random Forest by Thapa et al. [38] and Association Rule Mining by Wu et al. [41]. Our work also includes supervised machine learning with two of the most popular algorithms SVM and RF. Although we use RF in this work, it is in the conventional machine learning way, without rule mining. We do not employ rule mining or data mining because the data we use already exists as preprocessed datasets. 2.1 Supervised Machine Learning Classification with supervised learning has been well-explored but requires a set of data that has already been labelled or classified. In the context of software defect prediction, researchers have used sets of data where the code modules are already labelled as defective or not, often using a collection of history of failures to predict future software failures. Soe et al. [31] used the Random Forest algorithm and showed a maximum accuracy of prediction up to 99.59 with the minimum accuracy of 85.96 by using a hundred trees for software defect prediction. They concluded that around a hundred trees was more stable to get high accuracy and the details on the hyperparameter tuning are restricted to the number of trees only. Our work employs multiple hyperparameters over a grid search in attempt to find a good Random Forest 10 model, and we compare our results with those of Soe et al. In “Effective Software Defect Prediction Using Support Vector Machines (SVMs)” Goyal [8] proposed a filtering technique for effective defect prediction. Their FILTER method aimed to remove data points from the majority class (non-faulty instances) that were in close proximity to the faulty instances therefore creating a more balanced dataset. Although the features are listed, there is no discussion of feature selection or engineering by any other method before applying the novel filter. They concluded that using the FILTER with an SVM based software defect prediction model enhanced performance by 16.73%, 16.80% and 7.65% in terms of accuracy, AUC and F-measure respectively. We compare our SVM-Linear Kernel with no filtering with the SVM-Linear Kernel with filtering. 2.2 Ensemble Learning Ensemble learning includes stacking, boosting, bagging and voting. It is not unusual in this area of research to see multiple techniques combined as an ensemble to compare and contrast research results. Dr. R. Wu (Machine Learning lecture, June 16, 2021) discussed ensembling, “The main principle (assumption) behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.” Some work has used an ensemble approach to consider the class imbalance problem of binary classification that is not usually an even split. For example, the number of defective instances are less than the number of non- defective instances. Ali et al. [2] proposed an ensemble model that they trained on NASA datasets from the PROMISE repository. In their proposed model bagging and voting are used. They observed that their ensemble learning model produced better results than SVM, K-Nearest Neighbors (KNN), KNN, Decision Trees (DT) 11 or RF. However, the high accuracy reported for the ensemble model, 99.27%, raises concerns about potential overfitting, and there is no discussion on hyperparameter tuning. There is also no mention of cross-validation other than it appearing in Figure 2, but with no discussion on how cross-validation was performed, such as the number of folds used. It is stated that a “Chi-squared statistic method was applied to evaluate the importance of the feature and selection.” which is different to our approach of correlation. We compare our stacking classifier results with the proposed ensemble. Chen et al. [4] used seven base classifiers re-ensembled with a neural network to propose a class-imbalance solution they called dual ensemble software defect prediction (DE-SDP). They concluded that the improved G-mean scores, equation 2.1, indicated a “higher classification accuracy on defective data and is more robust to imbalanced data.” Stacking is an ensemble learning technique that combines the predictions of multiple base models using another model called a meta-learner. The base models’ predictions serve as features for the meta-learner, which learns how to best combine them to make the final prediction. In this paper we use the Scikit library [27] implementation StackingClassifier to stack the predictions of three supervised learning base models RF, SVM and Logistic Regression (LR), and compute a final prediction. Boosting is a machine learning ensemble technique where multiple weak learners are combined sequentially to form a strong learner. In boosting, each new learner focuses on the examples that previous learners struggled with, gradually improving overall performance. AdaBoost, Gradient Boosting, and XGBoost are examples of boosting implementations. Bagging (Bootstrap Aggregating) is an ensemble technique where multiple models are trained independently on different subsets of the training data, typically created by sampling with replacement. The final prediction is often made by averaging or voting on the predictions of all models. Voting is a simple ensemble technique where multiple models make predictions on a given input, and the final 12 prediction is determined by a majority vote for classification tasks (or averaging for regression tasks) of the individual predictions. Random Forests are an example of an ensemble method built on decision trees. These techniques are commonly used in ensemble learning to improve the overall performance and robustness of machine learning models. 2.3 Unsupervised Machine Learning Unsupervised learning means that already labelled or classified data is not available, such as on new projects where there is not yet a history of failures to draw from. In this paper, we use supervised learning in our experiments because our data is already labelled as defective or non-defective. Data from different projects is not easy to reuse in a cross-project or generalized way. Marjuni [20] attempts to address this issue of a lack of a training dataset with spectral classifier. Also, work performed by Ha [9] provided a framework for an unsupervised classifier that was made available to other researchers. We use the NASA MDP datasets from different projects and could therefore argue that we attempt cross-project defect prediction by trying the final model fitted and tested on one project’s data on a different project’s data. 2.4 Deep Learning Models Deep learning is a subset of machine learning and uses neural networks to see patterns from large amounts of data and was used in 35% of the papers surveyed. We found Artificial Neural Networks (ANN) to be the most popular deep learning technique, appearing in 23 of 96 (24%) papers that applied a deep learning model. ANN also appeared in the top ten machine learning models overall as seen in Figure A.2. In “Automated Parameter Tuning of Artificial Neural Networks for Software Defect 13 Figure 2.1: Deep Learning Models Usage: Deep learning is a subset of machine learning and was found in 35% of the papers surveyed. Prediction” [44] the main objective was to validate that the ANN defect prediction models with tuned parameter settings outperformed the default parameter settings. The results showed that the models trained with optimized parameter settings did outperform the models trained with default parameter settings. However, in this paper we do not use deep learning models as they are suited to larger datasets than the NASA MDP datasets. 2.5 Metrics The performance evaluation criteria for our work are presented in Section 3.8 and include widely accepted evaluation metrics. The most used performance metrics in the 14 literature review are accuracy, Equation (3.9), precision, Equation (3.10), and recall, Equation (3.11). Also F1-score and Receiver Operating Characteristic (ROC) curve and area under the ROC curve (AUC) which are defined in 3.8. Other metrics applied for evaluating classification models include Matthews Corre- lation Coefficient (MCC) and G-mean. We do not apply MCC in this paper, primarily because we want to compare our results with the most used ones in related work. Matthews Correlation Coefficient is a measure of the quality of binary classification models. It takes into account true positives, true negatives, false positives, and false negatives, and provides a value between -1 and 1. A coefficient of 1 represents a perfect classifier, 0 indicates a random classifier, and -1 indicates a completely incorrect classifier. It was introduced in 1975 by B.W. Matthews, [21]. He was a biochemist looking for a way to assess the quality of secondary structure predictions in proteins. It is now also as a performance metric for binary classification models [26, 32, 45]. Geometric mean (G-mean) is a mathematical concept that is equal to the nth root of the product of a group of n numbers. G-mean, equation (2.1), ranges from 0 to 1, with 1 being the best possible score. G-mean = √ precision × recall (2.1) By calculating the geometric mean of two rates, precision, (true negative rate) equation (3.10), and recall, (true positive rate) equation (3.11), it provides an overall measure of classification performance. In addition to Chen et al. [4] mentioned earlier, Tabassum et al. [36] use G-mean to demonstrate improvements in Cross-Project (CP) and Just-In-Time (JIT-SDP) software defect prediction. 15 Chapter 3 Methodology 3.1 Overview Using data with a specific algorithm creates a machine learning model. In the following sections we describe the steps taken to create our learning models and provide details about selecting the hyperparameters, training, predictions and assessing the models’ performance. Testing and further validation are performed using the trained or fitted model for predictions on unseen data that was held back from training. We use a publicly available collection of datasets derived from the NASA Metrics Data Program (MDP). It contains metrics gathered from various software projects developed within NASA and are used extensively in software defect prediction research. For example, dataset CM1 from the collection was found in 31% of our literature survey papers. Our methodology follows a standard workflow for applying a supervised machine learning model to labelled data. An overview of the steps is shown in Figure 3.1 and the steps are detailed in the following chapters of this paper. 3.1.1 Choice of Model It is generally noted that no single algorithm is universally the best for all scenarios, and it’s often beneficial to experiment with multiple algorithms to find the one that performs best for a specific problem. Therefore, we chose four training algorithms Start Data Collection Data Cleaning Feature Engineering Split Data (Train/Test) Train Model Evaluate Model Tuning Required? Retrain Model Final Predictions Yes No Figure 3.1: Machine Learning Workflow: Overview of Process 17 Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and in the final step, used a stacking classifier comprised of the LR, RF and SVM to see if predictions outperformed the individual models. LR is easy to interpret because it directly models the probability of a binary outcome, such as defect prediction. This made it a good starting point for our experiments. RF was chosen because combining multiple decision trees through ensemble learning, random forests reduce overfitting. SVM works to maximize the margin between classes while minimizing the classification error. This regularizes the model and prevents over-fitting. We felt this was important as we were working with small datasets. The individual results from LR, RF and SVM were used to answer RQ2: Which classifier among Logistic Regression (LR), random forest (RF), and support vector machine (SVM) provides the most accurate predictions of software defects? 3.2 Dataset Preparation First, data has to be either collected specifically or obtained from an existing source. If it does not already exist the cost and time needed to gather it can be expensive and prohibitive. Once data is obtained, any quality issues must be resolved such as duplication, missing data, values that are outside of a useful range and so on. This is referred to as sanitizing or cleaning the data. Feature engineering involves selecting, transforming, or creating new features from data attributes to improve the performance of machine learning algorithms. Finally, data is split into training and test datasets. It is essential that some data is set by to test the trained model so that it does not just repeat the results learned on the training data. These steps are described in more detail in the following sub sections. 18 Table 3.1: Overview of MDP ′′ Defect Datasets Dataset Project Total Defective % Defective Instances Instances CM1 Spacecraft instrumentation. 327 42 12.84% JM1 Real-time predictive ground sys- tem that uses simulations to gen- erate predictions for missions. 7,782 1,672 21.49% KC1 Storage management system for receiving and processing ground data. 1,183 314 26.54% PC1 Flight software for earth orbiting satellite. 705 61 8.65% 3.2.1 Data Collection For this study we obtained the publicly available cleaned NASA Metric Data Program MDP ′′ collection1 which is also popular in recent papers [3, 25, 37]. The authors state that they used the original versions of the data sets from the NASA MDP repository. We chose only to use the datasets created from projects written in C/C++ to avoid adding to the complexity of also combining Java and Perl projects. Starting with a common language base could allow for a more systematic approach to model development and evaluation. Also, we only use datasets where we could trace originals and information such as language used and project name. Table 3.1 provides an overview of the final four datasets. Note that the % Defective is calculated as the ratio of Defective instances to total instances. This is useful as a way of measuring the class-imbalance in that the Defective and non-Defective instances are not evenly split. 1https://figshare.com/collections/NASA_MDP_Software_Defects_Data_Sets/4054940 19 3.2.2 Data Cleaning To clean the original versions of the data sets from the NASA MDP repository MDP ′′ datasets, Sheppard et al. [30] performed initial preprocessing. The preprocessing involved binarization of the Defective class variable by calling Defective if the error count was ≥ 1. The ‘unique module identifier’ was removed as it gave no information toward the defectiveness of a module, along with other error data attributes such as age of error and error density. The remaining steps were: 1. Remove cases with implausible values. 2. Remove cases with conflict feature values. 3. Remove identical cases. 4. Remove inconsistent cases. 5. Remove cases with missing values. 6. Remove constant features. 7. Remove identical features. During our exploratory analysis, we observed that not every attribute appeared in all datasets so we performed a comparison of the MDP ′′ attributes in each dataset file. Table 3.2 shows the results of this analysis where the highlighted rows show the attributes that existed across all datasets. These common attributes became the n features in the features matrix. The Defective field states whether or not the module has one or more reported defects and became our one-dimensional target array. All 21 features are described in Section 3.3. The NASA data instances are referred to as a module to represent the unit of functionality. The name can differ among 20 languages with modules, functions or methods referred to as units of functionality. The original data was provided in Attribute-Relation file format (.arff), therefore, to avoid repetitively loading the file and dropping unwanted features during code execution, our final preprocessing step was to write out only the data for common attributes to .csv files. 3.2.3 Scaling Some algorithms, such as LR fits the data to a curve, so scaling was necessary. Where appropriate we applied the StandardScaler from Scikit-0.60 to transform each column. For example, as shown in Table 3.3, dataset CM1 in the column for branch count has a minimum of three and a maximum of 162, with a standard deviation of 16.8 and a mean of 13.0. After transforming the data to have a mean value of zero and a standard deviation of one, the minimum and maximum are -0.6 and 8.86. RF is an example where the data did not need to be scaled because it is an ensemble of decision trees, making decisions by splitting nodes based on the order of the data rather than their specific value. 3.3 Feature Engineering Feature engineering involves making decisions about which features to include in the model or if more data for new features is needed. We attempt to assess the usefulness of features to simplify the model and decrease complexity, and to make the output easier to analyze and interpret. This part of the study will allow us to answer RQ1: What are the key indicators in software code (attribute/feature importance) that machine learning models can use to predict defects? 21 Table 3.2: Dataset Attribute Existence Verification Highlighted row indicates that attribute exists in all four datasets File attribute CM1 JM1 KC1 PC1 Branch count 1 1 1 1 Call pairs 1 0 0 1 Condition count 1 0 0 1 Cyclomatic complexity 1 1 1 1 Cyclomatic density 1 0 0 1 Decision count 1 0 0 1 Decision density 1 0 0 1 Defective {Y,N} 1 1 1 1 Design complexity 1 1 1 1 Design density 1 0 0 1 Edge count 1 0 0 1 Essential complexity 1 1 1 1 Essential density 1 0 0 1 Global data complexity 0 0 0 0 Global Data Density 0 0 0 0 Halstead content 1 1 1 1 Halstead difficulty 1 1 1 1 Halstead effort 1 1 1 1 Halstead error est 1 1 1 1 Halstead length 1 1 1 1 Halstead level 1 1 1 1 Halstead Prog time 1 1 1 1 Halstead volume 1 1 1 1 Loc blank 1 1 1 1 Loc code and comment 1 1 1 1 Loc comments 1 1 1 1 Loc executable 1 1 1 1 Loc total 1 1 1 1 Maintenance severity 1 0 0 1 Modified condition count 1 0 0 1 Multiple condition count 1 0 0 1 Node count 1 0 0 1 Normalized cyclomatic complexity 1 0 0 1 Num operands 1 1 1 1 Num operators 1 1 1 1 Num unique operands 1 1 1 1 Num unique operators 1 1 1 1 Number of lines 1 0 0 1 Parameter count 1 0 0 1 Percent comments 1 0 0 1 22 Table 3.3: Dataset CM1 Column Values Before and After Scaling Branch Count Original Scaled Mean 13.015291 0 Std 16.843098 1 Min 3 -0.60 Max 162 8.86 3.3.1 One-hot Encoding One-hot encoding is a technique we used to convert the Defective target attribute from Y, N to 0 and 1 to create a binary vector where 1 indicates that a defect is present and 0 no defect is present. 3.3.2 Correlation Pearson’s correlation coefficient (r). British mathematician Karl Pearson’s work on the correlation coefficient was published in 1986 [34] and is calculated by dividing the covariance of the variables by the product of their standard deviation as shown in 3.1. r = cov(X, Y ) σXσY (3.1) where cov(X, Y ) is the covariance of variables X and Y , and σX and σY are the standard deviations of X and Y , respectively. The r value indicates the strength and direction of the linear relationship between the features. It ranges from -1 to 1 with 1 being a perfect positive linear relationship, -1 perfect negative linear relationship and 0 indicating no linear relationship. Figure 3.2 shows a sample of r values from a correlation matrix created for dataset CM1. The diagonal line that is all 1’s is simply where the chart columns and rows are the same feature. Multicollinearity means one variable can be linearly predicted from the others. In our work this could make it difficult to isolate the individual effect of the predictor features on the module being 23 Defective or not. In Chapter 4 we show results of using the correlation matrix of each dataset to create correlation heatmaps using a color a scale. Figure 3.2: Correlation Matrix: Example from dataset CM1 3.3.3 Feature Importance Dimensionality reduction is the process of reducing the number of features in a dataset while preserving the important information. We performed feature importance analysis to assess if some features could be removed from the dataset. Feature importance indicates the strength and direction of the relationship between the features and the target variable. Our initial training session included evaluating to what extent a particular feature impacts results. One option is to re-train the models using only features that reached a certain importance level to see if predictions improved. For larger datasets this could also be a way of reducing training time and costs by not having to analyze as many possible combinations. Also, the additional data can create noise that can obscure the underlying patterns leading to inaccuracies or decreased performance of the models. One should assess if the less important features are truly noise or contribute to the model by having domain knowledge and using methods such as L1 regularization (Lasso). Regularization limits how much weight is placed on one single feature. For Logistic Regression and SVM we were able to access the coefficients assigned to each feature and we use that to calculate the mean feature importance. The sklearn.ensemble.RandomForestClassifier we used has a built in 24 attribute that returns an array of feature importances. Control structures contribute to program complexity and there are measures including McCabe’s cyclomatic, design, and essential complexity that were introduced by McCabe in his 1976 paper “A Complexity Measure” [22]. Overly complex code becomes hard to maintain, error prone and requires more effort to manage it. Halstead was one of the early researches to provide options to estimate programmer effort in his 1975 paper “Toward a theoretical basis for estimating programming effort” [10]. A description of each of the quality metric attributes included as features in our exploratory analysis is given here: 1. Branch count tallies the number decision points in the control flow that can take different paths [22]. A larger number of branches means more testing resources are required to cover all cases. 2. Cyclomatic complexity measures the number of linearly independent paths through a program’s decision structure. Introduced by McCabe [22], one can also determine the minimum number of tests for code coverage. 3. Design complexity is the cyclomatic complexity of the reduced graph of a program. The reduction is performed to eliminate any complexity which does not influence the interrelationship between design modules. [23]. 4. Essential complexity indicates the extent to which a programs flow graph can be reduced through the process of removing or decomposing all the sub flow graphs (subroutines) with unique entry and exit nodes. McCabe [22]. 5. Halstead content is formally Intelligent Content and defined in the McCabe IQ Research Library2. Halstead content is another way of referring to the complexity 2http://www.mccabe.com/iq_research_metrics.htm 25 of a given algorithm independent of the programming language used to express the algorithm. Halstead Content = Halstead Volume Halstead Difficulty (3.2) 6. Halstead difficulty (D) is a measure of the level of how difficult it is for a programmer to comprehend the code [24, 14]. D = n1 2 · N2 n2 (3.3) 7. Halstead effort (E) is an estimate of the mental effort to work on a module. Calculated as the product of Halstead difficulty and Halstead volume [10]. E = D · V (3.4) 8. Halstead error est is the number of errors in a module. 9. Halstead length is the program length as total number of operators and the total number of operands [10]. N = N1 + N2 (3.5) 10. Halstead level (L) is the inverse of the program Halstead difficulty and also measures the program’s ability to be comprehended [10]. L = 1 D (3.6) 11. Halstead programming time is the estimated time to develop the module or implement an algorithm [10]. S = 18 represents the number of moments per second of effective mental discriminations of the human brain. Halstead chose 26 this based on psychological research by Stroud [35]. Halstead programming time = E S (3.7) 12. Halstead volume (V) The minimum number of bits required for coding the program2. V = (N1 + N2) · log2(n1 + n2) (3.8) 13. Loc blank is whitespace only. Useful for human readability. 14. Loc code and comment. The lines that contain both code and comment or Halstead’s line count of mixed code and comments.2 15. Loc comments the number of lines in a module. This particular metric includes all blank lines, comment lines, and source code lines.3 16. Loc executable or McCabe line count, the lines of code that contain only code and white space. 17. Loc total or Halstead’s line count. Total lines of code. 18. Num operands N2 The total operands used [10]. 19. Num operators N1 The total operators used [10]. 20. Num unique operands n2 The unique or distinct number of operands used [10]. 21. Num unique operators n1 The unique or distinct number of operators [10]. 3http://promise.site.uottawa.ca/SERepository/datasets/kc1-class-level- Numericdefect.arff 27 3.4 Model Validation In preparation for validating our models, instead of creating a training and test split on each dataset, we chose to use datasets CM1 and JM1 datasets for training, Table 3.4, and KC1 and PC1 for testing as listed in Table 3.5. This resulted close to an 80/20 percent split of the total instances. The importance of a test and training data split is that only applying the model with the samples it has already seen would result in a perfect score, therefore failing to predict anything useful on yet-unseen data. Having a too-perfect performance score is a problem called over-fitting, and a solution to over-fitting is using hold-out sets of data. 28 Table 3.4: Training Datasets Training Datasets Dataset Instances Percent Defective CM1 327 12.84% JM1 7782 21.49% Table 3.5: Test Datasets Test Datasets Dataset Instances Percent Defective KC1 1183 26.54% PC1 705 8.65% 3.5 Hyperparameter Tuning In machine learning, a hyperparameter is a parameter that can be set before the training begins to adjust the learning process and influence the model’s ability to generalize from training data to unseen data. In this section we discuss hyperparameter selection for each model and grid search employed in this study. 3.5.1 Logistic Regression As the starting point for our experiment we used the Scikit implementation of LR for our binary classification problem, LogisticRegression. The liblinear solver performs regularization by default, but we can still adjust the C parameter [27] which controls the inverse of regularization strength. A smaller value of C indicates stronger regularization, which can help prevent overfitting by penalizing large coefficients. For example, we used values 0.1, 1, and 10 for C with LR, which allowed a grid search to be performed over the those three values. 3.5.2 Random Forest Random Forest is an ensemble learning method that combines multiple decision trees to improve predictive accuracy. It also reduces over-fitting by averaging the predictions of multiple trees. The implementation we used is RandomForestClassifier [27]. The parameters we tuned were 29 • clf n estimators is the number of decision trees in the random forest, more trees generally improve performance but increase computational cost. • clf max depth defines the maximum depth of each decision tree. Limiting the depth can prevent overfitting and ‘None’ allows the trees to grow to their potential maximum depth. • clf min samples split is how many samples need to be at a node before it can be split into child nodes. • clf min samples leaf sets the minimum number of samples allowed in a leaf node. • clf bootstrap parameter specifies whether bootstrap samples are used when building trees. If set to True, each tree in the forest is built on a bootstrap sample (sampling with replacement). If set to False, the entire dataset is used to build each tree. Using bootstrap samples can introduce randomness, which can improve generalization. 3.5.3 Support Vector Machine For SVM we used the svc class from Scikit [27] for binary classification to separate defective and non-defective software modules. It is considered memory efficient because it uses only those data points that just touch the margins (the support vectors). We used a linear kernel with SVM that simply computes the dot product between two input vectors, V1 · V2, and is suitable for linearly separable data. For SVM the regularization parameter C regularizes the model and prevents overfitting by controlling the strictness of said margin. A large C means that support vectors cannot lie inside it, smaller values specify stronger regularization. We trained with low and higher C values [0.01, 0.1, 1, 10]. Other kernel functions are available to handle non-linear relationships in the data if needed. 3.5.4 Stacking Ensemble Stacking classifier was used to take the best estimator output of individual base classifiers, LR, RF and SVM and use that as input of a final estimator. Figure 3.3 30 illustrates this with the three models used as input parameters to the final model. Stacking combines the strengths of the base classifiers. By aggregating the predictions, it may achieve higher accuracy, be more robust and be better for generalization. The Scikit StackingClassifier is used to answer RQ3: Can a stacking classifier that combines LR, RF, and SVM improve predictive performance? Training Data Random ForestLogistic Regression Support Vector Machine Stacking Classifier Final Prediction Figure 3.3: Stacking Classifier: Uses the strength of each individual estimator by using their output as input of a final estimator. 3.6 Training Training the model involves applying the individual algorithm to the data with the selected hyperparameters and features to obtain a best fit. We first work with classifiers LR, RF and SVM separately. Model validation essentially involves fitting it to the training data and comparing the prediction to the known value. Model validation and the search for best parameters was also aided by using a grid search model to execute a cross-validated grid-search over a parameter grid. We used the default number of 5 for the number of cross-validation splits (folds/iterations), with a Scikit 31 implementation called GridSearchCV 4 to fit the model and provide results of using various hyperparameters of the individual estimators. As our estimators are classifiers and Y, defective / non-defective target vector, is binary, StratifiedKFold is used. StratifiedKFold is “a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.”5 Finally, we combine these using a Stacking Classifier, as illustrated in Figure 3.3. The best estimators from fitting the model are supplied to the Stacking Classifier to provide final predictions that can be compared to the individual model output. 3.7 Testing Finally we use the trained model to make predictions on the hold out test datasets. The model that was fitted to the training data is fit on unseen data using the best hyperparameters and most important features. Finally, performance measures on the outcomes are used to evaluate effectiveness in predicting whether or not modules contain defects. 3.8 Model Performance In this section, we describe the performance measures used for evaluating the effec- tiveness of each model. As we are predicting whether or not a module is Defective statistical binary classification measures are used. Definitions 1. True Positive (TP) a defect was predicted and it actually was a defect. 4https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. StackingClassifier.html 5https://scikit-learn.org/stable/modules/generated/sklearn.model_selection. GridSearchCV.html 32 2. False Positive (FP) a defect was predicted but there actually was no defect. 3. True Negative (TN) a defect was not predicted and there actually was no defect. 4. False Negative (FN) no defect was predicted but there actually was a defect. 5. Accuracy is the proportion of the number of correctly classified instances (TP and TN) to the total number of instances present (TP, TN, FP, and FN). Accuracy = TP + TN TP + FP + TN + FN (3.9) 6. Precision is the ratio of Defective instances that are classified correctly (TP) to the total number of instances that are classified as Defective (TP, and FP). Precision = TP TP + FP (3.10) 7. Recall is the ratio of Defective instances that are classified correctly (TP) to the total number of instances that are actually Defective (TP and FN). Recall = TP TP + FN (3.11) 8. F-measure, F-score, and F1 score are often used interchangeably, but they refer to the same metric used in binary classification. It combines precision and recall as the harmonic mean to provide a single value that represents the model’s accuracy in capturing both positive and negative instances. F1-score ranges from 0 to 1, with 1 being the best possible score, indicating perfect precision 33 and recall, and 0 indicating the worst possible performance. F1 = 2(Precision×Recall) Precision + Recall (3.12) 9. Receiver Operating Characteristic Area Under the Curve. The AUC of ROC measures the overall performance of the classifier across all possible thresholds. A higher AUC value (ranging from 0 to 1) indicates better discrimination and model performance. Huang et al. [13] give a good overview in their 2005 paper “Using AUC and Accuracy in Evaluating Learning Algorithms” and state that “The area under the ROC curve, or simply AUC, provides a good summary” for the performance of the ROC curves.” In the next chapter we present results and analysis based on these metrics to measure the accuracy and performance of our models and answer the research questions that we proposed in Chapter 1. 34 Chapter 4 Results and Analysis 4.1 Introduction In this chapter we answer the research questions posed in Chapter 1 by reviewing the overall performance of the models as applied to unseen test data sets and different training data setups. We present results from our attempts to understand the differ- ences in feature importance between training datasets and compare our results with existing work. • RQ1: What are the key indicators in software code (attribute/feature importance) that machine learning models can use to predict defects? • RQ2: Which classifier among logistic regression, random forest, and support vector machine provides the most accurate predictions of software defects? • RQ3: Can an ensemble using a stacking classifier that combines logistic regression, random forest, and support vector machine improve predictive performance? • RQ4: How do our results compare to existing work? Several tables in this chapter have a ‘Class’ column and the classes are:- • Class 0: Non-defective • Class 1: Defective 4.2 RQ1: What are the key indicators in software code that machine learning models can use to predict defects? It is a crucial part of machine learning to tailor feature selection and engineering efforts appropriately for each model and dataset. A tailored approach can improve model accuracy and provide deeper insights into the factors that drive the predictions in different software engineering contexts. The key indicators identified in this study for predicting software defects include Halstead content, Halstead volume, Halstead effort, and cyclomatic complexity. These metrics, which measure code complexity and the mental effort required to understand the code, were consistently significant predictors across different datasets. Halstead Volume captures the size of the program. Larger volumes typically correlate with higher complexity as can be seen on the heatmaps and RF feature importance charts, hence, higher defect rates. This suggests that features capturing the intricacy and comprehensibility of code play a vital role in defect prediction. Features such as branch count and the number of operands and operators were also important, highlighting the role of code structure and complexity in defect prediction. Further investigation into additional features and their interactions could provide deeper insights into improving model accuracy. 4.2.1 Correlation and Multicollinearity To illustrate how each feature correlates with the others in terms of influencing the outcome of the software defect, Figure 4.1 shows a correlation heatmap for each dataset (a) CM1, (b) JM1, (c) KC1, and (d) PC1 4.1. A value of 1 is deep red, 0 would be white and then -1 dark blue. Heatmaps visually show the correlation and can be used to zoom in on highly correlated features that may cause multicollinearity to occur when modelling. The red diagonals with value of 1 are where the features are cross-referenced 36 with themselves. However, the red value 1 squares not in the diagonal suggest very high correlations, or multicollinearity. For example, num operands, num operators, Halstead volume, and Halstead length have strong positive correlations with each other across all datasets. These positive correlations indicate that increases in one of the metrics is typically associated with increases in the others. This is not unexpected because Halstead length is the sum of operand and operators per Equation (3.5). Including all of these features in a regression model could lead to multicollinearity issues, making it difficult to isolate the individual effect of each predictor feature on the response variable (defective). The complexity metrics (cyclomatic complexity, design complexity, essential complexity) and branch count have moderate to high positive correlations with each other, suggesting that as the branching in code increases, so does the complexity. On the other end of the color scale, loc blank has a weak correlation with most other metrics, indicating that the number of blank lines in the code does not significantly impact other metrics. Halstead level also shows weak to moderate negative correlations with most other metrics, suggesting an inverse relationship. Higher Halstead level indicates simpler code, which likely corresponds with fewer lines of code and lower complexity metrics. As Equation (3.6) shows L = 1 D . This inverse relationship is easily verified looking at the values in the dark blue bands on all four heatmaps. Removing features from the data on strong correlation alone is not usually sufficient. Domain expertise is needed to understand that even correlated features may both be needed. Keeping both may make the model easier to interpret. However, reducing the feature space can help in managing the complexity of the model, improve training times, and reduce overfitting. In such cases, dropping some highly correlated features might be beneficial after thorough analysis. We continue with results and further analysis for feature selection. 37 (a) CM1 Heatmap (b) JM1 Heatmap (c) KC1 Heatmap (d) PC1 Heatmap Figure 4.1: Correlation Heatmaps: Shows the correlation between features on a color scale. Red indicating stronger correlations and blue weaker correlations. 38 4.2.2 Feature Importance To further analyze the potential impact of the models’ coefficients on each dataset we plotted the distribution of the coefficients to try and establish a correlation threshold above which features might be considered for removal. Taking the absolute value put the focus on the strength of the correlation without regard to whether it is positive or negative. From figure 4.2 it can be seen that even though the two training datasets (CM1 and JM1) have the same attributes, the distributions are not uniform. (a) LR Coefficients Distribution for CM1 (b) LR Coefficients Distribution for JM1 (c) SVM Coefficients Distribution for CM1 (d) SVM Coefficients Distribution for JM1 Figure 4.2: Distribution of Absolute Values of Coefficients for LR and SVM: The presence of few larger coefficients hints at some features having a greater impact on the model outcome compared to most. With LR the distribution is more even in JM1 (b) than in CM1 (a), with a peak 39 Table 4.1: 90th Percentile of Coefficients Dataset Logistic Regression SVM CM1 0.4897 0.1952 JM1 0.6194 0.3602 near 0.0 but also significant representation around 0.3 to 0.4 and another peak around 0.6. This could imply that a wider range of features contribute variably to the model, possibly indicating a more complex relationship in the JM1 data. In contrast, in CM1 LR, there is a high frequency of coefficients around 0.0 to 0.3, with a significant peak near 0.1 and few coefficients in higher ranges above 0.6. This distribution suggests that while most features have a moderate influence, there are a few features with very high or very low influence. The SVM distributions in JM1 show a high peak at 0.0 and another significant peak close to 0.4, with few instances near 1.0. The presence of these few larger coefficients hints at some features having a greater impact on the model outcome compared to most. The distribution for CM1 with SVM peaks strongly at 0.04 and 0.1, with most coefficients being below 0.15. This suggests that features generally have a low influence on the model or less complex relationships. The 90th percentiles listed in Table 4.1 are considered as possible values to use as a threshold for features to remove. LR tends to use higher magnitude coefficients compared to SVM across both datasets. The higher coefficients in LR could imply that the model is more sensitive to changes to input features. This can be beneficial for learning complex patterns but may also risk overfitting if not managed with techniques like regularization. On the other hand, the lower coefficients observed in SVM might suggest that the model is more robust to changes to input data. This might offer better generalization on unseen data which can be advantageous in scenarios where data is expected to vary considerably. Overall, JM1 seems to involve more 40 complex relationships among features, as indicated by the broader spread of significant coefficients in both models compared to CM1. For RF, we evaluated the feature importances using the feature importances property of the algorithm. We plotted the results as shown in Figure 4.3. JM1 seems to involve more complex relationships among features, as indicated by the broader spread of significant coefficients compared to CM1. Both datasets show high importance for Halstead content and Halstead effort which suggest these metrics are generally valuable across different contexts in assessing the models’ predictions. The num operators is also significant in both. This may be due to possible contextual differences in their roles between CM1 and JM1. In JM1, the broader spread of importance across many features could indicate a more complex interplay of factors that determine the target variable, suggesting that predictions in JM1 are less about a single dominant characteristic of the data. The high ranking of loc comments in CM1 could suggest a scenario where the volume of comments within the code significantly correlates with the target variable, possibly indicating the complexity or maintainability of the code. 41 (a) CM1 (b) JM1 Figure 4.3: Random Forest Feature Importance: JM1 seems to involve more complex relationships among features, as indicated by the broader spread of significant coefficients compared to CM1. 42 Answer to RQ1: What are the key indicators in software code that machine learning models can use to predict defects? The key indicators identified in this study for predicting software defects include Halstead content and Halstead effort. These metrics, which measure code complexity and the mental effort required to understand the code, were consistently significant predictors across different datasets. Halstead Effort measures the mental effort required to understand the code so the significance of this metric across different datasets suggests that the cognitive load on developers is a strong indicator of potential defects. Halstead Volume captures the size of the program. Larger volumes typically correlate with higher complexity as can be seen on the heatmaps and RF feature importance charts, hence, higher defect rates. Features such as branch count and the number of operands and operators were also important, highlighting the role of code structure and complexity in defect prediction. 4.3 RQ2: Which classifier among logistic regression, random forest, and support vector machine provides the most accurate predictions of software defects? Among the individual classifiers tested, RF consistently showed better performance in predicting software defects compared to LR and support SVM. This aligns with previous studies [19, 31, 38] that highlight the robustness and accuracy of random forest in handling complex datasets with numerous features. By averaging the results of multiple trees RF resists overfitting. Having an inbuilt method for selecting important features might have contributed to its better performance. 43 4.3.1 ROC AUC The AUC values indicate the overall quality of the models’ predictions, where 0.5 represents no discrimination ability (similar to random guessing) illustrated by ”No Skill” dotted line. 1.0 represents perfect discrimination. Figure 4.4 shows AUC plots for LR and SVM. There is a clear difference in performance between LR and SVM and the datasets they were trained on, suggesting that both model selection and the nature of the dataset significantly influence outcomes. In the LR graph all curves overlap and indicate almost identical performance AUC = 0.66 or 0.67 for all curves. This suggests a moderate ability to discriminate between the defective and non-defective classes performing better than a random guess (No Skill line) but not highly accurate. The overlapping nature indicates that changes in the data sets (CM1 vs. JM1) or model types (KC1 vs. PC1) do not affect the performance, which may suggest similar characteristics or distributions in the datasets or robustness across the models. In contrast, the SVM ROC curves show distinct performances for different combinations of model types and datasets. • KC1 trained on CM1: AUC = 0.52 (blue solid line) • PC1 trained on CM1: AUC = 0.72 (green dashed line) • KC1 trained on JM1: AUC = 0.64 (red dashed line) • PC1 trained on JM1: AUC = 0.70 (green dotted line) The best performance is observed in PC1 trained on CM1, with an AUC of 0.72, suggesting a good discriminatory ability. The worst performance is by KC1 trained on CM1, with an AUC close to 0.52, which is barely above the No Skill line, indicating a performance close to random guessing. The improvement in AUC for PC1 across both datasets (CM1 and JM1) compared to KC1 suggests that the model or features used in PC1 are more effective for this task. 44 (a) LR (b) Support Vector Machine Figure 4.4: ROC Curves Comparison: LR curves overlap. SVM curves show distinct performances for different combinations of datasets. 45 Table 4.2: Accuracy By Test Dataset Training Test Logistic Random Support Vector Dataset Dataset Regression Forest Machine CM1 KC1 0.73 0.73 0.73 JM1 KC1 0.74 0.74 0.73 CM1 PC1 0.73 0.91 0.91 JM1 PC1 0.74 0.90 0.91 4.3.2 Accuracy Table 4.2 lists accuracy scores for three classification models LR, RF, and SVM trained using CM1 and JM1 and tested on datasets KC1 and PC1. RF and SVM both performed best on PC1 at 0.91 and 0.9 after training on CM1 and JM1. LR performed consistently at 0.73 and 0.74 on all four test scenarios. Dataset KC1 tests were similar across LR, RF and SVM at 0.73 to 0.74. KC1 only reached 0.34 accuracy after training on JM1 and 0.5 after training with CM1. 4.3.3 Precision Precision scores are presented in Table 4.3. Class 0 (Non-defective, negative class) generally has much higher precision scores across all models compared to Class 1 (defective, positive class). This indicates that all models are better at identifying true negatives than true positives, which could imply an imbalance in the dataset where negatives might be more frequent or easier to predict. The uniform scores of 0.73 for Class 0 in the CM1/KC1 setup across all models suggest potential overfitting. Generally, models trained on JM1 seem to perform slightly better on Class 0 but worse on Class 1 when compared to those trained on CM1. Test dataset PC1 allows for higher precision in Class 0 predictions across all models compared to KC1. This could indicate that PC1 is either a less challenging dataset or better aligned with the features or distribution of the training data. 46 Table 4.3: Precision By Test Dataset Training Test Class Logistic Random Support Vector Dataset Dataset Regression Forest Machine CM1 KC1 0 0.73 0.73 0.73 1 0.00 0.00 0.00 JM1 KC1 0 0.74 0.75 0.73 1 0.69 0.56 0.00 CM1 PC1 0 0.73 0.91 0.91 1 0.00 1.00 0.00 JM1 PC1 0 0.74 0.93 0.91 1 0.69 0.32 0.00 4.3.4 Recall Recall data presented in Table 4.4 suggest a strong need for models to improve in detecting Class 1 defective instances without sacrificing the performance on Class 0. All models show high recall scores for Class 0 in most cases, indicating that they are very effective at identifying all negative instances. Particularly, LR, RF, and SVM consistently achieve perfect recall scores (1.00) in many scenarios. While maintaining high recall for Class 0, the RF and SVM models struggle with Class 1, rarely identifying more than a small fraction of positive cases, with few exceptions where Random Forest reaches up to 0.18 recall. 47 Table 4.4: Recall By Test Dataset Training Test Class Logistic Random Support Vector Dataset Dataset Regression Forest Machine CM1 KC1 0 1.00 1.00 1.00 1 0.00 0.00 0.00 JM1 KC1 0 0.99 0.97 1.00 1 0.04 0.09 0.00 CM1 PC1 0 1.00 1.00 1.00 1 0.00 0.02 0.00 JM1 PC1 0 0.99 0.96 1.00 1 0.04 0.18 0.00 4.3.5 F1 score The F1 scores in Table 4.5 reflect a balance between precision and recall for each model, and is especially useful for evaluating performance in datasets with imbalanced classes. The high F1 scores for Class 0 across most models indicate that models are better tuned to predict negatives, possibly due to a higher prevalence of non-defective instances. In contrast, the consistently low F1 scores for Class 1 across all models suggest that there is a substantial difficulty in predicting positives. 48 Table 4.5: F1 - score By Test Dataset Training Test Class Logistic Random Support Vector Dataset Dataset Regression Forest Machine CM1 KC1 0 0.85 0.85 0.85 1 0.00 0.00 0.00 JM1 KC1 0 0.85 0.85 0.85 1 0.07 0.16 0.00 CM1 PC1 0 0.85 0.96 0.95 1 0.00 0.03 0.00 JM1 PC1 0 0.85 0.94 0.95 1 0.07 0.23 0.00 Answer to RQ2: Which classifier among logistic regression, random forest, and support vector machine provides the most accurate predictions of software defects? The random forest classifier provided the most accurate predictions of software defects, with an average accuracy of 0.87 across the datasets. It outperformed logistic regression and support vector machine in terms of precision, recall, and F1-score, demonstrating its robustness and ability to handle complex interactions between features. 4.4 RQ3: Can an ensemble using a stacking classifier that combines logistic regression, random forest, and support vector machine improve predictive performance? The motivation for applying a stacking classifier arose from the expectation that taking the best of each base model, would build an improved model for more accurate predictions. Table 4.6 lists results from the stacking classifier using the best parameters from each of the three models using LR, RF and SVM. The performance metrics provided in Table 4.6 indicate that the stacking classifier, combining logistic regression, random forest, and support vector machine, did not improve predictive performance 49 Table 4.6: Stacking Classifier Training Test AUC Accuracy Class Precision Recall F-1 Score Dataset Dataset CM1 KC1 0.50 0.73 0 0.73 1.00 0.85 1 0.00 0.00 0.00 CM1 PC1 0.53 0.92 0 0.92 1.00 0.96 1 0.67 0.07 0.12 JM1 KC1 0.54 0.34 0 0.87 0.12 0.22 1 0.28 0.95 0.43 JM1 PC1 0.50 0.09 0 1.00 0.00 0.01 1 0.09 1.00 0.16 across different training and test datasets. Therefore, the ensemble method does not reliably enhance predictive performance in this context. The stacking classifier consistently shows the lowest accuracy scores across all scenarios, ranging from only 0.09 to 0.51. This might indicate that the stacking approach, as currently configured, is not effective. Further investigation could be undertaken to consider if it suffers from issues like overfitting the training data or not generalizing well to the test data. Investigation could also include re-examining the base classifiers used, their hyperparameters, and how they are combined. Accuracy varies widely depending on the dataset combination, from as low as 0.09 to as high as 0.92. The accuracy metrics are generally higher when predicting Class 0 correctly, especially noticeable in datasets like CM1 tested on PC1 (0.92 accuracy). However, low accuracy scores in some combinations (e.g., JM1 tested on PC1 with 0.09) indicate severe misclassification issues, particularly for Class 1. The low AUC scores and the varied performance across classes might imply that the current model complexity might not be suitable for the underlying data structure and distribution. The AUC values range from 0.50 to 0.54, which are close to 0.50, indicating that the classifier performs no better than random guessing on these dataset combinations.The stacking classifier did not perform better 50 on AUC than either LR or SVM where scores were higher in the range from 0.64 to 0.72, except KC1 trained on CM1 0.52. When trained on JM1 and tested on PC1, the Stacking Classifier shows a complete recall for Class 1 (1.00) but fails entirely for Class 0 (0.00). Choosing RandomForestClassifier ‘SelectFromModel’ might have selected features that are good for decision trees but not optimal for logistic regression, another reason why the current model complexity might not be suitable for the underlying data structure and distribution. Answer to RQ3: Can an ensemble using a stacking classifier that combines logistic regression, random forest, and support vector machine improve predictive performance? While some models achieved high accuracy for certain classes, others showed poor results, particularly in recall and F-1 score for the minority class. Specifically, the AUC values range from 0.50 to 0.54, suggesting that the model’s ability to distinguish between classes is no better than random guessing for most dataset pairs. The feature selection method might have selected features that are good for decision trees but not optimal for logistic regression. A more sophisticated ensemble method, or a neural network if appropriate for the size of the dataset, could be considered. 4.5 How do our results compare with similar research? In this section, we compare the results of our software defect prediction models with those from three notable research papers: “Effective software defect prediction using support vector machines (SVMs)” [8], “Software Defect Prediction Using Random Forest Algorithm” [31], and “An Ensemble Model for Software Defect Prediction”. [2]. Goyal’s work on SVMs highlighted the enhancement brought by their FILTER technique, claiming an improvement in accuracy by 16.73% and an improvement of 51 7.65% in F-measure (F1 Score). We compare our SVM-Linear Kernel (no filtering) with the SVM-Linear Kernel with and without filtering in Table 4.7. 52 Table 4.7: Comparison of SVM with SVM using FILTER This study Goyal Dataset Accuracy (avg) FILTER accuracy KC1 0.73 0.83 PC1 0.91 0.89 Dataset AUC (avg) FILTER AUC KC1 0.58 0.829 PC1 0.71 0.776 Dataset F1-score FILTER F1-score KC1 0.85 0.901 PC1 0.95 0.934 • Accuracy: The SVM-Linear Kernel with filtering shows higher accuracy for the KC1 dataset (0.83 compared to 0.73) and comparable accuracy for the PC1 dataset (0.89 compared to 0.91). This indicates that the filtering technique generally improves or maintains accuracy across different datasets. • AUC (Area Under the Curve): The AUC values indicate a significant improve- ment for the KC1 dataset (0.829 compared to 0.58) and a moderate improvement for the PC1 dataset (0.776 compared to 0.71) when using the filtering tech- nique. This suggests that the filtering technique enhances the model’s ability to discriminate between classes. • F1-score: The F1-score, which balances precision and recall, is also higher for the filtering technique in both datasets. For KC1, the F1-score with filtering is 0.901 compared to 0.85 without filtering, and for PC1, it is 0.934 compared to 0.95. This demonstrates that the filtering technique improves the overall predictive performance of the model. The values provided suggest that Goyal’s FILTER technique to enhance SVM models outperforms our standard SVM-Linear Kernel model across multiple metrics, including accuracy, AUC, and F1-score. Therefore, incorporating the filtering technique is beneficial for improving the effectiveness of SVM models in classification tasks, and should be considered in future work. Soe et al.’s research focused on using Random Forest (RF) algorithms with NASA metrics data and the Arçelik (AR) datasets from the PROMISE repository created 53 Table 4.8: Comparison of Random Forest: This study v. Soe et al. Comparison of Random Forest Results Dataset This study Accuracy Soe et al. Accuracy 200 trees 1,000 trees KC1 0.73 0.8682 PC1 0.91 0.9324 in 2007. RF models are known for their high accuracy and robustness in handling large datasets. In Table 4.8 we compare accuracy for datasets KC1 and PC1. It is noted that Soe et al. gained the highest accuracy with dataset PC2 (.9959) which was not in our test set, and was when using 1,000 trees, in our study 200 trees was the best estimator. Soe et al.’s final conclusion was that the number of “trees in the forest should be around a hundred because it is more stable for defect prediction to get the high accuracy”, which aligns closely with our finding that 200 trees were optimal in our experiments. This convergence suggests that while the exact number of trees may vary, a relatively high, but not excessively large, number of trees is beneficial for achieving stable and accurate results. Ali et al. proposed an ensemble model for defect prediction. Ensemble methods typically enhance prediction performance by combining the strengths of multiple classifiers. We compare our stacking classifier results with the proposed ensemble for dataset PC1. 54 Table 4.9: Comparison of Ali et al. Proposed Ensemble for dataset PC1 This study Ali et al. Accuracy 0.5 0.9927 Precision 0.91 0.9986 Recall 0.5 0.9935 F-score 0.46 0.9513 The accuracy of Ali et al.’s model (0.9927) is significantly higher than the accuracy of the stacking classifier for PC1 after training on CM1 from this study (0.92). As shown in Table 4.6 the accuracy of PC1 after using the model trained on JM1 was only 0.09 bringing the average accuracy down to only 0.5. The recall of Ali et al.’s model (0.9935) is significantly higher than the recall of the stacking classifier from this study (0.5). Higher recall means that Ali et al.’s model is more effective at identifying actual defects, thus reducing false negatives. Combining multiple classifiers, as done in Ali et al.’s proposed ensemble model, can significantly enhance prediction performance in defect prediction tasks and it outperforms the stacking classifier from this study across all metrics: accuracy, precision, recall, and F-score. In summary, we note that Goyal’s FILTER technique could be beneficial for improving the effectiveness of SVM models in classification tasks, and should be considered in future work. Our findings of 200 trees with RF aligns with Soe et al.’s final conclusion to recommend around 100 trees for optimal results. Our study also contributes to the body of knowledge by demonstrating the potential of stacking classifiers and identifying areas for future improvement. We highlight the effectiveness of Ali et al.’s ensemble method and suggest that further refinement and optimization of 55 our stacking classifier could yield better results. These comparisons indicate that while our stacking classifier is effective, further refinement and optimization are necessary to match or exceed the performance of advanced ensemble techniques. 4.6 Threats to Validity - Internal This is concerned with the reliability of the methodology, and so, by providing details of the methodology and process in Chapter 3 we allow other researchers to reproduce the same results. Conducting more thorough feature engineering could improve model performance and interpretability by ensuring that the features chosen are truly predictive for all models in the stacking ensemble. One possibility could be to include the use of principal component analysis (PCA) to address feature multicollinearity. The grid search for hyperparameter tuning was conducted with a limited set of values due to computational constraints and may not be extensive enough. Therefore, more extensive hyperparameter optimization using other techniques such as random search and greater processing power could potentially yield better-performing models. 4.7 Threats to Validity - External Validity External validity concerns the generalizability of our findings to other contexts. This study used separate projects for training and test but comprehensive cross-project evaluations was not our goal. Future studies could aim to perform extensive cross- project validations to better understand how well models generalize across different project environments and settings. This study relied on publicly available NASA datasets which may not fully represent the diversity and complexity of real-world software projects. The availability of a high volume of quality data from real-world productions systems can limit the feasibility of studies. Although using publicly 56 available data made our study easy to reproduce, data from more recent real-world projects would allow more real-world improvements. Barriers to access would be privacy, both personal and industry level. Collecting specific datasets would require funding, time and expertise. 57 Chapter 5 Conclusions and Future Work We conclude that research to find reliable prediction methods are still needed because we cannot prove that there are no defects in software. Artificial intelligence, including deep learning and machine learning is an area worthy of continued study and this thesis contributes to the understanding of machine learning applications in software defect prediction and highlights areas for further research. The findings reinforce the importance of model selection and feature importance. Our empirical validation of various machine learning models on publicly available datasets adds rigor to the study, offering reproducible and verifiable results. 5.1 Conclusions This thesis contributes to the field of software defect prediction by empirically evaluat- ing the performance of various machine learning models, including Logistic Regression, Random Forest, Support Vector Machine, and a stacking classifier combining these models. The findings highlight the importance of model selection and feature engi- neering to achieve accurate predictions. Despite the demonstrated potential of the stacking classifier, it did not consistently outperform individual models, underscoring the need for further exploration of ensem- ble methods and more complex models. A more sophisticated ensemble method, or a neural network if appropriate for the size of the dataset, could be considered. Addi- tionally, the comparative analysis with existing models, such as Ali et al.’s proposed ensemble, revealed areas where improvements can be made, particularly in terms of recall and F1-score. While this study provides some insights into the application of machine learning for software defect prediction, it also identifies several avenues for future research. By addressing the identified threats to validity and exploring advanced techniques and more diverse datasets, future work can further enhance the reliability and usefulness of defect prediction models across different domains. Ensuring that defects are identified and addressed before deployment can save lives and reduce the economic burden of system failures. 5.2 Future Work Using AI systems that translate natural language to code should be explored in future work. OpenAI Codex1, for instance, claims to be trained specifically to understand and generate code. Future research could involve continued experimentation with different algorithms and automated hyperparameter tuning to optimize model performance. There is a need to investigate the potential of overfitting within our models. Some strategies to consider include implementing a different cross-validation strategy such as Repeated or Randomized Stratified KFold to reduce variance and ensure each class is properly represented. Also, implementing more sophisticated feature engineering methods, such as automatic feature selection, feature extraction using unsupervised learning, and incorporating additional context-specific metrics, could improve model accuracy and interpretability. Although this study used separate projects for training and test, comprehensive cross-project evaluations was not our goal, but would be 1https://openai.com/index/openai-codex/ 59 an important area for future study. Conducting cross-project validation studies can evaluate and improve the generalizability of the models by testing the models on datasets from projects with different programming languages, characteristics and domains. Exploring other advanced machine learning models, such as deep learning techniques, could provide new insights and potentially improve prediction performance. We found Artificial Neural Networks (ANN) to be the most popular deep learning technique in our literature review and could be particularly useful for handling complex software metrics. Acquiring and utilizing larger datasets from modern software projects across various domains would help validate the models’ applicability and robustness in different contexts. Collaborations with industry partners could facilitate access to such data. In industries where software failure can lead to loss of life or significant economic damage, prioritizing recall can significantly reduce risks. For instance, in the aviation industry, accurate defect prediction can prevent incidents like the Ethiopian Airlines flight 302 crash [5], which was partially attributed to software issues in the aircraft’s control system. Future advancements in this field hold the potential to significantly improve the quality and dependability of software, particularly in environments where safety is paramount. 60 BIBLIOGRAPHY [1] Akiyama, F. An Example of Software System Debugging. IFIP Congress (1) (1971), 353–359. [2] Ali, A. R., Ur Rehman, A., Nawaz, A., Ali, T. M., and Abbas, M. An Ensemble Model for Software Defect Prediction. In 2022 2nd International Conference on Digital Futures and Transformative Technologies, ICoDT2 2022 (2022), Institute of Electrical and Electronics Engineers Inc. [3] Aljamaan, H., and Alazba, A. Software defect prediction using tree-based ensembles. In Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering (New York, NY, USA, 2020), PROMISE 2020, Association for Computing Machinery, pp. 1–10. [4] Chen, J., Xu, J., Cai, S., Wang, X., Gu, Y., and Wang, S. An effi- cient dual ensemble software defect prediction method with neural network. In Proceedings - 2021 IEEE International Symposium on Software Reliability Engi- neering Workshops, ISSREW 2021 (2021), Institute of Electrical and Electronics Engineers Inc., pp. 91–98. [5] Democratic Republic of Ethiopia Ministry of Transport and Lo- gistics. Investigation Report on Accident to the B737-MAX8 Reg. ET-AVJ Operated By Ethiopian Airlines. Tech. rep., Aircraft Accident Investigation Bureau, 12 2022. [6] Fenton, N., and Neil, M. A critique of software defect prediction models. IEEE Transactions on Software Engineering 25, 5 (1999), 675–689. [7] Gong, L., Jiang, S., and Jiang, L. Conditional Domain Adversarial Adapta- tion for Heterogeneous Defect Prediction. IEEE Access 8 (2020), 150738–150749. [8] Goyal, S. Effective software defect prediction using support vector machines (SVMs). International Journal of System Assurance Engineering and Management 13, 2 (4 2022), 681–696. [9] Ha, D.-A., Chen, T.-H., and Yuan, S.-M. Unsupervised Methods for Software Defect Prediction. In Proceedings of the 10th International Symposium on Information and Communication Technology (New York, NY, USA, 2019), SoICT ’19, Association for Computing Machinery, pp. 49–55. [10] Halstead, M. H. Toward a theoretical basis for estimating programming effort. In Proceedings of the 1975 Annual Conference (New York, NY, USA, 1975), ACM ’75, Association for Computing Machinery, pp. 222–224. [11] Herbold, S. On the Costs and Profit of Software Defect Prediction. IEEE Transactions on Software Engineering 47, 11 (11 2021), 2617–2631. [12] Herzig, K., Just, S., and Zeller, A. It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In 2013 35th International Conference on Software Engineering (ICSE) (2013), pp. 392–401. [13] Huang, J., and Ling, C. X. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering 17, 3 (3 2005), 299–310. [14] Huda, S., Alyahya, S., Mohsin Ali, M., Ahmad, S., Abawajy, J., Al- Dossari, H., and Yearwood, J. A Framework for Software Defect Prediction and Metric Selection. IEEE Access 6 (12 2017), 2844–2858. [15] ISO/IEC/IEEE. ISO/IEC/IEEE International Standard - Systems and software engineering–Systems and software assurance –Part 1:Concepts and vocabulary. ISO/IEC/IEEE 15026-1:2019(E) (2019), 1–38. [16] Jureczko, M., and Madeyski, L. Towards Identifying Software Project Clusters with Regard to Defect Prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering (New York, NY, USA, 2010), PROMISE ’10, Association for Computing Machinery. [17] Kamei, Y., Shihab, E., Adams, B., Hassan, A. E., Mockus, A., Sinha, A., and Ubayashi, N. A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering 39, 6 (2013), 757–773. [18] Krasner, H. The Cost of Poor Software Quality in the US: A 2022 report. Consortium for Information & Software Quality. Tech. rep., Consortium for Information & Software Quality, 12 2022. [19] Li, R., Zhou, L., Zhang, S., Liu, H., Huang, X., and Sun, Z. Software Defect Prediction Based on Ensemble Learning. In Proceedings of the 2019 2nd International Conference on Data Science and Information Technology (New York, NY, USA, 7 2019), vol. 6, ACM, pp. 1–6. 62 [20] Marjuni, A., Adji, T. B., and Ferdiana, R. Unsupervised software defect prediction using signed Laplacian-based spectral classifier. Soft Computing 23, 24 (12 2019), 13679–13690. [21] Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405, 2 (10 1975), 442–451. [22] McCabe, T. J. A Complexity Measure. IEEE Transactions on Software Engineering SE-2, 4 (1976), 308–320. [23] McCabe, T. J., and Butler, C. W. Design complexity measurement and testing. Commun. ACM 32, 12 (12 1989), 1415–1425. [24] Menzies, T., Di Stefano, J., Chapman, M., and McGill, K. Metrics that matter. In 27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings. (12 2002), IEEE Comput. Soc, pp. 51–57. [25] Mori, T., and Uchihira, N. Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering 24, 2 (4 2019), 779–825. [26] NezhadShokouhi, M. M., Majidi, M. A., and Rasoolzadegan, A. Soft- ware defect prediction using over-sampling and feature extraction based on Mahalanobis distance. Journal of Supercomputing 76, 1 (1 2020), 602–635. [27] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [28] Peters, F., and Menzies, T. Privacy and utility for defect prediction: Experiments with MORPH. In 2012 34th International Conference on Software Engineering (ICSE) (6 2012), IEEE, pp. 189–199. [29] Phan, V. A. Learning Stretch-Shrink Latent Representations With Autoencoder and K-Means for Software Defect Prediction. IEEE Access 10 (2022), 117827– 117835. [30] Shepperd, M., Song, Q., Sun, Z., and Mair, C. Data Quality: Some Comments on the NASA Software Defect Datasets. IEEE Transactions on Software Engineering 39, 9 (9 2013), 1208–1215. [31] Soe, Y. N., Santosa, P. I., and Hartanto, R. Software Defect Prediction Using Random Forest Algorithm. In 2018 12th South East Asian Technical University Consortium (SEATUC) (2018), vol. 1, pp. 1–5. 63 [32] Song, Q., Guo, Y., and Shepperd, M. A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction. IEEE Transactions on Software Engineering 45, 12 (12 2019), 1253–1269. [33] Srivastava, A. N., and Schumann, J. The case for software he