AN EMPIRICAL EXPLORATION OF ARTIFICIAL INTELLIGENCE FOR

SOFTWARE DEFECT PREDICTION IN SOFTWARE ENGINEERING

by

Elaine Cahill

July, 2024

Director of Thesis: Madhusudan Srinivasan, PhD

Major Department: Computer Science

Artificial Intelligence (AI) is an important topic in software engineering not only for

data analysis and pattern recognition, but for the opportunity of finding solutions to

problems that may not have explicit rules or instructions. Reliable prediction methods

are needed because we cannot prove that there are no defects in software. Deep

learning and machine learning have been applied to software defect prediction in the

attempt to generate valid software engineering practices since at least 1971. Avoiding

safety-critical or expensive system failures can save lives and reduce the economic

burden of maintaining systems by preventing failures in systems such as aviation

software, medical devices, and autonomous vehicles. This thesis contributes to the

field of software defect prediction by empirically evaluating the performance of various

machine learning models, including Logistic Regression, Random Forest, Support

Vector Machine, and a stacking classifier combining these models. The findings

highlight the importance of model selection and feature engineering in achieving

accurate predictions. We followed this with a stacking classifier that combines Logistic

Regression, Random Forest, and Support Vector Machine (SVM) to see if that improved

predictive performance. We compared our results with previous work and analyzed

which features or attributes appeared to be effective in predicting defects. We end by

discussing potential next steps for further research based on our work results.


AN EMPIRICAL EXPLORATION OF ARTIFICIAL INTELLIGENCE FOR

SOFTWARE DEFECT PREDICTION IN SOFTWARE ENGINEERING

A Thesis

Presented to The Faculty of the Department of Computer Science

East Carolina University

In Partial Fulfillment of the Requirements for the Degree

Master of Science in Software Engineering

by

Elaine Cahill

July, 2024

Director of Thesis: Madhusudan Srinivasan, PhD

Thesis Committee Members:

Nic Herndon, PhD

Wu Rui, PhD


Copyright Elaine Cahill, 2024


DEDICATION

This work is dedicated to the memory of my husband, James Cahill.


ACKNOWLEDGEMENTS

I would like to express love and thanks to my son, John, for all the support and

encouragement throughout my studies. I could not have got through them without

your help. Sincere thanks to Rolf, who has the patience of a saint, and for his unlimited

support and love. My brother, John Vickers, and his wife Debra for believing in me.

Sincere thanks to Dr. Mark Hills who started this journey with me, and to Dr. N.

Herndon, Dr. R. Wui, Dr. M.N.H. Tabrizi and especially Dr. M. Srinivasan for getting

me over the finish line.


Table of Contents

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivations for this Study . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Purpose of this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 8

CHAPTER 2: RELATED WORK . . . . . . . . . . . . . . . . . . . . . 9

2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . . 13

2.4 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

CHAPTER 3: METHODOLOGY . . . . . . . . . . . . . . . . . . . . . 16

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


3.1.1 Choice of Model . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 One-hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.3 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 30

3.5.4 Stacking Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.7 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

CHAPTER 4: RESULTS AND ANALYSIS . . . . . . . . . . . . . . . 35

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 RQ1: What are the key indicators in software code that machine

learning models can use to predict defects? . . . . . . . . . . . . . . . 36

4.2.1 Correlation and Multicollinearity . . . . . . . . . . . . . . . . 36

4.2.2 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . 39


4.3 RQ2: Which classifier among logistic regression, random forest, and sup-

port vector machine provides the most accurate predictions of software

defects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3.1 ROC AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3.5 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 RQ3: Can an ensemble using a stacking classifier that combines lo-

gistic regression, random forest, and support vector machine improve

predictive performance? . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5 How do our results compare with similar research? . . . . . . . . . . . 51

4.6 Threats to Validity - Internal . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Threats to Validity - External Validity . . . . . . . . . . . . . . . . . 56

CHAPTER 5: CONCLUSIONS AND FUTURE WORK . . . . . . . 58

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

APPENDIX A:LITERATURE REVIEW SUMMARY . . . . . . . . 66

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

A.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.2.1 ACM Digital Library . . . . . . . . . . . . . . . . . . . . . . . 67

A.2.2 IEEE Xplore . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

A.2.3 Springer Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 68


A.3 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 68

A.4 Defect Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.5 Repositories of Defect Datasets . . . . . . . . . . . . . . . . . . . . . 71


LIST OF TABLES

3.1 Overview of MDP ′′ Defect Datasets . . . . . . . . . . . . . . . . . . 19

3.2 Dataset Attribute Existence Verification . . . . . . . . . . . . . . . . 22

3.3 Dataset CM1 Column Values Before and After Scaling . . . . . . . . 23

3.4 Training Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Test Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 90th Percentile of Coefficients . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Accuracy By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Precision By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Recall By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 F1 - score By Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . 49

4.6 Stacking Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7 Comparison of SVM with SVM using FILTER . . . . . . . . . . . . . 53

4.8 Comparison of Random Forest: This study v. Soe et al. . . . . . . . . 54

4.9 Comparison of Ali et al. Proposed Ensemble for dataset PC1 . . . . . 55

A.1 Repositories of Defect Datasets . . . . . . . . . . . . . . . . . . . . . 73


LIST OF FIGURES

1.1 Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) 4

1.2 Support Vector Machine Hyperplane . . . . . . . . . . . . . . . . . . 5

2.1 Deep Learning Models Usage . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Machine Learning Workflow . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Stacking Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Correlation Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Distribution of Absolute Values of Coefficients . . . . . . . . . . . . . 39

4.3 Random Forest Feature Importance . . . . . . . . . . . . . . . . . . . 42

4.4 ROC Curves Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.1 Papers Per Year by Database . . . . . . . . . . . . . . . . . . . . . . 66

A.2 Top Ten Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.3 Deep Learning Models Usage . . . . . . . . . . . . . . . . . . . . . . 70

A.4 Top Ten Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.5 Top Ten Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


Chapter 1

Introduction

1.1 Background

Predictive software maintenance and defect prediction studies have been documented

since at least 1971 [1]. Artificial Intelligence (AI) is an important topic in software

engineering not only for data analysis and pattern recognition, but for the opportunity

of finding solutions to problems that may not have explicit rules or instructions. Deep

learning and machine learning have been applied to software defect prediction (SDP)

in the attempt to generate valid software engineering practices. The ultimate goal

for study in the area of defect prediction was proposed by Srivastava et al. [33] “A

mathematical theory of software failure needs to be developed which allows for the

development of provably correct algorithms for detecting, diagnosing, predicting, and

mitigating the adverse events of software issues.” Whilst this paper does not present

such a theory, and we still cannot prove that there are no defects in software, we do try

to add to the body of work by presenting results from our research. For clarification,

the following definitions of error, fault and failure are from “International Standard -

Systems and software engineering - Systems and software assurance - Part 1: Concepts

and vocabulary” [15].

1. Failure (3.4.9) is a termination of the ability of a system to perform a re-


quired function, or its inability to perform within previously specified limits; an

externally visible deviation from the system’s specification.

2. A fault (3.4.6) is a defect in a system or a representation of a system that if

executed or activated can potentially result in an error.

3. An error (3.4.5) is a discrepancy between a computed, observed or measured value

or condition and the true, specified or theoretically correct value or condition.

Errors can also be human mistakes where an action produces an incorrect result.

Using machine learning to predict defects or faults is an attempt to discover them

before a failure occurs. Static analysis of code can warn of modules that could benefit

from refactoring. If modules exist with no unit test coverage, it does not automatically

mean that errors will occur in that module, but it would increase confidence in the

software if code coverage was closer to 100%. Another approach could be to monitor

the condition of the system by running tests continuously, or checking that deployed

code has not been changed unknowingly. Also, sampling or testing to verify that

bad data has not been added to the system that would cause issues when selected or

updated would be useful.

An attempt to measure the economic impact of poor quality software can be found

in the report to the Consortium for Information & Software Quality [18] where Krasner

shows that the cost of finding and fixing defects in the United States in 2022 was $607

billion. Reliable prediction methods can be used to avoid safety-critical or expensive

system failures. “Defects, like quality, can be defined in many different ways but are

more commonly defined as deviations from specifications or expectations which might

lead to failures in operation.” [6]. Complexity and size metrics have been used for

many years in an attempt to predict the number of defects that will be found in testing

or operating in production. Reliability models have been developed to predict failure

2


rates based on the expected operational usage profile of the system, for example,

1. predicting the number of defects in the system,

2. estimating the reliability of the system in terms of time to failure and,

3. understanding the impact of design and testing processes on defect counts and

failure densities.

Approaching software defect prediction as a binary classification problem, we first

use three machine learning algorithms, Random Forest (RF), Support Vector Ma-

chines(SVM), and Logistic Regression (LR). After measuring performance individually

we implement a stacking classifier to compare performance against individual results.

Logistic regression predicts the probability of an event occurring such as whether

or not a software module is defective or not i.e., the dependent variable is categorical.

It is easy to implement and interpret, making it a good starting point for experiments.

Results can be also be displayed graphically such as with a Receiver Operating

Characteristic or ROC curve, as shown in Figure 1.1. This plots the true positive

rate against the false positive rate, the blue line, and shows that area under the curve

(AUC) in this example for dataset KC1 after being trained with dataset CM1 is 0.67.

The dashed line labelled “No Skill” is what we expect for random guessing. Random

Forest (RF) is an ensemble learning technique used in machine learning. It uses a

collection of decision trees. Decision trees can be used to map the possible outcomes of

a decision and each node of a decision tree represents a possible outcome. Percentages

are assigned to nodes based on the likelihood of the outcome occurring. Voting is

the final step to combine the predictions of all the trees to make the final prediction.

By having multiple decision trees in the random forest algorithm, predictions can be

more accurate by accounting for variability in the data, therefore reducing the risk of

bias and overfitting.

3


Figure 1.1: Receiver Operating Characteristic (ROC) Area Under the Curve (AUC): Example
ROC curve.

Support Vector Machines (SVM) are supervised learning models used for classifi-

cation and regression analysis. It finds the hyperplane that best separates the classes

and tries to maximize the margin between the classes. An example is shown in Figure

1.2 where two classes, the blue and orange dots, are separated by a central solid line

with dotted lines to represent the widest margin of separation. The dots, or nodes,

nearest to the margins are called the support vectors.

4


Figure 1.2: Support Vector Machine Hyperplane: The dots, or nodes, nearest to the margins
are called the support vectors. Source: https://scikit-learn.org/stable/modules/svm.html.

1.2 Motivations for this Study

Reliable prediction methods can be used to avoid safety-critical or expensive system

failures. Preventing system failure can not only reduce the economic impact of events,

but also save lives. Consider air craft disasters such as, the Ethiopian Airlines flight

30 on March 10, 2019 where all 157 passengers and crew on board died. The accident

involved the Boeing 737 Max 1. The Ethiopian investigation [5] documents the problem

with the angle of attack in the flight software called Maneuvering Characteristics

Augmentation System (MCAS). The investigation’s report was published in 2022 and

1https://www.ntsb.gov/news/press-releases/Pages/NR20190314.aspx

5


it explains that the design of MCAS pushed the jet’s nose down, even as the pilot

attempted to pull it up, and was a major cause of the accident. A well documented

example of how software issues can cause expensive failures was Knight Capital2

that lost over $460 million one day on August 1 2012 when its automated system,

Smart Market Access Routing System (SMARS), a high-speed order router for equity

orders failed. According to the Securities and Exchange Commission (SEC) [39],

“Knight did not have adequate controls in SMARS to prevent the entry of erroneous

orders.” One feature of SMARS was to receive orders from upstream components in

the trading platform, and then, as needed based on the available liquidity and price,

send one or more orders downstream for execution. The SEC report describes how

upstream orders were processed by defective, legacy ‘Power Peg’ code, and SMARS

sent millions of downstream orders, resulting in erroneous executions in stocks and

shares in approximately 45 minutes. Knight was unable to cover the unintended

positions, resulting in $460 million loss and were ordered to pay a $12 million penalty

by the SEC.

1.3 Purpose of this Study

The purpose of this study is to contribute to the work on code quality, specifically

software defect prediction and maintenance in the fields of software engineering and

AI. We build upon previous work to advance the understanding and application of

machine learning to software defect prediction by experimenting with different machine

learning models, including LR, RF, SVM, and a stacking classifier that combines these

models. Reliable defect prediction can have far-reaching impacts beyond immediate

software quality and so we also contribute to the broader goal of development of

reliable prediction methods that can minimize the impact of defects in software costs

2https://www.wsj.com/articles/SB10000872396390443866404577565402442234764

6


associated with fixing defects post-deployment. It can prevent disasters and save

lives in safety-critical systems e.g., aviation and medical devices, and reduce costs

associated with software maintenance and technical debt. Another purpose of this

study is to identify and analyze the key indicators in software code that machine

learning models can use to predict defects. Understanding which features are most

predictive of defects can focus efforts on the most critical aspects of the code, thereby

improving code quality and reducing the incidence of defects.

1.4 Research Questions

In order to add to the body of work on software defect prediction and the progress

towards the main idea of reaching provably correct algorithms [1], we consider the

following research questions:

• RQ1: What are the key indicators in software code (attribute/feature importance)
that machine learning models can use to predict defects?

• RQ2: Which classifier among logistic regression, random forest, and support
vector machine provides the most accurate predictions of software defects?

• RQ3: Can an ensemble using a stacking classifier that combines logistic regression,
random forest, and support vector machine improve predictive performance?

Answering these questions gives us an insight into how machine learning could be

applied in practice and consider where to focus further research.

1.5 Structure of the Thesis

The rest of the thesis is structured as follows. Related work is described in Chapter

2. Chapter 3 describes the research methodology, data preparation, performance

measures and training setup. Chapter 4 presents the results and analysis of our work

7


as it applies to answer the research questions. We conclude our work in Chapter 5

and propose future work.

1.6 Research Contribution

To the best of our knowledge no one has trained a stacking classifier specifically with

RF, SVM and LR on software defect data. In this paper we use the Scikit library [27] to

fill this gap in current research by implementing a novel application of stacking classifier

to stack the predictions of three supervised learning base models RF, SVM and LR,

and compute a final prediction. We contribute to the field of software engineering and

AI by presenting results aimed at the development of more reliable and cost-effective

software systems. For example, we identify and analyze the key indicators in software

code that machine learning models can use to predict defects. We benchmark the

performance of LR, RF, and SVM models against each other, as well as against

the proposed stacking classifier. Providing this comparison provides insight into the

strengths and weaknesses and a view of the most effective approach for different

datasets and context. Our evaluation of models on datasets from different projects

written in the same programming language (C/C++) provides an assessment of model

performance and ability to generalize. Cross-dataset evaluation demonstrates the

applicability and limitations of the models across software projects. The importance

of investing in reliable defect prediction methods can lead to substantial economic

savings and enhanced safety in critical systems and justifies further research and

development in this area. Based on the findings of this study, we also suggest areas

where further work and refinement of models could yield even better results.

8


Chapter 2

Related Work

In this chapter, we discuss existing approaches in applying artificial intelligence (AI),

machine learning in particular, to software defect prediction (SDP). In the sections

relating to supervised machine learning and ensemble learning we highlight how they

are different from this study. For a detailed summary of the results from performing a

systematic review of software defect prediction research, see the appendix. A complete

list of repositories used in the period is provided in the appendix A.1. There have been

a number of studies relating to the use of AI in the form of using machine learning

and deep learning models to predict defects in a piece of software using the NASA

MDP datasets. They have attempted to predict which module is defective and needs

attention. The current literature applies many machine learning models to software

defect prediction. The most popular machine learning algorithms are Random Forests

(RF), Näıve Bayes (NB), and Support Vector Machine (SVM). The top ten most used

algorithms accounted for 41% of all models and techniques used in the research papers

in our final selection. Figure A.2 lists these top ten models used by researchers for the

search period. Traditional software defect prediction links defects with module size

and complexity. Some techniques found in the review that are not directly classified as

machine learning or traditional SDP were class dependency networks and Augmented-

Code Property Graph (CPG). For example, Xu et al. [42, 43] in their papers on CPG


and augmented code graph for defect prediction (ACGDP) presented graphical neural

networks (GNN) for obtaining defect characteristics. Under the broader category

of data mining techniques, Zang and Ren [47] propose a new method for software

defect prediction based on self-organizing data mining (SODM) specifically applied

to finance software systems. They used SVM as the binary classification model but

focused on predicting if a change to the system was a faulty change or non-faulty

change rather than if a module was faulty as we do in this paper. Atomic Rule Mining

was used along with Random Forest by Thapa et al. [38] and Association Rule Mining

by Wu et al. [41]. Our work also includes supervised machine learning with two of

the most popular algorithms SVM and RF. Although we use RF in this work, it is

in the conventional machine learning way, without rule mining. We do not employ

rule mining or data mining because the data we use already exists as preprocessed

datasets.

2.1 Supervised Machine Learning

Classification with supervised learning has been well-explored but requires a set of

data that has already been labelled or classified. In the context of software defect

prediction, researchers have used sets of data where the code modules are already

labelled as defective or not, often using a collection of history of failures to predict

future software failures. Soe et al. [31] used the Random Forest algorithm and

showed a maximum accuracy of prediction up to 99.59 with the minimum accuracy of

85.96 by using a hundred trees for software defect prediction. They concluded that

around a hundred trees was more stable to get high accuracy and the details on the

hyperparameter tuning are restricted to the number of trees only. Our work employs

multiple hyperparameters over a grid search in attempt to find a good Random Forest

10


model, and we compare our results with those of Soe et al. In “Effective Software

Defect Prediction Using Support Vector Machines (SVMs)” Goyal [8] proposed a

filtering technique for effective defect prediction. Their FILTER method aimed to

remove data points from the majority class (non-faulty instances) that were in close

proximity to the faulty instances therefore creating a more balanced dataset. Although

the features are listed, there is no discussion of feature selection or engineering by

any other method before applying the novel filter. They concluded that using the

FILTER with an SVM based software defect prediction model enhanced performance

by 16.73%, 16.80% and 7.65% in terms of accuracy, AUC and F-measure respectively.

We compare our SVM-Linear Kernel with no filtering with the SVM-Linear Kernel

with filtering.

2.2 Ensemble Learning

Ensemble learning includes stacking, boosting, bagging and voting. It is not unusual

in this area of research to see multiple techniques combined as an ensemble to compare

and contrast research results. Dr. R. Wu (Machine Learning lecture, June 16, 2021)

discussed ensembling, “The main principle (assumption) behind the ensemble model is

that a group of weak learners come together to form a strong learner, thus increasing

the accuracy of the model.” Some work has used an ensemble approach to consider

the class imbalance problem of binary classification that is not usually an even split.

For example, the number of defective instances are less than the number of non-

defective instances. Ali et al. [2] proposed an ensemble model that they trained on

NASA datasets from the PROMISE repository. In their proposed model bagging

and voting are used. They observed that their ensemble learning model produced

better results than SVM, K-Nearest Neighbors (KNN), KNN, Decision Trees (DT)

11


or RF. However, the high accuracy reported for the ensemble model, 99.27%, raises

concerns about potential overfitting, and there is no discussion on hyperparameter

tuning. There is also no mention of cross-validation other than it appearing in Figure

2, but with no discussion on how cross-validation was performed, such as the number

of folds used. It is stated that a “Chi-squared statistic method was applied to evaluate

the importance of the feature and selection.” which is different to our approach of

correlation. We compare our stacking classifier results with the proposed ensemble.

Chen et al. [4] used seven base classifiers re-ensembled with a neural network to

propose a class-imbalance solution they called dual ensemble software defect prediction

(DE-SDP). They concluded that the improved G-mean scores, equation 2.1, indicated

a “higher classification accuracy on defective data and is more robust to imbalanced

data.” Stacking is an ensemble learning technique that combines the predictions of

multiple base models using another model called a meta-learner. The base models’

predictions serve as features for the meta-learner, which learns how to best combine

them to make the final prediction. In this paper we use the Scikit library [27]

implementation StackingClassifier to stack the predictions of three supervised learning

base models RF, SVM and Logistic Regression (LR), and compute a final prediction.

Boosting is a machine learning ensemble technique where multiple weak learners are

combined sequentially to form a strong learner. In boosting, each new learner focuses

on the examples that previous learners struggled with, gradually improving overall

performance. AdaBoost, Gradient Boosting, and XGBoost are examples of boosting

implementations. Bagging (Bootstrap Aggregating) is an ensemble technique where

multiple models are trained independently on different subsets of the training data,

typically created by sampling with replacement. The final prediction is often made

by averaging or voting on the predictions of all models. Voting is a simple ensemble

technique where multiple models make predictions on a given input, and the final

12


prediction is determined by a majority vote for classification tasks (or averaging for

regression tasks) of the individual predictions. Random Forests are an example of

an ensemble method built on decision trees. These techniques are commonly used

in ensemble learning to improve the overall performance and robustness of machine

learning models.

2.3 Unsupervised Machine Learning

Unsupervised learning means that already labelled or classified data is not available,

such as on new projects where there is not yet a history of failures to draw from. In

this paper, we use supervised learning in our experiments because our data is already

labelled as defective or non-defective. Data from different projects is not easy to reuse

in a cross-project or generalized way. Marjuni [20] attempts to address this issue

of a lack of a training dataset with spectral classifier. Also, work performed by Ha

[9] provided a framework for an unsupervised classifier that was made available to

other researchers. We use the NASA MDP datasets from different projects and could

therefore argue that we attempt cross-project defect prediction by trying the final

model fitted and tested on one project’s data on a different project’s data.

2.4 Deep Learning Models

Deep learning is a subset of machine learning and uses neural networks to see patterns

from large amounts of data and was used in 35% of the papers surveyed. We found

Artificial Neural Networks (ANN) to be the most popular deep learning technique,

appearing in 23 of 96 (24%) papers that applied a deep learning model. ANN also

appeared in the top ten machine learning models overall as seen in Figure A.2.

In “Automated Parameter Tuning of Artificial Neural Networks for Software Defect

13


Figure 2.1: Deep Learning Models Usage: Deep learning is a subset of machine learning and
was found in 35% of the papers surveyed.

Prediction” [44] the main objective was to validate that the ANN defect prediction

models with tuned parameter settings outperformed the default parameter settings.

The results showed that the models trained with optimized parameter settings did

outperform the models trained with default parameter settings. However, in this

paper we do not use deep learning models as they are suited to larger datasets than

the NASA MDP datasets.

2.5 Metrics

The performance evaluation criteria for our work are presented in Section 3.8 and

include widely accepted evaluation metrics. The most used performance metrics in the

14


literature review are accuracy, Equation (3.9), precision, Equation (3.10), and recall,

Equation (3.11). Also F1-score and Receiver Operating Characteristic (ROC) curve

and area under the ROC curve (AUC) which are defined in 3.8.

Other metrics applied for evaluating classification models include Matthews Corre-

lation Coefficient (MCC) and G-mean. We do not apply MCC in this paper, primarily

because we want to compare our results with the most used ones in related work.

Matthews Correlation Coefficient is a measure of the quality of binary classification

models. It takes into account true positives, true negatives, false positives, and false

negatives, and provides a value between -1 and 1. A coefficient of 1 represents a

perfect classifier, 0 indicates a random classifier, and -1 indicates a completely incorrect

classifier. It was introduced in 1975 by B.W. Matthews, [21]. He was a biochemist

looking for a way to assess the quality of secondary structure predictions in proteins.

It is now also as a performance metric for binary classification models [26, 32, 45].

Geometric mean (G-mean) is a mathematical concept that is equal to the nth root

of the product of a group of n numbers. G-mean, equation (2.1), ranges from 0 to 1,

with 1 being the best possible score.

G-mean =
√

precision × recall (2.1)

By calculating the geometric mean of two rates, precision, (true negative rate) equation

(3.10), and recall, (true positive rate) equation (3.11), it provides an overall measure of

classification performance. In addition to Chen et al. [4] mentioned earlier, Tabassum

et al. [36] use G-mean to demonstrate improvements in Cross-Project (CP) and

Just-In-Time (JIT-SDP) software defect prediction.

15


Chapter 3

Methodology

3.1 Overview

Using data with a specific algorithm creates a machine learning model. In the following

sections we describe the steps taken to create our learning models and provide details

about selecting the hyperparameters, training, predictions and assessing the models’

performance. Testing and further validation are performed using the trained or fitted

model for predictions on unseen data that was held back from training. We use a

publicly available collection of datasets derived from the NASA Metrics Data Program

(MDP). It contains metrics gathered from various software projects developed within

NASA and are used extensively in software defect prediction research. For example,

dataset CM1 from the collection was found in 31% of our literature survey papers. Our

methodology follows a standard workflow for applying a supervised machine learning

model to labelled data. An overview of the steps is shown in Figure 3.1 and the steps

are detailed in the following chapters of this paper.

3.1.1 Choice of Model

It is generally noted that no single algorithm is universally the best for all scenarios,

and it’s often beneficial to experiment with multiple algorithms to find the one that

performs best for a specific problem. Therefore, we chose four training algorithms


Start

Data Collection

Data Cleaning

Feature Engineering

Split Data (Train/Test)

Train Model

Evaluate Model

Tuning
Required?

Retrain Model

Final Predictions

Yes

No

Figure 3.1: Machine Learning Workflow: Overview of Process

17


Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and

in the final step, used a stacking classifier comprised of the LR, RF and SVM to see if

predictions outperformed the individual models. LR is easy to interpret because it

directly models the probability of a binary outcome, such as defect prediction. This

made it a good starting point for our experiments. RF was chosen because combining

multiple decision trees through ensemble learning, random forests reduce overfitting.

SVM works to maximize the margin between classes while minimizing the classification

error. This regularizes the model and prevents over-fitting. We felt this was important

as we were working with small datasets. The individual results from LR, RF and

SVM were used to answer RQ2: Which classifier among Logistic Regression (LR),

random forest (RF), and support vector machine (SVM) provides the most accurate

predictions of software defects?

3.2 Dataset Preparation

First, data has to be either collected specifically or obtained from an existing source.

If it does not already exist the cost and time needed to gather it can be expensive

and prohibitive. Once data is obtained, any quality issues must be resolved such

as duplication, missing data, values that are outside of a useful range and so on.

This is referred to as sanitizing or cleaning the data. Feature engineering involves

selecting, transforming, or creating new features from data attributes to improve the

performance of machine learning algorithms. Finally, data is split into training and

test datasets. It is essential that some data is set by to test the trained model so

that it does not just repeat the results learned on the training data. These steps are

described in more detail in the following sub sections.

18


Table 3.1: Overview of MDP ′′ Defect Datasets

Dataset Project Total Defective % Defective
Instances Instances

CM1 Spacecraft instrumentation. 327 42 12.84%
JM1 Real-time predictive ground sys-

tem that uses simulations to gen-
erate predictions for missions.

7,782 1,672 21.49%

KC1 Storage management system for
receiving and processing ground
data.

1,183 314 26.54%

PC1 Flight software for earth orbiting
satellite.

705 61 8.65%

3.2.1 Data Collection

For this study we obtained the publicly available cleaned NASA Metric Data Program

MDP ′′ collection1 which is also popular in recent papers [3, 25, 37]. The authors

state that they used the original versions of the data sets from the NASA MDP

repository. We chose only to use the datasets created from projects written in C/C++

to avoid adding to the complexity of also combining Java and Perl projects. Starting

with a common language base could allow for a more systematic approach to model

development and evaluation. Also, we only use datasets where we could trace originals

and information such as language used and project name. Table 3.1 provides an

overview of the final four datasets. Note that the % Defective is calculated as the

ratio of Defective instances to total instances. This is useful as a way of measuring

the class-imbalance in that the Defective and non-Defective instances are not evenly

split.

1https://figshare.com/collections/NASA_MDP_Software_Defects_Data_Sets/4054940

19


3.2.2 Data Cleaning

To clean the original versions of the data sets from the NASA MDP repository MDP ′′

datasets, Sheppard et al. [30] performed initial preprocessing. The preprocessing

involved binarization of the Defective class variable by calling Defective if the error

count was ≥ 1. The ‘unique module identifier’ was removed as it gave no information

toward the defectiveness of a module, along with other error data attributes such as

age of error and error density. The remaining steps were:

1. Remove cases with implausible values.

2. Remove cases with conflict feature values.

3. Remove identical cases.

4. Remove inconsistent cases.

5. Remove cases with missing values.

6. Remove constant features.

7. Remove identical features.

During our exploratory analysis, we observed that not every attribute appeared in

all datasets so we performed a comparison of the MDP ′′ attributes in each dataset

file. Table 3.2 shows the results of this analysis where the highlighted rows show

the attributes that existed across all datasets. These common attributes became

the n features in the features matrix. The Defective field states whether or not the

module has one or more reported defects and became our one-dimensional target array.

All 21 features are described in Section 3.3. The NASA data instances are referred

to as a module to represent the unit of functionality. The name can differ among

20


languages with modules, functions or methods referred to as units of functionality.

The original data was provided in Attribute-Relation file format (.arff), therefore,

to avoid repetitively loading the file and dropping unwanted features during code

execution, our final preprocessing step was to write out only the data for common

attributes to .csv files.

3.2.3 Scaling

Some algorithms, such as LR fits the data to a curve, so scaling was necessary. Where

appropriate we applied the StandardScaler from Scikit-0.60 to transform each column.

For example, as shown in Table 3.3, dataset CM1 in the column for branch count

has a minimum of three and a maximum of 162, with a standard deviation of 16.8

and a mean of 13.0. After transforming the data to have a mean value of zero and a

standard deviation of one, the minimum and maximum are -0.6 and 8.86. RF is an

example where the data did not need to be scaled because it is an ensemble of decision

trees, making decisions by splitting nodes based on the order of the data rather than

their specific value.

3.3 Feature Engineering

Feature engineering involves making decisions about which features to include in the

model or if more data for new features is needed. We attempt to assess the usefulness

of features to simplify the model and decrease complexity, and to make the output

easier to analyze and interpret. This part of the study will allow us to answer RQ1:

What are the key indicators in software code (attribute/feature importance) that

machine learning models can use to predict defects?

21


Table 3.2: Dataset Attribute Existence Verification

Highlighted row indicates that attribute exists in all four datasets
File attribute CM1 JM1 KC1 PC1
Branch count 1 1 1 1
Call pairs 1 0 0 1
Condition count 1 0 0 1
Cyclomatic complexity 1 1 1 1
Cyclomatic density 1 0 0 1
Decision count 1 0 0 1
Decision density 1 0 0 1
Defective {Y,N} 1 1 1 1
Design complexity 1 1 1 1
Design density 1 0 0 1
Edge count 1 0 0 1
Essential complexity 1 1 1 1
Essential density 1 0 0 1
Global data complexity 0 0 0 0
Global Data Density 0 0 0 0
Halstead content 1 1 1 1
Halstead difficulty 1 1 1 1
Halstead effort 1 1 1 1
Halstead error est 1 1 1 1
Halstead length 1 1 1 1
Halstead level 1 1 1 1
Halstead Prog time 1 1 1 1
Halstead volume 1 1 1 1
Loc blank 1 1 1 1
Loc code and comment 1 1 1 1
Loc comments 1 1 1 1
Loc executable 1 1 1 1
Loc total 1 1 1 1
Maintenance severity 1 0 0 1
Modified condition count 1 0 0 1
Multiple condition count 1 0 0 1
Node count 1 0 0 1
Normalized cyclomatic complexity 1 0 0 1
Num operands 1 1 1 1
Num operators 1 1 1 1
Num unique operands 1 1 1 1
Num unique operators 1 1 1 1
Number of lines 1 0 0 1
Parameter count 1 0 0 1
Percent comments 1 0 0 1

22


Table 3.3: Dataset CM1 Column Values Before and After Scaling

Branch Count Original Scaled
Mean 13.015291 0
Std 16.843098 1
Min 3 -0.60
Max 162 8.86

3.3.1 One-hot Encoding

One-hot encoding is a technique we used to convert the Defective target attribute

from Y, N to 0 and 1 to create a binary vector where 1 indicates that a defect is

present and 0 no defect is present.

3.3.2 Correlation

Pearson’s correlation coefficient (r). British mathematician Karl Pearson’s work on

the correlation coefficient was published in 1986 [34] and is calculated by dividing the

covariance of the variables by the product of their standard deviation as shown in 3.1.

r =
cov(X, Y )

σXσY

(3.1)

where cov(X, Y ) is the covariance of variables X and Y , and σX and σY are the

standard deviations of X and Y , respectively. The r value indicates the strength and

direction of the linear relationship between the features. It ranges from -1 to 1 with

1 being a perfect positive linear relationship, -1 perfect negative linear relationship

and 0 indicating no linear relationship. Figure 3.2 shows a sample of r values from a

correlation matrix created for dataset CM1. The diagonal line that is all 1’s is simply

where the chart columns and rows are the same feature. Multicollinearity means one

variable can be linearly predicted from the others. In our work this could make it

difficult to isolate the individual effect of the predictor features on the module being

23


Defective or not. In Chapter 4 we show results of using the correlation matrix of each

dataset to create correlation heatmaps using a color a scale.

Figure 3.2: Correlation Matrix: Example from dataset CM1

3.3.3 Feature Importance

Dimensionality reduction is the process of reducing the number of features in a dataset

while preserving the important information. We performed feature importance analysis

to assess if some features could be removed from the dataset. Feature importance

indicates the strength and direction of the relationship between the features and the

target variable. Our initial training session included evaluating to what extent a

particular feature impacts results. One option is to re-train the models using only

features that reached a certain importance level to see if predictions improved. For

larger datasets this could also be a way of reducing training time and costs by not

having to analyze as many possible combinations. Also, the additional data can create

noise that can obscure the underlying patterns leading to inaccuracies or decreased

performance of the models. One should assess if the less important features are truly

noise or contribute to the model by having domain knowledge and using methods

such as L1 regularization (Lasso). Regularization limits how much weight is placed

on one single feature. For Logistic Regression and SVM we were able to access the

coefficients assigned to each feature and we use that to calculate the mean feature

importance. The sklearn.ensemble.RandomForestClassifier we used has a built in

24


attribute that returns an array of feature importances. Control structures contribute

to program complexity and there are measures including McCabe’s cyclomatic, design,

and essential complexity that were introduced by McCabe in his 1976 paper “A

Complexity Measure” [22]. Overly complex code becomes hard to maintain, error

prone and requires more effort to manage it. Halstead was one of the early researches to

provide options to estimate programmer effort in his 1975 paper “Toward a theoretical

basis for estimating programming effort” [10].

A description of each of the quality metric attributes included as features in our

exploratory analysis is given here:

1. Branch count tallies the number decision points in the control flow that can take

different paths [22]. A larger number of branches means more testing resources

are required to cover all cases.

2. Cyclomatic complexity measures the number of linearly independent paths

through a program’s decision structure. Introduced by McCabe [22], one can

also determine the minimum number of tests for code coverage.

3. Design complexity is the cyclomatic complexity of the reduced graph of a

program. The reduction is performed to eliminate any complexity which does

not influence the interrelationship between design modules. [23].

4. Essential complexity indicates the extent to which a programs flow graph can be

reduced through the process of removing or decomposing all the sub flow graphs

(subroutines) with unique entry and exit nodes. McCabe [22].

5. Halstead content is formally Intelligent Content and defined in the McCabe IQ

Research Library2. Halstead content is another way of referring to the complexity

2http://www.mccabe.com/iq_research_metrics.htm

25


of a given algorithm independent of the programming language used to express

the algorithm.

Halstead Content =
Halstead Volume

Halstead Difficulty
(3.2)

6. Halstead difficulty (D) is a measure of the level of how difficult it is for a

programmer to comprehend the code [24, 14].

D =
n1

2
· N2

n2
(3.3)

7. Halstead effort (E) is an estimate of the mental effort to work on a module.

Calculated as the product of Halstead difficulty and Halstead volume [10].

E = D · V (3.4)

8. Halstead error est is the number of errors in a module.

9. Halstead length is the program length as total number of operators and the total

number of operands [10].

N = N1 + N2 (3.5)

10. Halstead level (L) is the inverse of the program Halstead difficulty and also

measures the program’s ability to be comprehended [10].

L =
1

D
(3.6)

11. Halstead programming time is the estimated time to develop the module or

implement an algorithm [10]. S = 18 represents the number of moments per

second of effective mental discriminations of the human brain. Halstead chose

26


this based on psychological research by Stroud [35].

Halstead programming time =
E

S
(3.7)

12. Halstead volume (V) The minimum number of bits required for coding the

program2.

V = (N1 + N2) · log2(n1 + n2) (3.8)

13. Loc blank is whitespace only. Useful for human readability.

14. Loc code and comment. The lines that contain both code and comment or

Halstead’s line count of mixed code and comments.2

15. Loc comments the number of lines in a module. This particular metric includes

all blank lines, comment lines, and source code lines.3

16. Loc executable or McCabe line count, the lines of code that contain only code

and white space.

17. Loc total or Halstead’s line count. Total lines of code.

18. Num operands N2 The total operands used [10].

19. Num operators N1 The total operators used [10].

20. Num unique operands n2 The unique or distinct number of operands used [10].

21. Num unique operators n1 The unique or distinct number of operators [10].

3http://promise.site.uottawa.ca/SERepository/datasets/kc1-class-level-

Numericdefect.arff

27


3.4 Model Validation

In preparation for validating our models, instead of creating a training and test split

on each dataset, we chose to use datasets CM1 and JM1 datasets for training, Table

3.4, and KC1 and PC1 for testing as listed in Table 3.5. This resulted close to an

80/20 percent split of the total instances. The importance of a test and training data

split is that only applying the model with the samples it has already seen would result

in a perfect score, therefore failing to predict anything useful on yet-unseen data.

Having a too-perfect performance score is a problem called over-fitting, and a solution

to over-fitting is using hold-out sets of data.

28


Table 3.4: Training Datasets

Training Datasets
Dataset Instances Percent Defective
CM1 327 12.84%
JM1 7782 21.49%

Table 3.5: Test Datasets

Test Datasets
Dataset Instances Percent Defective
KC1 1183 26.54%
PC1 705 8.65%

3.5 Hyperparameter Tuning

In machine learning, a hyperparameter is a parameter that can be set before the

training begins to adjust the learning process and influence the model’s ability to

generalize from training data to unseen data. In this section we discuss hyperparameter

selection for each model and grid search employed in this study.

3.5.1 Logistic Regression

As the starting point for our experiment we used the Scikit implementation of LR for

our binary classification problem, LogisticRegression. The liblinear solver performs

regularization by default, but we can still adjust the C parameter [27] which controls the

inverse of regularization strength. A smaller value of C indicates stronger regularization,

which can help prevent overfitting by penalizing large coefficients. For example, we

used values 0.1, 1, and 10 for C with LR, which allowed a grid search to be performed

over the those three values.

3.5.2 Random Forest

Random Forest is an ensemble learning method that combines multiple decision trees

to improve predictive accuracy. It also reduces over-fitting by averaging the predictions

of multiple trees. The implementation we used is RandomForestClassifier [27]. The

parameters we tuned were

29


• clf n estimators is the number of decision trees in the random forest, more trees
generally improve performance but increase computational cost.

• clf max depth defines the maximum depth of each decision tree. Limiting the
depth can prevent overfitting and ‘None’ allows the trees to grow to their
potential maximum depth.

• clf min samples split is how many samples need to be at a node before it can be
split into child nodes.

• clf min samples leaf sets the minimum number of samples allowed in a leaf node.

• clf bootstrap parameter specifies whether bootstrap samples are used when
building trees. If set to True, each tree in the forest is built on a bootstrap
sample (sampling with replacement). If set to False, the entire dataset is used
to build each tree. Using bootstrap samples can introduce randomness, which
can improve generalization.

3.5.3 Support Vector Machine

For SVM we used the svc class from Scikit [27] for binary classification to separate

defective and non-defective software modules. It is considered memory efficient because

it uses only those data points that just touch the margins (the support vectors). We

used a linear kernel with SVM that simply computes the dot product between two input

vectors, V1 · V2, and is suitable for linearly separable data. For SVM the regularization

parameter C regularizes the model and prevents overfitting by controlling the strictness

of said margin. A large C means that support vectors cannot lie inside it, smaller

values specify stronger regularization. We trained with low and higher C values [0.01,

0.1, 1, 10]. Other kernel functions are available to handle non-linear relationships in

the data if needed.

3.5.4 Stacking Ensemble

Stacking classifier was used to take the best estimator output of individual base

classifiers, LR, RF and SVM and use that as input of a final estimator. Figure 3.3

30


illustrates this with the three models used as input parameters to the final model.

Stacking combines the strengths of the base classifiers. By aggregating the predictions,

it may achieve higher accuracy, be more robust and be better for generalization. The

Scikit StackingClassifier is used to answer RQ3: Can a stacking classifier that combines

LR, RF, and SVM improve predictive performance?

Training Data

Random ForestLogistic Regression Support Vector Machine

Stacking Classifier

Final Prediction

Figure 3.3: Stacking Classifier: Uses the strength of each individual estimator by using their
output as input of a final estimator.

3.6 Training

Training the model involves applying the individual algorithm to the data with the

selected hyperparameters and features to obtain a best fit. We first work with classifiers

LR, RF and SVM separately. Model validation essentially involves fitting it to the

training data and comparing the prediction to the known value. Model validation

and the search for best parameters was also aided by using a grid search model to

execute a cross-validated grid-search over a parameter grid. We used the default

number of 5 for the number of cross-validation splits (folds/iterations), with a Scikit

31


implementation called GridSearchCV 4 to fit the model and provide results of using

various hyperparameters of the individual estimators. As our estimators are classifiers

and Y, defective / non-defective target vector, is binary, StratifiedKFold is used.

StratifiedKFold is “a variation of KFold that returns stratified folds. The folds are

made by preserving the percentage of samples for each class.”5 Finally, we combine

these using a Stacking Classifier, as illustrated in Figure 3.3. The best estimators from

fitting the model are supplied to the Stacking Classifier to provide final predictions

that can be compared to the individual model output.

3.7 Testing

Finally we use the trained model to make predictions on the hold out test datasets.

The model that was fitted to the training data is fit on unseen data using the best

hyperparameters and most important features. Finally, performance measures on

the outcomes are used to evaluate effectiveness in predicting whether or not modules

contain defects.

3.8 Model Performance

In this section, we describe the performance measures used for evaluating the effec-

tiveness of each model. As we are predicting whether or not a module is Defective

statistical binary classification measures are used.

Definitions

1. True Positive (TP) a defect was predicted and it actually was a defect.

4https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.

StackingClassifier.html
5https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.

GridSearchCV.html

32


2. False Positive (FP) a defect was predicted but there actually was no defect.

3. True Negative (TN) a defect was not predicted and there actually was no defect.

4. False Negative (FN) no defect was predicted but there actually was a defect.

5. Accuracy is the proportion of the number of correctly classified instances (TP

and TN) to the total number of instances present (TP, TN, FP, and FN).

Accuracy =
TP + TN

TP + FP + TN + FN
(3.9)

6. Precision is the ratio of Defective instances that are classified correctly (TP) to

the total number of instances that are classified as Defective (TP, and FP).

Precision =
TP

TP + FP
(3.10)

7. Recall is the ratio of Defective instances that are classified correctly (TP) to the

total number of instances that are actually Defective (TP and FN).

Recall =
TP

TP + FN
(3.11)

8. F-measure, F-score, and F1 score are often used interchangeably, but they refer

to the same metric used in binary classification. It combines precision and recall

as the harmonic mean to provide a single value that represents the model’s

accuracy in capturing both positive and negative instances. F1-score ranges

from 0 to 1, with 1 being the best possible score, indicating perfect precision

33


and recall, and 0 indicating the worst possible performance.

F1 =
2(Precision×Recall)

Precision + Recall
(3.12)

9. Receiver Operating Characteristic Area Under the Curve. The AUC of ROC

measures the overall performance of the classifier across all possible thresholds.

A higher AUC value (ranging from 0 to 1) indicates better discrimination and

model performance. Huang et al. [13] give a good overview in their 2005 paper

“Using AUC and Accuracy in Evaluating Learning Algorithms” and state that

“The area under the ROC curve, or simply AUC, provides a good summary” for

the performance of the ROC curves.”

In the next chapter we present results and analysis based on these metrics to

measure the accuracy and performance of our models and answer the research questions

that we proposed in Chapter 1.

34


Chapter 4

Results and Analysis

4.1 Introduction

In this chapter we answer the research questions posed in Chapter 1 by reviewing the

overall performance of the models as applied to unseen test data sets and different

training data setups. We present results from our attempts to understand the differ-

ences in feature importance between training datasets and compare our results with

existing work.

• RQ1: What are the key indicators in software code (attribute/feature importance)
that machine learning models can use to predict defects?

• RQ2: Which classifier among logistic regression, random forest, and support
vector machine provides the most accurate predictions of software defects?

• RQ3: Can an ensemble using a stacking classifier that combines logistic regression,
random forest, and support vector machine improve predictive performance?

• RQ4: How do our results compare to existing work?

Several tables in this chapter have a ‘Class’ column and the classes are:-

• Class 0: Non-defective

• Class 1: Defective


4.2 RQ1: What are the key indicators in software code that machine

learning models can use to predict defects?

It is a crucial part of machine learning to tailor feature selection and engineering efforts

appropriately for each model and dataset. A tailored approach can improve model

accuracy and provide deeper insights into the factors that drive the predictions in

different software engineering contexts. The key indicators identified in this study for

predicting software defects include Halstead content, Halstead volume, Halstead effort,

and cyclomatic complexity. These metrics, which measure code complexity and the

mental effort required to understand the code, were consistently significant predictors

across different datasets. Halstead Volume captures the size of the program. Larger

volumes typically correlate with higher complexity as can be seen on the heatmaps

and RF feature importance charts, hence, higher defect rates. This suggests that

features capturing the intricacy and comprehensibility of code play a vital role in defect

prediction. Features such as branch count and the number of operands and operators

were also important, highlighting the role of code structure and complexity in defect

prediction. Further investigation into additional features and their interactions could

provide deeper insights into improving model accuracy.

4.2.1 Correlation and Multicollinearity

To illustrate how each feature correlates with the others in terms of influencing the

outcome of the software defect, Figure 4.1 shows a correlation heatmap for each dataset

(a) CM1, (b) JM1, (c) KC1, and (d) PC1 4.1. A value of 1 is deep red, 0 would be

white and then -1 dark blue. Heatmaps visually show the correlation and can be used

to zoom in on highly correlated features that may cause multicollinearity to occur when

modelling. The red diagonals with value of 1 are where the features are cross-referenced

36


with themselves. However, the red value 1 squares not in the diagonal suggest very

high correlations, or multicollinearity. For example, num operands, num operators,

Halstead volume, and Halstead length have strong positive correlations with each

other across all datasets. These positive correlations indicate that increases in one of

the metrics is typically associated with increases in the others. This is not unexpected

because Halstead length is the sum of operand and operators per Equation (3.5).

Including all of these features in a regression model could lead to multicollinearity

issues, making it difficult to isolate the individual effect of each predictor feature on the

response variable (defective). The complexity metrics (cyclomatic complexity, design

complexity, essential complexity) and branch count have moderate to high positive

correlations with each other, suggesting that as the branching in code increases, so does

the complexity. On the other end of the color scale, loc blank has a weak correlation

with most other metrics, indicating that the number of blank lines in the code does

not significantly impact other metrics. Halstead level also shows weak to moderate

negative correlations with most other metrics, suggesting an inverse relationship.

Higher Halstead level indicates simpler code, which likely corresponds with fewer lines

of code and lower complexity metrics. As Equation (3.6) shows L = 1
D

. This inverse

relationship is easily verified looking at the values in the dark blue bands on all four

heatmaps. Removing features from the data on strong correlation alone is not usually

sufficient. Domain expertise is needed to understand that even correlated features

may both be needed. Keeping both may make the model easier to interpret. However,

reducing the feature space can help in managing the complexity of the model, improve

training times, and reduce overfitting. In such cases, dropping some highly correlated

features might be beneficial after thorough analysis. We continue with results and

further analysis for feature selection.

37


(a) CM1 Heatmap (b) JM1 Heatmap

(c) KC1 Heatmap (d) PC1 Heatmap

Figure 4.1: Correlation Heatmaps: Shows the correlation between features on a color scale.
Red indicating stronger correlations and blue weaker correlations.

38


4.2.2 Feature Importance

To further analyze the potential impact of the models’ coefficients on each dataset we

plotted the distribution of the coefficients to try and establish a correlation threshold

above which features might be considered for removal. Taking the absolute value put

the focus on the strength of the correlation without regard to whether it is positive or

negative. From figure 4.2 it can be seen that even though the two training datasets

(CM1 and JM1) have the same attributes, the distributions are not uniform.

(a) LR Coefficients Distribution for CM1 (b) LR Coefficients Distribution for JM1

(c) SVM Coefficients Distribution for CM1 (d) SVM Coefficients Distribution for JM1

Figure 4.2: Distribution of Absolute Values of Coefficients for LR and SVM: The presence of
few larger coefficients hints at some features having a greater impact on the model outcome
compared to most.

With LR the distribution is more even in JM1 (b) than in CM1 (a), with a peak

39


Table 4.1: 90th Percentile of Coefficients

Dataset Logistic Regression SVM

CM1 0.4897 0.1952

JM1 0.6194 0.3602

near 0.0 but also significant representation around 0.3 to 0.4 and another peak around

0.6. This could imply that a wider range of features contribute variably to the model,

possibly indicating a more complex relationship in the JM1 data. In contrast, in CM1

LR, there is a high frequency of coefficients around 0.0 to 0.3, with a significant peak

near 0.1 and few coefficients in higher ranges above 0.6. This distribution suggests

that while most features have a moderate influence, there are a few features with

very high or very low influence. The SVM distributions in JM1 show a high peak

at 0.0 and another significant peak close to 0.4, with few instances near 1.0. The

presence of these few larger coefficients hints at some features having a greater impact

on the model outcome compared to most. The distribution for CM1 with SVM peaks

strongly at 0.04 and 0.1, with most coefficients being below 0.15. This suggests that

features generally have a low influence on the model or less complex relationships.

The 90th percentiles listed in Table 4.1 are considered as possible values to use as

a threshold for features to remove. LR tends to use higher magnitude coefficients

compared to SVM across both datasets. The higher coefficients in LR could imply

that the model is more sensitive to changes to input features. This can be beneficial

for learning complex patterns but may also risk overfitting if not managed with

techniques like regularization. On the other hand, the lower coefficients observed in

SVM might suggest that the model is more robust to changes to input data. This might

offer better generalization on unseen data which can be advantageous in scenarios

where data is expected to vary considerably. Overall, JM1 seems to involve more

40


complex relationships among features, as indicated by the broader spread of significant

coefficients in both models compared to CM1.

For RF, we evaluated the feature importances using the feature importances property

of the algorithm. We plotted the results as shown in Figure 4.3. JM1 seems to involve

more complex relationships among features, as indicated by the broader spread of

significant coefficients compared to CM1. Both datasets show high importance for

Halstead content and Halstead effort which suggest these metrics are generally valuable

across different contexts in assessing the models’ predictions. The num operators

is also significant in both. This may be due to possible contextual differences in

their roles between CM1 and JM1. In JM1, the broader spread of importance across

many features could indicate a more complex interplay of factors that determine the

target variable, suggesting that predictions in JM1 are less about a single dominant

characteristic of the data. The high ranking of loc comments in CM1 could suggest a

scenario where the volume of comments within the code significantly correlates with

the target variable, possibly indicating the complexity or maintainability of the code.

41


(a) CM1

(b) JM1

Figure 4.3: Random Forest Feature Importance: JM1 seems to involve more complex
relationships among features, as indicated by the broader spread of significant coefficients
compared to CM1.

42


Answer to RQ1: What are the key indicators in software code that machine

learning models can use to predict defects?

The key indicators identified in this study for predicting software defects

include Halstead content and Halstead effort. These metrics, which measure code

complexity and the mental effort required to understand the code, were consistently

significant predictors across different datasets. Halstead Effort measures the mental

effort required to understand the code so the significance of this metric across

different datasets suggests that the cognitive load on developers is a strong indicator

of potential defects. Halstead Volume captures the size of the program. Larger

volumes typically correlate with higher complexity as can be seen on the heatmaps

and RF feature importance charts, hence, higher defect rates. Features such as

branch count and the number of operands and operators were also important,

highlighting the role of code structure and complexity in defect prediction.

4.3 RQ2: Which classifier among logistic regression, random forest, and

support vector machine provides the most accurate predictions of

software defects?

Among the individual classifiers tested, RF consistently showed better performance

in predicting software defects compared to LR and support SVM. This aligns with

previous studies [19, 31, 38] that highlight the robustness and accuracy of random

forest in handling complex datasets with numerous features. By averaging the results of

multiple trees RF resists overfitting. Having an inbuilt method for selecting important

features might have contributed to its better performance.

43


4.3.1 ROC AUC

The AUC values indicate the overall quality of the models’ predictions, where 0.5

represents no discrimination ability (similar to random guessing) illustrated by ”No

Skill” dotted line. 1.0 represents perfect discrimination. Figure 4.4 shows AUC plots

for LR and SVM. There is a clear difference in performance between LR and SVM and

the datasets they were trained on, suggesting that both model selection and the nature

of the dataset significantly influence outcomes. In the LR graph all curves overlap

and indicate almost identical performance AUC = 0.66 or 0.67 for all curves. This

suggests a moderate ability to discriminate between the defective and non-defective

classes performing better than a random guess (No Skill line) but not highly accurate.

The overlapping nature indicates that changes in the data sets (CM1 vs. JM1) or

model types (KC1 vs. PC1) do not affect the performance, which may suggest similar

characteristics or distributions in the datasets or robustness across the models. In

contrast, the SVM ROC curves show distinct performances for different combinations

of model types and datasets.

• KC1 trained on CM1: AUC = 0.52 (blue solid line)

• PC1 trained on CM1: AUC = 0.72 (green dashed line)

• KC1 trained on JM1: AUC = 0.64 (red dashed line)

• PC1 trained on JM1: AUC = 0.70 (green dotted line)

The best performance is observed in PC1 trained on CM1, with an AUC of 0.72,

suggesting a good discriminatory ability. The worst performance is by KC1 trained on

CM1, with an AUC close to 0.52, which is barely above the No Skill line, indicating

a performance close to random guessing. The improvement in AUC for PC1 across

both datasets (CM1 and JM1) compared to KC1 suggests that the model or features

used in PC1 are more effective for this task.

44


(a) LR

(b) Support Vector Machine

Figure 4.4: ROC Curves Comparison: LR curves overlap. SVM curves show distinct
performances for different combinations of datasets.

45


Table 4.2: Accuracy By Test Dataset

Training Test Logistic Random Support Vector
Dataset Dataset Regression Forest Machine
CM1 KC1 0.73 0.73 0.73
JM1 KC1 0.74 0.74 0.73
CM1 PC1 0.73 0.91 0.91
JM1 PC1 0.74 0.90 0.91

4.3.2 Accuracy

Table 4.2 lists accuracy scores for three classification models LR, RF, and SVM

trained using CM1 and JM1 and tested on datasets KC1 and PC1. RF and SVM

both performed best on PC1 at 0.91 and 0.9 after training on CM1 and JM1. LR

performed consistently at 0.73 and 0.74 on all four test scenarios. Dataset KC1 tests

were similar across LR, RF and SVM at 0.73 to 0.74. KC1 only reached 0.34 accuracy

after training on JM1 and 0.5 after training with CM1.

4.3.3 Precision

Precision scores are presented in Table 4.3. Class 0 (Non-defective, negative class)

generally has much higher precision scores across all models compared to Class 1

(defective, positive class). This indicates that all models are better at identifying true

negatives than true positives, which could imply an imbalance in the dataset where

negatives might be more frequent or easier to predict. The uniform scores of 0.73

for Class 0 in the CM1/KC1 setup across all models suggest potential overfitting.

Generally, models trained on JM1 seem to perform slightly better on Class 0 but

worse on Class 1 when compared to those trained on CM1. Test dataset PC1 allows

for higher precision in Class 0 predictions across all models compared to KC1. This

could indicate that PC1 is either a less challenging dataset or better aligned with the

features or distribution of the training data.

46


Table 4.3: Precision By Test Dataset

Training Test Class Logistic Random Support Vector
Dataset Dataset Regression Forest Machine

CM1 KC1 0 0.73 0.73 0.73
1 0.00 0.00 0.00

JM1 KC1 0 0.74 0.75 0.73
1 0.69 0.56 0.00

CM1 PC1 0 0.73 0.91 0.91
1 0.00 1.00 0.00

JM1 PC1 0 0.74 0.93 0.91
1 0.69 0.32 0.00

4.3.4 Recall

Recall data presented in Table 4.4 suggest a strong need for models to improve in

detecting Class 1 defective instances without sacrificing the performance on Class 0.

All models show high recall scores for Class 0 in most cases, indicating that they are

very effective at identifying all negative instances. Particularly, LR, RF, and SVM

consistently achieve perfect recall scores (1.00) in many scenarios. While maintaining

high recall for Class 0, the RF and SVM models struggle with Class 1, rarely identifying

more than a small fraction of positive cases, with few exceptions where Random Forest

reaches up to 0.18 recall.

47


Table 4.4: Recall By Test Dataset

Training Test Class Logistic Random Support Vector
Dataset Dataset Regression Forest Machine

CM1 KC1 0 1.00 1.00 1.00
1 0.00 0.00 0.00

JM1 KC1 0 0.99 0.97 1.00
1 0.04 0.09 0.00

CM1 PC1 0 1.00 1.00 1.00
1 0.00 0.02 0.00

JM1 PC1 0 0.99 0.96 1.00
1 0.04 0.18 0.00

4.3.5 F1 score

The F1 scores in Table 4.5 reflect a balance between precision and recall for each

model, and is especially useful for evaluating performance in datasets with imbalanced

classes. The high F1 scores for Class 0 across most models indicate that models are

better tuned to predict negatives, possibly due to a higher prevalence of non-defective

instances. In contrast, the consistently low F1 scores for Class 1 across all models

suggest that there is a substantial difficulty in predicting positives.

48


Table 4.5: F1 - score By Test Dataset

Training Test Class Logistic Random Support Vector
Dataset Dataset Regression Forest Machine

CM1 KC1 0 0.85 0.85 0.85
1 0.00 0.00 0.00

JM1 KC1 0 0.85 0.85 0.85
1 0.07 0.16 0.00

CM1 PC1 0 0.85 0.96 0.95
1 0.00 0.03 0.00

JM1 PC1 0 0.85 0.94 0.95
1 0.07 0.23 0.00

Answer to RQ2: Which classifier among logistic regression, random forest, and

support vector machine provides the most accurate predictions of software defects?

The random forest classifier provided the most accurate predictions of software

defects, with an average accuracy of 0.87 across the datasets. It outperformed

logistic regression and support vector machine in terms of precision, recall, and

F1-score, demonstrating its robustness and ability to handle complex interactions

between features.

4.4 RQ3: Can an ensemble using a stacking classifier that combines

logistic regression, random forest, and support vector machine improve

predictive performance?

The motivation for applying a stacking classifier arose from the expectation that

taking the best of each base model, would build an improved model for more accurate

predictions. Table 4.6 lists results from the stacking classifier using the best parameters

from each of the three models using LR, RF and SVM. The performance metrics

provided in Table 4.6 indicate that the stacking classifier, combining logistic regression,

random forest, and support vector machine, did not improve predictive performance

49


Table 4.6: Stacking Classifier

Training Test AUC Accuracy Class Precision Recall F-1 Score
Dataset Dataset

CM1 KC1 0.50 0.73 0 0.73 1.00 0.85
1 0.00 0.00 0.00

CM1 PC1 0.53 0.92 0 0.92 1.00 0.96
1 0.67 0.07 0.12

JM1 KC1 0.54 0.34 0 0.87 0.12 0.22
1 0.28 0.95 0.43

JM1 PC1 0.50 0.09 0 1.00 0.00 0.01
1 0.09 1.00 0.16

across different training and test datasets. Therefore, the ensemble method does

not reliably enhance predictive performance in this context. The stacking classifier

consistently shows the lowest accuracy scores across all scenarios, ranging from only

0.09 to 0.51. This might indicate that the stacking approach, as currently configured,

is not effective. Further investigation could be undertaken to consider if it suffers

from issues like overfitting the training data or not generalizing well to the test

data. Investigation could also include re-examining the base classifiers used, their

hyperparameters, and how they are combined. Accuracy varies widely depending on

the dataset combination, from as low as 0.09 to as high as 0.92. The accuracy metrics

are generally higher when predicting Class 0 correctly, especially noticeable in datasets

like CM1 tested on PC1 (0.92 accuracy). However, low accuracy scores in some

combinations (e.g., JM1 tested on PC1 with 0.09) indicate severe misclassification

issues, particularly for Class 1. The low AUC scores and the varied performance across

classes might imply that the current model complexity might not be suitable for the

underlying data structure and distribution. The AUC values range from 0.50 to 0.54,

which are close to 0.50, indicating that the classifier performs no better than random

guessing on these dataset combinations.The stacking classifier did not perform better

50


on AUC than either LR or SVM where scores were higher in the range from 0.64 to

0.72, except KC1 trained on CM1 0.52. When trained on JM1 and tested on PC1, the

Stacking Classifier shows a complete recall for Class 1 (1.00) but fails entirely for Class

0 (0.00). Choosing RandomForestClassifier ‘SelectFromModel’ might have selected

features that are good for decision trees but not optimal for logistic regression, another

reason why the current model complexity might not be suitable for the underlying

data structure and distribution.

Answer to RQ3: Can an ensemble using a stacking classifier that combines

logistic regression, random forest, and support vector machine improve predictive

performance?

While some models achieved high accuracy for certain classes, others showed

poor results, particularly in recall and F-1 score for the minority class. Specifically,

the AUC values range from 0.50 to 0.54, suggesting that the model’s ability to

distinguish between classes is no better than random guessing for most dataset

pairs. The feature selection method might have selected features that are good

for decision trees but not optimal for logistic regression. A more sophisticated

ensemble method, or a neural network if appropriate for the size of the dataset,

could be considered.

4.5 How do our results compare with similar research?

In this section, we compare the results of our software defect prediction models with

those from three notable research papers: “Effective software defect prediction using

support vector machines (SVMs)” [8], “Software Defect Prediction Using Random

Forest Algorithm” [31], and “An Ensemble Model for Software Defect Prediction”.

[2]. Goyal’s work on SVMs highlighted the enhancement brought by their FILTER

technique, claiming an improvement in accuracy by 16.73% and an improvement of

51


7.65% in F-measure (F1 Score). We compare our SVM-Linear Kernel (no filtering)

with the SVM-Linear Kernel with and without filtering in Table 4.7.

52


Table 4.7: Comparison of SVM with SVM using FILTER

This study Goyal
Dataset Accuracy (avg) FILTER accuracy

KC1 0.73 0.83
PC1 0.91 0.89

Dataset AUC (avg) FILTER AUC
KC1 0.58 0.829
PC1 0.71 0.776

Dataset F1-score FILTER F1-score
KC1 0.85 0.901
PC1 0.95 0.934

• Accuracy: The SVM-Linear Kernel with filtering shows higher accuracy for the
KC1 dataset (0.83 compared to 0.73) and comparable accuracy for the PC1
dataset (0.89 compared to 0.91). This indicates that the filtering technique
generally improves or maintains accuracy across different datasets.

• AUC (Area Under the Curve): The AUC values indicate a significant improve-
ment for the KC1 dataset (0.829 compared to 0.58) and a moderate improvement
for the PC1 dataset (0.776 compared to 0.71) when using the filtering tech-
nique. This suggests that the filtering technique enhances the model’s ability to
discriminate between classes.

• F1-score: The F1-score, which balances precision and recall, is also higher for
the filtering technique in both datasets. For KC1, the F1-score with filtering
is 0.901 compared to 0.85 without filtering, and for PC1, it is 0.934 compared
to 0.95. This demonstrates that the filtering technique improves the overall
predictive performance of the model.

The values provided suggest that Goyal’s FILTER technique to enhance SVM models

outperforms our standard SVM-Linear Kernel model across multiple metrics, including

accuracy, AUC, and F1-score. Therefore, incorporating the filtering technique is

beneficial for improving the effectiveness of SVM models in classification tasks, and

should be considered in future work.

Soe et al.’s research focused on using Random Forest (RF) algorithms with NASA

metrics data and the Arçelik (AR) datasets from the PROMISE repository created

53


Table 4.8: Comparison of Random Forest: This study v. Soe et al.

Comparison of Random Forest Results
Dataset This study Accuracy Soe et al. Accuracy

200 trees 1,000 trees
KC1 0.73 0.8682
PC1 0.91 0.9324

in 2007. RF models are known for their high accuracy and robustness in handling

large datasets. In Table 4.8 we compare accuracy for datasets KC1 and PC1. It is

noted that Soe et al. gained the highest accuracy with dataset PC2 (.9959) which

was not in our test set, and was when using 1,000 trees, in our study 200 trees was

the best estimator. Soe et al.’s final conclusion was that the number of “trees in the

forest should be around a hundred because it is more stable for defect prediction to get

the high accuracy”, which aligns closely with our finding that 200 trees were optimal

in our experiments. This convergence suggests that while the exact number of trees

may vary, a relatively high, but not excessively large, number of trees is beneficial

for achieving stable and accurate results. Ali et al. proposed an ensemble model for

defect prediction. Ensemble methods typically enhance prediction performance by

combining the strengths of multiple classifiers. We compare our stacking classifier

results with the proposed ensemble for dataset PC1.

54


Table 4.9: Comparison of Ali et al. Proposed Ensemble for dataset PC1

This study Ali et al.
Accuracy

0.5 0.9927
Precision

0.91 0.9986
Recall

0.5 0.9935
F-score

0.46 0.9513

The accuracy of Ali et al.’s model (0.9927) is significantly higher than the accuracy

of the stacking classifier for PC1 after training on CM1 from this study (0.92). As

shown in Table 4.6 the accuracy of PC1 after using the model trained on JM1 was

only 0.09 bringing the average accuracy down to only 0.5. The recall of Ali et al.’s

model (0.9935) is significantly higher than the recall of the stacking classifier from this

study (0.5). Higher recall means that Ali et al.’s model is more effective at identifying

actual defects, thus reducing false negatives. Combining multiple classifiers, as done in

Ali et al.’s proposed ensemble model, can significantly enhance prediction performance

in defect prediction tasks and it outperforms the stacking classifier from this study

across all metrics: accuracy, precision, recall, and F-score.

In summary, we note that Goyal’s FILTER technique could be beneficial for

improving the effectiveness of SVM models in classification tasks, and should be

considered in future work. Our findings of 200 trees with RF aligns with Soe et al.’s

final conclusion to recommend around 100 trees for optimal results. Our study also

contributes to the body of knowledge by demonstrating the potential of stacking

classifiers and identifying areas for future improvement. We highlight the effectiveness

of Ali et al.’s ensemble method and suggest that further refinement and optimization of

55


our stacking classifier could yield better results. These comparisons indicate that while

our stacking classifier is effective, further refinement and optimization are necessary

to match or exceed the performance of advanced ensemble techniques.

4.6 Threats to Validity - Internal

This is concerned with the reliability of the methodology, and so, by providing details

of the methodology and process in Chapter 3 we allow other researchers to reproduce

the same results. Conducting more thorough feature engineering could improve

model performance and interpretability by ensuring that the features chosen are truly

predictive for all models in the stacking ensemble. One possibility could be to include

the use of principal component analysis (PCA) to address feature multicollinearity.

The grid search for hyperparameter tuning was conducted with a limited set of values

due to computational constraints and may not be extensive enough. Therefore, more

extensive hyperparameter optimization using other techniques such as random search

and greater processing power could potentially yield better-performing models.

4.7 Threats to Validity - External Validity

External validity concerns the generalizability of our findings to other contexts. This

study used separate projects for training and test but comprehensive cross-project

evaluations was not our goal. Future studies could aim to perform extensive cross-

project validations to better understand how well models generalize across different

project environments and settings. This study relied on publicly available NASA

datasets which may not fully represent the diversity and complexity of real-world

software projects. The availability of a high volume of quality data from real-world

productions systems can limit the feasibility of studies. Although using publicly

56


available data made our study easy to reproduce, data from more recent real-world

projects would allow more real-world improvements. Barriers to access would be

privacy, both personal and industry level. Collecting specific datasets would require

funding, time and expertise.

57


Chapter 5

Conclusions and Future Work

We conclude that research to find reliable prediction methods are still needed because

we cannot prove that there are no defects in software. Artificial intelligence, including

deep learning and machine learning is an area worthy of continued study and this

thesis contributes to the understanding of machine learning applications in software

defect prediction and highlights areas for further research. The findings reinforce the

importance of model selection and feature importance. Our empirical validation of

various machine learning models on publicly available datasets adds rigor to the study,

offering reproducible and verifiable results.

5.1 Conclusions

This thesis contributes to the field of software defect prediction by empirically evaluat-

ing the performance of various machine learning models, including Logistic Regression,

Random Forest, Support Vector Machine, and a stacking classifier combining these

models. The findings highlight the importance of model selection and feature engi-

neering to achieve accurate predictions.

Despite the demonstrated potential of the stacking classifier, it did not consistently

outperform individual models, underscoring the need for further exploration of ensem-

ble methods and more complex models. A more sophisticated ensemble method, or a


neural network if appropriate for the size of the dataset, could be considered. Addi-

tionally, the comparative analysis with existing models, such as Ali et al.’s proposed

ensemble, revealed areas where improvements can be made, particularly in terms of

recall and F1-score.

While this study provides some insights into the application of machine learning

for software defect prediction, it also identifies several avenues for future research. By

addressing the identified threats to validity and exploring advanced techniques and

more diverse datasets, future work can further enhance the reliability and usefulness of

defect prediction models across different domains. Ensuring that defects are identified

and addressed before deployment can save lives and reduce the economic burden of

system failures.

5.2 Future Work

Using AI systems that translate natural language to code should be explored in future

work. OpenAI Codex1, for instance, claims to be trained specifically to understand and

generate code. Future research could involve continued experimentation with different

algorithms and automated hyperparameter tuning to optimize model performance.

There is a need to investigate the potential of overfitting within our models. Some

strategies to consider include implementing a different cross-validation strategy such

as Repeated or Randomized Stratified KFold to reduce variance and ensure each class

is properly represented. Also, implementing more sophisticated feature engineering

methods, such as automatic feature selection, feature extraction using unsupervised

learning, and incorporating additional context-specific metrics, could improve model

accuracy and interpretability. Although this study used separate projects for training

and test, comprehensive cross-project evaluations was not our goal, but would be

1https://openai.com/index/openai-codex/

59


an important area for future study. Conducting cross-project validation studies can

evaluate and improve the generalizability of the models by testing the models on

datasets from projects with different programming languages, characteristics and

domains. Exploring other advanced machine learning models, such as deep learning

techniques, could provide new insights and potentially improve prediction performance.

We found Artificial Neural Networks (ANN) to be the most popular deep learning

technique in our literature review and could be particularly useful for handling complex

software metrics. Acquiring and utilizing larger datasets from modern software projects

across various domains would help validate the models’ applicability and robustness in

different contexts. Collaborations with industry partners could facilitate access to such

data. In industries where software failure can lead to loss of life or significant economic

damage, prioritizing recall can significantly reduce risks. For instance, in the aviation

industry, accurate defect prediction can prevent incidents like the Ethiopian Airlines

flight 302 crash [5], which was partially attributed to software issues in the aircraft’s

control system. Future advancements in this field hold the potential to significantly

improve the quality and dependability of software, particularly in environments where

safety is paramount.

60


BIBLIOGRAPHY

[1] Akiyama, F. An Example of Software System Debugging. IFIP Congress (1)
(1971), 353–359.

[2] Ali, A. R., Ur Rehman, A., Nawaz, A., Ali, T. M., and Abbas, M.
An Ensemble Model for Software Defect Prediction. In 2022 2nd International
Conference on Digital Futures and Transformative Technologies, ICoDT2 2022
(2022), Institute of Electrical and Electronics Engineers Inc.

[3] Aljamaan, H., and Alazba, A. Software defect prediction using tree-based
ensembles. In Proceedings of the 16th ACM International Conference on Predictive
Models and Data Analytics in Software Engineering (New York, NY, USA, 2020),
PROMISE 2020, Association for Computing Machinery, pp. 1–10.

[4] Chen, J., Xu, J., Cai, S., Wang, X., Gu, Y., and Wang, S. An effi-
cient dual ensemble software defect prediction method with neural network. In
Proceedings - 2021 IEEE International Symposium on Software Reliability Engi-
neering Workshops, ISSREW 2021 (2021), Institute of Electrical and Electronics
Engineers Inc., pp. 91–98.

[5] Democratic Republic of Ethiopia Ministry of Transport and Lo-
gistics. Investigation Report on Accident to the B737-MAX8 Reg. ET-AVJ
Operated By Ethiopian Airlines. Tech. rep., Aircraft Accident Investigation
Bureau, 12 2022.

[6] Fenton, N., and Neil, M. A critique of software defect prediction models.
IEEE Transactions on Software Engineering 25, 5 (1999), 675–689.

[7] Gong, L., Jiang, S., and Jiang, L. Conditional Domain Adversarial Adapta-
tion for Heterogeneous Defect Prediction. IEEE Access 8 (2020), 150738–150749.

[8] Goyal, S. Effective software defect prediction using support vector machines
(SVMs). International Journal of System Assurance Engineering and Management
13, 2 (4 2022), 681–696.


[9] Ha, D.-A., Chen, T.-H., and Yuan, S.-M. Unsupervised Methods for
Software Defect Prediction. In Proceedings of the 10th International Symposium
on Information and Communication Technology (New York, NY, USA, 2019),
SoICT ’19, Association for Computing Machinery, pp. 49–55.

[10] Halstead, M. H. Toward a theoretical basis for estimating programming effort.
In Proceedings of the 1975 Annual Conference (New York, NY, USA, 1975), ACM
’75, Association for Computing Machinery, pp. 222–224.

[11] Herbold, S. On the Costs and Profit of Software Defect Prediction. IEEE
Transactions on Software Engineering 47, 11 (11 2021), 2617–2631.

[12] Herzig, K., Just, S., and Zeller, A. It’s not a bug, it’s a feature: How
misclassification impacts bug prediction. In 2013 35th International Conference
on Software Engineering (ICSE) (2013), pp. 392–401.

[13] Huang, J., and Ling, C. X. Using AUC and accuracy in evaluating learning
algorithms. IEEE Transactions on Knowledge and Data Engineering 17, 3 (3
2005), 299–310.

[14] Huda, S., Alyahya, S., Mohsin Ali, M., Ahmad, S., Abawajy, J., Al-
Dossari, H., and Yearwood, J. A Framework for Software Defect Prediction
and Metric Selection. IEEE Access 6 (12 2017), 2844–2858.

[15] ISO/IEC/IEEE. ISO/IEC/IEEE International Standard - Systems and software
engineering–Systems and software assurance –Part 1:Concepts and vocabulary.
ISO/IEC/IEEE 15026-1:2019(E) (2019), 1–38.

[16] Jureczko, M., and Madeyski, L. Towards Identifying Software Project
Clusters with Regard to Defect Prediction. In Proceedings of the 6th International
Conference on Predictive Models in Software Engineering (New York, NY, USA,
2010), PROMISE ’10, Association for Computing Machinery.

[17] Kamei, Y., Shihab, E., Adams, B., Hassan, A. E., Mockus, A., Sinha,
A., and Ubayashi, N. A large-scale empirical study of just-in-time quality
assurance. IEEE Transactions on Software Engineering 39, 6 (2013), 757–773.

[18] Krasner, H. The Cost of Poor Software Quality in the US: A 2022 report.
Consortium for Information & Software Quality. Tech. rep., Consortium for
Information & Software Quality, 12 2022.

[19] Li, R., Zhou, L., Zhang, S., Liu, H., Huang, X., and Sun, Z. Software
Defect Prediction Based on Ensemble Learning. In Proceedings of the 2019 2nd
International Conference on Data Science and Information Technology (New
York, NY, USA, 7 2019), vol. 6, ACM, pp. 1–6.

62


[20] Marjuni, A., Adji, T. B., and Ferdiana, R. Unsupervised software defect
prediction using signed Laplacian-based spectral classifier. Soft Computing 23,
24 (12 2019), 13679–13690.

[21] Matthews, B. W. Comparison of the predicted and observed secondary
structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein
Structure 405, 2 (10 1975), 442–451.

[22] McCabe, T. J. A Complexity Measure. IEEE Transactions on Software
Engineering SE-2, 4 (1976), 308–320.

[23] McCabe, T. J., and Butler, C. W. Design complexity measurement and
testing. Commun. ACM 32, 12 (12 1989), 1415–1425.

[24] Menzies, T., Di Stefano, J., Chapman, M., and McGill, K. Metrics that
matter. In 27th Annual NASA Goddard/IEEE Software Engineering Workshop,
2002. Proceedings. (12 2002), IEEE Comput. Soc, pp. 51–57.

[25] Mori, T., and Uchihira, N. Balancing the trade-off between accuracy and
interpretability in software defect prediction. Empirical Software Engineering 24,
2 (4 2019), 779–825.

[26] NezhadShokouhi, M. M., Majidi, M. A., and Rasoolzadegan, A. Soft-
ware defect prediction using over-sampling and feature extraction based on
Mahalanobis distance. Journal of Supercomputing 76, 1 (1 2020), 602–635.

[27] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel V., Thirion, B.,
Grisel, O., Blondel, M., Prettenhofer P., Weiss, R., Dubourg, V.,
Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot,
M., and Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research 12 (2011), 2825–2830.

[28] Peters, F., and Menzies, T. Privacy and utility for defect prediction:
Experiments with MORPH. In 2012 34th International Conference on Software
Engineering (ICSE) (6 2012), IEEE, pp. 189–199.

[29] Phan, V. A. Learning Stretch-Shrink Latent Representations With Autoencoder
and K-Means for Software Defect Prediction. IEEE Access 10 (2022), 117827–
117835.

[30] Shepperd, M., Song, Q., Sun, Z., and Mair, C. Data Quality: Some
Comments on the NASA Software Defect Datasets. IEEE Transactions on
Software Engineering 39, 9 (9 2013), 1208–1215.

[31] Soe, Y. N., Santosa, P. I., and Hartanto, R. Software Defect Prediction
Using Random Forest Algorithm. In 2018 12th South East Asian Technical
University Consortium (SEATUC) (2018), vol. 1, pp. 1–5.

63


[32] Song, Q., Guo, Y., and Shepperd, M. A Comprehensive Investigation of the
Role of Imbalanced Learning for Software Defect Prediction. IEEE Transactions
on Software Engineering 45, 12 (12 2019), 1253–1269.

[33] Srivastava, A. N., and Schumann, J. The case for software he