TY - JOUR
T1 - COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction
AU - Feng, Shuo
AU - Keung, Jacky
AU - Yu, Xiao
AU - Xiao, Yan
AU - Bennin, Kwabena Ebo
AU - Kabir, Md Alamgir
AU - Zhang, Miao
PY - 2021/1
Y1 - 2021/1
N2 - Context: Generally, there are more non-defective instances than defective instances in the datasets used for software defect prediction (SDP), which is referred to as the class imbalance problem. Oversampling techniques are frequently adopted to alleviate the problem by generating new synthetic defective instances. Existing techniques generate either near-duplicated instances which result in overgeneralization (high probability of false alarm, pf) or overly diverse instances which hurt the prediction model's ability to find defects (resulting in low probability of detection, pd). Furthermore, when existing oversampling techniques are applied in SDP, the effort needed to inspect the instances with different complexity is not taken into consideration. Objective: In this study, we introduce Complexity-based OverSampling TEchnique (COSTE), a novel oversampling technique that can achieve low pf and high pd simultaneously. Meanwhile, COSTE also performs better in terms of Norm(popt) and ACC, two effort-aware measures that consider the testing effort. Method: COSTE combines pairs of defective instances with similar complexity to generate synthetic instances, which improves the diversity within the data, maintains the ability of prediction models to find defects, and takes the different testing effort needed for different instances into consideration. We conduct experiments to compare COSTE with Synthetic Minority Oversampling TEchnique, Borderline-SMOTE, Majority Weighted Minority Oversampling TEchnique and MAHAKIL. Results: The experimental results on 23 releases of 10 projects show that COSTE greatly improves the diversity of the synthetic instances without compromising the ability of prediction models to find defects. In addition, COSTE outperforms the other oversampling techniques under the same testing effort. The statistical analysis indicates that COSTE's ability to outperform the other oversampling techniques is significant under the statistical Wilcoxon rank sum test and Cliff's effect size. Conclusion: COSTE is recommended as an efficient alternative to address the class imbalance problem in SDP.
AB - Context: Generally, there are more non-defective instances than defective instances in the datasets used for software defect prediction (SDP), which is referred to as the class imbalance problem. Oversampling techniques are frequently adopted to alleviate the problem by generating new synthetic defective instances. Existing techniques generate either near-duplicated instances which result in overgeneralization (high probability of false alarm, pf) or overly diverse instances which hurt the prediction model's ability to find defects (resulting in low probability of detection, pd). Furthermore, when existing oversampling techniques are applied in SDP, the effort needed to inspect the instances with different complexity is not taken into consideration. Objective: In this study, we introduce Complexity-based OverSampling TEchnique (COSTE), a novel oversampling technique that can achieve low pf and high pd simultaneously. Meanwhile, COSTE also performs better in terms of Norm(popt) and ACC, two effort-aware measures that consider the testing effort. Method: COSTE combines pairs of defective instances with similar complexity to generate synthetic instances, which improves the diversity within the data, maintains the ability of prediction models to find defects, and takes the different testing effort needed for different instances into consideration. We conduct experiments to compare COSTE with Synthetic Minority Oversampling TEchnique, Borderline-SMOTE, Majority Weighted Minority Oversampling TEchnique and MAHAKIL. Results: The experimental results on 23 releases of 10 projects show that COSTE greatly improves the diversity of the synthetic instances without compromising the ability of prediction models to find defects. In addition, COSTE outperforms the other oversampling techniques under the same testing effort. The statistical analysis indicates that COSTE's ability to outperform the other oversampling techniques is significant under the statistical Wilcoxon rank sum test and Cliff's effect size. Conclusion: COSTE is recommended as an efficient alternative to address the class imbalance problem in SDP.
KW - Class imbalance
KW - Effort-aware defect prediction
KW - MAHAKIL
KW - Oversampling
KW - SMOTE
KW - Software defect prediction
UR - https://doi.org/10.17632/nr29ttcxfx
U2 - 10.1016/j.infsof.2020.106432
DO - 10.1016/j.infsof.2020.106432
M3 - Article
AN - SCOPUS:85091712466
SN - 0950-5849
VL - 129
JO - Information and Software Technology
JF - Information and Software Technology
M1 - 106432
ER -