Two staged data preprocessing ensemble model for software fault prediction

Ehsan Elahi, Amber Ayub, Irfan Hussain

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Software fault prediction is an essential task for the researchers and software testers to determine the faulty modules in the software in early stages. This early identification of faulty modules improves the software quality and thus the software produced will be of higher quality and cost effective. The use of imbalanced dataset hinders in the performance of the software fault prediction model. The model gets biased towards the majority class and thus the worthy results may not be produced. Moreover, the class overlap problem in the data results in the incorrect prediction. This class overlap problem needs to be addressed as the available datasets are highly imbalanced and overlapped. Many fault predictions models have been proposed in the literature using machine learning classifiers but there is always a room for improvement. In this study, the main objective is to utilize the balanced and non-overlapping data in the training of our model, thus improving the prediction capability of the model. In this study, we have used the two staged preprocessing of the dataset before training of our model. Firstly, class overlap problem is addressed using neighborhood cleaning method and then secondly, data is balanced using random oversampling technique. Five publicly available datasets from PROMISE repository are utilized in this study. The four base learners are used and then the results of these base learners are ensembled using the model averaging method. The results are then compared with the use of overlapping method only and using the resampling technique only, to determine the usefulness of the proposed approach. Moreover, the results of the proposed approach are also compared with the existing approach of handling imbalanced data. Through experiments it is seen that the proposed technique has outperformed the prediction capability. For evaluation purpose, the performance measure used is area under the curve (AUC). To avoid the randomness and biasness, results are cross validated using k-fold (k = 10) cross validation.

Original languageEnglish
Title of host publicationProceedings of 18th International Bhurban Conference on Applied Sciences and Technologies, IBCAST 2021
EditorsMuhammad Zafar-Uz-Zaman, Naveed A. Siddiqui, Mazhar Iqbal, Abdur Rauf, Naeem Zafar, Usman Qayyum, Tahir Jamil, Saifullah Khan, Irfan Ali, Qaisar Ahsan, Sajjad Asghar, Mureed Hussian, Shiraz Ahmad, Muhammad Rafique, Naveed Durrani, Shafiq R. Qureshi, Syed Ali Abbas, Naveed Ahsan, Abdul Mueed
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages506-511
Number of pages6
ISBN (Electronic)9780738105352
DOIs
Publication statusPublished - 12 Jan 2021
Externally publishedYes
Event18th International Bhurban Conference on Applied Sciences and Technologies, IBCAST 2021 - Virtual, Islamabad, Pakistan
Duration: 12 Jan 202116 Jan 2021

Publication series

NameProceedings of 18th International Bhurban Conference on Applied Sciences and Technologies, IBCAST 2021

Conference

Conference18th International Bhurban Conference on Applied Sciences and Technologies, IBCAST 2021
Country/TerritoryPakistan
CityVirtual, Islamabad
Period12/01/202116/01/2021

Keywords

  • class overlapping
  • ensemble method
  • random oversampling
  • Software fault prediction

Fingerprint

Dive into the research topics of 'Two staged data preprocessing ensemble model for software fault prediction'. Together they form a unique fingerprint.

Cite this