Imputation of missing data using masked denoising autoencoder with L2-norm regularization in software effort estimation

A frequent problem in building initial software effort estimation (SEE) models is the existence of many missing values in historical software engineering datasets. Due to human intervention, this is caused by frequent damage to software project data. Loss of information and bias in data analysis d...

Full description

Saved in:
Bibliographic Details
Main Authors: Marco, Robert, Syed Ahmad, Sharifah Sakinah
Format: Article
Language:en
Published: Intelligent Network and Systems Society 2024
Online Access:http://eprints.utem.edu.my/id/eprint/28448/2/00896311220241139231570.pdf
http://eprints.utem.edu.my/id/eprint/28448/
https://oaji.net/articles/2023/3603-1719548951.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:A frequent problem in building initial software effort estimation (SEE) models is the existence of many missing values in historical software engineering datasets. Due to human intervention, this is caused by frequent damage to software project data. Loss of information and bias in data analysis due to missing data are serious problems. This study proposes a method to estimate missing data using a masked-denoising autoencoder (MaskedDAE) with L2-norm regularization, which can handle various types of data, missing patterns, proportions, and distributions. In this study, Cocomo81 and ISBSG-IFPUG datasets from open-source repositories were used. This experiment involved five missing data techniques, eight missing data rates (from 10% to 80%), and two missingness mechanisms (MCAR: missing completely at random and MNAR: missing not at random). The results show that the proposed Mask-DAE method has the best imputation performance in terms of imputation errors by outperforming DAE, k-nearest neighbor imputation (kNNI), random forest (RF) imputation, multiple imputations by chained equation (MICE), mean imputation and mode imputation. We find that the prediction error rate increases with the rate of missing data. Furthermore, prediction errors generated by MCAR mechanisms are lower than those generated by MNAR. Nevertheless, our method can reduce the model variance, which results in lower generalization error.