Imputation of missing data using masked denoising autoencoder with L2-norm regularization in software effort estimation
A frequent problem in building initial software effort estimation (SEE) models is the existence of many missing values in historical software engineering datasets. Due to human intervention, this is caused by frequent damage to software project data. Loss of information and bias in data analysis d...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | en |
| Published: |
Intelligent Network and Systems Society
2024
|
| Online Access: | http://eprints.utem.edu.my/id/eprint/28448/2/00896311220241139231570.pdf http://eprints.utem.edu.my/id/eprint/28448/ https://oaji.net/articles/2023/3603-1719548951.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | A frequent problem in building initial software effort estimation (SEE) models is the existence of many
missing values in historical software engineering datasets. Due to human intervention, this is caused by frequent
damage to software project data. Loss of information and bias in data analysis due to missing data are serious
problems. This study proposes a method to estimate missing data using a masked-denoising autoencoder (MaskedDAE) with L2-norm regularization, which can handle various types of data, missing patterns, proportions, and
distributions. In this study, Cocomo81 and ISBSG-IFPUG datasets from open-source repositories were used. This
experiment involved five missing data techniques, eight missing data rates (from 10% to 80%), and two missingness
mechanisms (MCAR: missing completely at random and MNAR: missing not at random). The results show that the
proposed Mask-DAE method has the best imputation performance in terms of imputation errors by outperforming
DAE, k-nearest neighbor imputation (kNNI), random forest (RF) imputation, multiple imputations by chained
equation (MICE), mean imputation and mode imputation. We find that the prediction error rate increases with the
rate of missing data. Furthermore, prediction errors generated by MCAR mechanisms are lower than those generated
by MNAR. Nevertheless, our method can reduce the model variance, which results in lower generalization error. |
|---|
