Staff View: Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification

Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification

An imbalanced dataset often challenges machine learning, particularly classification methods. Underrepresented minority classes can result in biased and inaccurate models. The Synthetic Minority Over-Sampling Technique (SMOTE) was developed to address the problem of imbalanced data. Over time, sever...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hairani, Hairani, Widiyaningtyas, Triyanna, Prasetya, Didik Dwi, Afrig, Aminuddin
Format:	Article
Language:	English
Published:	Tech Science Press 2025
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://umpir.ump.edu.my/id/eprint/44043/1/Addressing%20imbalance%20in%20health%20datasets.pdf http://umpir.ump.edu.my/id/eprint/44043/ https://doi.org/10.32604/cmc.2024.060837 https://doi.org/10.32604/cmc.2024.060837
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.ump.umpir.44043
record_format	eprints
spelling	my.ump.umpir.440432025-03-11T06:51:15Z http://umpir.ump.edu.my/id/eprint/44043/ Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification Hairani, Hairani Widiyaningtyas, Triyanna Prasetya, Didik Dwi Afrig, Aminuddin QA75 Electronic computers. Computer science An imbalanced dataset often challenges machine learning, particularly classification methods. Underrepresented minority classes can result in biased and inaccurate models. The Synthetic Minority Over-Sampling Technique (SMOTE) was developed to address the problem of imbalanced data. Over time, several weaknesses of the SMOTE method have been identified in generating synthetic minority class data, such as overlapping, noise, and small disjuncts. However, these studies generally focus on only one of SMOTE’s weaknesses: noise or overlapping. Therefore, this study addresses both issues simultaneously by tackling noise and overlapping in SMOTE-generated data. This study proposes a combined approach of filtering, clustering, and distance modification to reduce noise and overlapping produced by SMOTE. Filtering removes minority class data (noise) located in majority class regions, with the k-nn method applied for filtering. The use of Noise Reduction (NR), which removes data that is considered noise before applying SMOTE, has a positive impact in overcoming data imbalance. Clustering establishes decision boundaries by partitioning data into clusters, allowing SMOTE with modified distance metrics to generate minority class data within each cluster. This SMOTE clustering and distance modification approach aims to minimize overlap in synthetic minority data that could introduce noise. The proposed method is called “NR-Clustering SMOTE,” which has several stages in balancing data: (1) filtering by removing minority classes close to majority classes (data noise) using the k-nn method; (2) clustering data using K-means aims to establish decision boundaries by partitioning data into several clusters; (3) applying SMOTE oversampling with Manhattan distance within each cluster. Test results indicate that the proposed NR-Clustering SMOTE method achieves the best performance across all evaluation metrics for classification methods such as Random Forest, SVM, and Naїve Bayes, compared to the original data and traditional SMOTE. The proposed method (NR-Clustering SMOTE) improves accuracy by 15.34% on the Pima dataset and 20.96% on the Haberman dataset compared to SMOTE-LOF. Compared to Radius-SMOTE, this method increases accuracy by 3.16% on the Pima dataset and 13.24% on the Haberman dataset. Meanwhile, compared to RN-SMOTE, the accuracy improvement reaches 15.56% on the Pima dataset and 19.84% on the Haberman dataset. This research result implies that the proposed method experiences consistent performance improvement compared to traditional SMOTE and its latest variants, such as SMOTE-LOF, Radius-SMOTE, and RN-SMOTE, in solving imbalanced health data with class binaries. Tech Science Press 2025-02-17 Article PeerReviewed pdf en cc_by_4 http://umpir.ump.edu.my/id/eprint/44043/1/Addressing%20imbalance%20in%20health%20datasets.pdf Hairani, Hairani and Widiyaningtyas, Triyanna and Prasetya, Didik Dwi and Afrig, Aminuddin (2025) Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification. Computers, Materials and Continua, 82 (2). pp. 2931-2949. ISSN 1546-2218. (Published) https://doi.org/10.32604/cmc.2024.060837 https://doi.org/10.32604/cmc.2024.060837
institution	Universiti Malaysia Pahang Al-Sultan Abdullah
building	UMPSA Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaysia Pahang Al-Sultan Abdullah
content_source	UMPSA Institutional Repository
url_provider	http://umpir.ump.edu.my/
language	English
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Hairani, Hairani Widiyaningtyas, Triyanna Prasetya, Didik Dwi Afrig, Aminuddin Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification
description	An imbalanced dataset often challenges machine learning, particularly classification methods. Underrepresented minority classes can result in biased and inaccurate models. The Synthetic Minority Over-Sampling Technique (SMOTE) was developed to address the problem of imbalanced data. Over time, several weaknesses of the SMOTE method have been identified in generating synthetic minority class data, such as overlapping, noise, and small disjuncts. However, these studies generally focus on only one of SMOTE’s weaknesses: noise or overlapping. Therefore, this study addresses both issues simultaneously by tackling noise and overlapping in SMOTE-generated data. This study proposes a combined approach of filtering, clustering, and distance modification to reduce noise and overlapping produced by SMOTE. Filtering removes minority class data (noise) located in majority class regions, with the k-nn method applied for filtering. The use of Noise Reduction (NR), which removes data that is considered noise before applying SMOTE, has a positive impact in overcoming data imbalance. Clustering establishes decision boundaries by partitioning data into clusters, allowing SMOTE with modified distance metrics to generate minority class data within each cluster. This SMOTE clustering and distance modification approach aims to minimize overlap in synthetic minority data that could introduce noise. The proposed method is called “NR-Clustering SMOTE,” which has several stages in balancing data: (1) filtering by removing minority classes close to majority classes (data noise) using the k-nn method; (2) clustering data using K-means aims to establish decision boundaries by partitioning data into several clusters; (3) applying SMOTE oversampling with Manhattan distance within each cluster. Test results indicate that the proposed NR-Clustering SMOTE method achieves the best performance across all evaluation metrics for classification methods such as Random Forest, SVM, and Naїve Bayes, compared to the original data and traditional SMOTE. The proposed method (NR-Clustering SMOTE) improves accuracy by 15.34% on the Pima dataset and 20.96% on the Haberman dataset compared to SMOTE-LOF. Compared to Radius-SMOTE, this method increases accuracy by 3.16% on the Pima dataset and 13.24% on the Haberman dataset. Meanwhile, compared to RN-SMOTE, the accuracy improvement reaches 15.56% on the Pima dataset and 19.84% on the Haberman dataset. This research result implies that the proposed method experiences consistent performance improvement compared to traditional SMOTE and its latest variants, such as SMOTE-LOF, Radius-SMOTE, and RN-SMOTE, in solving imbalanced health data with class binaries.
format	Article
author	Hairani, Hairani Widiyaningtyas, Triyanna Prasetya, Didik Dwi Afrig, Aminuddin
author_facet	Hairani, Hairani Widiyaningtyas, Triyanna Prasetya, Didik Dwi Afrig, Aminuddin
author_sort	Hairani, Hairani
title	Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification
title_short	Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification
title_full	Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification
title_fullStr	Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification
title_full_unstemmed	Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification
title_sort	addressing imbalance in health datasets: a new method nr-clustering smote and distance metric modification
publisher	Tech Science Press
publishDate	2025
url	http://umpir.ump.edu.my/id/eprint/44043/1/Addressing%20imbalance%20in%20health%20datasets.pdf http://umpir.ump.edu.my/id/eprint/44043/ https://doi.org/10.32604/cmc.2024.060837 https://doi.org/10.32604/cmc.2024.060837
_version_	1827518435267969024
score	13.251813

Addressing imbalance in health datasets: A new method NR-clustering SMOTE and distance metric modification

Similar Items