An enhanced resampling technique for imbalanced data sets

A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampl...

Full description

Saved in:
Bibliographic Details
Main Author: Maisarah, Zorkeflee
Format: Thesis
Language:en
en
Published: 2015
Subjects:
Online Access:https://etd.uum.edu.my/5330/1/s814594.pdf
https://etd.uum.edu.my/5330/2/s814594_abstract.pdf
https://etd.uum.edu.my/5330/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1833436498504974336
author Maisarah, Zorkeflee
author_facet Maisarah, Zorkeflee
author_sort Maisarah, Zorkeflee
building UUM Library
collection Institutional Repository
content_provider Universiti Utara Malaysia
content_source UUM Electronic Theses
continent Asia
country Malaysia
description A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800.
format Thesis
id my.uum.etd-5330
institution Universiti Utara Malaysia
language en
en
publishDate 2015
record_format eprints
spelling my.uum.etd-53302021-04-04T07:31:37Z https://etd.uum.edu.my/5330/ An enhanced resampling technique for imbalanced data sets Maisarah, Zorkeflee QA76.76 Fuzzy System. A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800. 2015 Thesis NonPeerReviewed text en https://etd.uum.edu.my/5330/1/s814594.pdf text en https://etd.uum.edu.my/5330/2/s814594_abstract.pdf Maisarah, Zorkeflee (2015) An enhanced resampling technique for imbalanced data sets. Masters thesis, Universiti Utara Malaysia.
spellingShingle QA76.76 Fuzzy System.
Maisarah, Zorkeflee
An enhanced resampling technique for imbalanced data sets
title An enhanced resampling technique for imbalanced data sets
title_full An enhanced resampling technique for imbalanced data sets
title_fullStr An enhanced resampling technique for imbalanced data sets
title_full_unstemmed An enhanced resampling technique for imbalanced data sets
title_short An enhanced resampling technique for imbalanced data sets
title_sort enhanced resampling technique for imbalanced data sets
topic QA76.76 Fuzzy System.
url https://etd.uum.edu.my/5330/1/s814594.pdf
https://etd.uum.edu.my/5330/2/s814594_abstract.pdf
https://etd.uum.edu.my/5330/
url_provider http://etd.uum.edu.my/