An enhanced resampling technique for imbalanced data sets
A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampl...
Saved in:
| Main Author: | |
|---|---|
| Format: | Thesis |
| Language: | en en |
| Published: |
2015
|
| Subjects: | |
| Online Access: | https://etd.uum.edu.my/5330/1/s814594.pdf https://etd.uum.edu.my/5330/2/s814594_abstract.pdf https://etd.uum.edu.my/5330/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1833436498504974336 |
|---|---|
| author | Maisarah, Zorkeflee |
| author_facet | Maisarah, Zorkeflee |
| author_sort | Maisarah, Zorkeflee |
| building | UUM Library |
| collection | Institutional Repository |
| content_provider | Universiti Utara Malaysia |
| content_source | UUM Electronic Theses |
| continent | Asia |
| country | Malaysia |
| description | A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related
to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it
with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased
Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority
Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of
SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance
of binary classification. Furthermore, the techniques performed well on small record
size data sets that have of instances in the range of approximately 100 to 800. |
| format | Thesis |
| id | my.uum.etd-5330 |
| institution | Universiti Utara Malaysia |
| language | en en |
| publishDate | 2015 |
| record_format | eprints |
| spelling | my.uum.etd-53302021-04-04T07:31:37Z https://etd.uum.edu.my/5330/ An enhanced resampling technique for imbalanced data sets Maisarah, Zorkeflee QA76.76 Fuzzy System. A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800. 2015 Thesis NonPeerReviewed text en https://etd.uum.edu.my/5330/1/s814594.pdf text en https://etd.uum.edu.my/5330/2/s814594_abstract.pdf Maisarah, Zorkeflee (2015) An enhanced resampling technique for imbalanced data sets. Masters thesis, Universiti Utara Malaysia. |
| spellingShingle | QA76.76 Fuzzy System. Maisarah, Zorkeflee An enhanced resampling technique for imbalanced data sets |
| title | An enhanced resampling technique for imbalanced data sets |
| title_full | An enhanced resampling technique for imbalanced data sets |
| title_fullStr | An enhanced resampling technique for imbalanced data sets |
| title_full_unstemmed | An enhanced resampling technique for imbalanced data sets |
| title_short | An enhanced resampling technique for imbalanced data sets |
| title_sort | enhanced resampling technique for imbalanced data sets |
| topic | QA76.76 Fuzzy System. |
| url | https://etd.uum.edu.my/5330/1/s814594.pdf https://etd.uum.edu.my/5330/2/s814594_abstract.pdf https://etd.uum.edu.my/5330/ |
| url_provider | http://etd.uum.edu.my/ |
