Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
Data classification is one of the important aspects in the real-world decision- making support function which can be severely affected by an imbalance class distribution in the training data especially in the medical field. In medical datasets, the data are mainly had imbalanced datasets problem whi...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/102725/1/SyasyaFarinaDzulkefliMSKE2022.pdf http://eprints.utm.my/id/eprint/102725/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:149741 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.utm.102725 |
---|---|
record_format |
eprints |
spelling |
my.utm.1027252023-09-20T03:24:37Z http://eprints.utm.my/id/eprint/102725/ Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients Dzulkefli, Syasya Farina TK Electrical engineering. Electronics Nuclear engineering Data classification is one of the important aspects in the real-world decision- making support function which can be severely affected by an imbalance class distribution in the training data especially in the medical field. In medical datasets, the data are mainly had imbalanced datasets problem which composed of minority of normal samples and majority of abnormal samples. As for today's example, the outbreak of novel coronavirus disease or also called as COVID-19 in late 2019 is still on-going which we can see new variants have been discovered from time to time and this can lead to increasing of number of cases around the world. The medical staffs can detect the patients by checking on the symptoms but one of the common COVID-19 symptoms that will be investigating in this research is pneumonia. It is important to detect the pneumonia faster at early stage to avoid it become more severe. Thus, Chest Xray scan images can be considered as one of the confirmatory approaches as they are fast to obtain and easily accessible. Diagnosing diseases in general is a considerable application of data analysis for medical science. In this research, data sampling methods will be explored and implemented for pneumonia detection for imbalanced datasets. The imbalanced datasets of pneumonia X-Ray images from Kaggle dataset will be obtained and different existing data sampling methods also new proposed methods that are achieved by combining or modifying exiting methods will be implemented to balance the images between majority and minority classes of the datasets. After achieved a balanced dataset, CNN model will be implemented to set benchmark of detection accuracy in terms of confusion matrix, precision, accuracy, F1-score and recall for each method and those results will be compared to choose which method will give the highest accuracy in detecting pneumonia. The best undersampling method is near miss with 85.47% accuracy, the best oversampling method is data augmentation with 88.78% accuracy and the best combination method is SMOTE + Tomek with 83.20% accuracy compared to 79% of accuracy when there is no method being implemented on the imbalanced dataset. Implementing data sampling methods will boost the performance of data classification in all applications especially in detecting pneumonia in COVID-19 patients. 2022 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/102725/1/SyasyaFarinaDzulkefliMSKE2022.pdf Dzulkefli, Syasya Farina (2022) Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients. Masters thesis, Universiti Teknologi Malaysia. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:149741 |
institution |
Universiti Teknologi Malaysia |
building |
UTM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknologi Malaysia |
content_source |
UTM Institutional Repository |
url_provider |
http://eprints.utm.my/ |
language |
English |
topic |
TK Electrical engineering. Electronics Nuclear engineering |
spellingShingle |
TK Electrical engineering. Electronics Nuclear engineering Dzulkefli, Syasya Farina Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients |
description |
Data classification is one of the important aspects in the real-world decision- making support function which can be severely affected by an imbalance class distribution in the training data especially in the medical field. In medical datasets, the data are mainly had imbalanced datasets problem which composed of minority of normal samples and majority of abnormal samples. As for today's example, the outbreak of novel coronavirus disease or also called as COVID-19 in late 2019 is still on-going which we can see new variants have been discovered from time to time and this can lead to increasing of number of cases around the world. The medical staffs can detect the patients by checking on the symptoms but one of the common COVID-19 symptoms that will be investigating in this research is pneumonia. It is important to detect the pneumonia faster at early stage to avoid it become more severe. Thus, Chest Xray scan images can be considered as one of the confirmatory approaches as they are fast to obtain and easily accessible. Diagnosing diseases in general is a considerable application of data analysis for medical science. In this research, data sampling methods will be explored and implemented for pneumonia detection for imbalanced datasets. The imbalanced datasets of pneumonia X-Ray images from Kaggle dataset will be obtained and different existing data sampling methods also new proposed methods that are achieved by combining or modifying exiting methods will be implemented to balance the images between majority and minority classes of the datasets. After achieved a balanced dataset, CNN model will be implemented to set benchmark of detection accuracy in terms of confusion matrix, precision, accuracy, F1-score and recall for each method and those results will be compared to choose which method will give the highest accuracy in detecting pneumonia. The best undersampling method is near miss with 85.47% accuracy, the best oversampling method is data augmentation with 88.78% accuracy and the best combination method is SMOTE + Tomek with 83.20% accuracy compared to 79% of accuracy when there is no method being implemented on the imbalanced dataset. Implementing data sampling methods will boost the performance of data classification in all applications especially in detecting pneumonia in COVID-19 patients. |
format |
Thesis |
author |
Dzulkefli, Syasya Farina |
author_facet |
Dzulkefli, Syasya Farina |
author_sort |
Dzulkefli, Syasya Farina |
title |
Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients |
title_short |
Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients |
title_full |
Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients |
title_fullStr |
Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients |
title_full_unstemmed |
Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients |
title_sort |
data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients |
publishDate |
2022 |
url |
http://eprints.utm.my/id/eprint/102725/1/SyasyaFarinaDzulkefliMSKE2022.pdf http://eprints.utm.my/id/eprint/102725/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:149741 |
_version_ |
1778160772272095232 |
score |
13.211869 |