Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients

Data classification is one of the important aspects in the real-world decision- making support function which can be severely affected by an imbalance class distribution in the training data especially in the medical field. In medical datasets, the data are mainly had imbalanced datasets problem whi...

Full description

Saved in:
Bibliographic Details
Main Author: Dzulkefli, Syasya Farina
Format: Thesis
Language:English
Published: 2022
Subjects:
Online Access:http://eprints.utm.my/id/eprint/102725/1/SyasyaFarinaDzulkefliMSKE2022.pdf
http://eprints.utm.my/id/eprint/102725/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:149741
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utm.102725
record_format eprints
spelling my.utm.1027252023-09-20T03:24:37Z http://eprints.utm.my/id/eprint/102725/ Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients Dzulkefli, Syasya Farina TK Electrical engineering. Electronics Nuclear engineering Data classification is one of the important aspects in the real-world decision- making support function which can be severely affected by an imbalance class distribution in the training data especially in the medical field. In medical datasets, the data are mainly had imbalanced datasets problem which composed of minority of normal samples and majority of abnormal samples. As for today's example, the outbreak of novel coronavirus disease or also called as COVID-19 in late 2019 is still on-going which we can see new variants have been discovered from time to time and this can lead to increasing of number of cases around the world. The medical staffs can detect the patients by checking on the symptoms but one of the common COVID-19 symptoms that will be investigating in this research is pneumonia. It is important to detect the pneumonia faster at early stage to avoid it become more severe. Thus, Chest Xray scan images can be considered as one of the confirmatory approaches as they are fast to obtain and easily accessible. Diagnosing diseases in general is a considerable application of data analysis for medical science. In this research, data sampling methods will be explored and implemented for pneumonia detection for imbalanced datasets. The imbalanced datasets of pneumonia X-Ray images from Kaggle dataset will be obtained and different existing data sampling methods also new proposed methods that are achieved by combining or modifying exiting methods will be implemented to balance the images between majority and minority classes of the datasets. After achieved a balanced dataset, CNN model will be implemented to set benchmark of detection accuracy in terms of confusion matrix, precision, accuracy, F1-score and recall for each method and those results will be compared to choose which method will give the highest accuracy in detecting pneumonia. The best undersampling method is near miss with 85.47% accuracy, the best oversampling method is data augmentation with 88.78% accuracy and the best combination method is SMOTE + Tomek with 83.20% accuracy compared to 79% of accuracy when there is no method being implemented on the imbalanced dataset. Implementing data sampling methods will boost the performance of data classification in all applications especially in detecting pneumonia in COVID-19 patients. 2022 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/102725/1/SyasyaFarinaDzulkefliMSKE2022.pdf Dzulkefli, Syasya Farina (2022) Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients. Masters thesis, Universiti Teknologi Malaysia. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:149741
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
language English
topic TK Electrical engineering. Electronics Nuclear engineering
spellingShingle TK Electrical engineering. Electronics Nuclear engineering
Dzulkefli, Syasya Farina
Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
description Data classification is one of the important aspects in the real-world decision- making support function which can be severely affected by an imbalance class distribution in the training data especially in the medical field. In medical datasets, the data are mainly had imbalanced datasets problem which composed of minority of normal samples and majority of abnormal samples. As for today's example, the outbreak of novel coronavirus disease or also called as COVID-19 in late 2019 is still on-going which we can see new variants have been discovered from time to time and this can lead to increasing of number of cases around the world. The medical staffs can detect the patients by checking on the symptoms but one of the common COVID-19 symptoms that will be investigating in this research is pneumonia. It is important to detect the pneumonia faster at early stage to avoid it become more severe. Thus, Chest Xray scan images can be considered as one of the confirmatory approaches as they are fast to obtain and easily accessible. Diagnosing diseases in general is a considerable application of data analysis for medical science. In this research, data sampling methods will be explored and implemented for pneumonia detection for imbalanced datasets. The imbalanced datasets of pneumonia X-Ray images from Kaggle dataset will be obtained and different existing data sampling methods also new proposed methods that are achieved by combining or modifying exiting methods will be implemented to balance the images between majority and minority classes of the datasets. After achieved a balanced dataset, CNN model will be implemented to set benchmark of detection accuracy in terms of confusion matrix, precision, accuracy, F1-score and recall for each method and those results will be compared to choose which method will give the highest accuracy in detecting pneumonia. The best undersampling method is near miss with 85.47% accuracy, the best oversampling method is data augmentation with 88.78% accuracy and the best combination method is SMOTE + Tomek with 83.20% accuracy compared to 79% of accuracy when there is no method being implemented on the imbalanced dataset. Implementing data sampling methods will boost the performance of data classification in all applications especially in detecting pneumonia in COVID-19 patients.
format Thesis
author Dzulkefli, Syasya Farina
author_facet Dzulkefli, Syasya Farina
author_sort Dzulkefli, Syasya Farina
title Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
title_short Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
title_full Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
title_fullStr Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
title_full_unstemmed Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
title_sort data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients
publishDate 2022
url http://eprints.utm.my/id/eprint/102725/1/SyasyaFarinaDzulkefliMSKE2022.pdf
http://eprints.utm.my/id/eprint/102725/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:149741
_version_ 1778160772272095232
score 13.211869