An Improve k-NN Classifier using Similarity Distance Plot-Data Reduction and Dask for Big Datasets

The k-Nearest Neighbour (k-NN) algorithm is one of the most widely used Instance-Based Learning methods due to its simplicity and ease of implementation. However, k-NN faces two major challenges, particularly when applied to large datasets. First, it requires substantial memory to store all training...

Full description

Saved in:
Bibliographic Details
Main Author: Abdul Muqtasid, Rushdi
Format: Thesis
Language:en
en
en
Published: The International Journal of Computing and Digital Systems (IJCDS) 2025
Subjects:
Online Access:http://ir.unimas.my/id/eprint/51603/5/DOW_Abdul%20Muqtasid.pdf
http://ir.unimas.my/id/eprint/51603/6/Thesis%20Ms_Abdul%20Muqtasid.pdf
http://ir.unimas.my/id/eprint/51603/7/Thesis%20Ms_Abdul%20Muqtasid_24%20pages.pdf
http://ir.unimas.my/id/eprint/51603/
https://journal.uob.edu.bh/items/d9950900-7e0b-4bce-ade3-a99ffdcebff3/full
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The k-Nearest Neighbour (k-NN) algorithm is one of the most widely used Instance-Based Learning methods due to its simplicity and ease of implementation. However, k-NN faces two major challenges, particularly when applied to large datasets. First, it requires substantial memory to store all training instances and compute distances during classification. Second, this results in slower classification speeds, as the algorithm must search through a large number of instances to classify each test sample. To address these challenges, this study proposes a data reduction technique called Similarity Distance Plot–Data Reduction (SDP-DR), which aims to reduce the volume of data stored, thereby improving classification speed. To accommodate big-scale data, the method integrates a parallel computing framework, Dask, during both the reduction and classification processes. The classification of test data follows the conventional k-NN procedure. The performance of SDP-DR is systematically evaluated using benchmark datasets of varying sizes. Comparisons are made against the original k-NN and several advanced data reduction and classification methods, including RIS, DROP3, ATISA1, LHS-FKNN, CQ-EKNN, and GEK-NN. The evaluation criteria include classification accuracy, classification time, data reduction rate, and reduction time. Experimental results demonstrate that the proposed method significantly reduces data storage requirements, improves classification speed, and maintains or even enhances accuracy compared to both the original k-NN and other state-of-the-art methods. Overall, SDP-DR demonstrates strong potential as a scalable and accurate approach for enhancing instance-based classifiers in big-scale data environments.