A partition based feature selection approach for mixed data clustering / Ashish Dutt

Presently, educational institutions compile and store huge volumes of data, such as student enrolment and attendance records, as well as their examination results. Mining such data yields stimulating information that serves its handlers well. Rapid growth in educational data points to the fact that...

Full description

Saved in:
Bibliographic Details
Main Author: Ashish , Dutt
Format: Thesis
Published: 2020
Subjects:
Online Access:http://studentsrepo.um.edu.my/14481/2/Ashish_Dutt.pdf
http://studentsrepo.um.edu.my/14481/1/Ashish_Dutt.pdf
http://studentsrepo.um.edu.my/14481/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Presently, educational institutions compile and store huge volumes of data, such as student enrolment and attendance records, as well as their examination results. Mining such data yields stimulating information that serves its handlers well. Rapid growth in educational data points to the fact that distilling massive amounts of data requires a more sophisticated set of algorithms. This issue led to the emergence of the field of Educational Data Mining (EDM). Traditional data mining algorithms cannot be directly applied to educational problems, as they may have a specific objective and function. This implies that a pre-processing algorithm has to be enforced first and only then some specific data mining methods can be applied to the problems. One such pre-processing algorithm in EDM is clustering. It is a widely used method in data mining to discover unique patterns in underlying data. It finds patterns by analysing the features in data. A feature contains a measured value. A value can be of an atomic type like categorical (text only) or numerical (number only). A categorical data type can be ordinal (ordered) or nominal (unordered). In either case, the feature is of univariate data type. Often in real-world environment, data consist of both categorical and numerical valued features. Such datasets are called mixed data. In literature, several clustering methods exist for analysing numerical or categorical data. There are a few clustering algorithms for handling mixed data. Clustering mixed data is dependent on the dissimilarities of its constituent features. This dependability on data types may influence a clustering solution. Assigning appropriate weights to the feature, such that it diminishes the data type influence may improve the performance of a partition clustering algorithm. In this thesis, a novel weighted feature selection approach on nominal features is proposed, for a partition. clustering algorithm that can handle mixed data. The proposed approach exploits the pre-processing nature of the partition clustering algorithm in the selection of weight assignment for nominal features. The benefits of weighting are demonstrated on both simulated and real-world mixed datasets. The experimental results yield better results for weighted nominal features in mixed data clustering.