Taxonomy learning from Malay texts using artificial immune system based clustering

In taxonomy learning from texts, the extracted features that are used to describe the context of a term usually are erroneous and sparse. Various attempts to overcome data sparseness and noise have been made using clustering algorithm such as Hierarchical Agglomerative Clustering (HAC), Bisecting K-...

Full description

Saved in:
Bibliographic Details
Main Author: Ahmad Nazri, Mohd. Zakree
Format: Thesis
Language:English
Published: 2011
Subjects:
Online Access:http://eprints.utm.my/id/eprint/36947/1/MohdZakreeAhmadNazriPFSKSM2011.pdf
http://eprints.utm.my/id/eprint/36947/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utm.36947
record_format eprints
spelling my.utm.369472018-05-27T08:15:39Z http://eprints.utm.my/id/eprint/36947/ Taxonomy learning from Malay texts using artificial immune system based clustering Ahmad Nazri, Mohd. Zakree QA75 Electronic computers. Computer science In taxonomy learning from texts, the extracted features that are used to describe the context of a term usually are erroneous and sparse. Various attempts to overcome data sparseness and noise have been made using clustering algorithm such as Hierarchical Agglomerative Clustering (HAC), Bisecting K-means and Guided Agglomerative Hierarchical Clustering (GAHC). However these methods suffer low recall. Therefore, the purpose of this study is to investigate the application of two hybridized artificial immune system (AIS) in taxonomy learning from Malay text and develop a Google-based Text Miner (GTM) for feature selection to reduce data sparseness. Two novel taxonomy learning algorithms have been proposed and compared with the benchmark methods (i.e., HAC, GAHC and Bisecting K-means). The first algorithm is designed through the hybridization of GAHC and Artificial Immune Network (aiNet) called GCAINT (Guided Clustering and aiNet for Taxonomy Learning). The GCAINT algorithm exploits a Hypernym Oracle (HO) to guide the hierarchical clustering process and produce better results than the benchmark methods. However, the Malay HO introduces erroneous hypernym-hyponym pairs and affects the result. Therefore, the second novel algorithm called CLOSAT (Clonal Selection Algorithm for Taxonomy Learning) is proposed by hybridizing Clonal Selection Algorithm (CLONALG) and Bisecting k-means. CLOSAT produces the best results compared to the benchmark methods and GCAINT. In order to reduce sparseness in the obtained dataset, the GTM is proposed. However, the experimental results reveal that GTM introduces too many noises into the dataset which leads to many false positives of hypernym-hyponym pairs. The effect of different combinations of affinity measurement (i.e., Hamming, Jaccard and Rand) on the performance of the developed methods was also studied. Jaccard is found better than Hamming and Rand in measuring the similarity distance between terms. In addition, the use of Particle Swarm Optimization (PSO) for automatic parameter tuning the GCAINT and CLOSAT was also proposed. Experimental results demonstrate that in most cases, PSO-tuned CLOSAT and GCAINT produce better results compared to the benchmark methods and able to reduce data sparseness and noise in the dataset. 2011-04 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/36947/1/MohdZakreeAhmadNazriPFSKSM2011.pdf Ahmad Nazri, Mohd. Zakree (2011) Taxonomy learning from Malay texts using artificial immune system based clustering. PhD thesis, Universiti Teknologi Malaysia, Faculty of Computer Science and Information System.
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
language English
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Ahmad Nazri, Mohd. Zakree
Taxonomy learning from Malay texts using artificial immune system based clustering
description In taxonomy learning from texts, the extracted features that are used to describe the context of a term usually are erroneous and sparse. Various attempts to overcome data sparseness and noise have been made using clustering algorithm such as Hierarchical Agglomerative Clustering (HAC), Bisecting K-means and Guided Agglomerative Hierarchical Clustering (GAHC). However these methods suffer low recall. Therefore, the purpose of this study is to investigate the application of two hybridized artificial immune system (AIS) in taxonomy learning from Malay text and develop a Google-based Text Miner (GTM) for feature selection to reduce data sparseness. Two novel taxonomy learning algorithms have been proposed and compared with the benchmark methods (i.e., HAC, GAHC and Bisecting K-means). The first algorithm is designed through the hybridization of GAHC and Artificial Immune Network (aiNet) called GCAINT (Guided Clustering and aiNet for Taxonomy Learning). The GCAINT algorithm exploits a Hypernym Oracle (HO) to guide the hierarchical clustering process and produce better results than the benchmark methods. However, the Malay HO introduces erroneous hypernym-hyponym pairs and affects the result. Therefore, the second novel algorithm called CLOSAT (Clonal Selection Algorithm for Taxonomy Learning) is proposed by hybridizing Clonal Selection Algorithm (CLONALG) and Bisecting k-means. CLOSAT produces the best results compared to the benchmark methods and GCAINT. In order to reduce sparseness in the obtained dataset, the GTM is proposed. However, the experimental results reveal that GTM introduces too many noises into the dataset which leads to many false positives of hypernym-hyponym pairs. The effect of different combinations of affinity measurement (i.e., Hamming, Jaccard and Rand) on the performance of the developed methods was also studied. Jaccard is found better than Hamming and Rand in measuring the similarity distance between terms. In addition, the use of Particle Swarm Optimization (PSO) for automatic parameter tuning the GCAINT and CLOSAT was also proposed. Experimental results demonstrate that in most cases, PSO-tuned CLOSAT and GCAINT produce better results compared to the benchmark methods and able to reduce data sparseness and noise in the dataset.
format Thesis
author Ahmad Nazri, Mohd. Zakree
author_facet Ahmad Nazri, Mohd. Zakree
author_sort Ahmad Nazri, Mohd. Zakree
title Taxonomy learning from Malay texts using artificial immune system based clustering
title_short Taxonomy learning from Malay texts using artificial immune system based clustering
title_full Taxonomy learning from Malay texts using artificial immune system based clustering
title_fullStr Taxonomy learning from Malay texts using artificial immune system based clustering
title_full_unstemmed Taxonomy learning from Malay texts using artificial immune system based clustering
title_sort taxonomy learning from malay texts using artificial immune system based clustering
publishDate 2011
url http://eprints.utm.my/id/eprint/36947/1/MohdZakreeAhmadNazriPFSKSM2011.pdf
http://eprints.utm.my/id/eprint/36947/
_version_ 1643650047880986624
score 13.211869