Staff View: A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora

A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora

Multi Multilingual corpora, containing the same documents in a variety of languages, are becoming an essential resource for natural language processing. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequency-based Information Retrieval...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ng Zhen Wei, Chan Chen Jie, Rayner Alfred, Joe Henry Obit
Format:	Article
Language:	English English
Published:	Universal Association of Computer and Electronics Engineers (UACEE) 2012
Subjects:	QA76.75-76.765 Computer software QH426-470 Genetics
Online Access:	https://eprints.ums.edu.my/id/eprint/29029/2/A%20genetic-based%20HAC%20technique%20for%20parallel%20clustering%20of%20bilingual%20Malay-English%20corpora_ABSTRACT.pdf https://eprints.ums.edu.my/id/eprint/29029/3/A%20Genetic-Based%20HAC%20Technique%20for%20Parallel%20Clustering%20of%20Bilingual%20Malay-English%20Corpora%20FULL%20TEXT.pdf https://eprints.ums.edu.my/id/eprint/29029/ https://www.seekdl.org/assets/pdf/20121212_101902.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.ums.eprints.29029
record_format	eprints
spelling	my.ums.eprints.290292021-09-10T01:59:33Z https://eprints.ums.edu.my/id/eprint/29029/ A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora Ng Zhen Wei Chan Chen Jie Rayner Alfred Joe Henry Obit QA76.75-76.765 Computer software QH426-470 Genetics Multi Multilingual corpora, containing the same documents in a variety of languages, are becoming an essential resource for natural language processing. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequency-based Information Retrieval (IR) tools are used. It also allows one to use the Natural Language Processing (NLP) and IR tools in one language to implement IR for another language. For instance, in this way, the most relevant articles to be translated from language Malay to language English can be selected after studying the clusters of abstracts in language English. In this paper, we report on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English. We cluster these documents for each language and compare the results both with respect to the content of clusters produced. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. Further, we study the effects of changing the method used to compute the inter-clusters distance that includes single link, complete link and average link distance between clusters. Finally, we describe an experiment employing a genetic algorithm to fine-tune the individual term weights in order to reproduce more closely a predefined set of clusters. In this way, clustering becomes a supervised learning technique that is trained to better reproduce known clusters in language Malay when applied to the corresponding documents in language English. Other possible applications include training the algorithm on a hand clustered set of documents, and subsequently applying it to a superset, including unseen documents, incorporating in this way expert knowledge about the domain in the clustering algorithm. Universal Association of Computer and Electronics Engineers (UACEE) 2012 Article PeerReviewed text en https://eprints.ums.edu.my/id/eprint/29029/2/A%20genetic-based%20HAC%20technique%20for%20parallel%20clustering%20of%20bilingual%20Malay-English%20corpora_ABSTRACT.pdf text en https://eprints.ums.edu.my/id/eprint/29029/3/A%20Genetic-Based%20HAC%20Technique%20for%20Parallel%20Clustering%20of%20Bilingual%20Malay-English%20Corpora%20FULL%20TEXT.pdf Ng Zhen Wei and Chan Chen Jie and Rayner Alfred and Joe Henry Obit (2012) A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora. International Journal of Computer Science and its Applications, 2. pp. 161-168. ISSN 2250-3765 https://www.seekdl.org/assets/pdf/20121212_101902.pdf
institution	Universiti Malaysia Sabah
building	UMS Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaysia Sabah
content_source	UMS Institutional Repository
url_provider	http://eprints.ums.edu.my/
language	English English
topic	QA76.75-76.765 Computer software QH426-470 Genetics
spellingShingle	QA76.75-76.765 Computer software QH426-470 Genetics Ng Zhen Wei Chan Chen Jie Rayner Alfred Joe Henry Obit A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora
description	Multi Multilingual corpora, containing the same documents in a variety of languages, are becoming an essential resource for natural language processing. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequency-based Information Retrieval (IR) tools are used. It also allows one to use the Natural Language Processing (NLP) and IR tools in one language to implement IR for another language. For instance, in this way, the most relevant articles to be translated from language Malay to language English can be selected after studying the clusters of abstracts in language English. In this paper, we report on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English. We cluster these documents for each language and compare the results both with respect to the content of clusters produced. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. Further, we study the effects of changing the method used to compute the inter-clusters distance that includes single link, complete link and average link distance between clusters. Finally, we describe an experiment employing a genetic algorithm to fine-tune the individual term weights in order to reproduce more closely a predefined set of clusters. In this way, clustering becomes a supervised learning technique that is trained to better reproduce known clusters in language Malay when applied to the corresponding documents in language English. Other possible applications include training the algorithm on a hand clustered set of documents, and subsequently applying it to a superset, including unseen documents, incorporating in this way expert knowledge about the domain in the clustering algorithm.
format	Article
author	Ng Zhen Wei Chan Chen Jie Rayner Alfred Joe Henry Obit
author_facet	Ng Zhen Wei Chan Chen Jie Rayner Alfred Joe Henry Obit
author_sort	Ng Zhen Wei
title	A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora
title_short	A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora
title_full	A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora
title_fullStr	A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora
title_full_unstemmed	A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora
title_sort	genetic-based hac technique for parallel clustering of bilingual malay-english corpora
publisher	Universal Association of Computer and Electronics Engineers (UACEE)
publishDate	2012
url	https://eprints.ums.edu.my/id/eprint/29029/2/A%20genetic-based%20HAC%20technique%20for%20parallel%20clustering%20of%20bilingual%20Malay-English%20corpora_ABSTRACT.pdf https://eprints.ums.edu.my/id/eprint/29029/3/A%20Genetic-Based%20HAC%20Technique%20for%20Parallel%20Clustering%20of%20Bilingual%20Malay-English%20Corpora%20FULL%20TEXT.pdf https://eprints.ums.edu.my/id/eprint/29029/ https://www.seekdl.org/assets/pdf/20121212_101902.pdf
_version_	1760230663678590976
score	13.211869

A genetic-based HAC technique for parallel clustering of bilingual Malay-English corpora

Similar Items