Text clustering for reducing semantic information in Malay semantic representation
The generation of texts are dramatically increased in this era. A text basically consists of structured and unstructured texts. The enormous amount of unstructured texts can be easily perceived by humans, unfortunately cannot be simply processed by computer. It needs efficient techniques to redu...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Penerbit Universiti Kebangsaan Malaysia
2020
|
Online Access: | http://journalarticle.ukm.my/16833/1/02.pdf http://journalarticle.ukm.my/16833/ https://www.ukm.my/apjitm/articles-year.php |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The generation of texts are dramatically increased in this era. A text basically consists of structured and
unstructured texts. The enormous amount of unstructured texts can be easily perceived by humans, unfortunately
cannot be simply processed by computer. It needs efficient techniques to reduce the information into more
valuable vectors. In this article, we introduce text clustering method using Malay linguistic information to reduce
the unstructured semantic information derived from Wikipedia Bahasa Melayu’s articles. The proposed method
uses the linguistic features in Malay language to cater the morphological issues of Malay words. We have
incorporated semantic information from semantic lexical resource for Malay, which called Wikipedia Bahasa
Melayu (WikiBM). Then, an experiment was conducted to evaluate the effects of text clustering to the semantic
similarity value using gloss definition of WikiBM’s article. We used Jaccard similarity to calculate the overlaps
vectors from the text of WikiBM. Then, the correlation was computed using Pearson’s correlation. The score
between original text definition was compared to the new text definition using text clustering method. From the
experiment, we can conclude that the correlation value was increased after the semantic information was reduced
to more valuable vectors using text clustering method (from 0.39 to 0.43). |
---|