Malay part of speech tagging using ruled-based approach

The research on part of speech (POS) tagging has been widely applied and used through a variety of approaches, particularly for European languages. But it is more challenging for Asian languages, especially Malay as it has some element of modification from other languages such as English and Arabic....

Full description

Saved in:
Bibliographic Details
Main Authors: Nur Ashikin Halid,, Nazlia Omar,
Format: Article
Language:English
Published: Penerbit Universiti Kebangsaan Malaysia 2017
Online Access:http://journalarticle.ukm.my/11857/1/19146-65044-1-PB.pdf
http://journalarticle.ukm.my/11857/
http://ejournals.ukm.my/apjitm/issue/view/1050
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-ukm.journal.11857
record_format eprints
spelling my-ukm.journal.118572018-07-10T00:13:29Z http://journalarticle.ukm.my/11857/ Malay part of speech tagging using ruled-based approach Nur Ashikin Halid, Nazlia Omar, The research on part of speech (POS) tagging has been widely applied and used through a variety of approaches, particularly for European languages. But it is more challenging for Asian languages, especially Malay as it has some element of modification from other languages such as English and Arabic. Among the issues that often occur in POS tagging are the existence of ambiguous words and unknown words. Meanwhile, the lack of rules in the existing work has become a major problem in Malay POS tagging. Therefore, this research aims to develop new rules for Malay POS tagging and to compare the performance of this new development with the existing gold standard. This process begins with the collection and selection of the corpus using secondary data, obtained from online daily news which covers several domains. Next, the corpus has gone through the process of pre-processing in raw text of article form which include sentence splitter and tokenization process to generate an unlabeled corpus. POS tag dictionary also has been constructed to form a lexicon that only consists of root words. The rule development process involves detailing every type of POS tag to its suitable rules and get the best rules ordering for each type of this POS. A total of 30 rules including affixation rules and 16 word type relations have been developed in this process. The evaluation process is used to test the precision of the developed POS tagger and to get the best rules ordering. The POS tagging result is compared with existing gold standard. Overall, the test showed good result with an accuracy of 93.06% compared to the gold standard performance of 77.17%. Hence, this research showed better accuracy compared with the gold standard and at the same time, it proves that the addition of a new rules and rules ordering among the factors that contributed to the higher precision in tagging Malay corpus. As an improvement in future studies, the use of compound words should be taken into account because most of these words are used in most news sources. In addition, corpus from social media sources can be used because the content of information disseminated through social media is fast and up-to-date even though the language used for this resource is mostly informal and confronts with noise data issues. Penerbit Universiti Kebangsaan Malaysia 2017-12 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/11857/1/19146-65044-1-PB.pdf Nur Ashikin Halid, and Nazlia Omar, (2017) Malay part of speech tagging using ruled-based approach. Asia-Pacific Journal of Information Technology and Multimedia, 6 (2). pp. 90-106. ISSN 2289-2192 http://ejournals.ukm.my/apjitm/issue/view/1050
institution Universiti Kebangsaan Malaysia
building Perpustakaan Tun Sri Lanang Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Kebangsaan Malaysia
content_source UKM Journal Article Repository
url_provider http://journalarticle.ukm.my/
language English
description The research on part of speech (POS) tagging has been widely applied and used through a variety of approaches, particularly for European languages. But it is more challenging for Asian languages, especially Malay as it has some element of modification from other languages such as English and Arabic. Among the issues that often occur in POS tagging are the existence of ambiguous words and unknown words. Meanwhile, the lack of rules in the existing work has become a major problem in Malay POS tagging. Therefore, this research aims to develop new rules for Malay POS tagging and to compare the performance of this new development with the existing gold standard. This process begins with the collection and selection of the corpus using secondary data, obtained from online daily news which covers several domains. Next, the corpus has gone through the process of pre-processing in raw text of article form which include sentence splitter and tokenization process to generate an unlabeled corpus. POS tag dictionary also has been constructed to form a lexicon that only consists of root words. The rule development process involves detailing every type of POS tag to its suitable rules and get the best rules ordering for each type of this POS. A total of 30 rules including affixation rules and 16 word type relations have been developed in this process. The evaluation process is used to test the precision of the developed POS tagger and to get the best rules ordering. The POS tagging result is compared with existing gold standard. Overall, the test showed good result with an accuracy of 93.06% compared to the gold standard performance of 77.17%. Hence, this research showed better accuracy compared with the gold standard and at the same time, it proves that the addition of a new rules and rules ordering among the factors that contributed to the higher precision in tagging Malay corpus. As an improvement in future studies, the use of compound words should be taken into account because most of these words are used in most news sources. In addition, corpus from social media sources can be used because the content of information disseminated through social media is fast and up-to-date even though the language used for this resource is mostly informal and confronts with noise data issues.
format Article
author Nur Ashikin Halid,
Nazlia Omar,
spellingShingle Nur Ashikin Halid,
Nazlia Omar,
Malay part of speech tagging using ruled-based approach
author_facet Nur Ashikin Halid,
Nazlia Omar,
author_sort Nur Ashikin Halid,
title Malay part of speech tagging using ruled-based approach
title_short Malay part of speech tagging using ruled-based approach
title_full Malay part of speech tagging using ruled-based approach
title_fullStr Malay part of speech tagging using ruled-based approach
title_full_unstemmed Malay part of speech tagging using ruled-based approach
title_sort malay part of speech tagging using ruled-based approach
publisher Penerbit Universiti Kebangsaan Malaysia
publishDate 2017
url http://journalarticle.ukm.my/11857/1/19146-65044-1-PB.pdf
http://journalarticle.ukm.my/11857/
http://ejournals.ukm.my/apjitm/issue/view/1050
_version_ 1643738622953783296
score 13.211869