Staff View: Malay part of speech tagging using ruled-based approach

Malay part of speech tagging using ruled-based approach

The research on part of speech (POS) tagging has been widely applied and used through a variety of approaches, particularly for European languages. But it is more challenging for Asian languages, especially Malay as it has some element of modification from other languages such as English and Arabic....

Full description

Saved in:

Bibliographic Details
Main Authors:	Nur Ashikin Halid,, Nazlia Omar,
Format:	Article
Language:	English
Published:	Penerbit Universiti Kebangsaan Malaysia 2017
Online Access:	http://journalarticle.ukm.my/11857/1/19146-65044-1-PB.pdf http://journalarticle.ukm.my/11857/ http://ejournals.ukm.my/apjitm/issue/view/1050
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-ukm.journal.11857
record_format	eprints
spelling	my-ukm.journal.118572018-07-10T00:13:29Z http://journalarticle.ukm.my/11857/ Malay part of speech tagging using ruled-based approach Nur Ashikin Halid, Nazlia Omar, The research on part of speech (POS) tagging has been widely applied and used through a variety of approaches, particularly for European languages. But it is more challenging for Asian languages, especially Malay as it has some element of modification from other languages such as English and Arabic. Among the issues that often occur in POS tagging are the existence of ambiguous words and unknown words. Meanwhile, the lack of rules in the existing work has become a major problem in Malay POS tagging. Therefore, this research aims to develop new rules for Malay POS tagging and to compare the performance of this new development with the existing gold standard. This process begins with the collection and selection of the corpus using secondary data, obtained from online daily news which covers several domains. Next, the corpus has gone through the process of pre-processing in raw text of article form which include sentence splitter and tokenization process to generate an unlabeled corpus. POS tag dictionary also has been constructed to form a lexicon that only consists of root words. The rule development process involves detailing every type of POS tag to its suitable rules and get the best rules ordering for each type of this POS. A total of 30 rules including affixation rules and 16 word type relations have been developed in this process. The evaluation process is used to test the precision of the developed POS tagger and to get the best rules ordering. The POS tagging result is compared with existing gold standard. Overall, the test showed good result with an accuracy of 93.06% compared to the gold standard performance of 77.17%. Hence, this research showed better accuracy compared with the gold standard and at the same time, it proves that the addition of a new rules and rules ordering among the factors that contributed to the higher precision in tagging Malay corpus. As an improvement in future studies, the use of compound words should be taken into account because most of these words are used in most news sources. In addition, corpus from social media sources can be used because the content of information disseminated through social media is fast and up-to-date even though the language used for this resource is mostly informal and confronts with noise data issues. Penerbit Universiti Kebangsaan Malaysia 2017-12 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/11857/1/19146-65044-1-PB.pdf Nur Ashikin Halid, and Nazlia Omar, (2017) Malay part of speech tagging using ruled-based approach. Asia-Pacific Journal of Information Technology and Multimedia, 6 (2). pp. 90-106. ISSN 2289-2192 http://ejournals.ukm.my/apjitm/issue/view/1050
institution	Universiti Kebangsaan Malaysia
building	Perpustakaan Tun Sri Lanang Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Kebangsaan Malaysia
content_source	UKM Journal Article Repository
url_provider	http://journalarticle.ukm.my/
language	English
description	The research on part of speech (POS) tagging has been widely applied and used through a variety of approaches, particularly for European languages. But it is more challenging for Asian languages, especially Malay as it has some element of modification from other languages such as English and Arabic. Among the issues that often occur in POS tagging are the existence of ambiguous words and unknown words. Meanwhile, the lack of rules in the existing work has become a major problem in Malay POS tagging. Therefore, this research aims to develop new rules for Malay POS tagging and to compare the performance of this new development with the existing gold standard. This process begins with the collection and selection of the corpus using secondary data, obtained from online daily news which covers several domains. Next, the corpus has gone through the process of pre-processing in raw text of article form which include sentence splitter and tokenization process to generate an unlabeled corpus. POS tag dictionary also has been constructed to form a lexicon that only consists of root words. The rule development process involves detailing every type of POS tag to its suitable rules and get the best rules ordering for each type of this POS. A total of 30 rules including affixation rules and 16 word type relations have been developed in this process. The evaluation process is used to test the precision of the developed POS tagger and to get the best rules ordering. The POS tagging result is compared with existing gold standard. Overall, the test showed good result with an accuracy of 93.06% compared to the gold standard performance of 77.17%. Hence, this research showed better accuracy compared with the gold standard and at the same time, it proves that the addition of a new rules and rules ordering among the factors that contributed to the higher precision in tagging Malay corpus. As an improvement in future studies, the use of compound words should be taken into account because most of these words are used in most news sources. In addition, corpus from social media sources can be used because the content of information disseminated through social media is fast and up-to-date even though the language used for this resource is mostly informal and confronts with noise data issues.
format	Article
author	Nur Ashikin Halid, Nazlia Omar,
spellingShingle	Nur Ashikin Halid, Nazlia Omar, Malay part of speech tagging using ruled-based approach
author_facet	Nur Ashikin Halid, Nazlia Omar,
author_sort	Nur Ashikin Halid,
title	Malay part of speech tagging using ruled-based approach
title_short	Malay part of speech tagging using ruled-based approach
title_full	Malay part of speech tagging using ruled-based approach
title_fullStr	Malay part of speech tagging using ruled-based approach
title_full_unstemmed	Malay part of speech tagging using ruled-based approach
title_sort	malay part of speech tagging using ruled-based approach
publisher	Penerbit Universiti Kebangsaan Malaysia
publishDate	2017
url	http://journalarticle.ukm.my/11857/1/19146-65044-1-PB.pdf http://journalarticle.ukm.my/11857/ http://ejournals.ukm.my/apjitm/issue/view/1050
_version_	1643738622953783296
score	13.251813

Malay part of speech tagging using ruled-based approach

Similar Items