A category classification algorithm for Indonesian and Malay news documents

Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited comp...

Full description

Saved in:
Bibliographic Details
Main Authors: Jaafar, J., Indra, Z., Zamin, N.
Format: Article
Published: Penerbit UTM Press 2016
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-84988430997&doi=10.11113%2fjt.v78.9549&partnerID=40&md5=ecdbab4a964888b760afd4013033549a
http://eprints.utp.edu.my/25485/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utp.eprints.25485
record_format eprints
spelling my.utp.eprints.254852021-08-27T13:02:07Z A category classification algorithm for Indonesian and Malay news documents Jaafar, J. Indra, Z. Zamin, N. Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014-2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63 for language identification, and 97.5 for category classification. While the category classifier works optimally on n = 60, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification. © 2016 Penerbit UTM Press. All rights reserved. Penerbit UTM Press 2016 Article NonPeerReviewed https://www.scopus.com/inward/record.uri?eid=2-s2.0-84988430997&doi=10.11113%2fjt.v78.9549&partnerID=40&md5=ecdbab4a964888b760afd4013033549a Jaafar, J. and Indra, Z. and Zamin, N. (2016) A category classification algorithm for Indonesian and Malay news documents. Jurnal Teknologi, 78 (8-2). pp. 121-132. http://eprints.utp.edu.my/25485/
institution Universiti Teknologi Petronas
building UTP Resource Centre
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Petronas
content_source UTP Institutional Repository
url_provider http://eprints.utp.edu.my/
description Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014-2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63 for language identification, and 97.5 for category classification. While the category classifier works optimally on n = 60, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification. © 2016 Penerbit UTM Press. All rights reserved.
format Article
author Jaafar, J.
Indra, Z.
Zamin, N.
spellingShingle Jaafar, J.
Indra, Z.
Zamin, N.
A category classification algorithm for Indonesian and Malay news documents
author_facet Jaafar, J.
Indra, Z.
Zamin, N.
author_sort Jaafar, J.
title A category classification algorithm for Indonesian and Malay news documents
title_short A category classification algorithm for Indonesian and Malay news documents
title_full A category classification algorithm for Indonesian and Malay news documents
title_fullStr A category classification algorithm for Indonesian and Malay news documents
title_full_unstemmed A category classification algorithm for Indonesian and Malay news documents
title_sort category classification algorithm for indonesian and malay news documents
publisher Penerbit UTM Press
publishDate 2016
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-84988430997&doi=10.11113%2fjt.v78.9549&partnerID=40&md5=ecdbab4a964888b760afd4013033549a
http://eprints.utp.edu.my/25485/
_version_ 1738656736723599360
score 13.211869