Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data

Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixi...

Full description

Saved in:
Bibliographic Details
Main Authors: Afifah, Mohd Shamsuddin, Sarah Flora, Samson Juan, Stephanie, Chua, Arif, Bramantoro
Format: Article
Language:English
Published: Penerbit Akademia Baru 2024
Subjects:
Online Access:http://ir.unimas.my/id/eprint/47390/3/5640-Article%20Text-28397-1-10-20250109.pdf
http://ir.unimas.my/id/eprint/47390/
https://www.akademiabaru.com/submit/index.php/ard/article/view/5640
https://doi.org/10.37934/ard.123.1.198212
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.unimas.ir-47390
record_format eprints
spelling my.unimas.ir-473902025-01-22T04:02:53Z http://ir.unimas.my/id/eprint/47390/ Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data Afifah, Mohd Shamsuddin Sarah Flora, Samson Juan Stephanie, Chua Arif, Bramantoro QA75 Electronic computers. Computer science Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixing of languages within sentences or phrasesincreases the likelihood of inaccurately classified sentiment. Sentiment analysers built with models trained with monolingual data will cause misclassification due to out-of-vocabulary issues, thus limiting the efficacy of these analysers on code-switcheddata. Plus, obtaining a code-switched corpus annotated with sentiment labels is scarce, and investigations on sentiment analysis on code-switched social media data are lacking, particularly on Malaysian social media posts. We proposed MESocSentiment, a Malay-English social media corpus with sentiment labels constructed via a semi-automatic sentiment identification method to address this challenge. The framework leveraged existing language identifiers and sentiment analysers to annotate code-switched data. We collected 229,566 tweets containing #Malaysia between August 12, 2022, and May 15, 2023. Using our strategy, we identified 19714 code-switched posts containing Malay and English words from the collection. Our analysis showed that 78.23% of the corpus had neutral sentiments, while 16.32% were positive and 5.44% negative. Furthermore, the descriptive analysis of tweet length revealed a range spanning 43 words, with a mean of 8.92 and a standard deviation of 5.56 words. This comprehensive framework contributes to a deeper understanding of sentiment expression within code-switched social media data, particularly in the context of Malaysia's linguistic and cultural diversity. Additionally, the MESocSentiment corpus is published on GitHub for future research. Penerbit Akademia Baru 2024-12 Article PeerReviewed text en http://ir.unimas.my/id/eprint/47390/3/5640-Article%20Text-28397-1-10-20250109.pdf Afifah, Mohd Shamsuddin and Sarah Flora, Samson Juan and Stephanie, Chua and Arif, Bramantoro (2024) Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data. Journal of Advanced Research Design, 123 (1). pp. 198-212. ISSN 2289-7984 https://www.akademiabaru.com/submit/index.php/ard/article/view/5640 https://doi.org/10.37934/ard.123.1.198212
institution Universiti Malaysia Sarawak
building Centre for Academic Information Services (CAIS)
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sarawak
content_source UNIMAS Institutional Repository
url_provider http://ir.unimas.my/
language English
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Afifah, Mohd Shamsuddin
Sarah Flora, Samson Juan
Stephanie, Chua
Arif, Bramantoro
Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data
description Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixing of languages within sentences or phrasesincreases the likelihood of inaccurately classified sentiment. Sentiment analysers built with models trained with monolingual data will cause misclassification due to out-of-vocabulary issues, thus limiting the efficacy of these analysers on code-switcheddata. Plus, obtaining a code-switched corpus annotated with sentiment labels is scarce, and investigations on sentiment analysis on code-switched social media data are lacking, particularly on Malaysian social media posts. We proposed MESocSentiment, a Malay-English social media corpus with sentiment labels constructed via a semi-automatic sentiment identification method to address this challenge. The framework leveraged existing language identifiers and sentiment analysers to annotate code-switched data. We collected 229,566 tweets containing #Malaysia between August 12, 2022, and May 15, 2023. Using our strategy, we identified 19714 code-switched posts containing Malay and English words from the collection. Our analysis showed that 78.23% of the corpus had neutral sentiments, while 16.32% were positive and 5.44% negative. Furthermore, the descriptive analysis of tweet length revealed a range spanning 43 words, with a mean of 8.92 and a standard deviation of 5.56 words. This comprehensive framework contributes to a deeper understanding of sentiment expression within code-switched social media data, particularly in the context of Malaysia's linguistic and cultural diversity. Additionally, the MESocSentiment corpus is published on GitHub for future research.
format Article
author Afifah, Mohd Shamsuddin
Sarah Flora, Samson Juan
Stephanie, Chua
Arif, Bramantoro
author_facet Afifah, Mohd Shamsuddin
Sarah Flora, Samson Juan
Stephanie, Chua
Arif, Bramantoro
author_sort Afifah, Mohd Shamsuddin
title Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data
title_short Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data
title_full Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data
title_fullStr Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data
title_full_unstemmed Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data
title_sort semi-automatic sentiment identification for malay-english code-switched data
publisher Penerbit Akademia Baru
publishDate 2024
url http://ir.unimas.my/id/eprint/47390/3/5640-Article%20Text-28397-1-10-20250109.pdf
http://ir.unimas.my/id/eprint/47390/
https://www.akademiabaru.com/submit/index.php/ard/article/view/5640
https://doi.org/10.37934/ard.123.1.198212
_version_ 1822896187636711424
score 13.235362