Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data

Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixi...

Full description

Saved in:
Bibliographic Details
Main Authors: Afifah, Mohd Shamsuddin, Sarah Flora, Samson Juan, Stephanie, Chua, Arif, Bramantoro
Format: Article
Language:English
Published: Penerbit Akademia Baru 2024
Subjects:
Online Access:http://ir.unimas.my/id/eprint/47390/3/5640-Article%20Text-28397-1-10-20250109.pdf
http://ir.unimas.my/id/eprint/47390/
https://www.akademiabaru.com/submit/index.php/ard/article/view/5640
https://doi.org/10.37934/ard.123.1.198212
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixing of languages within sentences or phrasesincreases the likelihood of inaccurately classified sentiment. Sentiment analysers built with models trained with monolingual data will cause misclassification due to out-of-vocabulary issues, thus limiting the efficacy of these analysers on code-switcheddata. Plus, obtaining a code-switched corpus annotated with sentiment labels is scarce, and investigations on sentiment analysis on code-switched social media data are lacking, particularly on Malaysian social media posts. We proposed MESocSentiment, a Malay-English social media corpus with sentiment labels constructed via a semi-automatic sentiment identification method to address this challenge. The framework leveraged existing language identifiers and sentiment analysers to annotate code-switched data. We collected 229,566 tweets containing #Malaysia between August 12, 2022, and May 15, 2023. Using our strategy, we identified 19714 code-switched posts containing Malay and English words from the collection. Our analysis showed that 78.23% of the corpus had neutral sentiments, while 16.32% were positive and 5.44% negative. Furthermore, the descriptive analysis of tweet length revealed a range spanning 43 words, with a mean of 8.92 and a standard deviation of 5.56 words. This comprehensive framework contributes to a deeper understanding of sentiment expression within code-switched social media data, particularly in the context of Malaysia's linguistic and cultural diversity. Additionally, the MESocSentiment corpus is published on GitHub for future research.