Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data
Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixi...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Penerbit Akademia Baru
2024
|
Subjects: | |
Online Access: | http://ir.unimas.my/id/eprint/47390/3/5640-Article%20Text-28397-1-10-20250109.pdf http://ir.unimas.my/id/eprint/47390/ https://www.akademiabaru.com/submit/index.php/ard/article/view/5640 https://doi.org/10.37934/ard.123.1.198212 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.unimas.ir-47390 |
---|---|
record_format |
eprints |
spelling |
my.unimas.ir-473902025-01-22T04:02:53Z http://ir.unimas.my/id/eprint/47390/ Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data Afifah, Mohd Shamsuddin Sarah Flora, Samson Juan Stephanie, Chua Arif, Bramantoro QA75 Electronic computers. Computer science Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixing of languages within sentences or phrasesincreases the likelihood of inaccurately classified sentiment. Sentiment analysers built with models trained with monolingual data will cause misclassification due to out-of-vocabulary issues, thus limiting the efficacy of these analysers on code-switcheddata. Plus, obtaining a code-switched corpus annotated with sentiment labels is scarce, and investigations on sentiment analysis on code-switched social media data are lacking, particularly on Malaysian social media posts. We proposed MESocSentiment, a Malay-English social media corpus with sentiment labels constructed via a semi-automatic sentiment identification method to address this challenge. The framework leveraged existing language identifiers and sentiment analysers to annotate code-switched data. We collected 229,566 tweets containing #Malaysia between August 12, 2022, and May 15, 2023. Using our strategy, we identified 19714 code-switched posts containing Malay and English words from the collection. Our analysis showed that 78.23% of the corpus had neutral sentiments, while 16.32% were positive and 5.44% negative. Furthermore, the descriptive analysis of tweet length revealed a range spanning 43 words, with a mean of 8.92 and a standard deviation of 5.56 words. This comprehensive framework contributes to a deeper understanding of sentiment expression within code-switched social media data, particularly in the context of Malaysia's linguistic and cultural diversity. Additionally, the MESocSentiment corpus is published on GitHub for future research. Penerbit Akademia Baru 2024-12 Article PeerReviewed text en http://ir.unimas.my/id/eprint/47390/3/5640-Article%20Text-28397-1-10-20250109.pdf Afifah, Mohd Shamsuddin and Sarah Flora, Samson Juan and Stephanie, Chua and Arif, Bramantoro (2024) Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data. Journal of Advanced Research Design, 123 (1). pp. 198-212. ISSN 2289-7984 https://www.akademiabaru.com/submit/index.php/ard/article/view/5640 https://doi.org/10.37934/ard.123.1.198212 |
institution |
Universiti Malaysia Sarawak |
building |
Centre for Academic Information Services (CAIS) |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Malaysia Sarawak |
content_source |
UNIMAS Institutional Repository |
url_provider |
http://ir.unimas.my/ |
language |
English |
topic |
QA75 Electronic computers. Computer science |
spellingShingle |
QA75 Electronic computers. Computer science Afifah, Mohd Shamsuddin Sarah Flora, Samson Juan Stephanie, Chua Arif, Bramantoro Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data |
description |
Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixing of languages within sentences or phrasesincreases the likelihood of inaccurately classified sentiment. Sentiment analysers built with models trained with monolingual data will cause misclassification due to out-of-vocabulary issues, thus limiting the efficacy of these analysers on code-switcheddata. Plus, obtaining a code-switched corpus annotated with sentiment labels is scarce, and investigations on sentiment analysis on code-switched social media data are lacking, particularly on Malaysian social media posts. We proposed MESocSentiment, a Malay-English social media corpus with sentiment labels constructed via a semi-automatic sentiment identification method to address this challenge. The framework leveraged existing language identifiers and sentiment analysers to annotate code-switched data. We collected 229,566 tweets containing #Malaysia between August 12, 2022, and May 15, 2023. Using our strategy, we identified 19714 code-switched posts containing Malay and English words from the collection. Our analysis showed that 78.23% of the corpus had neutral sentiments, while 16.32% were positive and 5.44% negative. Furthermore, the descriptive analysis of tweet length revealed a range spanning 43 words, with a mean of 8.92 and a standard deviation of 5.56 words. This comprehensive framework contributes to a deeper understanding of sentiment expression within code-switched social media data, particularly in the context of Malaysia's linguistic and cultural diversity. Additionally, the MESocSentiment corpus is published on GitHub for future research. |
format |
Article |
author |
Afifah, Mohd Shamsuddin Sarah Flora, Samson Juan Stephanie, Chua Arif, Bramantoro |
author_facet |
Afifah, Mohd Shamsuddin Sarah Flora, Samson Juan Stephanie, Chua Arif, Bramantoro |
author_sort |
Afifah, Mohd Shamsuddin |
title |
Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data |
title_short |
Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data |
title_full |
Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data |
title_fullStr |
Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data |
title_full_unstemmed |
Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data |
title_sort |
semi-automatic sentiment identification for malay-english code-switched data |
publisher |
Penerbit Akademia Baru |
publishDate |
2024 |
url |
http://ir.unimas.my/id/eprint/47390/3/5640-Article%20Text-28397-1-10-20250109.pdf http://ir.unimas.my/id/eprint/47390/ https://www.akademiabaru.com/submit/index.php/ard/article/view/5640 https://doi.org/10.37934/ard.123.1.198212 |
_version_ |
1822896187636711424 |
score |
13.235362 |