A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution

Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily...

Full description

Saved in:
Bibliographic Details
Main Authors: Maskat, Ruhaila, Azman, Norazmiera Ayunie, Nulizairos, Nur Shaheera Shastera, Zahidin, Nurul Athirah, Mahadi, Adibah Humairah Mahadi, Norshamsul, Siti Rubaya, Mohd Sharif, Mohd Mukhlis, Mahdin, Hairulnizam
Format: Article
Language:en
Published: Elsevier 2024
Subjects:
Online Access:http://eprints.uthm.edu.my/11791/1/J17377_a3b15f369ba6e61ca5517eaf40899173.pdf
http://eprints.uthm.edu.my/11791/
https://doi.org/10.1016/j.dib.2024.110034
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1833419620314251264
author Maskat, Ruhaila
Azman, Norazmiera Ayunie
Nulizairos, Nur Shaheera Shastera
Zahidin, Nurul Athirah
Mahadi, Adibah Humairah Mahadi
Norshamsul, Siti Rubaya
Mohd Sharif, Mohd Mukhlis
Mahdin, Hairulnizam
author_facet Maskat, Ruhaila
Azman, Norazmiera Ayunie
Nulizairos, Nur Shaheera Shastera
Zahidin, Nurul Athirah
Mahadi, Adibah Humairah Mahadi
Norshamsul, Siti Rubaya
Mohd Sharif, Mohd Mukhlis
Mahdin, Hairulnizam
author_sort Maskat, Ruhaila
building UTHM Library
collection Institutional Repository
content_provider Universiti Tun Hussein Onn Malaysia
content_source UTHM Institutional Repository
continent Asia
country Malaysia
description Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code-switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay-English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low-resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias,and providing targeted recommendations for gender-specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay-English code-switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers.
format Article
id my.uthm.eprints-11791
institution Universiti Tun Hussein Onn Malaysia
language en
publishDate 2024
publisher Elsevier
record_format eprints
spelling my.uthm.eprints-117912025-04-29T02:56:24Z http://eprints.uthm.edu.my/11791/ A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution Maskat, Ruhaila Azman, Norazmiera Ayunie Nulizairos, Nur Shaheera Shastera Zahidin, Nurul Athirah Mahadi, Adibah Humairah Mahadi Norshamsul, Siti Rubaya Mohd Sharif, Mohd Mukhlis Mahdin, Hairulnizam QA Mathematics Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code-switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay-English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low-resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias,and providing targeted recommendations for gender-specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay-English code-switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers. Elsevier 2024 Article PeerReviewed text en http://eprints.uthm.edu.my/11791/1/J17377_a3b15f369ba6e61ca5517eaf40899173.pdf Maskat, Ruhaila and Azman, Norazmiera Ayunie and Nulizairos, Nur Shaheera Shastera and Zahidin, Nurul Athirah and Mahadi, Adibah Humairah Mahadi and Norshamsul, Siti Rubaya and Mohd Sharif, Mohd Mukhlis and Mahdin, Hairulnizam (2024) A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution. Data in Brief, 52. pp. 1-14. https://doi.org/10.1016/j.dib.2024.110034
spellingShingle QA Mathematics
Maskat, Ruhaila
Azman, Norazmiera Ayunie
Nulizairos, Nur Shaheera Shastera
Zahidin, Nurul Athirah
Mahadi, Adibah Humairah Mahadi
Norshamsul, Siti Rubaya
Mohd Sharif, Mohd Mukhlis
Mahdin, Hairulnizam
A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_full A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_fullStr A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_full_unstemmed A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_short A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_sort bi-annotated malay-english code-switching (manglish) dataset of x posts for biological gender identification and authorship attribution
topic QA Mathematics
url http://eprints.uthm.edu.my/11791/1/J17377_a3b15f369ba6e61ca5517eaf40899173.pdf
http://eprints.uthm.edu.my/11791/
https://doi.org/10.1016/j.dib.2024.110034
url_provider http://eprints.uthm.edu.my/