Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets

Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohd Suhairi, Md Suhaimin, Mohd Hanafi, Ahmad Hijazi, Moung, Ervin Gubin, Mohd Azwan, Mohamad Hamza
Format: Conference or Workshop Item
Language:English
English
Published: Institute of Electrical and Electronics Engineers Inc. 2023
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/40378/1/Data%20augmentation%20approach%20for%20language%20identification.pdf
http://umpir.ump.edu.my/id/eprint/40378/2/Data%20augmentation%20approach%20for%20language%20identification%20in%20imbalanced%20bilingual%20code-mixed%20social%20media%20datasets_ABS.pdf
http://umpir.ump.edu.my/id/eprint/40378/
https://doi.org/10.1109/IICAIET59451.2023.10292108
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.ump.umpir.40378
record_format eprints
spelling my.ump.umpir.403782024-04-16T04:18:57Z http://umpir.ump.edu.my/id/eprint/40378/ Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets Mohd Suhairi, Md Suhaimin Mohd Hanafi, Ahmad Hijazi Moung, Ervin Gubin Mohd Azwan, Mohamad Hamza QA75 Electronic computers. Computer science QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General) Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts to optimize performance. This paper introduces an augmentation approach designed to enhance language identification in bilingual code-mixed social media data. By incorporating reverse translation, semantic similarity, and sampling techniques alongside customized reprocessing strategies, our approach offers a comprehensive solution to these complex issues. To evaluate the effectiveness of the proposed approach, experiments were conducted on language identification at both the sentence and word levels. The results demonstrated the potential of the approach in optimizing language identification performance, offering a compelling combination of generation techniques for addressing the challenges of language identification in code-mixed data. Institute of Electrical and Electronics Engineers Inc. 2023 Conference or Workshop Item PeerReviewed pdf en http://umpir.ump.edu.my/id/eprint/40378/1/Data%20augmentation%20approach%20for%20language%20identification.pdf pdf en http://umpir.ump.edu.my/id/eprint/40378/2/Data%20augmentation%20approach%20for%20language%20identification%20in%20imbalanced%20bilingual%20code-mixed%20social%20media%20datasets_ABS.pdf Mohd Suhairi, Md Suhaimin and Mohd Hanafi, Ahmad Hijazi and Moung, Ervin Gubin and Mohd Azwan, Mohamad Hamza (2023) Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets. In: 5th IEEE International Conference on Artificial Intelligence in Engineering and Technology, IICAIET 2023 , 12-14 September 2023 , Kota Kinabalu. pp. 257-261. (193996). ISBN 979-835030415-2 https://doi.org/10.1109/IICAIET59451.2023.10292108
institution Universiti Malaysia Pahang Al-Sultan Abdullah
building UMPSA Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Pahang Al-Sultan Abdullah
content_source UMPSA Institutional Repository
url_provider http://umpir.ump.edu.my/
language English
English
topic QA75 Electronic computers. Computer science
QA76 Computer software
T Technology (General)
TA Engineering (General). Civil engineering (General)
spellingShingle QA75 Electronic computers. Computer science
QA76 Computer software
T Technology (General)
TA Engineering (General). Civil engineering (General)
Mohd Suhairi, Md Suhaimin
Mohd Hanafi, Ahmad Hijazi
Moung, Ervin Gubin
Mohd Azwan, Mohamad Hamza
Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
description Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts to optimize performance. This paper introduces an augmentation approach designed to enhance language identification in bilingual code-mixed social media data. By incorporating reverse translation, semantic similarity, and sampling techniques alongside customized reprocessing strategies, our approach offers a comprehensive solution to these complex issues. To evaluate the effectiveness of the proposed approach, experiments were conducted on language identification at both the sentence and word levels. The results demonstrated the potential of the approach in optimizing language identification performance, offering a compelling combination of generation techniques for addressing the challenges of language identification in code-mixed data.
format Conference or Workshop Item
author Mohd Suhairi, Md Suhaimin
Mohd Hanafi, Ahmad Hijazi
Moung, Ervin Gubin
Mohd Azwan, Mohamad Hamza
author_facet Mohd Suhairi, Md Suhaimin
Mohd Hanafi, Ahmad Hijazi
Moung, Ervin Gubin
Mohd Azwan, Mohamad Hamza
author_sort Mohd Suhairi, Md Suhaimin
title Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_short Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_full Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_fullStr Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_full_unstemmed Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_sort data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
publisher Institute of Electrical and Electronics Engineers Inc.
publishDate 2023
url http://umpir.ump.edu.my/id/eprint/40378/1/Data%20augmentation%20approach%20for%20language%20identification.pdf
http://umpir.ump.edu.my/id/eprint/40378/2/Data%20augmentation%20approach%20for%20language%20identification%20in%20imbalanced%20bilingual%20code-mixed%20social%20media%20datasets_ABS.pdf
http://umpir.ump.edu.my/id/eprint/40378/
https://doi.org/10.1109/IICAIET59451.2023.10292108
_version_ 1822924226039906304
score 13.232414