Staff View: Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets

Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets

Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mohd Suhairi, Md Suhaimin, Mohd Hanafi, Ahmad Hijazi, Moung, Ervin Gubin, Mohd Azwan, Mohamad Hamza
Format:	Conference or Workshop Item
Language:	English English
Published:	Institute of Electrical and Electronics Engineers Inc. 2023
Subjects:	QA75 Electronic computers. Computer science QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General)
Online Access:	http://umpir.ump.edu.my/id/eprint/40378/1/Data%20augmentation%20approach%20for%20language%20identification.pdf http://umpir.ump.edu.my/id/eprint/40378/2/Data%20augmentation%20approach%20for%20language%20identification%20in%20imbalanced%20bilingual%20code-mixed%20social%20media%20datasets_ABS.pdf http://umpir.ump.edu.my/id/eprint/40378/ https://doi.org/10.1109/IICAIET59451.2023.10292108
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.ump.umpir.40378
record_format	eprints
spelling	my.ump.umpir.403782024-04-16T04:18:57Z http://umpir.ump.edu.my/id/eprint/40378/ Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets Mohd Suhairi, Md Suhaimin Mohd Hanafi, Ahmad Hijazi Moung, Ervin Gubin Mohd Azwan, Mohamad Hamza QA75 Electronic computers. Computer science QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General) Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts to optimize performance. This paper introduces an augmentation approach designed to enhance language identification in bilingual code-mixed social media data. By incorporating reverse translation, semantic similarity, and sampling techniques alongside customized reprocessing strategies, our approach offers a comprehensive solution to these complex issues. To evaluate the effectiveness of the proposed approach, experiments were conducted on language identification at both the sentence and word levels. The results demonstrated the potential of the approach in optimizing language identification performance, offering a compelling combination of generation techniques for addressing the challenges of language identification in code-mixed data. Institute of Electrical and Electronics Engineers Inc. 2023 Conference or Workshop Item PeerReviewed pdf en http://umpir.ump.edu.my/id/eprint/40378/1/Data%20augmentation%20approach%20for%20language%20identification.pdf pdf en http://umpir.ump.edu.my/id/eprint/40378/2/Data%20augmentation%20approach%20for%20language%20identification%20in%20imbalanced%20bilingual%20code-mixed%20social%20media%20datasets_ABS.pdf Mohd Suhairi, Md Suhaimin and Mohd Hanafi, Ahmad Hijazi and Moung, Ervin Gubin and Mohd Azwan, Mohamad Hamza (2023) Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets. In: 5th IEEE International Conference on Artificial Intelligence in Engineering and Technology, IICAIET 2023 , 12-14 September 2023 , Kota Kinabalu. pp. 257-261. (193996). ISBN 979-835030415-2 https://doi.org/10.1109/IICAIET59451.2023.10292108
institution	Universiti Malaysia Pahang Al-Sultan Abdullah
building	UMPSA Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaysia Pahang Al-Sultan Abdullah
content_source	UMPSA Institutional Repository
url_provider	http://umpir.ump.edu.my/
language	English English
topic	QA75 Electronic computers. Computer science QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General)
spellingShingle	QA75 Electronic computers. Computer science QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General) Mohd Suhairi, Md Suhaimin Mohd Hanafi, Ahmad Hijazi Moung, Ervin Gubin Mohd Azwan, Mohamad Hamza Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
description	Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts to optimize performance. This paper introduces an augmentation approach designed to enhance language identification in bilingual code-mixed social media data. By incorporating reverse translation, semantic similarity, and sampling techniques alongside customized reprocessing strategies, our approach offers a comprehensive solution to these complex issues. To evaluate the effectiveness of the proposed approach, experiments were conducted on language identification at both the sentence and word levels. The results demonstrated the potential of the approach in optimizing language identification performance, offering a compelling combination of generation techniques for addressing the challenges of language identification in code-mixed data.
format	Conference or Workshop Item
author	Mohd Suhairi, Md Suhaimin Mohd Hanafi, Ahmad Hijazi Moung, Ervin Gubin Mohd Azwan, Mohamad Hamza
author_facet	Mohd Suhairi, Md Suhaimin Mohd Hanafi, Ahmad Hijazi Moung, Ervin Gubin Mohd Azwan, Mohamad Hamza
author_sort	Mohd Suhairi, Md Suhaimin
title	Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_short	Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_full	Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_fullStr	Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_full_unstemmed	Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
title_sort	data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
publisher	Institute of Electrical and Electronics Engineers Inc.
publishDate	2023
url	http://umpir.ump.edu.my/id/eprint/40378/1/Data%20augmentation%20approach%20for%20language%20identification.pdf http://umpir.ump.edu.my/id/eprint/40378/2/Data%20augmentation%20approach%20for%20language%20identification%20in%20imbalanced%20bilingual%20code-mixed%20social%20media%20datasets_ABS.pdf http://umpir.ump.edu.my/id/eprint/40378/ https://doi.org/10.1109/IICAIET59451.2023.10292108
_version_	1822924226039906304
score	13.232414

Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets

Similar Items