Data augmentation approach for language identification in imbalanced bilingual code-mixed social media datasets
Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference or Workshop Item |
Language: | English English |
Published: |
Institute of Electrical and Electronics Engineers Inc.
2023
|
Subjects: | |
Online Access: | http://umpir.ump.edu.my/id/eprint/40378/1/Data%20augmentation%20approach%20for%20language%20identification.pdf http://umpir.ump.edu.my/id/eprint/40378/2/Data%20augmentation%20approach%20for%20language%20identification%20in%20imbalanced%20bilingual%20code-mixed%20social%20media%20datasets_ABS.pdf http://umpir.ump.edu.my/id/eprint/40378/ https://doi.org/10.1109/IICAIET59451.2023.10292108 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Addressing the problem of language identification in code-mixed datasets poses notable challenges due to data scarcity and high confusability in bilingual contexts. These challenges are further amplified by the associated imbalance and noise characteristic of social media data, complicating efforts to optimize performance. This paper introduces an augmentation approach designed to enhance language identification in bilingual code-mixed social media data. By incorporating reverse translation, semantic similarity, and sampling techniques alongside customized reprocessing strategies, our approach offers a comprehensive solution to these complex issues. To evaluate the effectiveness of the proposed approach, experiments were conducted on language identification at both the sentence and word levels. The results demonstrated the potential of the approach in optimizing language identification performance, offering a compelling combination of generation techniques for addressing the challenges of language identification in code-mixed data. |
---|