Optimizing the impact of data augmentation for low-resource grammatical error correction

Grammatical Error Correction (GEC) refers to the automatic identification and amendment of grammatical, spelling, punctuation, and word-positioning errors in monolingual texts. Neural Machine Translation (NMT) is nowadays one of the most valuable techniques used for GEC but it may suffer from scarci...

Full description

Saved in:
Bibliographic Details
Main Authors: Aiman Solyman, Marco Zappatore, Wang Zhenyu, Zeinab Mahmoud, Ali Alfatemi, Ashraf Osman Ibrahim Elsayed, Lubna Abdelkareim Gabralla
Format: Article
Language:en
Published: Elsevier B.V. 2023
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/45150/1/FULL%20TEXT.pdf
https://eprints.ums.edu.my/id/eprint/45150/
https://doi.org/10.1016/j.jksuci.2023.101572
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Grammatical Error Correction (GEC) refers to the automatic identification and amendment of grammatical, spelling, punctuation, and word-positioning errors in monolingual texts. Neural Machine Translation (NMT) is nowadays one of the most valuable techniques used for GEC but it may suffer from scarcity of training data and domain shift, depending on the addressed language. However, current techniques (e.g., tuning pre-trained language models or developing spell-confusion methods without focusing on language diversity) tackling the data sparsity problem associated with NMT create mismatched data distributions. This paper proposes new aggressive transformation approaches to augment data during training that extend the distribution of authentic data. In particular, it uses augmented data as auxiliary tasks to provide new contexts when the target prefix is not helpful for the next word prediction. This enhances the encoder and steadily increases its contribution by forcing the GEC model to pay more attention to the text representations of the encoder during decoding. The impact of these approaches was investigated using the Transformer-based for low-resource GEC task, and Arabic GEC was used as a case study. GEC models trained with our data tend more to source information, are more domain shift robustness, and have less hallucinations with tiny training datasets and domain shift. Experimental results showed that the proposed approaches outperformed the baseline, the most common data augmentation methods, and classical synthetic data approaches. In addition, a combination of the three best approaches Misspelling, Swap, and Reverse achieved the best score in two benchmarks and outperformed previous Arabic GEC approaches.