Language Modelling for a Low-Resource Language in Sarawak, Malaysia

This paper explores state-of-the-art techniques for creating language models in low-resource setting. It is known that building a good statistical language model requires a large amount of data. Therefore, models that are trained on low-resource language suffer from poor performances. We conducted a...

Full description

Saved in:
Bibliographic Details
Main Authors: Sarah Flora, Samson Juan, Muhamad Fikr, Bin Che Ismail, Hamimah, Binti Ujir, Irwandi Hipni, Bin Mohamad Hipiny
Format: Book Chapter
Language:en
Published: Springer, Singapore 2019
Subjects:
Online Access:http://ir.unimas.my/id/eprint/28716/1/Language%20Modelling%20for%20a%20Low-Resource%20Language%20in%20Sarawak%2C%20Malaysia.pdf
http://ir.unimas.my/id/eprint/28716/
https://link.springer.com/chapter/10.1007/978-981-15-1289-6_14
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper explores state-of-the-art techniques for creating language models in low-resource setting. It is known that building a good statistical language model requires a large amount of data. Therefore, models that are trained on low-resource language suffer from poor performances. We conducted a study on current language modelling techniques such as n-gram and recurrent neural network (RNN) to observe their outcomes on data from a language in Sarawak, Malaysia. The target language is Iban, a widely spoken language in this region. We have collected news data form an online source to build an Iban text corpus. After normalising the data, we trained trigram and RNN language models and tested on automatic speech recognition data. Based on our results, we observed that the RNN language models did not significantly outperform the trigram language models. A slight improvement on RNN model is seen after the size of the training data was increased. We have also experimented on merging n-gram and RNN language models and we obtained 32.33% improvement after using a trigram-RNN language model.