Language Modelling for a Low-Resource Language in Sarawak, Malaysia
This paper explores state-of-the-art techniques for creating language models in low-resource setting. It is known that building a good statistical language model requires a large amount of data. Therefore, models that are trained on low-resource language suffer from poor performances. We conducted a...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Book Chapter |
| Language: | en |
| Published: |
Springer, Singapore
2019
|
| Subjects: | |
| Online Access: | http://ir.unimas.my/id/eprint/28716/1/Language%20Modelling%20for%20a%20Low-Resource%20Language%20in%20Sarawak%2C%20Malaysia.pdf http://ir.unimas.my/id/eprint/28716/ https://link.springer.com/chapter/10.1007/978-981-15-1289-6_14 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This paper explores state-of-the-art techniques for creating language models in low-resource setting. It is known that building a good statistical language model requires a large amount of data. Therefore, models that are trained on low-resource language suffer from poor performances. We conducted a study on current language modelling techniques such as n-gram and recurrent neural network (RNN) to observe their outcomes on data from a language in Sarawak, Malaysia. The target language is Iban, a widely spoken language in this region. We have collected news data form an online source to build an Iban text corpus. After normalising the data, we trained trigram and RNN language models and tested on automatic speech recognition data. Based on our results, we observed that the RNN language models did not significantly outperform the trigram language models. A slight improvement on RNN model is seen after the size of the training data was increased. We have also experimented on merging n-gram and RNN language models and we obtained 32.33% improvement after using a trigram-RNN language model. |
|---|
