Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot
The expeditious spread of blogs, microblogs, and social network services has led to accelerate the usage of casual written language, known as user generated content (UGC). The UGC diverges from standard writing conventions because of the usage of coding strategies, such as phonetic transcriptions (a...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Published: |
2018
|
Subjects: | |
Online Access: | http://studentsrepo.um.edu.my/8982/1/Mohammad_Arshi.pdf http://studentsrepo.um.edu.my/8982/6/arshi.pdf http://studentsrepo.um.edu.my/8982/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.um.stud.8982 |
---|---|
record_format |
eprints |
spelling |
my.um.stud.89822021-03-16T00:55:50Z Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot Mohammad Arshi , Saloot QA75 Electronic computers. Computer science The expeditious spread of blogs, microblogs, and social network services has led to accelerate the usage of casual written language, known as user generated content (UGC). The UGC diverges from standard writing conventions because of the usage of coding strategies, such as phonetic transcriptions (are → r), digit phonemes (me too → me2), misspellings (misappropriate → missapropriate), vowel drops (double → dble), and missing or incorrect punctuation marks (In that situation, I'd possibly come. → In that situation Id possibly come). These modifications are due to three primary elements: 1) limited message length (e.g. 140 characters per Tweet); 2) miniature keyboards; and 3) extensive usage of UGC in unofficial and informal communications. However, the existence of many out-of-vocabulary (OOV) words, also known as unknown words, substantially disturbs standard natural language processing (NLP) systems. Therefore, research in NLP has increasingly focused on the text normalization task, where the OOV words will convert into their context-appropriate standard words. Currently, while diverse normalization approaches exist in the English language, the problem is neglected in other languages, such as Malay language. In this work, the Malay language is chosen because of its considerable usage on Twitter, where, it is the fourth leading language used in Twitter. Thus, a rule-based approach to normalize the Malay language Twitter messages is proposed based on corpus-driven analysis. To do so, a corpus-driven analysis depends on frequencies in specifying word-frequency lists, concordancing, clusters, and keywords. To design the normalization system, three analyzing tasks on the Malay language Twitter corpus and standard Malay corpus were performed: 1) frequency of unknown words; 2) abbreviation patterns; and 3) letter repetition. A Malay language Twitter corpus known as Malay Chat-style Corpus (MCC) is constructed. The MCC, which encompasses 1 million twitter messages, consists of 14,484,384 word instances, 646,807 unique vocabularies, and metadata, such as used Twitter client application, posting time, and type of Twitter message (simple Tweet, Retweet, Reply). To build the MCC, which represents the Malay language Twitter lingo, corpus-compiling criteria were considered which are: sampling, representativeness, machine readability, balance, and size of data. A portion of the MCC is manually annotated to be used in the development and testing stages of the normalization system. The architecture of the Malay normalization system contains seven primary modules: (1) enhanced tokenization; (2) In-Vocabulary (IV) detection; (3) colloquial dictionary lookup; (4) repeated letter elimination; (5) abbreviation normalizer; (6) English word translation; and (7) de-tokenization. The normalization modules are formulated based on the result of MCC analysis and implemented via rule-based state machines. An evaluation is performed in term of BLEU score to measure the accuracy of the system. The result is encouraging whereby 0.91 BLEU score is achieved against 0.46 BLEU baseline score. To compare the accuracy of the system with other probabilistic approaches with an identical Malay dataset, statistical machine translation (SMT) normalization system is chosen to be implemented, trained, and evaluated. The experimental results prove that higher accuracy is achieved by the proposed architecture, which is designed based on the results of our corpus-driven analysis. 2018-07 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/8982/1/Mohammad_Arshi.pdf application/pdf http://studentsrepo.um.edu.my/8982/6/arshi.pdf Mohammad Arshi , Saloot (2018) Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot. PhD thesis, University of Malaya. http://studentsrepo.um.edu.my/8982/ |
institution |
Universiti Malaya |
building |
UM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Malaya |
content_source |
UM Student Repository |
url_provider |
http://studentsrepo.um.edu.my/ |
topic |
QA75 Electronic computers. Computer science |
spellingShingle |
QA75 Electronic computers. Computer science Mohammad Arshi , Saloot Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot |
description |
The expeditious spread of blogs, microblogs, and social network services has led to accelerate the usage of casual written language, known as user generated content (UGC). The UGC diverges from standard writing conventions because of the usage of coding strategies, such as phonetic transcriptions (are → r), digit phonemes (me too → me2), misspellings (misappropriate → missapropriate), vowel drops (double → dble), and missing or incorrect punctuation marks (In that situation, I'd possibly come. → In that situation Id possibly come). These modifications are due to three primary elements: 1) limited message length (e.g. 140 characters per Tweet); 2) miniature keyboards; and 3) extensive usage of UGC in unofficial and informal communications. However, the existence of many out-of-vocabulary (OOV) words, also known as unknown words, substantially disturbs standard natural language processing (NLP) systems. Therefore, research in NLP has increasingly focused on the text normalization task, where the OOV words will convert into their context-appropriate standard words. Currently, while diverse normalization approaches exist in the English language, the problem is neglected in other languages, such as Malay language. In this work, the Malay language is chosen because of its considerable usage on Twitter, where, it is the fourth leading language used in Twitter. Thus, a rule-based approach to normalize the Malay language Twitter messages is proposed based on corpus-driven analysis. To do so, a corpus-driven analysis depends on frequencies in specifying word-frequency lists, concordancing, clusters, and keywords. To design the normalization system, three analyzing tasks on the Malay language Twitter corpus and standard Malay corpus were performed: 1) frequency of unknown words; 2) abbreviation patterns; and 3) letter repetition. A Malay language Twitter corpus known as Malay Chat-style Corpus (MCC) is constructed. The MCC, which encompasses 1 million twitter messages, consists of 14,484,384 word instances, 646,807 unique vocabularies, and metadata, such as used Twitter client application, posting time, and type of Twitter message (simple Tweet, Retweet, Reply). To build the MCC, which represents the Malay language Twitter lingo, corpus-compiling criteria were considered which are: sampling, representativeness, machine readability, balance, and size of data. A portion of the MCC is manually annotated to be used in the development and testing stages of the normalization system. The architecture of the Malay normalization system contains seven primary modules: (1) enhanced tokenization; (2) In-Vocabulary (IV) detection; (3) colloquial dictionary lookup; (4) repeated letter elimination; (5) abbreviation normalizer; (6) English word translation; and (7) de-tokenization. The normalization modules are formulated based on the result of MCC analysis and implemented via rule-based state machines. An evaluation is performed in term of BLEU score to measure the accuracy of the system. The result is encouraging whereby 0.91 BLEU score is achieved against 0.46 BLEU baseline score. To compare the accuracy of the system with other probabilistic approaches with an identical Malay dataset, statistical machine translation (SMT) normalization system is chosen to be implemented, trained, and evaluated. The experimental results prove that higher accuracy is achieved by the proposed architecture, which is designed based on the results of our corpus-driven analysis. |
format |
Thesis |
author |
Mohammad Arshi , Saloot |
author_facet |
Mohammad Arshi , Saloot |
author_sort |
Mohammad Arshi , Saloot |
title |
Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot |
title_short |
Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot |
title_full |
Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot |
title_fullStr |
Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot |
title_full_unstemmed |
Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot |
title_sort |
corpus-driven malay language tweet normalization / mohammad arshi saloot |
publishDate |
2018 |
url |
http://studentsrepo.um.edu.my/8982/1/Mohammad_Arshi.pdf http://studentsrepo.um.edu.my/8982/6/arshi.pdf http://studentsrepo.um.edu.my/8982/ |
_version_ |
1738506210424913920 |
score |
13.223943 |