Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
Universiti Malaysia Sarawak, (UNIMAS)
2013
|
Subjects: | |
Online Access: | http://ir.unimas.my/id/eprint/8340/3/Generic%20Named-Entity%20Recognition%20For%20Indigenous%20Languages%20of%20Sarawak%20%28NERSIL%29%20%28full%29.pdf http://ir.unimas.my/id/eprint/8340/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs considered in this research are Person, Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs
carry important information about the text itself. Thus, there are targets for extraction. NER approaches can be categorised broadly as rule-based approach, machine learningbased approach, and hybrid approach. Rule-based approach relies on hand-crafted
linguistic grammars. Machine learning-based approach needs a large amount of annotated training data, which is unavailable for SILs. Hybrid approach is the combination of rulebased
and machine learning-based approach. NERSIL requires special attention as it is impossible to apply directly from the existing NER approaches. In this thesis, an NER system that is built by extending and modifying the existing NER approaches is presented. There are three main processes: the non-modified ANNIE (A Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context investigation. Firstly, the input texts are submitted to an English NER, in this case ANNIE with the assumption that some NEs that appear in English texts will also occur in SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new
gazetteers for SILs are built in order to identify more NEs. However, the first two v processes are not enough to provide a good accuracy in recognising all NEs. Thus, context investigation is needed. Context investigation includes frequency analysis, triggered words filtering, and concordance analysis. The context of a NE (the left or right side of NE) will be investigated.Finally, a NER system designed for SILs will be an advancement of world knowledge. Besides, the design can be improved by incorporating the machine translation, WordNet, and adding more noise filtering (e.g. context filtering, and morphological filtering). With more research and future studies, this NER system will reach a high level of performance like the English NER work on. |
---|