Generating Pashto Clitics

Automatically generating a natural text that is perceived as grammatically correct remains a challenging task. The generated text must at least be coherent, accurate, and understandable. This research concerns the automatic generation of clitics in Pashto texts, since native Pashto speakers use clit...

Full description

Saved in:
Bibliographic Details
Main Author: Aziz, Ud Din
Format: Thesis
Language:English
Published: Universiti Malaysia Sarawak (UNIMAS) 2017
Subjects:
Online Access:http://ir.unimas.my/id/eprint/31749/1/Aziz%20Ud%20Din%20ft.pdf
http://ir.unimas.my/id/eprint/31749/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Automatically generating a natural text that is perceived as grammatically correct remains a challenging task. The generated text must at least be coherent, accurate, and understandable. This research concerns the automatic generation of clitics in Pashto texts, since native Pashto speakers use clitics extensively in everyday conversation and writing. A clitic is a word or particle that cannot bear accent or stress, and phonetically leans on an accented adjacent word. Pashto language is spoken in Pakistan and Afghanistan. It is one of the several languages featuring clitics. There are two main types of clitics in Pashto: Second Position (2P) clitics and endoclitics. The linguistic behaviours of these clitics are studied and formalised into rules. The design of the Pashto clitic generation system is approached in two ways. In the first approach, system generates cliticised sentences from the semantic representation of the sentences. This system has been implemented using Combinatory Categorial Grammar (CCG). The second approach operates on the surface representation of sentences. It uses syntactic pattern matching rules for the identification and generation of clitics at sentence level. In this system, a text can be generated separately, so that after the text generation step, clitic generation rules can be applied to sentences as post-processing step. This system has been implemented in Python. The main advantage of this method is the separation of clitic generation task from the text generation task. The evaluation of the proposed solutions has been mainly constrained by the non-existence of morphosyntactically annotated corpus, and language processing tools for Pashto. Notwithstanding, two independent corpora were developed. The first corpus contained semantic representations for generating 12 sentences based on Pashto CCG grammar. The second corpus consisted of256 syntactically annotated sentences to evaluate the python-based clitic IV generation system. The system is capable of generating all Pashto clitics including endoclitics, the most challenging clitic due to many constraints for its generation. All of the target sentences are successfully realised by the CCG grammar. The python-based Pashto clitic generator system achieves an accuracy of 89.62% on the test corpus. Incorrectly generated systems by the python-based generator have been fed to CCG generator to evaluate the agreement between the two systems. The accuracy achieved in this case is 87.5%.