Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak
Diacritics play an important role in interpreting the meaning of a sentence through the proper pronunciation. Any text that needs diacritics is sensitive as any disarrangement of diacritics (intentional or unintentional) will result in complete misinterpretation of the text. There are different diac...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Published: |
2018
|
Subjects: | |
Online Access: | http://studentsrepo.um.edu.my/10408/1/Saqib_Iqbal_Hakak.pdf http://studentsrepo.um.edu.my/10408/2/Saqib_Iqbal_Hakak_%E2%80%93_Thesis.pdf http://studentsrepo.um.edu.my/10408/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Diacritics play an important role in interpreting the meaning of a sentence through the proper pronunciation. Any text that needs diacritics is sensitive as any disarrangement of diacritics (intentional or unintentional) will result in complete misinterpretation of the text. There are different diacritics like punctuation symbols, extended letters (e.g. kashidas) and other symbols, that can be easily tampered to alter the original meaning of the text. There are limited studies focused on the authentication of such sensitive diacritical content (SDC). Most of the studies have removed the diacritics for authentication making the process questionable. Besides, the proliferation of such a sensitive content in different languages and formats on the internet has further exaggerated the issue of authentication involving search and retrieval phases. To address the mentioned issues, this thesis presents the different methods to authenticate the SDC with the aim to improve the searching and retrieval phases. The first method is based on the residual approach that authenticates any two similar sample texts written in different styles using one common database. It minimizes the overhead associated with maintaining the multiple databases. The objective is achieved using logical operations and the character segmentation. The second method is based on the representation of the diacritical text within the database to improve the retrieval performance for authentication of a single sentence (verse). The objective is achieved by creating individual nodes based on the total number of characters and placing each diacritical verse within its respective node. The last method is based on the pattern matching approach, where given multiple pattern input is authenticated from a given text. The purpose of exploring pattern matching approach is to authenticate multiple diacritical verses with improved time and space efficiency. The proposed method works by splitting the given pattern into two halves and searching for the respective halves. The searching of halves is achieved through two different algorithms based on the split approach and the parallel approach respectively. To show the practicality of the proposed methods, they are tested on sensitive diacritical text, which includes the Arabic Digital Holy Quran (DHQ). The reason for selecting the DHQ for evaluation purposes is its availability in different styles like uthmani and plain Arabic style that makes evaluation possible based on our first method. The second reason is the complexity of diacritics within DHQ and encoding scheme that decreases the authentication performance due to inefficient data representation and search/retrieval strategies. The mentioned reason made the evaluation of the second proposed method feasible and practical. Finally, for evaluating the pattern matching based approach, different sensitive texts including Arabic, French. Italian, English and Chinese were taken. The findings show that the first method manages to convert Uthmani and Plain Quranic verses into one common style with an accuracy of about 87 %. Similarly, the second method manages to authenticate single DHQ verse with the improvement in search time by approximately 70 % over the existing methods. Finally, the final method successfully authenticates multiple verses of different sensitive diacritical texts with improved computational time and memory consumption. |
---|