Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak

Diacritics play an important role in interpreting the meaning of a sentence through the proper pronunciation. Any text that needs diacritics is sensitive as any disarrangement of diacritics (intentional or unintentional) will result in complete misinterpretation of the text. There are different diac...

Full description

Saved in:
Bibliographic Details
Main Author: Saqib Iqbal , Hakak
Format: Thesis
Published: 2018
Subjects:
Online Access:http://studentsrepo.um.edu.my/10408/1/Saqib_Iqbal_Hakak.pdf
http://studentsrepo.um.edu.my/10408/2/Saqib_Iqbal_Hakak_%E2%80%93_Thesis.pdf
http://studentsrepo.um.edu.my/10408/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.um.stud.10408
record_format eprints
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Student Repository
url_provider http://studentsrepo.um.edu.my/
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Saqib Iqbal , Hakak
Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak
description Diacritics play an important role in interpreting the meaning of a sentence through the proper pronunciation. Any text that needs diacritics is sensitive as any disarrangement of diacritics (intentional or unintentional) will result in complete misinterpretation of the text. There are different diacritics like punctuation symbols, extended letters (e.g. kashidas) and other symbols, that can be easily tampered to alter the original meaning of the text. There are limited studies focused on the authentication of such sensitive diacritical content (SDC). Most of the studies have removed the diacritics for authentication making the process questionable. Besides, the proliferation of such a sensitive content in different languages and formats on the internet has further exaggerated the issue of authentication involving search and retrieval phases. To address the mentioned issues, this thesis presents the different methods to authenticate the SDC with the aim to improve the searching and retrieval phases. The first method is based on the residual approach that authenticates any two similar sample texts written in different styles using one common database. It minimizes the overhead associated with maintaining the multiple databases. The objective is achieved using logical operations and the character segmentation. The second method is based on the representation of the diacritical text within the database to improve the retrieval performance for authentication of a single sentence (verse). The objective is achieved by creating individual nodes based on the total number of characters and placing each diacritical verse within its respective node. The last method is based on the pattern matching approach, where given multiple pattern input is authenticated from a given text. The purpose of exploring pattern matching approach is to authenticate multiple diacritical verses with improved time and space efficiency. The proposed method works by splitting the given pattern into two halves and searching for the respective halves. The searching of halves is achieved through two different algorithms based on the split approach and the parallel approach respectively. To show the practicality of the proposed methods, they are tested on sensitive diacritical text, which includes the Arabic Digital Holy Quran (DHQ). The reason for selecting the DHQ for evaluation purposes is its availability in different styles like uthmani and plain Arabic style that makes evaluation possible based on our first method. The second reason is the complexity of diacritics within DHQ and encoding scheme that decreases the authentication performance due to inefficient data representation and search/retrieval strategies. The mentioned reason made the evaluation of the second proposed method feasible and practical. Finally, for evaluating the pattern matching based approach, different sensitive texts including Arabic, French. Italian, English and Chinese were taken. The findings show that the first method manages to convert Uthmani and Plain Quranic verses into one common style with an accuracy of about 87 %. Similarly, the second method manages to authenticate single DHQ verse with the improvement in search time by approximately 70 % over the existing methods. Finally, the final method successfully authenticates multiple verses of different sensitive diacritical texts with improved computational time and memory consumption.
format Thesis
author Saqib Iqbal , Hakak
author_facet Saqib Iqbal , Hakak
author_sort Saqib Iqbal , Hakak
title Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak
title_short Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak
title_full Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak
title_fullStr Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak
title_full_unstemmed Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak
title_sort authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / saqib iqbal hakak
publishDate 2018
url http://studentsrepo.um.edu.my/10408/1/Saqib_Iqbal_Hakak.pdf
http://studentsrepo.um.edu.my/10408/2/Saqib_Iqbal_Hakak_%E2%80%93_Thesis.pdf
http://studentsrepo.um.edu.my/10408/
_version_ 1738506362924564480
spelling my.um.stud.104082020-02-02T19:10:04Z Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak Saqib Iqbal , Hakak QA75 Electronic computers. Computer science Diacritics play an important role in interpreting the meaning of a sentence through the proper pronunciation. Any text that needs diacritics is sensitive as any disarrangement of diacritics (intentional or unintentional) will result in complete misinterpretation of the text. There are different diacritics like punctuation symbols, extended letters (e.g. kashidas) and other symbols, that can be easily tampered to alter the original meaning of the text. There are limited studies focused on the authentication of such sensitive diacritical content (SDC). Most of the studies have removed the diacritics for authentication making the process questionable. Besides, the proliferation of such a sensitive content in different languages and formats on the internet has further exaggerated the issue of authentication involving search and retrieval phases. To address the mentioned issues, this thesis presents the different methods to authenticate the SDC with the aim to improve the searching and retrieval phases. The first method is based on the residual approach that authenticates any two similar sample texts written in different styles using one common database. It minimizes the overhead associated with maintaining the multiple databases. The objective is achieved using logical operations and the character segmentation. The second method is based on the representation of the diacritical text within the database to improve the retrieval performance for authentication of a single sentence (verse). The objective is achieved by creating individual nodes based on the total number of characters and placing each diacritical verse within its respective node. The last method is based on the pattern matching approach, where given multiple pattern input is authenticated from a given text. The purpose of exploring pattern matching approach is to authenticate multiple diacritical verses with improved time and space efficiency. The proposed method works by splitting the given pattern into two halves and searching for the respective halves. The searching of halves is achieved through two different algorithms based on the split approach and the parallel approach respectively. To show the practicality of the proposed methods, they are tested on sensitive diacritical text, which includes the Arabic Digital Holy Quran (DHQ). The reason for selecting the DHQ for evaluation purposes is its availability in different styles like uthmani and plain Arabic style that makes evaluation possible based on our first method. The second reason is the complexity of diacritics within DHQ and encoding scheme that decreases the authentication performance due to inefficient data representation and search/retrieval strategies. The mentioned reason made the evaluation of the second proposed method feasible and practical. Finally, for evaluating the pattern matching based approach, different sensitive texts including Arabic, French. Italian, English and Chinese were taken. The findings show that the first method manages to convert Uthmani and Plain Quranic verses into one common style with an accuracy of about 87 %. Similarly, the second method manages to authenticate single DHQ verse with the improvement in search time by approximately 70 % over the existing methods. Finally, the final method successfully authenticates multiple verses of different sensitive diacritical texts with improved computational time and memory consumption. 2018-07 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/10408/1/Saqib_Iqbal_Hakak.pdf application/pdf http://studentsrepo.um.edu.my/10408/2/Saqib_Iqbal_Hakak_%E2%80%93_Thesis.pdf Saqib Iqbal , Hakak (2018) Authenticating sensitive diacritical texts using residual, data representation and pattern matching methods / Saqib Iqbal Hakak. PhD thesis, University of Malaya. http://studentsrepo.um.edu.my/10408/
score 13.211869