Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani

Summarization is a process to select important information from a source text. Summarizing strategies are the core of the cognitive processes involved in the summarization activity. Summarizing strategies include a set of conscious tasks that are used to determine important information and extrac...

Full description

Saved in:
Bibliographic Details
Main Author: Seyed Asadollah, Abdiesfandani
Format: Thesis
Published: 2016
Subjects:
Online Access:http://studentsrepo.um.edu.my/6400/4/seyed.pdf
http://studentsrepo.um.edu.my/6400/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.um.stud.6400
record_format eprints
spelling my.um.stud.64002019-10-23T19:06:07Z Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani Seyed Asadollah, Abdiesfandani QA75 Electronic computers. Computer science Summarization is a process to select important information from a source text. Summarizing strategies are the core of the cognitive processes involved in the summarization activity. Summarizing strategies include a set of conscious tasks that are used to determine important information and extract the main idea of a source text. In this research project, we conducted a study on students’ summaries. The findings of the study show that, there is a strong relationship between the summary writing proficiency of students and the summarizing strategies that they used. We then develop a new algorithm to address the summarizing strategies identification problem. The algorithm simulates two important tasks that are frequently used by the human experts to identify summarizing strategies used to produce the summary sentences: 1) sentences relevance identification; and 2) summarizing strategies identification. The sentences relevance identification module uses a statistical based approach such as vector space model (VSM) to represent sentences and compute similarity between the source sentences and the summary sentences using the cosine similarity measure. It then integrates both the semantic and syntactic similarity measures using a linear equation to capture the meaning in comparison between two sentences. It aims to distinguish the meaning of two sentences, when two sentences have same surface or share the similar bag-of-words (BOW), while their meaning is different. The module also employed a word semantic similarity measuring method to overcome vocabulary mismatch problem in sentence comparison. The method bridges the lexical gaps for semantically similar contexts that are expressed in a different wording. In addition, the sentences relevance identification module requires some degree of linguistic pre-processing, including part of speech tagging (POS), word stemming and stop-words removal. iii The summarizing strategies identification module relies on a set of heuristic rules, statistical and linguistic methods such as position-based method, title-based method, cue-phrase method and word-frequency method to identify the summarizing strategies employed by students. To evaluate the algorithm, we conducted two experiments. In the first experiment, we examine the functionality of the system, whether the system is able to identify the summarizing strategies used by students in summary writing. The result for the first experiment shows that the system is able to identify some of summarizing strategies which are deletion, sentence combination, paraphrase and topic sentence selection. The system is also able to detect copy- verbatim strategy, the most commonly strategy used by students. Besides than these strategies, there are four methods used in topic sentence selection strategy which can also be identified by the system. They are 1) cue method; 2) title method; 3) keyword method; and 4) location method. In the second experiment, we want to measure the performance of the algorithm against human judgment to identify the summarizing strategies using the precision, recall, F-measure score and accuracy rate. The experimental results show that the proposed algorithm achieved acceptable results in comparison to human judgment. The algorithm achieved an average of 87% precision, 83% of recall, 85% of F-score and 82% of accuracy rate. 2016 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/6400/4/seyed.pdf Seyed Asadollah, Abdiesfandani (2016) Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani. PhD thesis, University of Malaya. http://studentsrepo.um.edu.my/6400/
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Student Repository
url_provider http://studentsrepo.um.edu.my/
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Seyed Asadollah, Abdiesfandani
Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani
description Summarization is a process to select important information from a source text. Summarizing strategies are the core of the cognitive processes involved in the summarization activity. Summarizing strategies include a set of conscious tasks that are used to determine important information and extract the main idea of a source text. In this research project, we conducted a study on students’ summaries. The findings of the study show that, there is a strong relationship between the summary writing proficiency of students and the summarizing strategies that they used. We then develop a new algorithm to address the summarizing strategies identification problem. The algorithm simulates two important tasks that are frequently used by the human experts to identify summarizing strategies used to produce the summary sentences: 1) sentences relevance identification; and 2) summarizing strategies identification. The sentences relevance identification module uses a statistical based approach such as vector space model (VSM) to represent sentences and compute similarity between the source sentences and the summary sentences using the cosine similarity measure. It then integrates both the semantic and syntactic similarity measures using a linear equation to capture the meaning in comparison between two sentences. It aims to distinguish the meaning of two sentences, when two sentences have same surface or share the similar bag-of-words (BOW), while their meaning is different. The module also employed a word semantic similarity measuring method to overcome vocabulary mismatch problem in sentence comparison. The method bridges the lexical gaps for semantically similar contexts that are expressed in a different wording. In addition, the sentences relevance identification module requires some degree of linguistic pre-processing, including part of speech tagging (POS), word stemming and stop-words removal. iii The summarizing strategies identification module relies on a set of heuristic rules, statistical and linguistic methods such as position-based method, title-based method, cue-phrase method and word-frequency method to identify the summarizing strategies employed by students. To evaluate the algorithm, we conducted two experiments. In the first experiment, we examine the functionality of the system, whether the system is able to identify the summarizing strategies used by students in summary writing. The result for the first experiment shows that the system is able to identify some of summarizing strategies which are deletion, sentence combination, paraphrase and topic sentence selection. The system is also able to detect copy- verbatim strategy, the most commonly strategy used by students. Besides than these strategies, there are four methods used in topic sentence selection strategy which can also be identified by the system. They are 1) cue method; 2) title method; 3) keyword method; and 4) location method. In the second experiment, we want to measure the performance of the algorithm against human judgment to identify the summarizing strategies using the precision, recall, F-measure score and accuracy rate. The experimental results show that the proposed algorithm achieved acceptable results in comparison to human judgment. The algorithm achieved an average of 87% precision, 83% of recall, 85% of F-score and 82% of accuracy rate.
format Thesis
author Seyed Asadollah, Abdiesfandani
author_facet Seyed Asadollah, Abdiesfandani
author_sort Seyed Asadollah, Abdiesfandani
title Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani
title_short Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani
title_full Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani
title_fullStr Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani
title_full_unstemmed Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani
title_sort relevance detection and summarizing strategies identification algorithm using linguistic measures / seyed asadollah abdiesfandani
publishDate 2016
url http://studentsrepo.um.edu.my/6400/4/seyed.pdf
http://studentsrepo.um.edu.my/6400/
_version_ 1738505911296589824
score 13.211869