Staff View: An improved framework for content-based spamdexing detection

An improved framework for content-based spamdexing detection

To the modern Search Engines (SEs), one of the biggest threats to be considered is spamdexing. Nowadays spammers are using a wide range of techniques for content generation, they are using content spam to fill the Search Engine Result Pages (SERPs) with low-quality web pages. Generally, spam web pag...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shahzad, Asim, Mahdin, Hairulnizam, Mohd Nawi, Nazri
Format:	Article
Language:	English
Published:	SAI Organization 2020
Subjects:	T Technology (General) QA299.6-433 Analysis
Online Access:	http://eprints.uthm.edu.my/5278/1/AJ%202020%20%28137%29.pdf http://eprints.uthm.edu.my/5278/ https://dx.doi.org/ 10.14569/IJACSA.2020.0110151
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.uthm.eprints.5278
record_format	eprints
spelling	my.uthm.eprints.52782022-01-09T01:43:19Z http://eprints.uthm.edu.my/5278/ An improved framework for content-based spamdexing detection Shahzad, Asim Mahdin, Hairulnizam Mohd Nawi, Nazri T Technology (General) QA299.6-433 Analysis To the modern Search Engines (SEs), one of the biggest threats to be considered is spamdexing. Nowadays spammers are using a wide range of techniques for content generation, they are using content spam to fill the Search Engine Result Pages (SERPs) with low-quality web pages. Generally, spam web pages are insufficient, irrelevant and improper results for users. Many researchers from academia and industry are working on spamdexing to identify the spam web pages. However, so far not even a single universally efficient method is developed for identification of all spam web pages. We believe that for tackling the content spam there must be improved methods. This article is an attempt in that direction, where a framework has been proposed for spam web pages identification. The framework uses Stop words, Keywords Density, Spam Keywords Database, Part of Speech (POS) ratio, and Copied Content algorithms. For conducting the experiments and obtaining threshold values WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets have been used. An excellent and promising F-measure of 77.38% illustrates the effectiveness and applicability of proposed method. SAI Organization 2020 Article PeerReviewed text en http://eprints.uthm.edu.my/5278/1/AJ%202020%20%28137%29.pdf Shahzad, Asim and Mahdin, Hairulnizam and Mohd Nawi, Nazri (2020) An improved framework for content-based spamdexing detection. International Journal of Advanced Computer Science and Applications, 11 (1). pp. 409-420. ISSN 2158-107X https://dx.doi.org/ 10.14569/IJACSA.2020.0110151
institution	Universiti Tun Hussein Onn Malaysia
building	UTHM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Tun Hussein Onn Malaysia
content_source	UTHM Institutional Repository
url_provider	http://eprints.uthm.edu.my/
language	English
topic	T Technology (General) QA299.6-433 Analysis
spellingShingle	T Technology (General) QA299.6-433 Analysis Shahzad, Asim Mahdin, Hairulnizam Mohd Nawi, Nazri An improved framework for content-based spamdexing detection
description	To the modern Search Engines (SEs), one of the biggest threats to be considered is spamdexing. Nowadays spammers are using a wide range of techniques for content generation, they are using content spam to fill the Search Engine Result Pages (SERPs) with low-quality web pages. Generally, spam web pages are insufficient, irrelevant and improper results for users. Many researchers from academia and industry are working on spamdexing to identify the spam web pages. However, so far not even a single universally efficient method is developed for identification of all spam web pages. We believe that for tackling the content spam there must be improved methods. This article is an attempt in that direction, where a framework has been proposed for spam web pages identification. The framework uses Stop words, Keywords Density, Spam Keywords Database, Part of Speech (POS) ratio, and Copied Content algorithms. For conducting the experiments and obtaining threshold values WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets have been used. An excellent and promising F-measure of 77.38% illustrates the effectiveness and applicability of proposed method.
format	Article
author	Shahzad, Asim Mahdin, Hairulnizam Mohd Nawi, Nazri
author_facet	Shahzad, Asim Mahdin, Hairulnizam Mohd Nawi, Nazri
author_sort	Shahzad, Asim
title	An improved framework for content-based spamdexing detection
title_short	An improved framework for content-based spamdexing detection
title_full	An improved framework for content-based spamdexing detection
title_fullStr	An improved framework for content-based spamdexing detection
title_full_unstemmed	An improved framework for content-based spamdexing detection
title_sort	improved framework for content-based spamdexing detection
publisher	SAI Organization
publishDate	2020
url	http://eprints.uthm.edu.my/5278/1/AJ%202020%20%28137%29.pdf http://eprints.uthm.edu.my/5278/ https://dx.doi.org/ 10.14569/IJACSA.2020.0110151
_version_	1738581359478177792
score	13.211869

An improved framework for content-based spamdexing detection

Similar Items