Staff View: Term frequency and inverse document frequency with position score and mean value for mining web content outliers

Term frequency and inverse document frequency with position score and mean value for mining web content outliers

In the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Mining Web Content Outliers is used t...

Full description

Saved in:

Bibliographic Details
Main Author:	Wan Zulkifeli, Wan Rusila
Format:	Thesis
Language:	English
Published:	2013
Online Access:	http://psasir.upm.edu.my/id/eprint/39114/1/FSKTM%202013%208%20IR.pdf http://psasir.upm.edu.my/id/eprint/39114/
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.upm.eprints.39114
record_format	eprints
spelling	my.upm.eprints.391142016-04-07T01:24:10Z http://psasir.upm.edu.my/id/eprint/39114/ Term frequency and inverse document frequency with position score and mean value for mining web content outliers Wan Zulkifeli, Wan Rusila In the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. It is important to detect outliers especially when a web portal is hacked. Recently, there are only a few approaches suggested to Mining Web Content Outliers such as Signed-with-Weight technique and mining through mathematical approach. The mathematical approach developed is based on two way rectangular representations and correlation method. However the approaches do not take the advantage of position score and stemmed domain dictionary. Position score and stemmed domain dictionary are very useful in mining web content outliers because it may effects on reduction the relevance of documents. Therefore, this study was made to resolve the problems in Mining Web Content Outliers by combining the strength of word-based techniques, position score weighting technique and stemmed domain dictionary. The existing weighting technique was transformed to the Term Frequency and Inverse Document Frequency with Position Score and Mean Value (TF.IDF.PSM) weighting technique by implementing a standard weighting technique from Information Retrieval called Term Frequency and Inverse Document Frequency (TF.IDF) and a weighting technique from Text Categorization called the Term Frequency and Relevance Frequency (TF.RF) into Web Content Mining. This technique is started with extracting the web pages, preprocess it and then generate the full word profile. Depending on the length of the character, the respective index on the stemmed domain dictionary is searched. Positive count is incremented by one, if the word is present in the dictionary and document. Then word frequency in a web page and in every web pages and position score are counted. Finally the dissimilarity measure is computed to determine outliers. In the dissimilarity measure part, the TF.IDF.PSM is used not only to calculate and analyze the relevant words but also to consider the importance of the irrelevant words by assigning weight based on the word position in a page. A statistical approach ‗mean‘ is added to balance the weight of position score. The technique has been tested on 431 web pages from the Course folder of University Wisconsin, provided by World Wide Knowledge Base. While the 43 benchmark dataset is from Science Medical folder provided by The 20 Newsgroups Dataset. Term Frequency and Inverse Document Frequency (TF.IDF) weighting technique from Information Retrieval (IR) and the Term Frequency and Relevance Frequency (TF.RF) weighting technique by Text Categorization are used during experimental phase and the results are qualified by two parameters which is the percentage of the accuracy and the F1-measure. The experimental results show that the TF.IDF.PSM weighting technique achieves up to 98.95% of accuracy, which is about 3.21% higher than the Signed-with-Weight technique. Besides, it also achieves up to 94.19% of F1-measure, which is a 18.12% improvement from the Signed-with-Weight technique. 2013-12 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/39114/1/FSKTM%202013%208%20IR.pdf Wan Zulkifeli, Wan Rusila (2013) Term frequency and inverse document frequency with position score and mean value for mining web content outliers. Masters thesis, Universiti Putra Malaysia.
institution	Universiti Putra Malaysia
building	UPM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Putra Malaysia
content_source	UPM Institutional Repository
url_provider	http://psasir.upm.edu.my/
language	English
description	In the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. It is important to detect outliers especially when a web portal is hacked. Recently, there are only a few approaches suggested to Mining Web Content Outliers such as Signed-with-Weight technique and mining through mathematical approach. The mathematical approach developed is based on two way rectangular representations and correlation method. However the approaches do not take the advantage of position score and stemmed domain dictionary. Position score and stemmed domain dictionary are very useful in mining web content outliers because it may effects on reduction the relevance of documents. Therefore, this study was made to resolve the problems in Mining Web Content Outliers by combining the strength of word-based techniques, position score weighting technique and stemmed domain dictionary. The existing weighting technique was transformed to the Term Frequency and Inverse Document Frequency with Position Score and Mean Value (TF.IDF.PSM) weighting technique by implementing a standard weighting technique from Information Retrieval called Term Frequency and Inverse Document Frequency (TF.IDF) and a weighting technique from Text Categorization called the Term Frequency and Relevance Frequency (TF.RF) into Web Content Mining. This technique is started with extracting the web pages, preprocess it and then generate the full word profile. Depending on the length of the character, the respective index on the stemmed domain dictionary is searched. Positive count is incremented by one, if the word is present in the dictionary and document. Then word frequency in a web page and in every web pages and position score are counted. Finally the dissimilarity measure is computed to determine outliers. In the dissimilarity measure part, the TF.IDF.PSM is used not only to calculate and analyze the relevant words but also to consider the importance of the irrelevant words by assigning weight based on the word position in a page. A statistical approach ‗mean‘ is added to balance the weight of position score. The technique has been tested on 431 web pages from the Course folder of University Wisconsin, provided by World Wide Knowledge Base. While the 43 benchmark dataset is from Science Medical folder provided by The 20 Newsgroups Dataset. Term Frequency and Inverse Document Frequency (TF.IDF) weighting technique from Information Retrieval (IR) and the Term Frequency and Relevance Frequency (TF.RF) weighting technique by Text Categorization are used during experimental phase and the results are qualified by two parameters which is the percentage of the accuracy and the F1-measure. The experimental results show that the TF.IDF.PSM weighting technique achieves up to 98.95% of accuracy, which is about 3.21% higher than the Signed-with-Weight technique. Besides, it also achieves up to 94.19% of F1-measure, which is a 18.12% improvement from the Signed-with-Weight technique.
format	Thesis
author	Wan Zulkifeli, Wan Rusila
spellingShingle	Wan Zulkifeli, Wan Rusila Term frequency and inverse document frequency with position score and mean value for mining web content outliers
author_facet	Wan Zulkifeli, Wan Rusila
author_sort	Wan Zulkifeli, Wan Rusila
title	Term frequency and inverse document frequency with position score and mean value for mining web content outliers
title_short	Term frequency and inverse document frequency with position score and mean value for mining web content outliers
title_full	Term frequency and inverse document frequency with position score and mean value for mining web content outliers
title_fullStr	Term frequency and inverse document frequency with position score and mean value for mining web content outliers
title_full_unstemmed	Term frequency and inverse document frequency with position score and mean value for mining web content outliers
title_sort	term frequency and inverse document frequency with position score and mean value for mining web content outliers
publishDate	2013
url	http://psasir.upm.edu.my/id/eprint/39114/1/FSKTM%202013%208%20IR.pdf http://psasir.upm.edu.my/id/eprint/39114/
_version_	1643832329244770304
score	13.211869

Term frequency and inverse document frequency with position score and mean value for mining web content outliers

Similar Items