Text Extraction Algorithm for Web Text Classification
Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2010
|
Subjects: | |
Online Access: | http://etd.uum.edu.my/2164/1/Mustafa_Muwafak_Theab.pdf http://etd.uum.edu.my/2164/ http://lintas.uum.edu.my:8080/elmu/index.jsp?module=webopac-l&action=fullDisplayRetriever.jsp&szMaterialNo=0000757917 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.uum.etd.2164 |
---|---|
record_format |
eprints |
spelling |
my.uum.etd.21642013-07-24T12:14:42Z http://etd.uum.edu.my/2164/ Text Extraction Algorithm for Web Text Classification Theab, Mustafa Muwafak QA71-90 Instruments and machines Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets are then classified using Naive-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naive-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification. 2010 Thesis NonPeerReviewed application/pdf en http://etd.uum.edu.my/2164/1/Mustafa_Muwafak_Theab.pdf Theab, Mustafa Muwafak (2010) Text Extraction Algorithm for Web Text Classification. Masters thesis, Universiti Utara Malaysia. http://lintas.uum.edu.my:8080/elmu/index.jsp?module=webopac-l&action=fullDisplayRetriever.jsp&szMaterialNo=0000757917 |
institution |
Universiti Utara Malaysia |
building |
UUM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Utara Malaysia |
content_source |
UUM Electronic Theses |
url_provider |
http://etd.uum.edu.my/ |
language |
English |
topic |
QA71-90 Instruments and machines |
spellingShingle |
QA71-90 Instruments and machines Theab, Mustafa Muwafak Text Extraction Algorithm for Web Text Classification |
description |
Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets
are then classified using Naive-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naive-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification. |
format |
Thesis |
author |
Theab, Mustafa Muwafak |
author_facet |
Theab, Mustafa Muwafak |
author_sort |
Theab, Mustafa Muwafak |
title |
Text Extraction Algorithm for Web Text Classification |
title_short |
Text Extraction Algorithm for Web Text Classification |
title_full |
Text Extraction Algorithm for Web Text Classification |
title_fullStr |
Text Extraction Algorithm for Web Text Classification |
title_full_unstemmed |
Text Extraction Algorithm for Web Text Classification |
title_sort |
text extraction algorithm for web text classification |
publishDate |
2010 |
url |
http://etd.uum.edu.my/2164/1/Mustafa_Muwafak_Theab.pdf http://etd.uum.edu.my/2164/ http://lintas.uum.edu.my:8080/elmu/index.jsp?module=webopac-l&action=fullDisplayRetriever.jsp&szMaterialNo=0000757917 |
_version_ |
1644276611358392320 |
score |
13.211869 |