Effectiveness of RSS feed item duplication detection using word matching

Users of feed aggregators know that duplicated articles are found occasionally on the feeds they subscribe to. It can be time consuming to read all articles and stumble upon duplicated items they have already read. Our work here is to determine the effectiveness of using basic word matching to remov...

Full description

Saved in:
Bibliographic Details
Main Authors: Tan, Ian K. T., Su, Tze-Wei, Khor, Hao-Ming, Ong, Ee-Chun
Format: Article
Language:English
Published: Sunway University 2011
Subjects:
Online Access:http://eprints.sunway.edu.my/387/1/SAJ_8_2011_38-53.pdf
http://eprints.sunway.edu.my/387/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.sunway.eprints.387
record_format eprints
spelling my.sunway.eprints.3872016-10-03T10:28:01Z http://eprints.sunway.edu.my/387/ Effectiveness of RSS feed item duplication detection using word matching Tan, Ian K. T. Su, Tze-Wei Khor, Hao-Ming Ong, Ee-Chun QA76 Computer software Users of feed aggregators know that duplicated articles are found occasionally on the feeds they subscribe to. It can be time consuming to read all articles and stumble upon duplicated items they have already read. Our work here is to determine the effectiveness of using basic word matching to remove duplicated items and only show the most relevant item, thus saving readers’ time. The method described in this paper to remove duplicates involves word matching heuristics with an appropriate matching percentage. The duplicated feeds are then ranked to only display the highest ranked article. Ranking is done using the number of search items found on the titles of the news feeds where the highest number returned will be considered the highest ranked article. Using Malaysian online news feeds, our method found that with a matching percentage of 40%, our method will be able to minimize duplicates effectively with minimal errors. We did further empirical studies using 9 technology blog feeds over a longer period to provide us with a better averaging results. The matching percentage obtained is also within the same quantum. The method described here has a low overhead in terms of processing for the duplicates and with careful selection of matching percentage, the system will effectively remove the majority of duplicates Sunway University 2011 Article PeerReviewed text en cc_by_nd http://eprints.sunway.edu.my/387/1/SAJ_8_2011_38-53.pdf Tan, Ian K. T. and Su, Tze-Wei and Khor, Hao-Ming and Ong, Ee-Chun (2011) Effectiveness of RSS feed item duplication detection using word matching. Sunway Academic Journal, 8. pp. 38-53. ISSN 1823-500X
institution Sunway University
building Sunway Campus Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Sunway University
content_source Sunway Institutional Repository
url_provider http://eprints.sunway.edu.my/
language English
topic QA76 Computer software
spellingShingle QA76 Computer software
Tan, Ian K. T.
Su, Tze-Wei
Khor, Hao-Ming
Ong, Ee-Chun
Effectiveness of RSS feed item duplication detection using word matching
description Users of feed aggregators know that duplicated articles are found occasionally on the feeds they subscribe to. It can be time consuming to read all articles and stumble upon duplicated items they have already read. Our work here is to determine the effectiveness of using basic word matching to remove duplicated items and only show the most relevant item, thus saving readers’ time. The method described in this paper to remove duplicates involves word matching heuristics with an appropriate matching percentage. The duplicated feeds are then ranked to only display the highest ranked article. Ranking is done using the number of search items found on the titles of the news feeds where the highest number returned will be considered the highest ranked article. Using Malaysian online news feeds, our method found that with a matching percentage of 40%, our method will be able to minimize duplicates effectively with minimal errors. We did further empirical studies using 9 technology blog feeds over a longer period to provide us with a better averaging results. The matching percentage obtained is also within the same quantum. The method described here has a low overhead in terms of processing for the duplicates and with careful selection of matching percentage, the system will effectively remove the majority of duplicates
format Article
author Tan, Ian K. T.
Su, Tze-Wei
Khor, Hao-Ming
Ong, Ee-Chun
author_facet Tan, Ian K. T.
Su, Tze-Wei
Khor, Hao-Ming
Ong, Ee-Chun
author_sort Tan, Ian K. T.
title Effectiveness of RSS feed item duplication detection using word matching
title_short Effectiveness of RSS feed item duplication detection using word matching
title_full Effectiveness of RSS feed item duplication detection using word matching
title_fullStr Effectiveness of RSS feed item duplication detection using word matching
title_full_unstemmed Effectiveness of RSS feed item duplication detection using word matching
title_sort effectiveness of rss feed item duplication detection using word matching
publisher Sunway University
publishDate 2011
url http://eprints.sunway.edu.my/387/1/SAJ_8_2011_38-53.pdf
http://eprints.sunway.edu.my/387/
_version_ 1644324309989064704
score 13.211869