Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
The basic Bag of Words (BOW) representation generally used in text documents clustering or categorization loses important syntactic and semantic information contained in the documents. When the texts contain a lot of stop words or when they are of a short length this may be particularly problematic....
Saved in:
Main Authors: | , , |
---|---|
Format: | Research Report |
Language: | English |
Published: |
Universiti Malaysia Sabah
2011
|
Subjects: | |
Online Access: | https://eprints.ums.edu.my/id/eprint/22890/1/Enhancing%20document%20clustering%20by%20integrating%20semantic%20background%20knowledge%20and%20syntactic%20features%20into%20the%20bag%20of%20words%20representation.pdf https://eprints.ums.edu.my/id/eprint/22890/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The basic Bag of Words (BOW) representation generally used in text documents clustering or categorization loses important syntactic and semantic information contained in the documents. When the texts contain a lot of stop words or when they are of a short length this may be particularly problematic. In this research, we study the contribution of incorporating syntactic features [and semantic background knowledge into the representation in clustering texts corpus.
We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies-Bouldin index (DBI). In this research, we compare the quality of the
clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. This research helps the
understanding on how the quality of documents clustering can be improved by enriching the classic bag of words representation with additional background information. |
---|