Summarizing Text Articles with Dirichlet Distribution
The Latent Dirichlet Allocation (LDA) is based on the hypothesis that a person writing a document has topics in mind. To write about a topic then means to pick a word with a certain probability from the pool of words of that topic. A document can then be represented as a mixture of various topics...
Saved in:
Main Author: | |
---|---|
Format: | Final Year Project |
Language: | English |
Published: |
Universiti Teknologi Petronas
2011
|
Subjects: | |
Online Access: | http://utpedia.utp.edu.my/8730/1/2011%20-%20Summarizing%20text%20articles%20with%20dirichlet%20distribution.pdf http://utpedia.utp.edu.my/8730/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The Latent Dirichlet Allocation (LDA) is based on the hypothesis that a person
writing a document has topics in mind. To write about a topic then means to pick a
word with a certain probability from the pool of words of that topic. A document can
then be represented as a mixture of various topics. LDA is a generative probabilistic
model for a corpus of discrete data, such as the words in a set of documents. LDA
models the words in the documents under "bag-of-words" assumption, which
basically ignores the orders of the words in the documents. Following this
"exchangeability", the distribution of the words would be independent and
identically distributed given conditioned on some parameters. This conditionally
independence allows us to build a hierarchical Bayesian model for a corpus of
documents and words. The objective is to develop a text sununarization system base
on the Latent Dirichlet Allocation (LDA) method. The system would be used to
determine the accuracy level of the method. This is done by comparing the result
produced by the text summarization system with an existing sununary that is
produced by a human. |
---|