Summarizing Text Articles with Dirichlet Distribution

The Latent Dirichlet Allocation (LDA) is based on the hypothesis that a person writing a document has topics in mind. To write about a topic then means to pick a word with a certain probability from the pool of words of that topic. A document can then be represented as a mixture of various topics...

Full description

Saved in:
Bibliographic Details
Main Author: Mohamed, Noor Zalifah
Format: Final Year Project
Language:English
Published: Universiti Teknologi Petronas 2011
Subjects:
Online Access:http://utpedia.utp.edu.my/8730/1/2011%20-%20Summarizing%20text%20articles%20with%20dirichlet%20distribution.pdf
http://utpedia.utp.edu.my/8730/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The Latent Dirichlet Allocation (LDA) is based on the hypothesis that a person writing a document has topics in mind. To write about a topic then means to pick a word with a certain probability from the pool of words of that topic. A document can then be represented as a mixture of various topics. LDA is a generative probabilistic model for a corpus of discrete data, such as the words in a set of documents. LDA models the words in the documents under "bag-of-words" assumption, which basically ignores the orders of the words in the documents. Following this "exchangeability", the distribution of the words would be independent and identically distributed given conditioned on some parameters. This conditionally independence allows us to build a hierarchical Bayesian model for a corpus of documents and words. The objective is to develop a text sununarization system base on the Latent Dirichlet Allocation (LDA) method. The system would be used to determine the accuracy level of the method. This is done by comparing the result produced by the text summarization system with an existing sununary that is produced by a human.