Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the finan...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2017
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf http://psasir.upm.edu.my/id/eprint/113985/ http://ethesis.upm.edu.my/id/eprint/18043 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.upm.eprints.113985 |
---|---|
record_format |
eprints |
spelling |
my.upm.eprints.1139852024-12-04T08:26:32Z http://psasir.upm.edu.my/id/eprint/113985/ Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification Yazdani, Sepideh Foroozan This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the financial news like news reports and impacts on the market. In recent years, a huge amount of this information is accessible for investment and research analysis in a text format where investors and researchers can simply get access to the desired information through a variety of channels on the Internet. Despite the studies conducted in automated sentiment classification of financial news, there are still challenges in some parts of text mining and financial news classification that concerns feature extraction, feature selection, and classification processes. Most existing literature on sentiment financial news typically relies on very simple linguistic features, such as Bag-of-Words (BOW) in which each piece of news is represented using distinct words with frequencies as a feature type, and only a few numbers of the studies have employed complicated approaches. Obviously, not all words are needed to reflect a given text. The primary downside of the BOW or unigrams is the huge number of linguistic features that it produces. The secondary downside is that linguistic features have too much information to become features while it is not clear which ones are important to the sentiment of financial news classification. Furthermore, since the extraction of words is based on their high frequency, typically low frequency-based linguistic features can be worth ignored. This research proposes two feature process models, Ngram-based and the NgramPOS-based models for the sentiment classification of financial news. The Ngram-based model utilizes statistical approaches for feature processing in order to classify financial news. This high frequency-based model combines unigrams and bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method with a certain threshold as dimensionality reduction method since it is suitable for high dimensional feature space. NgramPOS-based model is able to enhance the performance of feature processing in Ngram-based model. NgramPOS-based model employs a combination of statistical and linguistic approaches to extract sentiment information as features in order to classify financial news. This low frequency-based model extracts the combination of sentimentrich words and phrases as unigrams and bigrams using the defined POS-based fixed patterns along with the binary weighting method and applies Principle Component Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted feature space. Both models utilized RBF Support Vector Machine (SVM) with optimized parameters (∁, γ) to classify the financial news as positive and negative news. Experiments showed that the combination of features as unigram and bigram along with TF-IDF and binary feature weighting methods in both models leads to the best result in financial news classification among, diverse feature spaces, with different accuracy for two models as 97.34% and 67.19% respectively. 2017-10 Thesis NonPeerReviewed text en http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf Yazdani, Sepideh Foroozan (2017) Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification. Doctoral thesis, Universiti Putra Malaysia. http://ethesis.upm.edu.my/id/eprint/18043 Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella |
institution |
Universiti Putra Malaysia |
building |
UPM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Putra Malaysia |
content_source |
UPM Institutional Repository |
url_provider |
http://psasir.upm.edu.my/ |
language |
English |
topic |
Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella |
spellingShingle |
Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella Yazdani, Sepideh Foroozan Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification |
description |
This thesis utilizes sentiment classification task within the field of artificial intelligence
for financial news using the combination of machine learning, linguistics, and
statistical methods. The motivation for this approach comes from human emotion and
vital information that lies in the financial news like news reports and impacts on the
market. In recent years, a huge amount of this information is accessible for investment
and research analysis in a text format where investors and researchers can simply get
access to the desired information through a variety of channels on the Internet.
Despite the studies conducted in automated sentiment classification of financial news,
there are still challenges in some parts of text mining and financial news classification
that concerns feature extraction, feature selection, and classification processes. Most
existing literature on sentiment financial news typically relies on very simple linguistic
features, such as Bag-of-Words (BOW) in which each piece of news is represented
using distinct words with frequencies as a feature type, and only a few numbers of the
studies have employed complicated approaches. Obviously, not all words are needed to
reflect a given text. The primary downside of the BOW or unigrams is the huge number
of linguistic features that it produces. The secondary downside is that linguistic features
have too much information to become features while it is not clear which ones are
important to the sentiment of financial news classification. Furthermore, since the
extraction of words is based on their high frequency, typically low frequency-based
linguistic features can be worth ignored. This research proposes two feature process
models, Ngram-based and the NgramPOS-based models for the sentiment classification
of financial news.
The Ngram-based model utilizes statistical approaches for feature processing in order
to classify financial news. This high frequency-based model combines unigrams and
bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method
with a certain threshold as dimensionality reduction method since it is suitable for high
dimensional feature space.
NgramPOS-based model is able to enhance the performance of feature processing in
Ngram-based model. NgramPOS-based model employs a combination of statistical and
linguistic approaches to extract sentiment information as features in order to classify
financial news. This low frequency-based model extracts the combination of sentimentrich
words and phrases as unigrams and bigrams using the defined POS-based fixed
patterns along with the binary weighting method and applies Principle Component
Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted
feature space.
Both models utilized RBF Support Vector Machine (SVM) with optimized parameters
(∁, γ) to classify the financial news as positive and negative news. Experiments showed
that the combination of features as unigram and bigram along with TF-IDF and binary
feature weighting methods in both models leads to the best result in financial news
classification among, diverse feature spaces, with different accuracy for two models as
97.34% and 67.19% respectively. |
format |
Thesis |
author |
Yazdani, Sepideh Foroozan |
author_facet |
Yazdani, Sepideh Foroozan |
author_sort |
Yazdani, Sepideh Foroozan |
title |
Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification |
title_short |
Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification |
title_full |
Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification |
title_fullStr |
Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification |
title_full_unstemmed |
Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification |
title_sort |
automated frequency-based statistical and linguistic feature process models for financial news sentiment classification |
publishDate |
2017 |
url |
http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf http://psasir.upm.edu.my/id/eprint/113985/ http://ethesis.upm.edu.my/id/eprint/18043 |
_version_ |
1817844692887273472 |
score |
13.222552 |