Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion i...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Published: |
2022
|
Subjects: | |
Online Access: | http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.um.stud.14562 |
---|---|
record_format |
eprints |
spelling |
my.um.stud.145622023-07-04T23:00:20Z Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim Hemin Fatih , Ibrahim QA75 Electronic computers. Computer science Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion in human life, many researchers are inspired by studying Speech Emotion Recognition (SER) and considered as a key effort in Human- Computer Interaction (HCI). The accurate SER system can have an effective role in several services, such as call center services, the in-car board, educational systems, and children in care. The major challenges in SER are, to catch and extract the most relevant emotion features from the raw speech signal with distinctive information and a robust and cheap computational model. The main focus of this thesis is to design a model for emotion recognition from speech signals, which still has plenty of challenges in the area, and adopt the most relevant features. This thesis tackles these challenges by providing a multivariate time series classification based on reservoir computing for detecting emotions from speech. Due to the time series and sparse nature of emotion in speech, the multivariate time series handcrafted feature has been adopted as input data. The bidirectional Echo State Network (ESN) which is a type of reservoir computing and as a special case of the Recurrent Neural Network (RNN) has been adopted to avoid model complexity because of its untrained and sparse nature when mapping the features into a higher dimensional space. Although the ESN has advantages, some problems still need to be solved, such as the instability with initializing fixed weights randomly and selecting the optimal value for hyperparameters which have a big impact on the ESN performance. Therefore, to address these issues in ESN, the bidirectional ESN with twin reservoirs is adopted to catch additional independent information from each direction. Additionally, the late fusion of the same direction from twin reservoirs leads to having a more informative representation and enhances the memorization capability for SER applications. The truncated normal distribution approach is exploited to initialize random connection weights for the input weight, in addition to optimizing the hyperparameters in the ESN model by Bayesian optimization and Population Based Training (PBT) approaches. Moreover, the high dimensional sparse output from a reservoir makes feature representation suffer from the curse of dimensionality, for that reason the Sparse Random Projection (SRP) is adopted for dimensionality reduction since it offers significant computational advantages because it does not need any training and removes redundancies with minimal loss of information. Experimental results of this thesis with a speaker-independent strategy achieved 89.21%, 70.48%, 76.76%, and 46.34% unweighted average recalls on the Emo-DB, SAVEE, RAVDESS, and FAU Aibo datasets respectively. The results show the superior performance of our proposed model over a set of other methods on four publicly available emotional speech datasets. 2022-04 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf application/pdf http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf Hemin Fatih , Ibrahim (2022) Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim. PhD thesis, Universiti Malaya. http://studentsrepo.um.edu.my/14562/ |
institution |
Universiti Malaya |
building |
UM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Malaya |
content_source |
UM Student Repository |
url_provider |
http://studentsrepo.um.edu.my/ |
topic |
QA75 Electronic computers. Computer science |
spellingShingle |
QA75 Electronic computers. Computer science Hemin Fatih , Ibrahim Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim |
description |
Speech is an effective, quick, and important way of communicating and exchanging
complex information between humans. Emotions have always been a part of the normal
human conversation which makes the speech more attractive and more effective. Because
of this major role of both speech and emotion in human life, many researchers are inspired
by studying Speech Emotion Recognition (SER) and considered as a key effort in Human-
Computer Interaction (HCI). The accurate SER system can have an effective role in several
services, such as call center services, the in-car board, educational systems, and children
in care. The major challenges in SER are, to catch and extract the most relevant emotion
features from the raw speech signal with distinctive information and a robust and cheap
computational model. The main focus of this thesis is to design a model for emotion
recognition from speech signals, which still has plenty of challenges in the area, and adopt
the most relevant features. This thesis tackles these challenges by providing a multivariate
time series classification based on reservoir computing for detecting emotions from speech.
Due to the time series and sparse nature of emotion in speech, the multivariate time series
handcrafted feature has been adopted as input data. The bidirectional Echo State Network
(ESN) which is a type of reservoir computing and as a special case of the Recurrent Neural
Network (RNN) has been adopted to avoid model complexity because of its untrained and
sparse nature when mapping the features into a higher dimensional space. Although the
ESN has advantages, some problems still need to be solved, such as the instability with
initializing fixed weights randomly and selecting the optimal value for hyperparameters
which have a big impact on the ESN performance. Therefore, to address these issues in ESN, the bidirectional ESN with twin reservoirs is adopted to catch additional independent
information from each direction. Additionally, the late fusion of the same direction
from twin reservoirs leads to having a more informative representation and enhances the
memorization capability for SER applications. The truncated normal distribution approach
is exploited to initialize random connection weights for the input weight, in addition to
optimizing the hyperparameters in the ESN model by Bayesian optimization and Population
Based Training (PBT) approaches. Moreover, the high dimensional sparse output from
a reservoir makes feature representation suffer from the curse of dimensionality, for that
reason the Sparse Random Projection (SRP) is adopted for dimensionality reduction since
it offers significant computational advantages because it does not need any training and
removes redundancies with minimal loss of information. Experimental results of this
thesis with a speaker-independent strategy achieved 89.21%, 70.48%, 76.76%, and 46.34%
unweighted average recalls on the Emo-DB, SAVEE, RAVDESS, and FAU Aibo datasets
respectively. The results show the superior performance of our proposed model over a set
of other methods on four publicly available emotional speech datasets.
|
format |
Thesis |
author |
Hemin Fatih , Ibrahim |
author_facet |
Hemin Fatih , Ibrahim |
author_sort |
Hemin Fatih , Ibrahim |
title |
Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim |
title_short |
Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim |
title_full |
Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim |
title_fullStr |
Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim |
title_full_unstemmed |
Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim |
title_sort |
speech emotion recognition using bidirectional echo state network with random projection / hemin fatih ibrahim |
publishDate |
2022 |
url |
http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/ |
_version_ |
1772811929256984576 |
score |
13.211869 |