Staff View: Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim

Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim

Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion i...

Full description

Saved in:

Bibliographic Details
Main Author:	Hemin Fatih , Ibrahim
Format:	Thesis
Published:	2022
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.um.stud.14562
record_format	eprints
spelling	my.um.stud.145622023-07-04T23:00:20Z Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim Hemin Fatih , Ibrahim QA75 Electronic computers. Computer science Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion in human life, many researchers are inspired by studying Speech Emotion Recognition (SER) and considered as a key effort in Human- Computer Interaction (HCI). The accurate SER system can have an effective role in several services, such as call center services, the in-car board, educational systems, and children in care. The major challenges in SER are, to catch and extract the most relevant emotion features from the raw speech signal with distinctive information and a robust and cheap computational model. The main focus of this thesis is to design a model for emotion recognition from speech signals, which still has plenty of challenges in the area, and adopt the most relevant features. This thesis tackles these challenges by providing a multivariate time series classification based on reservoir computing for detecting emotions from speech. Due to the time series and sparse nature of emotion in speech, the multivariate time series handcrafted feature has been adopted as input data. The bidirectional Echo State Network (ESN) which is a type of reservoir computing and as a special case of the Recurrent Neural Network (RNN) has been adopted to avoid model complexity because of its untrained and sparse nature when mapping the features into a higher dimensional space. Although the ESN has advantages, some problems still need to be solved, such as the instability with initializing fixed weights randomly and selecting the optimal value for hyperparameters which have a big impact on the ESN performance. Therefore, to address these issues in ESN, the bidirectional ESN with twin reservoirs is adopted to catch additional independent information from each direction. Additionally, the late fusion of the same direction from twin reservoirs leads to having a more informative representation and enhances the memorization capability for SER applications. The truncated normal distribution approach is exploited to initialize random connection weights for the input weight, in addition to optimizing the hyperparameters in the ESN model by Bayesian optimization and Population Based Training (PBT) approaches. Moreover, the high dimensional sparse output from a reservoir makes feature representation suffer from the curse of dimensionality, for that reason the Sparse Random Projection (SRP) is adopted for dimensionality reduction since it offers significant computational advantages because it does not need any training and removes redundancies with minimal loss of information. Experimental results of this thesis with a speaker-independent strategy achieved 89.21%, 70.48%, 76.76%, and 46.34% unweighted average recalls on the Emo-DB, SAVEE, RAVDESS, and FAU Aibo datasets respectively. The results show the superior performance of our proposed model over a set of other methods on four publicly available emotional speech datasets. 2022-04 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf application/pdf http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf Hemin Fatih , Ibrahim (2022) Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim. PhD thesis, Universiti Malaya. http://studentsrepo.um.edu.my/14562/
institution	Universiti Malaya
building	UM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaya
content_source	UM Student Repository
url_provider	http://studentsrepo.um.edu.my/
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Hemin Fatih , Ibrahim Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
description	Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion in human life, many researchers are inspired by studying Speech Emotion Recognition (SER) and considered as a key effort in Human- Computer Interaction (HCI). The accurate SER system can have an effective role in several services, such as call center services, the in-car board, educational systems, and children in care. The major challenges in SER are, to catch and extract the most relevant emotion features from the raw speech signal with distinctive information and a robust and cheap computational model. The main focus of this thesis is to design a model for emotion recognition from speech signals, which still has plenty of challenges in the area, and adopt the most relevant features. This thesis tackles these challenges by providing a multivariate time series classification based on reservoir computing for detecting emotions from speech. Due to the time series and sparse nature of emotion in speech, the multivariate time series handcrafted feature has been adopted as input data. The bidirectional Echo State Network (ESN) which is a type of reservoir computing and as a special case of the Recurrent Neural Network (RNN) has been adopted to avoid model complexity because of its untrained and sparse nature when mapping the features into a higher dimensional space. Although the ESN has advantages, some problems still need to be solved, such as the instability with initializing fixed weights randomly and selecting the optimal value for hyperparameters which have a big impact on the ESN performance. Therefore, to address these issues in ESN, the bidirectional ESN with twin reservoirs is adopted to catch additional independent information from each direction. Additionally, the late fusion of the same direction from twin reservoirs leads to having a more informative representation and enhances the memorization capability for SER applications. The truncated normal distribution approach is exploited to initialize random connection weights for the input weight, in addition to optimizing the hyperparameters in the ESN model by Bayesian optimization and Population Based Training (PBT) approaches. Moreover, the high dimensional sparse output from a reservoir makes feature representation suffer from the curse of dimensionality, for that reason the Sparse Random Projection (SRP) is adopted for dimensionality reduction since it offers significant computational advantages because it does not need any training and removes redundancies with minimal loss of information. Experimental results of this thesis with a speaker-independent strategy achieved 89.21%, 70.48%, 76.76%, and 46.34% unweighted average recalls on the Emo-DB, SAVEE, RAVDESS, and FAU Aibo datasets respectively. The results show the superior performance of our proposed model over a set of other methods on four publicly available emotional speech datasets.
format	Thesis
author	Hemin Fatih , Ibrahim
author_facet	Hemin Fatih , Ibrahim
author_sort	Hemin Fatih , Ibrahim
title	Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_short	Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_full	Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_fullStr	Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_full_unstemmed	Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_sort	speech emotion recognition using bidirectional echo state network with random projection / hemin fatih ibrahim
publishDate	2022
url	http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf http://studentsrepo.um.edu.my/14562/
_version_	1772811929256984576
score	13.211869

Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim

Similar Items