Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim

Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion i...

Full description

Saved in:
Bibliographic Details
Main Author: Hemin Fatih , Ibrahim
Format: Thesis
Published: 2022
Subjects:
Online Access:http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf
http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf
http://studentsrepo.um.edu.my/14562/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.um.stud.14562
record_format eprints
spelling my.um.stud.145622023-07-04T23:00:20Z Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim Hemin Fatih , Ibrahim QA75 Electronic computers. Computer science Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion in human life, many researchers are inspired by studying Speech Emotion Recognition (SER) and considered as a key effort in Human- Computer Interaction (HCI). The accurate SER system can have an effective role in several services, such as call center services, the in-car board, educational systems, and children in care. The major challenges in SER are, to catch and extract the most relevant emotion features from the raw speech signal with distinctive information and a robust and cheap computational model. The main focus of this thesis is to design a model for emotion recognition from speech signals, which still has plenty of challenges in the area, and adopt the most relevant features. This thesis tackles these challenges by providing a multivariate time series classification based on reservoir computing for detecting emotions from speech. Due to the time series and sparse nature of emotion in speech, the multivariate time series handcrafted feature has been adopted as input data. The bidirectional Echo State Network (ESN) which is a type of reservoir computing and as a special case of the Recurrent Neural Network (RNN) has been adopted to avoid model complexity because of its untrained and sparse nature when mapping the features into a higher dimensional space. Although the ESN has advantages, some problems still need to be solved, such as the instability with initializing fixed weights randomly and selecting the optimal value for hyperparameters which have a big impact on the ESN performance. Therefore, to address these issues in ESN, the bidirectional ESN with twin reservoirs is adopted to catch additional independent information from each direction. Additionally, the late fusion of the same direction from twin reservoirs leads to having a more informative representation and enhances the memorization capability for SER applications. The truncated normal distribution approach is exploited to initialize random connection weights for the input weight, in addition to optimizing the hyperparameters in the ESN model by Bayesian optimization and Population Based Training (PBT) approaches. Moreover, the high dimensional sparse output from a reservoir makes feature representation suffer from the curse of dimensionality, for that reason the Sparse Random Projection (SRP) is adopted for dimensionality reduction since it offers significant computational advantages because it does not need any training and removes redundancies with minimal loss of information. Experimental results of this thesis with a speaker-independent strategy achieved 89.21%, 70.48%, 76.76%, and 46.34% unweighted average recalls on the Emo-DB, SAVEE, RAVDESS, and FAU Aibo datasets respectively. The results show the superior performance of our proposed model over a set of other methods on four publicly available emotional speech datasets. 2022-04 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf application/pdf http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf Hemin Fatih , Ibrahim (2022) Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim. PhD thesis, Universiti Malaya. http://studentsrepo.um.edu.my/14562/
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Student Repository
url_provider http://studentsrepo.um.edu.my/
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Hemin Fatih , Ibrahim
Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
description Speech is an effective, quick, and important way of communicating and exchanging complex information between humans. Emotions have always been a part of the normal human conversation which makes the speech more attractive and more effective. Because of this major role of both speech and emotion in human life, many researchers are inspired by studying Speech Emotion Recognition (SER) and considered as a key effort in Human- Computer Interaction (HCI). The accurate SER system can have an effective role in several services, such as call center services, the in-car board, educational systems, and children in care. The major challenges in SER are, to catch and extract the most relevant emotion features from the raw speech signal with distinctive information and a robust and cheap computational model. The main focus of this thesis is to design a model for emotion recognition from speech signals, which still has plenty of challenges in the area, and adopt the most relevant features. This thesis tackles these challenges by providing a multivariate time series classification based on reservoir computing for detecting emotions from speech. Due to the time series and sparse nature of emotion in speech, the multivariate time series handcrafted feature has been adopted as input data. The bidirectional Echo State Network (ESN) which is a type of reservoir computing and as a special case of the Recurrent Neural Network (RNN) has been adopted to avoid model complexity because of its untrained and sparse nature when mapping the features into a higher dimensional space. Although the ESN has advantages, some problems still need to be solved, such as the instability with initializing fixed weights randomly and selecting the optimal value for hyperparameters which have a big impact on the ESN performance. Therefore, to address these issues in ESN, the bidirectional ESN with twin reservoirs is adopted to catch additional independent information from each direction. Additionally, the late fusion of the same direction from twin reservoirs leads to having a more informative representation and enhances the memorization capability for SER applications. The truncated normal distribution approach is exploited to initialize random connection weights for the input weight, in addition to optimizing the hyperparameters in the ESN model by Bayesian optimization and Population Based Training (PBT) approaches. Moreover, the high dimensional sparse output from a reservoir makes feature representation suffer from the curse of dimensionality, for that reason the Sparse Random Projection (SRP) is adopted for dimensionality reduction since it offers significant computational advantages because it does not need any training and removes redundancies with minimal loss of information. Experimental results of this thesis with a speaker-independent strategy achieved 89.21%, 70.48%, 76.76%, and 46.34% unweighted average recalls on the Emo-DB, SAVEE, RAVDESS, and FAU Aibo datasets respectively. The results show the superior performance of our proposed model over a set of other methods on four publicly available emotional speech datasets.
format Thesis
author Hemin Fatih , Ibrahim
author_facet Hemin Fatih , Ibrahim
author_sort Hemin Fatih , Ibrahim
title Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_short Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_full Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_fullStr Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_full_unstemmed Speech emotion recognition using bidirectional echo state network with random projection / Hemin Fatih Ibrahim
title_sort speech emotion recognition using bidirectional echo state network with random projection / hemin fatih ibrahim
publishDate 2022
url http://studentsrepo.um.edu.my/14562/1/Hemin_Fatih.pdf
http://studentsrepo.um.edu.my/14562/2/Hemin_Fatih.pdf
http://studentsrepo.um.edu.my/14562/
_version_ 1772811929256984576
score 13.211869