EnhancerNet: A Self-Organizing Map-Based DNA Sequence to Enhancer Motif Activation Map Encoding Method for Enhancer Classification with Convolutional Neural Network Analysis
Convolutional neural networks (CNNs) have achieved significant advancements in biological sequence analysis over recent years. Specifically, it has the edge over the traditional feature-based machine learning approaches in deciphering the regulatory properties of sequences. Nevertheless, one of the...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English |
Published: |
UNIMAS Institutional Repository
2023
|
Subjects: | |
Online Access: | http://ir.unimas.my/id/eprint/43083/3/Chia%20Shu%20En_dsva.pdf http://ir.unimas.my/id/eprint/43083/4/Thesis%20Master_Chia%20Shu%20En.ftext.pdf http://ir.unimas.my/id/eprint/43083/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Convolutional neural networks (CNNs) have achieved significant advancements in biological sequence analysis over recent years. Specifically, it has the edge over the traditional feature-based machine learning approaches in deciphering the regulatory properties of sequences. Nevertheless, one of the technical challenges remains in representing the biological sequences in a suitable input matrix for effective CNN learning. To address this challenge, this study proposes a novel sequence encoding approach that focuses on modelling enhancer motifs within DNA sequences using a self-organizing map (SOM)-based template feature map. This two-dimensional template map is constructed by clustering known motifs associated with high enhancer activity, enabling it to act as a motif scanner that detects significant enhancer motifs through similarity measures. The motifs within each node of the template map are ranked based on their activation values, allowing for the selection of conserved and significant motifs as the final feature representation. Consequently, the input DNA sequence is transformed into an activation map, where the spatial locations between significant motifs and their activation strengths are utilized to characterize enhancer motifs in a meaningful way. The activation map generated from the input is crucial in developing EnhancerNet, a specialized CNN model designed and trained specifically for enhancer classification. By utilizing the information within the activation map, EnhancerNet effectively learns to recognize and extract discriminative features and patterns. This capability enables EnhancerNet to achieve a high level of accuracy in classifying enhancers. The efficacy of the proposed model is validated by visualizing the enhancer motif activation map and intermediate feature representations within the CNN layers, ensuring the meaningfulness of the learned representations. Furthermore, the proposed method is compared against six state-of-the-art sequence encoding methods (one-hot encoding, k-mer counting, random walk, skip-gram, continuous-bag-of-words, and Global Vectors) using the same benchmark input histone datasets. The evaluation, which encompasses accuracy, precision, recall, F1-score, and AUC score, consistently demonstrates superior performance with improvements ranging from 0.0283 to 0.0573 across these metrics compared to the other methods. Additionally, time accuracy analysis further supports the effectiveness of the proposed model in terms of accuracy and computational efficiency, and a t-test confirms the statistical significance of the performance difference. In conclusion, the comprehensive evaluation results indicate that EnhancerNet is an effective approach for generating meaningful representations, resulting in significant improvements in the performance of CNN classifiers. This thesis work contributes a novel approach for transforming DNA sequences into an enhancer motif activation map, capturing spatial relationships, context dependency, and over-represented motifs. The approach capitalizes on CNN's ability to effectively model higher-level abstraction features, and it is expected to inspire future designs of DNA sequence representation for CNN modelling. |
---|