Comparisons of DNA Sequence Representation Methods for Deep Learning Modelling

Learning the enhancer sequence grammar from protein-DNA interaction via a computational approach is a challenging task because the features associated with the recognition codes are ill-defined. While sequence features are not the only way to define the sequence characteristics, they are the most ef...

Full description

Saved in:
Bibliographic Details
Main Authors: Shu En, Chia, Lee, Nung Kion
Format: Proceeding
Language:en
Published: 2022
Subjects:
Online Access:http://ir.unimas.my/id/eprint/42246/3/Comparisons%20of%20DNA%20-%20Copy.pdf
http://ir.unimas.my/id/eprint/42246/
https://ieeexplore.ieee.org/document/9936754
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Learning the enhancer sequence grammar from protein-DNA interaction via a computational approach is a challenging task because the features associated with the recognition codes are ill-defined. While sequence features are not the only way to define the sequence characteristics, they are the most effective. Deep learning neural networks have become the key technique for modeling those features for the classification task. Nevertheless, effective learning of deep learning requires enhancer sequence features to be represented and encoded into suitable matrix form. The aims of this paper is to evaluate six sequence feature representation/encoding methods for convolutional neural networks modelling. Using a histone marks dataset as input data, our results indicate k-mer feature achieved the best performance, followed by word-based features, which performed favorably better than one-hot encoding. The random-walk feature, nevertheless, performed the worst. Moreover, our finding provides strong evidence to use kmer/word features instead of the popular one-hot encoding for histone sequence in CNN modeling.