Integrated framework with association analysis for gene selection in microarray data classification

Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable tas...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Huey Fang
Format: Thesis
Language:English
English
Published: 2011
Online Access:http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf
http://psasir.upm.edu.my/id/eprint/27711/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.upm.eprints.27711
record_format eprints
spelling my.upm.eprints.277112014-04-10T04:22:58Z http://psasir.upm.edu.my/id/eprint/27711/ Integrated framework with association analysis for gene selection in microarray data classification Ong, Huey Fang Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation. 2011-04 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf Ong, Huey Fang (2011) Integrated framework with association analysis for gene selection in microarray data classification. Masters thesis, Universiti Putra Malaysia. English
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
English
description Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation.
format Thesis
author Ong, Huey Fang
spellingShingle Ong, Huey Fang
Integrated framework with association analysis for gene selection in microarray data classification
author_facet Ong, Huey Fang
author_sort Ong, Huey Fang
title Integrated framework with association analysis for gene selection in microarray data classification
title_short Integrated framework with association analysis for gene selection in microarray data classification
title_full Integrated framework with association analysis for gene selection in microarray data classification
title_fullStr Integrated framework with association analysis for gene selection in microarray data classification
title_full_unstemmed Integrated framework with association analysis for gene selection in microarray data classification
title_sort integrated framework with association analysis for gene selection in microarray data classification
publishDate 2011
url http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf
http://psasir.upm.edu.my/id/eprint/27711/
_version_ 1643829257941549056
score 13.211869