Development of compound clustering techniques using hybrid soft-computing algorithms

Databases of molecular structures available to the pharmaceutical industry comprise millions of molecules. With the advent of combinatorial chemistry, a vast number of compounds can be available either physically or virtually, which can make screening all of them infeasible in terms of time and cost...

Full description

Saved in:
Bibliographic Details
Main Authors: Salim, Naomie, Shamsuddin, Siti Mariyam, Salleh @ Sallehuddin, Roselina, Alwee, Razana
Format: Monograph
Language:English
Published: Faculty of Computer Science and Information System 2006
Subjects:
Online Access:http://eprints.utm.my/id/eprint/4139/1/74252.pdf
http://eprints.utm.my/id/eprint/4139/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utm.4139
record_format eprints
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
language English
topic T Technology (General)
spellingShingle T Technology (General)
Salim, Naomie
Shamsuddin, Siti Mariyam
Salleh @ Sallehuddin, Roselina
Alwee, Razana
Development of compound clustering techniques using hybrid soft-computing algorithms
description Databases of molecular structures available to the pharmaceutical industry comprise millions of molecules. With the advent of combinatorial chemistry, a vast number of compounds can be available either physically or virtually, which can make screening all of them infeasible in terms of time and cost. Therefore, only a subset of the entire database that encompasses the full range of structural types of the underlying dataset needs to be selected for screening to maximise the likelihood of finding as many biologically distinct active compounds as possible in a screening experiment. One of most used compound selection method is cluster-based compound selection, which involves subdividing a set of compounds into clusters and choosing one compound or a small number of compounds from each cluster. Selecting only representative compounds from each cluster is based on the assumption that structurally similar molecules have similar properties. A good clustering method groups similar compounds together, to ensure all activity classes are represented, whilst separating active and inactive compounds into different sets of clusters, to avoid an inactive compound being selected as a cluster representative. Hierarchical clustering methods such as Ward’s and Group Average are considered industry standard for compound selection purposes. Previously, there is limited work on the clustering and classification of biologically active compounds into their activity based classes using fuzzy and neural network. Furthermore, it has been found that many of the biologically active molecular structures exhibit more than one activity in which case they can be used as drugs for the treatment of more than one disease. However, previous clustering methods on chemical compounds are mostly limited to hard partitioning, which allows a compound to belong to only one cluster. In this work, neural, fuzzy and hybrid methods are utilized for the clustering of biologically active molecular structures into their corresponding activity classes. The methods have been evaluated for their performance on MDL’s MDDR, NCI’s AIDS and IDDB drug databases containing various biologically active classes of molecular structures. The neural network methods use a number of heuristics to find appropriate parametric values. Initially, the heuristics needs user intervention to select optimal values, which give poor results. To overcome this problem, fuzzy memberships have been employed to find optimal parameters. Since fuzzy clustering methods such as the fuzzy c-means and fuzzy G – K are computationally exhaustive in terms of time and memory requirements, a hierarchical approach have also been used in this work for their implementation. The hierarchical fuzzy clustering algorithm developed in this work assign the overlapping structures (structures having more than one activity) to more than one clusters if their fuzzy membership values are significantly high for those clusters. When compared with industry standard methods, the neural networks show very poor performance when 2-D bit-strings descriptors are used. However, their relative performance improves when used with topological indices as descriptors. The fuzzy and fuzzy neural methods show slightly better results than the industry standard methods. The hierarchical fuzzy clustering method developed here is far better than a similar implementation of the hard k-means method. When used for overlapping structures, its performance improves significantly. Although the neural network methods are not very effective in clustering biologically active structures, their performance is remarkable when used as classifiers. The feed forward and radial basis functions networks show higher learning capabilities than support vector machines and rough set classifier in the classification of datasets comprising more than two classes. However, their performance is slightly inferior to that of support vector machines for binary classification of chemical structures into drug and non drug compounds.
format Monograph
author Salim, Naomie
Shamsuddin, Siti Mariyam
Salleh @ Sallehuddin, Roselina
Alwee, Razana
author_facet Salim, Naomie
Shamsuddin, Siti Mariyam
Salleh @ Sallehuddin, Roselina
Alwee, Razana
author_sort Salim, Naomie
title Development of compound clustering techniques using hybrid soft-computing algorithms
title_short Development of compound clustering techniques using hybrid soft-computing algorithms
title_full Development of compound clustering techniques using hybrid soft-computing algorithms
title_fullStr Development of compound clustering techniques using hybrid soft-computing algorithms
title_full_unstemmed Development of compound clustering techniques using hybrid soft-computing algorithms
title_sort development of compound clustering techniques using hybrid soft-computing algorithms
publisher Faculty of Computer Science and Information System
publishDate 2006
url http://eprints.utm.my/id/eprint/4139/1/74252.pdf
http://eprints.utm.my/id/eprint/4139/
_version_ 1643643977229926400
spelling my.utm.41392010-06-01T03:15:04Z http://eprints.utm.my/id/eprint/4139/ Development of compound clustering techniques using hybrid soft-computing algorithms Salim, Naomie Shamsuddin, Siti Mariyam Salleh @ Sallehuddin, Roselina Alwee, Razana T Technology (General) Databases of molecular structures available to the pharmaceutical industry comprise millions of molecules. With the advent of combinatorial chemistry, a vast number of compounds can be available either physically or virtually, which can make screening all of them infeasible in terms of time and cost. Therefore, only a subset of the entire database that encompasses the full range of structural types of the underlying dataset needs to be selected for screening to maximise the likelihood of finding as many biologically distinct active compounds as possible in a screening experiment. One of most used compound selection method is cluster-based compound selection, which involves subdividing a set of compounds into clusters and choosing one compound or a small number of compounds from each cluster. Selecting only representative compounds from each cluster is based on the assumption that structurally similar molecules have similar properties. A good clustering method groups similar compounds together, to ensure all activity classes are represented, whilst separating active and inactive compounds into different sets of clusters, to avoid an inactive compound being selected as a cluster representative. Hierarchical clustering methods such as Ward’s and Group Average are considered industry standard for compound selection purposes. Previously, there is limited work on the clustering and classification of biologically active compounds into their activity based classes using fuzzy and neural network. Furthermore, it has been found that many of the biologically active molecular structures exhibit more than one activity in which case they can be used as drugs for the treatment of more than one disease. However, previous clustering methods on chemical compounds are mostly limited to hard partitioning, which allows a compound to belong to only one cluster. In this work, neural, fuzzy and hybrid methods are utilized for the clustering of biologically active molecular structures into their corresponding activity classes. The methods have been evaluated for their performance on MDL’s MDDR, NCI’s AIDS and IDDB drug databases containing various biologically active classes of molecular structures. The neural network methods use a number of heuristics to find appropriate parametric values. Initially, the heuristics needs user intervention to select optimal values, which give poor results. To overcome this problem, fuzzy memberships have been employed to find optimal parameters. Since fuzzy clustering methods such as the fuzzy c-means and fuzzy G – K are computationally exhaustive in terms of time and memory requirements, a hierarchical approach have also been used in this work for their implementation. The hierarchical fuzzy clustering algorithm developed in this work assign the overlapping structures (structures having more than one activity) to more than one clusters if their fuzzy membership values are significantly high for those clusters. When compared with industry standard methods, the neural networks show very poor performance when 2-D bit-strings descriptors are used. However, their relative performance improves when used with topological indices as descriptors. The fuzzy and fuzzy neural methods show slightly better results than the industry standard methods. The hierarchical fuzzy clustering method developed here is far better than a similar implementation of the hard k-means method. When used for overlapping structures, its performance improves significantly. Although the neural network methods are not very effective in clustering biologically active structures, their performance is remarkable when used as classifiers. The feed forward and radial basis functions networks show higher learning capabilities than support vector machines and rough set classifier in the classification of datasets comprising more than two classes. However, their performance is slightly inferior to that of support vector machines for binary classification of chemical structures into drug and non drug compounds. Faculty of Computer Science and Information System 2006-10-31 Monograph NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/4139/1/74252.pdf Salim, Naomie and Shamsuddin, Siti Mariyam and Salleh @ Sallehuddin, Roselina and Alwee, Razana (2006) Development of compound clustering techniques using hybrid soft-computing algorithms. Project Report. Faculty of Computer Science and Information System, Skudai, Johor. (Unpublished)
score 13.211869