Chemical Compound Classification Ensemble
Abstract
In the research of health science, scientists often need to screen numerous chemical compounds to find drugs that can treat a disease. The process of testing the functionality of these compounds in the laboratory is very time-consuming. Computational methods have been used to accelerate this process. These computational methods are implemented based on the principle that chemical compounds with similar structure often have similar function. Thus, these methods maintain a database of chemical compounds whose function has been verified using laboratory experiments. The database contains the chemical structural formula of a compound, the 3D coordinate of every atom, and whether it has a certain function, e.g. it can kill a virus. Then, for a new compound, the programs compare its structure with those in the database and predict if it has the function based on the structure similarity. Thus, predicting the function of a compound is a two-class classification problem. In this project, we try to address this two-class classification problem using global and local similarity between compounds. The global similarity measures the overall structural resemblance between two compounds. When a group of compounds have the same function, they usually share some common sub-structures. These common sub-structures may correspond to their functional sites. Local similarity is computed based on the occurrences of common sub-structures between compounds. We built several classification models based on global and local similarity. To improve the classification result, we used an ensemble of those models to predict the function compounds in NCI cancer data sets. We predict whether a compound can inhibit cancer cell growth or not, obtaining AUC higher than 80% for five datasets. We compare our results with other state-of-the-art methods. Our classification result is the best in all five datasets. Our results show that local similarity is more useful than global similarity in predicting compound function. An ensemble method integrating global and local similarity achieves much better performance than single predicting models.