Computational Methods for Predicting Protein-Nucleic Acids Interaction
View/ Open
Abstract
Since the inception of various proteomic projects, protein structures with unknown functions have been discovered at a fast speed. The proteins regulate many important biological processes by interacting with nucleic acids that include DNA and RNA. Traditional wet-lab methods for protein function discovery are too slow to handle this rapid increase of data. Therefore, there is a need for computational methods that can predict the interaction between proteins and nucleic acids. There are two related problems when predicting protein-nucleic interactions. One problem is to identify nucleic acid-binding sites on the protein structures, and the other problem is to predict the 3-D structure of the complex that protein and nucleic acids form during interaction. The second problem can be further divided into two steps. The first step is to generate potential structures for the protein-nucleic acids complex. The second step is to assign scores to the poses generated in the first step. This dissertation presents two computational methods that we developed to predict the protein-nucleic acids interaction. The first method is a scoring function that can discriminate native structures of protein-DNA complexes from non-native poses, which are also known as docking decoys. We analyze the distribution of protein atoms around each structural component of the DNA and develop spatial-specific scoring matrices (SSSMs) based on the observed distribution. We show that the SSSMs could be used as a knowledge-based energy function to discriminate native protein-DNA structures and various decoys. Our second method discovers the graphs that are enriched on the protein-nucleic acids interfaces and then uses the sub-graphs to predict RNA-binding sites on protein structures and to assign scores to protein-RNA poses. First, the interface area of each RNA-binding protein is represented as a graph, where each node represents an interface residue. Then, common sub-graphs being abundant in these graphs are identified. The method is able to identify RNA-binding sites on the protein surface with high accuracy. We also demonstrate that the common sub-graphs can be used as a scoring function to rank the protein-RNA poses. Our method is simple in computation, while its results are easier to interpret in biological contexts.