Prediction of interactiveness of proteins and nucleic acids based on feature selections
: Yuan, Y.; Shi, X.; Li, X.; Lu, W.; Cai, Y.; Gu, L.; Liu, L.; Li, M.; Kong, X.; Xing, M.
|Molecular Diversity 14 (2010), No.4, pp.627-633|
| Journal Article|
|Fraunhofer SCAI ()|
It is important to identify which proteins can interact with nucleic acids for the purpose of protein annotation, since interactions between nucleic acids and proteins involve in numerous cellular processes such as replication, transcription, splicing, and DNA repair. This research tries to identify proteins that can interact with DNA, RNA, and rRNA, respectively. mRMR (Minimum redundancy and maximum relevance), with its elegant mathematical formulation, has been applied widely in processing biological data and feature analysis since its introduction in 2005. mRMR plus incremental feature selection (IFS) is known to be very efficient in feature selection and analysis, and able to improve both effectiveness and efficiency of a prediction model. IFS is applied to decide how many features should be selected from feature list provided by mRMR. In the end, the selected features of mRMR and IFS are further refined by a conventional feature selection method-forward feature wra pper(FFW), by reordering the features. Each protein is coded by 132 features including amino acid compositions and physicochemical properties. After the feature selection, k-Nearest Neighbor algorithm, the adopted prediction model, is trained and tested. As a result, the optimized prediction accuracies for the DNA, RNA, and rRNA are 82.0, 83.4, and 92.3%, respectively. Furthermore, the most important features that contribute to the prediction are identified and analyzed biologically. The predictor, developed for this research, is available for public access at http://chemdata.shu.edu.cn/protein-na-mrmr/ .