Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data
Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required. However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage. Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation. Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction. In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers. In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms. In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality. For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process.