Classification on high-dimensional data with<br>thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the<br>quality of the feature set. The problem can be addressed<br>by using feature selection to choose only informative<br>features or feature construction to create new high-level<br>features. Genetic programming (GP) using a tree-based<br>representation can be used for both feature construction and implicit feature selection. This work presents<br>a comprehensive study to investigate the use of GP for<br>feature construction and selection on high-dimensional<br>classification problems. Different combinations of the<br>constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used<br>to evaluate their performance. The results show that<br>the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or<br>even increase the classification accuracy in most cases.<br>The cases with overfitting occurred are analysed via<br>the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance.<br><br>This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.
Tran, B., Xue, B. & Zhang, M. (2016). Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Computing, 8(1), 3-15. https://doi.org/10.1007/s12293-015-0173-y