Open Access Te Herenga Waka-Victoria University of Wellington
thesis_access.pdf (4.9 MB)

Evolutionary Computation for Feature Manipulation in Classification on High-dimensional Data

Download (4.9 MB)
posted on 2021-11-23, 13:22 authored by Tran, Binh Ngan

More and more high-dimensional data appears in machine learning, especially in classification tasks. With thousands of features, these datasets bring challenges to learning algorithms not only because of the curse of dimensionality but also the existence of many irrelevant and redundant features. Therefore, feature selection and feature construction (or feature manipulation in short) are essential techniques in preprocessing these datasets. While feature selection aims to select relevant features, feature construction constructs high-level features from the original ones to better represent the target concept. Both methods can decrease the dimensionality and improve the performance of learning algorithms in terms of classification accuracy and computation time.  Although feature manipulation has been studied for decades, the task on high-dimensional data is still challenging due to the huge search space. Existing methods usually face the problem of stagnation in local optima and/or require high computation time. Evolutionary computation techniques are well-known for their global search. Particle swarm optimisation (PSO) and genetic programming (GP) have shown promise in feature selection and feature construction, respectively. However, the use of these techniques to high-dimensional data usually requires high memory and computation time.  The overall goal of this thesis is to investigate new approaches to using PSO for feature selection and GP for feature construction on high-dimensional classification problems. This thesis focuses on incorporating a variety of strategies into the evolutionary process and developing new PSO and GP representations to improve the effectiveness and efficiency of PSO and GP for feature manipulation on high-dimensional data.  This thesis proposes a new PSO based feature selection approach to high-dimensional data by incorporating a new local search to balance global and local search of PSO. A hybrid of wrapper and filter evaluation method which can be sped up in the local search is proposed to help PSO achieve better performance, scalability and robustness on high-dimensional data. The results show that the proposed method significantly outperforms the compared methods in 80% of the cases with an increase up to 16% average accuracy while reduces the number of features from one to two orders of magnitude.  This thesis develops the first PSO based feature selection via discretisation method that performs both multivariate discretisation and feature selection in a single stage to achieve better solutions than applying these techniques separately in two stages. Two new PSO representations are proposed to evolve cut-points for multiple features simultaneously. The results show that the proposed method selects less than 4.6% of the features in all cases to improve the classification performance from 5% to 23% in most cases.  This thesis proposes the first clustering-based feature construction method to improve the performance of single-tree GP on high-dimensional data. A new feature clustering method is proposed to automatically group similar features into the same group based on a given redundancy level. The results show that compared with standard GP, the new method can select less than half of the features to construct a new high-level feature that achieves significantly better accuracy in most cases. The combination of the single constructed feature and the selected ones achieves the best performance among different feature sets created from a single tree.  This thesis develops the first class-dependent multiple feature construction method using multi-tree GP for high-dimensional data. A new GP representation and a new filter fitness function that combines two filter measures are proposed to evaluate the whole set of constructed features more effectively and efficiently. The results show that in 83% of the cases, with less than 10 constructed features, the class-dependent method increases up to 32% average accuracy on using all the original thousands of features and 10% on using those constructed by the class-independent method.


Copyright Date


Date of Award



Te Herenga Waka—Victoria University of Wellington

Rights License

Author Retains Copyright

Degree Discipline

Computer Science

Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Degree Level


Degree Name

Doctor of Philosophy

ANZSRC Type Of Activity code

970108 Expanding Knowledge in the Information and Computing Sciences

Victoria University of Wellington Item Type

Awarded Doctoral Thesis



Victoria University of Wellington School

School of Engineering and Computer Science


Zhang, Mengjie; Xue, Bing