Population-based Ensemble Learning with Tree Structures for Classification
Ensemble learning is one of the most powerful extensions for improving upon individual machine learning models. Rather than a single model being used, several models are trained and the predictions combined to make a more informed decision. Such combinations will ideally overcome the shortcomings of any individual member of the ensemble. Most ma- chine learning competition winners feature an ensemble of some sort, and there is also sound theoretical proof to the performance of certain ensem- bling schemes. The benefits of ensembling are clear in both theory and practice. Despite the great performance, ensemble learning is not a trivial task. One of the main difficulties is designing appropriate ensembles. For exam- ple, how large should an ensemble be? What members should be included in an ensemble? How should these members be weighted? Our first contribution addresses these concerns using a strongly-typed population- based search (genetic programming) to construct well-performing ensem- bles, where the entire ensemble (members, hyperparameters, structure) is automatically learnt. The proposed method was found, in general, to be significantly better than all base members and commonly used compari- son methods trialled. With automatically designed ensembles, there is a range of applica- tions, such as competition entries, forecasting and state-of-the-art predic- tions. However, often these applications also require additional prepro- cessing of the input data. Above the ensemble considers only the original training data, however, in many machine learning scenarios a pipeline is required (for example performing feature selection before classification). For the second contribution, a novel automated machine learning method is proposed based on ensemble learning. This method uses a random population-based search of appropriate tree structures, and as such is em- barrassingly parallel, an important consideration for automated machine learning. The proposed method is able to achieve equivalent or improved results over the current state-of-the-art methods and does so in a fraction of the time (six times as fast). Finally, while complex ensembles offer great performance, one large limitation is the interpretability of such ensembles. For example, why does a forest of 500 trees predict a particular class for a given instance? In an effort to explain the behaviour of complex models (such as ensem- bles), several methods have been proposed. However, these approaches tend to suffer at least one of the following limitations: overly complex in the representation, local in their application, limited to particular fea- ture types (i.e. categorical only), or limited to particular algorithms. For our third contribution, a novel model agnostic method for interpreting complex black-box machine learning models is proposed. The method is based on strongly-typed genetic programming and overcomes the afore- mentioned limitations. Multi-objective optimisation is used to generate a Pareto frontier of simple and explainable models which approximate the behaviour of much more complex methods. We found the resulting rep- resentations are far simpler than existing approaches (an important con- sideration for interpretability) while providing equivalent reconstruction performance. Overall, this thesis addresses two of the major limitations of existing ensemble learning, i.e. the complex construction process and the black- box models that are often difficult to interpret. A novel application of ensemble learning in the field of automated machine learning is also pro- posed. All three methods have shown at least equivalent or improved performance than existing methods.