Open Access Te Herenga Waka-Victoria University of Wellington
Browse
- No file added yet -

Finite Mixture Models - A Divide and Conquer Approach

Download (11.63 MB)
thesis
posted on 2022-10-09, 04:13 authored by Owen, Flynn

Within the field of data clustering, methods are commonly referred to as either 'distance-based' or 'mixture-based', with each having certain limitations. Distance-based methods are fast, but lack an underlying parametric model. Mixture-based methods are interpretable and generalise easily to distributions other than Gaussian, but are slow to estimate, especially when the number of clusters is large. Although generalisable to any distribution within the exponential family, we constrain our focus to only binary data, and thus consider row clustering of data from the Bernoulli distribution. Taking inspiration from divisive hierarchical clustering and develop four 'divide and conquer' mixture-based algorithms to rapidly estimate mixture models with large numbers of components. Our algorithms are: 1) The Hierarchical No Turning Back (HNTB) algorithm, 2) The Hierarchical Single Reallocation (HSR) algorithm, 3) The Hierarchical Reallocation Loop (HRL) algorithm 4) The Hierarchical algorithm that Reallocates During execution (HRD). Our algorithms work by recursively fitting a data matrix with R = 1 and R = 2 mixture components, and then partitioning this data matrix dependent on cluster membership, halting when no sub-components can be partitioned further. We then adapt the 'E-step' from the Expectation-Maximisation (EM) algorithm to reallocate data points across each estimated sub-component of the data matrix. We perform a series of Monte Carlo simulation studies on a series of generated datasets, and compare our results against an adapted version of well known 'Wolfe's method’ (Wolfe, 1971), which maximises the likelihood directly. We establish that 'divide and conquer' mixture modelling approaches are able to identify mixtures with large numbers of components at a much faster rate than our adapted version of Wolfe's method, and approximate both the true coefficients used to simulate the data, as well as the results from our adapted method, to a good degree. Algorithms introduced in this thesis provide suitable methods to address scalability with the ever increasing magnitude of data being stored and processed, while ensuring interpretability with the increasing demand for data-driven insights.

History

Copyright Date

2022-10-09

Date of Award

2022-10-09

Publisher

Te Herenga Waka—Victoria University of Wellington

Rights License

Author Retains Copyright

Degree Discipline

Statistics and Operations Research

Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Degree Level

Masters

Degree Name

Master of Science

ANZSRC Type Of Activity code

2 Strategic basic research

Victoria University of Wellington Item Type

Awarded Research Masters Thesis

Language

en_NZ

Victoria University of Wellington School

School of Mathematics and Statistics

Advisors

Arnold, Richard; McMillan, Louise