Finite Mixture Models - A Divide and Conquer Approach
Within the field of data clustering, methods are commonly referred to as either 'distance-based' or 'mixture-based', with each having certain limitations. Distance-based methods are fast, but lack an underlying parametric model. Mixture-based methods are interpretable and generalise easily to distributions other than Gaussian, but are slow to estimate, especially when the number of clusters is large. Although generalisable to any distribution within the exponential family, we constrain our focus to only binary data, and thus consider row clustering of data from the Bernoulli distribution. Taking inspiration from divisive hierarchical clustering and develop four 'divide and conquer' mixture-based algorithms to rapidly estimate mixture models with large numbers of components. Our algorithms are: 1) The Hierarchical No Turning Back (HNTB) algorithm, 2) The Hierarchical Single Reallocation (HSR) algorithm, 3) The Hierarchical Reallocation Loop (HRL) algorithm 4) The Hierarchical algorithm that Reallocates During execution (HRD). Our algorithms work by recursively fitting a data matrix with R = 1 and R = 2 mixture components, and then partitioning this data matrix dependent on cluster membership, halting when no sub-components can be partitioned further. We then adapt the 'E-step' from the Expectation-Maximisation (EM) algorithm to reallocate data points across each estimated sub-component of the data matrix. We perform a series of Monte Carlo simulation studies on a series of generated datasets, and compare our results against an adapted version of well known 'Wolfe's method’ (Wolfe, 1971), which maximises the likelihood directly. We establish that 'divide and conquer' mixture modelling approaches are able to identify mixtures with large numbers of components at a much faster rate than our adapted version of Wolfe's method, and approximate both the true coefficients used to simulate the data, as well as the results from our adapted method, to a good degree. Algorithms introduced in this thesis provide suitable methods to address scalability with the ever increasing magnitude of data being stored and processed, while ensuring interpretability with the increasing demand for data-driven insights.